doc/recordmodel-grs.xml

   1  <chapter id="grs">
   2   <!-- $Id: recordmodel-grs.xml,v 1.9 2007-05-24 13:44:09 adam Exp $ -->
   3   <title>&acro.grs1; Record Model and Filter Modules</title>
   4
   5      <note>
   6       <para>
   7         The functionality of this record model has been improved and
   8         replaced by the DOM &acro.xml; record model. See
   9         <xref linkend="record-model-domxml"/>.
  10       </para>
  11      </note>
  12
  13   <para>
  14    The record model described in this chapter applies to the fundamental,
  15    structured
  16    record type <literal>grs</literal>, introduced in
  17    <xref linkend="componentmodulesgrs"/>.
  18   </para>
  19
  20
  21   <section id="grs-filters">
  22    <title>&acro.grs1; Record Filters</title>
  23    <para>
  24     Many basic subtypes of the <emphasis>grs</emphasis> type are
  25     currently available:
  26    </para>
  27
  28    <para>
  29     <variablelist>
  30      <varlistentry>
  31       <term><literal>grs.sgml</literal></term>
  32       <listitem>
  33        <para>
  34         This is the canonical input format
  35         described <xref linkend="grs-canonical-format"/>. It is using
  36         simple &acro.sgml;-like syntax.
  37        </para>
  38       </listitem>
  39      </varlistentry>
  40      <varlistentry>
  41       <term><literal>grs.marc.</literal><replaceable>type</replaceable></term>
  42       <listitem>
  43        <para>
  44         This allows &zebra; to read
  45         records in the ISO2709 (&acro.marc;) encoding standard.
  46         Last parameter <replaceable>type</replaceable> names the
  47         <literal>.abs</literal> file (see below)
  48         which describes the specific &acro.marc; structure of the input record as
  49         well as the indexing rules.
  50        </para>
  51        <para>The <literal>grs.marc</literal> uses an internal represtantion
  52         which is not &acro.xml; conformant. In particular &acro.marc; tags are
  53         presented as elements with the same name. And &acro.xml; elements
  54         may not start with digits. Therefore this filter is only
  55         suitable for systems returning &acro.grs1; and &acro.marc; records. For &acro.xml;
  56         use <literal>grs.marcxml</literal> filter instead (see below).
  57        </para>
  58        <para>
  59          The loadable <literal>grs.marc</literal> filter module
  60          is packaged in the GNU/Debian package
  61         <literal>libidzebra2.0-mod-grs-marc</literal>
  62        </para>
  63       </listitem>
  64      </varlistentry>
  65      <varlistentry>
  66       <term><literal>grs.marcxml.</literal><replaceable>type</replaceable></term>
  67       <listitem>
  68        <para>
  69         This allows &zebra; to read ISO2709 encoded records.
  70         Last parameter <replaceable>type</replaceable> names the
  71         <literal>.abs</literal> file (see below)
  72         which describes the specific &acro.marc; structure of the input record as
  73         well as the indexing rules.
  74        </para>
  75        <para>
  76         The internal representation for <literal>grs.marcxml</literal>
  77         is the same as for <ulink url="&url.marcxml;">&acro.marcxml;</ulink>.
  78         It slightly more complicated to work with than
  79         <literal>grs.marc</literal> but &acro.xml; conformant.
  80        </para>
  81        <para>
  82         The loadable <literal>grs.marcxml</literal> filter module
  83         is also contained in the GNU/Debian package
  84         <literal>libidzebra2.0-mod-grs-marc</literal>
  85        </para>
  86       </listitem>
  87      </varlistentry>
  88      <varlistentry>
  89       <term><literal>grs.xml</literal></term>
  90       <listitem>
  91        <para>
  92         This filter reads &acro.xml; records and uses
  93         <ulink url="http://expat.sourceforge.net/">Expat</ulink> to
  94         parse them and convert them into ID&zebra;'s internal
  95         <literal>grs</literal> record model.
  96         Only one record per file is supported, due to the fact &acro.xml; does
  97         not allow two documents to "follow" each other (there is no way
  98         to know when a document is finished).
  99         This filter is only available if &zebra; is compiled with EXPAT support.
 100        </para>
 101        <para>
 102         The loadable <literal>grs.xml</literal> filter module
 103         is packagged in the GNU/Debian package
 104         <literal>libidzebra2.0-mod-grs-xml</literal>
 105         </para>
 106       </listitem>
 107      </varlistentry>
 108      <varlistentry>
 109       <term><literal>grs.regx.</literal><replaceable>filter</replaceable></term>
 110       <listitem>
 111        <para>
 112         This enables a user-supplied Regular Expressions input
 113         filter described in <xref linkend="grs-regx-tcl"/>.
 114        </para>
 115        <para>
 116         The loadable <literal>grs.regx</literal> filter module
 117         is packaged in the GNU/Debian package
 118         <literal>libidzebra2.0-mod-grs-regx</literal>
 119        </para>
 120       </listitem>
 121      </varlistentry>
 122      <varlistentry>
 123       <term><literal>grs.tcl.</literal><replaceable>filter</replaceable></term>
 124       <listitem>
 125        <para>
 126         Similar to grs.regx but using Tcl for rules, described in
 127         <xref linkend="grs-regx-tcl"/>.
 128        </para>
 129        <para>
 130         The loadable <literal>grs.tcl</literal> filter module
 131         is also packaged in the GNU/Debian package
 132         <literal>libidzebra2.0-mod-grs-regx</literal>
 133        </para>
 134       </listitem>
 135      </varlistentry>
 136
 137     </variablelist>
 138    </para>
 139
 140    <section id="grs-canonical-format">
 141     <title>&acro.grs1; Canonical Input Format</title>
 142
 143     <para>
 144      Although input data can take any form, it is sometimes useful to
 145      describe the record processing capabilities of the system in terms of
 146      a single, canonical input format that gives access to the full
 147      spectrum of structure and flexibility in the system. In &zebra;, this
 148      canonical format is an "&acro.sgml;-like" syntax.
 149     </para>
 150
 151     <para>
 152      To use the canonical format specify <literal>grs.sgml</literal> as
 153      the record type.
 154     </para>
 155
 156     <para>
 157      Consider a record describing an information resource (such a record is
 158      sometimes known as a <emphasis>locator record</emphasis>).
 159      It might contain a field describing the distributor of the
 160      information resource, which might in turn be partitioned into
 161      various fields providing details about the distributor, like this:
 162     </para>
 163
 164     <para>
 165
 166      <screen>
 167       &#60;Distributor&#62;
 168         &#60;Name&#62; USGS/WRD &#60;/Name&#62;
 169         &#60;Organization&#62; USGS/WRD &#60;/Organization&#62;
 170         &#60;Street-Address&#62;
 171           U.S. GEOLOGICAL SURVEY, 505 MARQUETTE, NW
 172         &#60;/Street-Address&#62;
 173         &#60;City&#62; ALBUQUERQUE &#60;/City&#62;
 174         &#60;State&#62; NM &#60;/State&#62;
 175         &#60;Zip-Code&#62; 87102 &#60;/Zip-Code&#62;
 176         &#60;Country&#62; USA &#60;/Country&#62;
 177         &#60;Telephone&#62; (505) 766-5560 &#60;/Telephone&#62;
 178       &#60;/Distributor&#62;
 179      </screen>
 180
 181     </para>
 182
 183     <!-- There is no indentation in the example above!  -H
 184     -note-
 185      -para-
 186       The indentation used above is used to illustrate how &zebra;
 187       interprets the mark-up. The indentation, in itself, has no
 188       significance to the parser for the canonical input format, which
 189       discards superfluous whitespace.
 190      -/para-
 191     -/note-
 192     -->
 193
 194     <para>
 195      The keywords surrounded by &lt;...&gt; are
 196      <emphasis>tags</emphasis>, while the sections of text
 197      in between are the <emphasis>data elements</emphasis>.
 198      A data element is characterized by its location in the tree
 199      that is made up by the nested elements.
 200      Each element is terminated by a closing tag - beginning
 201      with <literal>&#60;</literal>/, and containing the same symbolic
 202      tag-name as the corresponding opening tag.
 203      The general closing tag - <literal>&lt;/&gt;</literal> -
 204      terminates the element started by the last opening tag. The
 205      structuring of elements is significant.
 206      The element <emphasis>Telephone</emphasis>,
 207      for instance, may be indexed and presented to the client differently,
 208      depending on whether it appears inside the
 209      <emphasis>Distributor</emphasis> element, or some other,
 210      structured data element such a <emphasis>Supplier</emphasis> element.
 211     </para>
 212
 213     <section id="grs-record-root">
 214      <title>Record Root</title>
 215
 216      <para>
 217       The first tag in a record describes the root node of the tree that
 218       makes up the total record. In the canonical input format, the root tag
 219       should contain the name of the schema that lends context to the
 220       elements of the record
 221       (see <xref linkend="grs-internal-representation"/>).
 222       The following is a GILS record that
 223       contains only a single element (strictly speaking, that makes it an
 224       illegal GILS record, since the GILS profile includes several mandatory
 225       elements - &zebra; does not validate the contents of a record against
 226       the &acro.z3950; profile, however - it merely attempts to match up elements
 227       of a local representation with the given schema):
 228      </para>
 229
 230      <para>
 231
 232       <screen>
 233        &#60;gils&#62;
 234           &#60;title&#62;Zen and the Art of Motorcycle Maintenance&#60;/title&#62;
 235        &#60;/gils&#62;
 236       </screen>
 237
 238      </para>
 239
 240     </section>
 241
 242     <section id="grs-variants">
 243      <title>Variants</title>
 244
 245      <para>
 246       &zebra; allows you to provide individual data elements in a number of
 247       <emphasis>variant forms</emphasis>. Examples of variant forms are
 248       textual data elements which might appear in different languages, and
 249       images which may appear in different formats or layouts.
 250       The variant system in &zebra; is essentially a representation of
 251       the variant mechanism of &acro.z3950;-1995.
 252      </para>
 253
 254      <para>
 255       The following is an example of a title element which occurs in two
 256       different languages.
 257      </para>
 258
 259      <para>
 260
 261       <screen>
 262        &#60;title&#62;
 263        &#60;var lang lang "eng"&#62;
 264        Zen and the Art of Motorcycle Maintenance&#60;/&#62;
 265        &#60;var lang lang "dan"&#62;
 266        Zen og Kunsten at Vedligeholde en Motorcykel&#60;/&#62;
 267        &#60;/title&#62;
 268       </screen>
 269
 270      </para>
 271
 272      <para>
 273       The syntax of the <emphasis>variant element</emphasis> is
 274       <literal>&lt;var class type value&gt;</literal>.
 275       The available values for the <emphasis>class</emphasis> and
 276       <emphasis>type</emphasis> fields are given by the variant set
 277       that is associated with the current schema
 278       (see <xref linkend="grs-variants"/>).
 279      </para>
 280
 281      <para>
 282       Variant elements are terminated by the general end-tag &#60;/&#62;, by
 283       the variant end-tag &#60;/var&#62;, by the appearance of another variant
 284       tag with the same <emphasis>class</emphasis> and
 285       <emphasis>value</emphasis> settings, or by the
 286       appearance of another, normal tag. In other words, the end-tags for
 287       the variants used in the example above could have been omitted.
 288      </para>
 289
 290      <para>
 291       Variant elements can be nested. The element
 292      </para>
 293
 294      <para>
 295
 296       <screen>
 297        &#60;title&#62;
 298        &#60;var lang lang "eng"&#62;&#60;var body iana "text/plain"&#62;
 299        Zen and the Art of Motorcycle Maintenance
 300        &#60;/title&#62;
 301       </screen>
 302
 303      </para>
 304
 305      <para>
 306       Associates two variant components to the variant list for the title
 307       element.
 308      </para>
 309
 310      <para>
 311       Given the nesting rules described above, we could write
 312      </para>
 313
 314      <para>
 315
 316       <screen>
 317        &#60;title&#62;
 318        &#60;var body iana "text/plain&#62;
 319        &#60;var lang lang "eng"&#62;
 320        Zen and the Art of Motorcycle Maintenance
 321        &#60;var lang lang "dan"&#62;
 322        Zen og Kunsten at Vedligeholde en Motorcykel
 323        &#60;/title&#62;
 324       </screen>
 325
 326      </para>
 327
 328      <para>
 329       The title element above comes in two variants. Both have the IANA body
 330       type "text/plain", but one is in English, and the other in
 331       Danish. The client, using the element selection mechanism of &acro.z3950;,
 332       can retrieve information about the available variant forms of data
 333       elements, or it can select specific variants based on the requirements
 334       of the end-user.
 335      </para>
 336
 337     </section>
 338
 339    </section>
 340
 341    <section id="grs-regx-tcl">
 342     <title>&acro.grs1; REGX And TCL Input Filters</title>
 343
 344     <para>
 345      In order to handle general input formats, &zebra; allows the
 346      operator to define filters which read individual records in their
 347      native format and produce an internal representation that the system
 348      can work with.
 349     </para>
 350
 351     <para>
 352      Input filters are ASCII files, generally with the suffix
 353      <literal>.flt</literal>.
 354      The system looks for the files in the directories given in the
 355      <emphasis>profilePath</emphasis> setting in the
 356      <literal>zebra.cfg</literal> files.
 357      The record type for the filter is
 358      <literal>grs.regx.</literal><emphasis>filter-filename</emphasis>
 359      (fundamental type <literal>grs</literal>, file read
 360      type <literal>regx</literal>, argument
 361      <emphasis>filter-filename</emphasis>).
 362     </para>
 363
 364     <para>
 365      Generally, an input filter consists of a sequence of rules, where each
 366      rule consists of a sequence of expressions, followed by an action. The
 367      expressions are evaluated against the contents of the input record,
 368      and the actions normally contribute to the generation of an internal
 369      representation of the record.
 370     </para>
 371
 372     <para>
 373      An expression can be either of the following:
 374     </para>
 375
 376     <para>
 377      <variablelist>
 378
 379       <varlistentry>
 380        <term><literal>INIT</literal></term>
 381        <listitem>
 382         <para>
 383          The action associated with this expression is evaluated
 384          exactly once in the lifetime of the application, before any records
 385          are read. It can be used in conjunction with an action that
 386          initializes tables or other resources that are used in the processing
 387          of input records.
 388         </para>
 389        </listitem>
 390       </varlistentry>
 391       <varlistentry>
 392        <term><literal>BEGIN</literal></term>
 393        <listitem>
 394         <para>
 395          Matches the beginning of the record. It can be used to
 396          initialize variables, etc. Typically, the
 397          <emphasis>BEGIN</emphasis> rule is also used
 398          to establish the root node of the record.
 399         </para>
 400        </listitem>
 401       </varlistentry>
 402       <varlistentry>
 403        <term><literal>END</literal></term>
 404        <listitem>
 405         <para>
 406          Matches the end of the record - when all of the contents
 407          of the record has been processed.
 408         </para>
 409        </listitem>
 410       </varlistentry>
 411       <varlistentry>
 412        <term>
 413         <literal>/</literal><replaceable>reg</replaceable><literal>/</literal>
 414        </term>
 415        <listitem>
 416         <para>
 417          Matches regular expression pattern <replaceable>reg</replaceable>
 418          from the input record. The operators supported are the same
 419          as for regular expression queries. Refer to
 420          <xref linkend="querymodel-regular"/>.
 421         </para>
 422        </listitem>
 423       </varlistentry>
 424       <varlistentry>
 425        <term><literal>BODY</literal></term>
 426        <listitem>
 427         <para>
 428          This keyword may only be used between two patterns.
 429          It matches everything between (not including) those patterns.
 430         </para>
 431        </listitem>
 432       </varlistentry>
 433       <varlistentry>
 434        <term><literal>FINISH</literal></term>
 435        <listitem>
 436         <para>
 437          The expression associated with this pattern is evaluated
 438          once, before the application terminates. It can be used to release
 439          system resources - typically ones allocated in the
 440          <emphasis>INIT</emphasis> step.
 441         </para>
 442        </listitem>
 443       </varlistentry>
 444      </variablelist>
 445     </para>
 446
 447     <para>
 448      An action is surrounded by curly braces ({...}), and
 449      consists of a sequence of statements. Statements may be separated
 450      by newlines or semicolons (;).
 451      Within actions, the strings that matched the expressions
 452      immediately preceding the action can be referred to as
 453      $0, $1, $2, etc.
 454     </para>
 455
 456     <para>
 457      The available statements are:
 458     </para>
 459
 460     <para>
 461      <variablelist>
 462
 463       <varlistentry>
 464        <term>begin <replaceable>type [parameter ... ]</replaceable></term>
 465        <listitem>
 466         <para>
 467          Begin a new
 468          data element. The <replaceable>type</replaceable> is one of
 469          the following:
 470          <variablelist>
 471
 472           <varlistentry>
 473            <term>record</term>
 474            <listitem>
 475             <para>
 476              Begin a new record. The following parameter should be the
 477              name of the schema that describes the structure of the record, eg.
 478              <literal>gils</literal> or <literal>wais</literal> (see below).
 479              The <literal>begin record</literal> call should precede
 480              any other use of the <replaceable>begin</replaceable> statement.
 481             </para>
 482            </listitem>
 483           </varlistentry>
 484           <varlistentry>
 485            <term>element</term>
 486            <listitem>
 487             <para>
 488              Begin a new tagged element. The parameter is the
 489              name of the tag. If the tag is not matched anywhere in the tagsets
 490              referenced by the current schema, it is treated as a local string
 491              tag.
 492             </para>
 493            </listitem>
 494           </varlistentry>
 495           <varlistentry>
 496            <term>variant</term>
 497            <listitem>
 498             <para>
 499              Begin a new node in a variant tree. The parameters are
 500              <replaceable>class type value</replaceable>.
 501             </para>
 502            </listitem>
 503           </varlistentry>
 504          </variablelist>
 505         </para>
 506        </listitem>
 507       </varlistentry>
 508       <varlistentry>
 509        <term>data <replaceable>parameter</replaceable></term>
 510        <listitem>
 511         <para>
 512          Create a data element. The concatenated arguments make
 513          up the value of the data element.
 514          The option <literal>-text</literal> signals that
 515          the layout (whitespace) of the data should be retained for
 516          transmission.
 517          The option <literal>-element</literal>
 518          <replaceable>tag</replaceable> wraps the data up in
 519          the <replaceable>tag</replaceable>.
 520          The use of the <literal>-element</literal> option is equivalent to
 521          preceding the command with a <replaceable>begin
 522           element</replaceable> command, and following
 523          it with the <replaceable>end</replaceable> command.
 524         </para>
 525        </listitem>
 526       </varlistentry>
 527       <varlistentry>
 528        <term>end <replaceable>[type]</replaceable></term>
 529        <listitem>
 530         <para>
 531          Close a tagged element. If no parameter is given,
 532          the last element on the stack is terminated.
 533          The first parameter, if any, is a type name, similar
 534          to the <replaceable>begin</replaceable> statement.
 535          For the <replaceable>element</replaceable> type, a tag
 536          name can be provided to terminate a specific tag.
 537         </para>
 538        </listitem>
 539       </varlistentry>
 540
 541       <varlistentry>
 542        <term>unread <replaceable>no</replaceable></term>
 543        <listitem>
 544         <para>
 545          Move the input pointer to the offset of first character that
 546          match rule given by <replaceable>no</replaceable>.
 547          The first rule from left-to-right is numbered zero,
 548          the second rule is named 1 and so on.
 549         </para>
 550        </listitem>
 551       </varlistentry>
 552
 553      </variablelist>
 554     </para>
 555
 556     <para>
 557      The following input filter reads a Usenet news file, producing a
 558      record in the WAIS schema. Note that the body of a news posting is
 559      separated from the list of headers by a blank line (or rather a
 560      sequence of two newline characters.
 561     </para>
 562
 563     <para>
 564
 565      <screen>
 566       BEGIN                { begin record wais }
 567
 568       /^From:/ BODY /$/    { data -element name $1 }
 569       /^Subject:/ BODY /$/ { data -element title $1 }
 570       /^Date:/ BODY /$/    { data -element lastModified $1 }
 571       /\n\n/ BODY END      {
 572          begin element bodyOfDisplay
 573          begin variant body iana "text/plain"
 574          data -text $1
 575          end record
 576       }
 577      </screen>
 578
 579     </para>
 580
 581     <para>
 582      If &zebra; is compiled with support for Tcl enabled, the statements
 583      described above are supplemented with a complete
 584      scripting environment, including control structures (conditional
 585      expressions and loop constructs), and powerful string manipulation
 586      mechanisms for modifying the elements of a record.
 587     </para>
 588
 589    </section>
 590
 591   </section>
 592
 593   <section id="grs-internal-representation">
 594    <title>&acro.grs1; Internal Record Representation</title>
 595
 596    <para>
 597     When records are manipulated by the system, they're represented in a
 598     tree-structure, with data elements at the leaf nodes, and tags or
 599     variant components at the non-leaf nodes. The root-node identifies the
 600     schema that lends context to the tagging and structuring of the
 601     record. Imagine a simple record, consisting of a 'title' element and
 602     an 'author' element:
 603    </para>
 604
 605    <para>
 606
 607     <screen>
 608      ROOT
 609         TITLE     "Zen and the Art of Motorcycle Maintenance"
 610         AUTHOR    "Robert Pirsig"
 611     </screen>
 612
 613    </para>
 614
 615    <para>
 616     A slightly more complex record would have the author element consist
 617     of two elements, a surname and a first name:
 618    </para>
 619
 620    <para>
 621
 622     <screen>
 623      ROOT
 624         TITLE  "Zen and the Art of Motorcycle Maintenance"
 625         AUTHOR
 626            FIRST-NAME "Robert"
 627            SURNAME    "Pirsig"
 628     </screen>
 629
 630    </para>
 631
 632    <para>
 633     The root of the record will refer to the record schema that describes
 634     the structuring of this particular record. The schema defines the
 635     element tags (TITLE, FIRST-NAME, etc.) that may occur in the record, as
 636     well as the structuring (SURNAME should appear below AUTHOR, etc.). In
 637     addition, the schema establishes element set names that are used by
 638     the client to request a subset of the elements of a given record. The
 639     schema may also establish rules for converting the record to a
 640     different schema, by stating, for each element, a mapping to a
 641     different tag path.
 642    </para>
 643
 644    <section id="grs-tagged-elements">
 645     <title>Tagged Elements</title>
 646
 647     <para>
 648      A data element is characterized by its tag, and its position in the
 649      structure of the record. For instance, while the tag "telephone
 650      number" may be used different places in a record, we may need to
 651      distinguish between these occurrences, both for searching and
 652      presentation purposes. For instance, while the phone numbers for the
 653      "customer" and the "service provider" are both
 654      representatives for the same type of resource (a telephone number), it
 655      is essential that they be kept separate. The record schema provides
 656      the structure of the record, and names each data element (defined by
 657      the sequence of tags - the tag path - by which the element can be
 658      reached from the root of the record).
 659     </para>
 660
 661    </section>
 662
 663    <section id="grs-variant-details">
 664     <title>Variants</title>
 665
 666     <para>
 667      The children of a tag node may be either more tag nodes, a data node
 668      (possibly accompanied by tag nodes),
 669      or a tree of variant nodes. The children of  variant nodes are either
 670      more variant nodes or a data node (possibly accompanied by more
 671      variant nodes). Each leaf node, which is normally a
 672      data node, corresponds to a <emphasis>variant form</emphasis> of the
 673      tagged element identified by the tag which parents the variant tree.
 674      The following title element occurs in two different languages:
 675     </para>
 676
 677     <para>
 678
 679      <screen>
 680       VARIANT LANG=ENG  "War and Peace"
 681       TITLE
 682       VARIANT LANG=DAN  "Krig og Fred"
 683      </screen>
 684
 685     </para>
 686
 687     <para>
 688      Which of the two elements are transmitted to the client by the server
 689      depends on the specifications provided by the client, if any.
 690     </para>
 691
 692     <para>
 693      In practice, each variant node is associated with a triple of class,
 694      type, value, corresponding to the variant mechanism of &acro.z3950;.
 695     </para>
 696
 697    </section>
 698
 699    <section id="grs-data-elements">
 700     <title>Data Elements</title>
 701
 702     <para>
 703      Data nodes have no children (they are always leaf nodes in the record
 704      tree).
 705     </para>
 706
 707     <!--
 708     FIXME! Documentation needs extension here about types of nodes - numerical,
 709     textual, etc., plus the various types of inclusion notes.
 710    </para>
 711     -->
 712
 713    </section>
 714
 715   </section>
 716
 717   <section id="grs-conf">
 718    <title>&acro.grs1; Record Model Configuration</title>
 719
 720    <para>
 721     The following sections describe the configuration files that govern
 722     the internal management of <literal>grs</literal> records.
 723     The system searches for the files
 724     in the directories specified by the <emphasis>profilePath</emphasis>
 725     setting in the <literal>zebra.cfg</literal> file.
 726    </para>
 727
 728    <section id="grs-abstract-syntax">
 729     <title>The Abstract Syntax</title>
 730
 731     <para>
 732      The abstract syntax definition (also known as an Abstract Record
 733      Structure, or ARS) is the focal point of the
 734      record schema description. For a given schema, the ABS file may state any
 735      or all of the following:
 736     </para>
 737
 738     <!--
 739      FIXME - Need a diagram here, or a simple explanation how it all hangs together -H
 740     -->
 741
 742     <para>
 743
 744      <itemizedlist>
 745       <listitem>
 746
 747        <para>
 748         The object identifier of the &acro.z3950; schema associated
 749         with the ARS, so that it can be referred to by the client.
 750        </para>
 751       </listitem>
 752
 753       <listitem>
 754        <para>
 755         The attribute set (which can possibly be a compound of multiple
 756         sets) which applies in the profile. This is used when indexing and
 757         searching the records belonging to the given profile.
 758        </para>
 759       </listitem>
 760
 761       <listitem>
 762        <para>
 763         The tag set (again, this can consist of several different sets).
 764         This is used when reading the records from a file, to recognize the
 765         different tags, and when transmitting the record to the client -
 766         mapping the tags to their numerical representation, if they are
 767         known.
 768        </para>
 769       </listitem>
 770
 771       <listitem>
 772        <para>
 773         The variant set which is used in the profile. This provides a
 774         vocabulary for specifying the <emphasis>forms</emphasis> of
 775         data that appear inside the records.
 776        </para>
 777       </listitem>
 778
 779       <listitem>
 780        <para>
 781         Element set names, which are a shorthand way for the client to
 782         ask for a subset of the data elements contained in a record. Element
 783         set names, in the retrieval module, are mapped to <emphasis>element
 784          specifications</emphasis>, which contain information equivalent to the
 785         <emphasis>Espec-1</emphasis> syntax of &acro.z3950;.
 786        </para>
 787       </listitem>
 788
 789       <listitem>
 790        <para>
 791         Map tables, which may specify mappings to
 792         <emphasis>other</emphasis> database profiles, if desired.
 793        </para>
 794       </listitem>
 795
 796       <listitem>
 797        <para>
 798         Possibly, a set of rules describing the mapping of elements to a
 799         &acro.marc; representation.
 800
 801        </para>
 802       </listitem>
 803
 804       <listitem>
 805        <para>
 806         A list of element descriptions (this is the actual ARS of the
 807         schema, in &acro.z3950; terms), which lists the ways in which the various
 808         tags can be used and organized hierarchically.
 809        </para>
 810       </listitem>
 811
 812      </itemizedlist>
 813
 814     </para>
 815
 816     <para>
 817      Several of the entries above simply refer to other files, which
 818      describe the given objects.
 819     </para>
 820
 821    </section>
 822
 823    <section id="grs-configuration-files">
 824     <title>The Configuration Files</title>
 825
 826     <para>
 827      This section describes the syntax and use of the various tables which
 828      are used by the retrieval module.
 829     </para>
 830
 831     <para>
 832      The number of different file types may appear daunting at first, but
 833      each type corresponds fairly clearly to a single aspect of the &acro.z3950;
 834      retrieval facilities. Further, the average database administrator,
 835      who is simply reusing an existing profile for which tables already
 836      exist, shouldn't have to worry too much about the contents of these tables.
 837     </para>
 838
 839     <para>
 840      Generally, the files are simple ASCII files, which can be maintained
 841      using any text editor. Blank lines, and lines beginning with a (#) are
 842      ignored. Any characters on a line followed by a (#) are also ignored.
 843      All other lines contain <emphasis>directives</emphasis>, which provide
 844      some setting or value to the system.
 845      Generally, settings are characterized by a single
 846      keyword, identifying the setting, followed by a number of parameters.
 847      Some settings are repeatable (r), while others may occur only once in a
 848      file. Some settings are optional (o), while others again are
 849      mandatory (m).
 850     </para>
 851
 852    </section>
 853
 854    <section id="abs-file">
 855     <title>The Abstract Syntax (.abs) Files</title>
 856
 857     <para>
 858      The name of this file type is slightly misleading in &acro.z3950; terms,
 859      since, apart from the actual abstract syntax of the profile, it also
 860      includes most of the other definitions that go into a database
 861      profile.
 862     </para>
 863
 864     <para>
 865      When a record in the canonical, &acro.sgml;-like format is read from a file
 866      or from the database, the first tag of the file should reference the
 867      profile that governs the layout of the record. If the first tag of the
 868      record is, say, <literal>&lt;gils&gt;</literal>, the system will look
 869      for the profile definition in the file <literal>gils.abs</literal>.
 870      Profile definitions are cached, so they only have to be read once
 871      during the lifespan of the current process.
 872     </para>
 873
 874     <para>
 875      When writing your own input filters, the
 876      <emphasis>record-begin</emphasis> command
 877      introduces the profile, and should always be called first thing when
 878      introducing a new record.
 879     </para>
 880
 881     <para>
 882      The file may contain the following directives:
 883     </para>
 884
 885     <para>
 886      <variablelist>
 887
 888       <varlistentry>
 889        <term>name <replaceable>symbolic-name</replaceable></term>
 890        <listitem>
 891         <para>
 892          (m) This provides a shorthand name or
 893          description for the profile. Mostly useful for diagnostic purposes.
 894         </para>
 895        </listitem>
 896       </varlistentry>
 897       <varlistentry>
 898        <term>reference <replaceable>OID-name</replaceable></term>
 899        <listitem>
 900         <para>
 901          (m) The reference name of the OID for the profile.
 902          The reference names can be found in the <emphasis>util</emphasis>
 903          module of &yaz;.
 904         </para>
 905        </listitem>
 906       </varlistentry>
 907       <varlistentry>
 908        <term>attset <replaceable>filename</replaceable></term>
 909        <listitem>
 910         <para>
 911          (m) The attribute set that is used for
 912          indexing and searching records belonging to this profile.
 913         </para>
 914        </listitem>
 915       </varlistentry>
 916       <varlistentry>
 917        <term>tagset <replaceable>filename</replaceable></term>
 918        <listitem>
 919         <para>
 920          (o) The tag set (if any) that describe
 921          that fields of the records.
 922         </para>
 923        </listitem>
 924       </varlistentry>
 925       <varlistentry>
 926        <term>varset <replaceable>filename</replaceable></term>
 927        <listitem>
 928         <para>
 929          (o) The variant set used in the profile.
 930         </para>
 931        </listitem>
 932       </varlistentry>
 933       <varlistentry>
 934        <term>maptab <replaceable>filename</replaceable></term>
 935        <listitem>
 936         <para>
 937          (o,r) This points to a
 938          conversion table that might be used if the client asks for the record
 939          in a different schema from the native one.
 940         </para>
 941        </listitem>
 942       </varlistentry>
 943       <varlistentry>
 944        <term>marc <replaceable>filename</replaceable></term>
 945        <listitem>
 946         <para>
 947          (o) Points to a file containing parameters
 948          for representing the record contents in the ISO2709 syntax.
 949          Read the description of the &acro.marc; representation facility below.
 950         </para>
 951        </listitem>
 952       </varlistentry>
 953       <varlistentry>
 954        <term>esetname <replaceable>name filename</replaceable></term>
 955        <listitem>
 956         <para>
 957          (o,r) Associates the
 958          given element set name with an element selection file. If an (@) is
 959          given in place of the filename, this corresponds to a null mapping for
 960          the given element set name.
 961         </para>
 962        </listitem>
 963       </varlistentry>
 964       <varlistentry>
 965        <term>all <replaceable>tags</replaceable></term>
 966        <listitem>
 967         <para>
 968          (o) This directive specifies a list of attributes
 969          which should be appended to the attribute list given for each
 970          element. The effect is to make every single element in the abstract
 971          syntax searchable by way of the given attributes. This directive
 972          provides an efficient way of supporting free-text searching across all
 973          elements. However, it does increase the size of the index
 974          significantly. The attributes can be qualified with a structure, as in
 975          the <replaceable>elm</replaceable> directive below.
 976         </para>
 977        </listitem>
 978       </varlistentry>
 979       <varlistentry>
 980        <term>elm <replaceable>path name attributes</replaceable></term>
 981        <listitem>
 982         <para>
 983          (o,r) Adds an element to the abstract record syntax of the schema.
 984          The <replaceable>path</replaceable> follows the
 985          syntax which is suggested by the &acro.z3950; document - that is, a sequence
 986          of tags separated by slashes (&#x2f;). Each tag is given as a
 987          comma-separated pair of tag type and -value surrounded by parenthesis.
 988          The <replaceable>name</replaceable> is the name of the element, and
 989          the <replaceable>attributes</replaceable>
 990          specifies which attributes to use when indexing the element in a
 991          comma-separated list.
 992          A <literal>!</literal> in place of the attribute name is equivalent
 993          to specifying an attribute name identical to the element name.
 994          A <literal>-</literal> in place of the attribute name
 995          specifies that no indexing is to take place for the given element.
 996          The attributes can be qualified with <replaceable>field
 997           types</replaceable> to specify which
 998          character set should govern the indexing procedure for that field.
 999          The same data element may be indexed into several different
1000          fields, using different character set definitions.
1001          See the <xref linkend="fields-and-charsets"/>.
1002          The default field type is <literal>w</literal> for
1003          <emphasis>word</emphasis>.
1004         </para>
1005        </listitem>
1006       </varlistentry>
1007
1008       <varlistentry>
1009        <term>xelm <replaceable>xpath attributes</replaceable></term>
1010        <listitem>
1011         <para>
1012          Specifies indexing for record nodes given by
1013          <replaceable>xpath</replaceable>. Unlike directive
1014          elm, this directive allows you to index attribute
1015          contents. The <replaceable>xpath</replaceable> uses
1016          a syntax similar to XPath. The <replaceable>attributes</replaceable>
1017          have same syntax and meaning as directive elm, except that operator
1018          ! refers to the nodes selected by <replaceable>xpath</replaceable>.
1019          <!--
1020          xelm   /         !:w                 default index
1021          xelm   //        !:w                 additional index
1022          xelm   /gils/title/@att    myatt:w   index attribute @att in myatt
1023          xelm   title/@att          myatt:w   same meaning.
1024          -->
1025         </para>
1026        </listitem>
1027       </varlistentry>
1028       <varlistentry>
1029        <term>melm <replaceable>field$subfield attributes</replaceable></term>
1030        <listitem>
1031         <para>
1032          This directive is specifically for &acro.marc;-formatted records,
1033          ingested either in the form of &acro.marcxml; documents, or in the
1034          ISO2709/Z39.2 format using the grs.marcxml input filter. You can
1035          specify indexing rules for any subfield, or you can leave off the
1036          <replaceable>$subfield</replaceable> part and specify default rules
1037          for all subfields of the given field (note: default rules should come
1038          after any subfield-specific rules in the configuration file). The
1039          <replaceable>attributes</replaceable> have the same syntax and meaning
1040          as for the 'elm' directive above.
1041         </para>
1042        </listitem>
1043       </varlistentry>
1044       <varlistentry>
1045        <term>encoding <replaceable>encodingname</replaceable></term>
1046        <listitem>
1047         <para>
1048          This directive specifies character encoding for external records.
1049          For records such as &acro.xml; that specifies encoding within the
1050          file via a header this directive is ignored.
1051          If neither this directive is given, nor an encoding is set
1052          within external records, ISO-8859-1 encoding is assumed.
1053          </para>
1054        </listitem>
1055       </varlistentry>
1056       <varlistentry>
1057        <term>xpath <literal>enable</literal>/<literal>disable</literal></term>
1058        <listitem>
1059         <para>
1060          If this directive is followed by <literal>enable</literal>,
1061          then extra indexing is performed to allow for XPath-like queries.
1062          If this directive is not specified - equivalent to
1063          <literal>disable</literal> - no extra XPath-indexing is performed.
1064         </para>
1065        </listitem>
1066       </varlistentry>
1067
1068       <!-- Adam's version
1069       <varlistentry>
1070        <term>systag <replaceable>systemtag</replaceable> <replaceable>element</replaceable></term>
1071        <listitem>
1072         <para>
1073          This directive maps system information to an element during
1074          retrieval. This information is dynamically created. The
1075          following system tags are defined
1076          <variablelist>
1077           <varlistentry>
1078            <term>size</term>
1079            <listitem>
1080             <para>
1081              Size of record in bytes. By default this
1082              is mapped to element <literal>size</literal>.
1083             </para>
1084            </listitem>
1085           </varlistentry>
1086
1087           <varlistentry>
1088            <term>rank</term>
1089            <listitem>
1090             <para>
1091              Score/rank of record. By default this
1092              is mapped to element <literal>rank</literal>.
1093              If no score was calculated for the record (non-ranked
1094              searched) search this directive is ignored.
1095             </para>
1096            </listitem>
1097           </varlistentry>
1098
1099           <varlistentry>
1100            <term>sysno</term>
1101            <listitem>
1102             <para>
1103              &zebra;'s system number (record ID) for the
1104              record. By default this is mapped to element
1105              <literal>localControlNumber</literal>.
1106             </para>
1107            </listitem>
1108           </varlistentry>
1109          </variablelist>
1110          If you do not want a particular system tag to be applied,
1111          then set the resulting element to something undefined in the
1112          abs file (such as <literal>none</literal>).
1113         </para>
1114        </listitem>
1115       </varlistentry>
1116       -->
1117
1118       <!-- Mike's version -->
1119       <varlistentry>
1120        <term>
1121         systag
1122         <replaceable>systemTag</replaceable>
1123         <replaceable>actualTag</replaceable>
1124        </term>
1125        <listitem>
1126         <para>
1127          Specifies what information, if any, &zebra; should
1128          automatically include in retrieval records for the
1129          ``system fields'' that it supports.
1130          <replaceable>systemTag</replaceable> may
1131          be any of the following:
1132          <variablelist>
1133           <varlistentry>
1134            <term><literal>rank</literal></term>
1135            <listitem><para>
1136             An integer indicating the relevance-ranking score
1137             assigned to the record.
1138            </para></listitem>
1139           </varlistentry>
1140           <varlistentry>
1141            <term><literal>sysno</literal></term>
1142            <listitem><para>
1143             An automatically generated identifier for the record,
1144             unique within this database.  It is represented by the
1145             <literal>&lt;localControlNumber&gt;</literal> element in
1146             &acro.xml; and the <literal>(1,14)</literal> tag in &acro.grs1;.
1147            </para></listitem>
1148           </varlistentry>
1149           <varlistentry>
1150            <term><literal>size</literal></term>
1151            <listitem><para>
1152             The size, in bytes, of the retrieved record.
1153            </para></listitem>
1154           </varlistentry>
1155          </variablelist>
1156         </para>
1157         <para>
1158          The <replaceable>actualTag</replaceable> parameter may be
1159          <literal>none</literal> to indicate that the named element
1160          should be omitted from retrieval records.
1161         </para>
1162        </listitem>
1163       </varlistentry>
1164      </variablelist>
1165     </para>
1166
1167     <note>
1168      <para>
1169       The mechanism for controlling indexing is not adequate for
1170       complex databases, and will probably be moved into a separate
1171       configuration table eventually.
1172      </para>
1173     </note>
1174
1175     <para>
1176      The following is an excerpt from the abstract syntax file for the GILS
1177      profile.
1178     </para>
1179
1180     <para>
1181
1182      <screen>
1183       name gils
1184       reference GILS-schema
1185       attset gils.att
1186       tagset gils.tag
1187       varset var1.var
1188
1189       maptab gils-usmarc.map
1190
1191       # Element set names
1192
1193       esetname VARIANT gils-variant.est  # for WAIS-compliance
1194       esetname B gils-b.est
1195       esetname G gils-g.est
1196       esetname F @
1197
1198       elm (1,10)               rank                        -
1199       elm (1,12)               url                         -
1200       elm (1,14)               localControlNumber     Local-number
1201       elm (1,16)               dateOfLastModification Date/time-last-modified
1202       elm (2,1)                title                       w:!,p:!
1203       elm (4,1)                controlIdentifier      Identifier-standard
1204       elm (2,6)                abstract               Abstract
1205       elm (4,51)               purpose                     !
1206       elm (4,52)               originator                  -
1207       elm (4,53)               accessConstraints           !
1208       elm (4,54)               useConstraints              !
1209       elm (4,70)               availability                -
1210       elm (4,70)/(4,90)        distributor                 -
1211       elm (4,70)/(4,90)/(2,7)  distributorName             !
1212       elm (4,70)/(4,90)/(2,10) distributorOrganization     !
1213       elm (4,70)/(4,90)/(4,2)  distributorStreetAddress    !
1214       elm (4,70)/(4,90)/(4,3)  distributorCity             !
1215      </screen>
1216
1217     </para>
1218
1219    </section>
1220
1221    <section id="attset-files">
1222     <title>The Attribute Set (.att) Files</title>
1223
1224     <para>
1225      This file type describes the <replaceable>Use</replaceable> elements of
1226      an attribute set.
1227      It contains the following directives.
1228     </para>
1229
1230     <para>
1231      <variablelist>
1232       <varlistentry>
1233        <term>name <replaceable>symbolic-name</replaceable></term>
1234        <listitem>
1235         <para>
1236          (m) This provides a shorthand name or
1237          description for the attribute set.
1238          Mostly useful for diagnostic purposes.
1239         </para>
1240        </listitem></varlistentry>
1241       <varlistentry>
1242        <term>reference <replaceable>OID-name</replaceable></term>
1243        <listitem>
1244         <para>
1245          (m) The reference name of the OID for
1246          the attribute set.
1247          The reference names can be found in the <replaceable>util</replaceable>
1248          module of <replaceable>&yaz;</replaceable>.
1249         </para>
1250        </listitem></varlistentry>
1251       <varlistentry>
1252        <term>include <replaceable>filename</replaceable></term>
1253        <listitem>
1254         <para>
1255          (o,r) This directive is used to
1256          include another attribute set as a part of the current one. This is
1257          used when a new attribute set is defined as an extension to another
1258          set. For instance, many new attribute sets are defined as extensions
1259          to the <replaceable>bib-1</replaceable> set.
1260          This is an important feature of the retrieval
1261          system of &acro.z3950;, as it ensures the highest possible level of
1262          interoperability, as those access points of your database which are
1263          derived from the external set (say, bib-1) can be used even by clients
1264          who are unaware of the new set.
1265         </para>
1266        </listitem></varlistentry>
1267       <varlistentry>
1268        <term>att
1269         <replaceable>att-value att-name [local-value]</replaceable></term>
1270        <listitem>
1271         <para>
1272          (o,r) This
1273          repeatable directive introduces a new attribute to the set. The
1274          attribute value is stored in the index (unless a
1275          <replaceable>local-value</replaceable> is
1276          given, in which case this is stored). The name is used to refer to the
1277          attribute from the <replaceable>abstract syntax</replaceable>.
1278         </para>
1279        </listitem></varlistentry>
1280      </variablelist>
1281     </para>
1282
1283     <para>
1284      This is an excerpt from the GILS attribute set definition.
1285      Notice how the file describing the <emphasis>bib-1</emphasis>
1286      attribute set is referenced.
1287     </para>
1288
1289     <para>
1290
1291      <screen>
1292       name gils
1293       reference GILS-attset
1294       include bib1.att
1295
1296       att 2001          distributorName
1297       att 2002          indextermsControlled
1298       att 2003          purpose
1299       att 2004          accessConstraints
1300       att 2005          useConstraints
1301      </screen>
1302
1303     </para>
1304
1305    </section>
1306
1307    <section id="grs-tag-files">
1308     <title>The Tag Set (.tag) Files</title>
1309
1310     <para>
1311      This file type defines the tagset of the profile, possibly by
1312      referencing other tag sets (most tag sets, for instance, will include
1313      tagsetG and tagsetM from the &acro.z3950; specification. The file may
1314      contain the following directives.
1315     </para>
1316
1317     <para>
1318      <variablelist>
1319
1320       <varlistentry>
1321        <term>name <emphasis>symbolic-name</emphasis></term>
1322        <listitem>
1323         <para>
1324          (m) This provides a shorthand name or
1325          description for the tag set. Mostly useful for diagnostic purposes.
1326         </para>
1327        </listitem></varlistentry>
1328       <varlistentry>
1329        <term>reference <emphasis>OID-name</emphasis></term>
1330        <listitem>
1331         <para>
1332          (o) The reference name of the OID for the tag set.
1333          The reference names can be found in the <emphasis>util</emphasis>
1334          module of <emphasis>&yaz;</emphasis>.
1335          The directive is optional, since not all tag sets
1336          are registered outside of their schema.
1337         </para>
1338        </listitem></varlistentry>
1339       <varlistentry>
1340        <term>type <emphasis>integer</emphasis></term>
1341        <listitem>
1342         <para>
1343          (m) The type number of the tagset within the schema
1344          profile (note: this specification really should belong to the .abs
1345          file. This will be fixed in a future release).
1346         </para>
1347        </listitem></varlistentry>
1348       <varlistentry>
1349        <term>include <emphasis>filename</emphasis></term>
1350        <listitem>
1351         <para>
1352          (o,r) This directive is used
1353          to include the definitions of other tag sets into the current one.
1354         </para>
1355        </listitem></varlistentry>
1356       <varlistentry>
1357        <term>tag <emphasis>number names type</emphasis></term>
1358        <listitem>
1359         <para>
1360          (o,r) Introduces a new tag to the set.
1361          The <emphasis>number</emphasis> is the tag number as used
1362          in the protocol (there is currently no mechanism for
1363          specifying string tags at this point, but this would be quick
1364          work to add).
1365          The <emphasis>names</emphasis> parameter is a list of names
1366          by which the tag should be recognized in the input file format.
1367          The names should be separated by slashes (/).
1368          The <emphasis>type</emphasis> is the recommended data type of
1369          the tag.
1370          It should be one of the following:
1371
1372          <itemizedlist>
1373           <listitem>
1374            <para>
1375             structured
1376            </para>
1377           </listitem>
1378
1379           <listitem>
1380            <para>
1381             string
1382            </para>
1383           </listitem>
1384
1385           <listitem>
1386            <para>
1387             numeric
1388            </para>
1389           </listitem>
1390
1391           <listitem>
1392            <para>
1393             bool
1394            </para>
1395           </listitem>
1396
1397           <listitem>
1398            <para>
1399             oid
1400            </para>
1401           </listitem>
1402
1403           <listitem>
1404            <para>
1405             generalizedtime
1406            </para>
1407           </listitem>
1408
1409           <listitem>
1410            <para>
1411             intunit
1412            </para>
1413           </listitem>
1414
1415           <listitem>
1416            <para>
1417             int
1418            </para>
1419           </listitem>
1420
1421           <listitem>
1422            <para>
1423             octetstring
1424            </para>
1425           </listitem>
1426
1427           <listitem>
1428            <para>
1429             null
1430            </para>
1431           </listitem>
1432
1433          </itemizedlist>
1434
1435         </para>
1436        </listitem></varlistentry>
1437      </variablelist>
1438     </para>
1439
1440     <para>
1441      The following is an excerpt from the TagsetG definition file.
1442     </para>
1443
1444     <para>
1445      <screen>
1446       name tagsetg
1447       reference TagsetG
1448       type 2
1449
1450       tag       1       title           string
1451       tag       2       author          string
1452       tag       3       publicationPlace string
1453       tag       4       publicationDate string
1454       tag       5       documentId      string
1455       tag       6       abstract        string
1456       tag       7       name            string
1457       tag       8       date            generalizedtime
1458       tag       9       bodyOfDisplay   string
1459       tag       10      organization    string
1460      </screen>
1461     </para>
1462
1463    </section>
1464
1465    <section id="grs-var-files">
1466     <title>The Variant Set (.var) Files</title>
1467
1468     <para>
1469      The variant set file is a straightforward representation of the
1470      variant set definitions associated with the protocol. At present, only
1471      the <emphasis>Variant-1</emphasis> set is known.
1472     </para>
1473
1474     <para>
1475      These are the directives allowed in the file.
1476     </para>
1477
1478     <para>
1479      <variablelist>
1480
1481       <varlistentry>
1482        <term>name <emphasis>symbolic-name</emphasis></term>
1483        <listitem>
1484         <para>
1485          (m) This provides a shorthand name or
1486          description for the variant set. Mostly useful for diagnostic purposes.
1487         </para>
1488        </listitem></varlistentry>
1489       <varlistentry>
1490        <term>reference <emphasis>OID-name</emphasis></term>
1491        <listitem>
1492         <para>
1493          (o) The reference name of the OID for
1494          the variant set, if one is required. The reference names can be found
1495          in the <emphasis>util</emphasis> module of <emphasis>&yaz;</emphasis>.
1496         </para>
1497        </listitem></varlistentry>
1498       <varlistentry>
1499        <term>class <emphasis>integer class-name</emphasis></term>
1500        <listitem>
1501         <para>
1502          (m,r) Introduces a new
1503          class to the variant set.
1504         </para>
1505        </listitem></varlistentry>
1506       <varlistentry>
1507        <term>type <emphasis>integer type-name datatype</emphasis></term>
1508        <listitem>
1509         <para>
1510          (m,r) Addes a
1511          new type to the current class (the one introduced by the most recent
1512          <emphasis>class</emphasis> directive).
1513          The type names belong to the same name space as the one used
1514          in the tag set definition file.
1515         </para>
1516        </listitem></varlistentry>
1517      </variablelist>
1518     </para>
1519
1520     <para>
1521      The following is an excerpt from the file describing the variant set
1522      <emphasis>Variant-1</emphasis>.
1523     </para>
1524
1525     <para>
1526
1527      <screen>
1528       name variant-1
1529       reference Variant-1
1530
1531       class 1 variantId
1532
1533       type      1       variantId               octetstring
1534
1535       class 2 body
1536
1537       type      1       iana                    string
1538       type      2       z39.50                  string
1539       type      3       other                   string
1540      </screen>
1541
1542     </para>
1543
1544    </section>
1545
1546    <section id="grs-est-files">
1547     <title>The Element Set (.est) Files</title>
1548
1549     <para>
1550      The element set specification files describe a selection of a subset
1551      of the elements of a database record. The element selection mechanism
1552      is equivalent to the one supplied by the <emphasis>Espec-1</emphasis>
1553      syntax of the &acro.z3950; specification.
1554      In fact, the internal representation of an element set
1555      specification is identical to the <emphasis>Espec-1</emphasis> structure,
1556      and we'll refer you to the description of that structure for most of
1557      the detailed semantics of the directives below.
1558     </para>
1559
1560     <note>
1561      <para>
1562       Not all of the Espec-1 functionality has been implemented yet.
1563       The fields that are mentioned below all work as expected, unless
1564       otherwise is noted.
1565      </para>
1566     </note>
1567
1568     <para>
1569      The directives available in the element set file are as follows:
1570     </para>
1571
1572     <para>
1573      <variablelist>
1574       <varlistentry>
1575        <term>defaultVariantSetId <emphasis>OID-name</emphasis></term>
1576        <listitem>
1577         <para>
1578          (o) If variants are used in
1579          the following, this should provide the name of the variantset used
1580          (it's not currently possible to specify a different set in the
1581          individual variant request). In almost all cases (certainly all
1582          profiles known to us), the name
1583          <literal>Variant-1</literal> should be given here.
1584         </para>
1585        </listitem></varlistentry>
1586       <varlistentry>
1587        <term>defaultVariantRequest <emphasis>variant-request</emphasis></term>
1588        <listitem>
1589         <para>
1590          (o) This directive
1591          provides a default variant request for
1592          use when the individual element requests (see below) do not contain a
1593          variant request. Variant requests consist of a blank-separated list of
1594          variant components. A variant compont is a comma-separated,
1595          parenthesized triple of variant class, type, and value (the two former
1596          values being represented as integers). The value can currently only be
1597          entered as a string (this will change to depend on the definition of
1598          the variant in question). The special value (@) is interpreted as a
1599          null value, however.
1600         </para>
1601        </listitem></varlistentry>
1602       <varlistentry>
1603        <term>simpleElement
1604         <emphasis>path ['variant' variant-request]</emphasis></term>
1605        <listitem>
1606         <para>
1607          (o,r) This corresponds to a simple element request
1608          in <emphasis>Espec-1</emphasis>.
1609          The path consists of a sequence of tag-selectors, where each of
1610          these can consist of either:
1611         </para>
1612
1613         <para>
1614          <itemizedlist>
1615           <listitem>
1616            <para>
1617             A simple tag, consisting of a comma-separated type-value pair in
1618             parenthesis, possibly followed by a colon (:) followed by an
1619             occurrences-specification (see below). The tag-value can be a number
1620             or a string. If the first character is an apostrophe ('), this
1621             forces the value to be interpreted as a string, even if it
1622             appears to be numerical.
1623            </para>
1624           </listitem>
1625
1626           <listitem>
1627            <para>
1628             A WildThing, represented as a question mark (?), possibly
1629             followed by a colon (:) followed by an occurrences
1630             specification (see below).
1631            </para>
1632           </listitem>
1633
1634           <listitem>
1635            <para>
1636             A WildPath, represented as an asterisk (*). Note that the last
1637             element of the path should not be a wildPath (wildpaths don't
1638             work in this version).
1639            </para>
1640           </listitem>
1641
1642          </itemizedlist>
1643
1644         </para>
1645
1646         <para>
1647          The occurrences-specification can be either the string
1648          <literal>all</literal>, the string <literal>last</literal>, or
1649          an explicit value-range. The value-range is represented as
1650          an integer (the starting point), possibly followed by a
1651          plus (+) and a second integer (the number of elements, default
1652          being one).
1653         </para>
1654
1655         <para>
1656          The variant-request has the same syntax as the defaultVariantRequest
1657          above. Note that it may sometimes be useful to give an empty variant
1658          request, simply to disable the default for a specific set of fields
1659          (we aren't certain if this is proper <emphasis>Espec-1</emphasis>,
1660          but it works in this implementation).
1661         </para>
1662        </listitem></varlistentry>
1663      </variablelist>
1664     </para>
1665
1666     <para>
1667      The following is an example of an element specification belonging to
1668      the GILS profile.
1669     </para>
1670
1671     <para>
1672
1673      <screen>
1674       simpleelement (1,10)
1675       simpleelement (1,12)
1676       simpleelement (2,1)
1677       simpleelement (1,14)
1678       simpleelement (4,1)
1679       simpleelement (4,52)
1680      </screen>
1681
1682     </para>
1683
1684    </section>
1685
1686    <section id="schema-mapping">
1687     <title>The Schema Mapping (.map) Files</title>
1688
1689     <para>
1690      Sometimes, the client might want to receive a database record in
1691      a schema that differs from the native schema of the record. For
1692      instance, a client might only know how to process WAIS records, while
1693      the database record is represented in a more specific schema, such as
1694      GILS. In this module, a mapping of data to one of the &acro.marc; formats is
1695      also thought of as a schema mapping (mapping the elements of the
1696      record into fields consistent with the given &acro.marc; specification, prior
1697      to actually converting the data to the ISO2709). This use of the
1698      object identifier for &acro.usmarc; as a schema identifier represents an
1699      overloading of the OID which might not be entirely proper. However,
1700      it represents the dual role of schema and record syntax which
1701      is assumed by the &acro.marc; family in &acro.z3950;.
1702     </para>
1703
1704     <!--
1705      <emphasis>NOTE: FIXME! The schema-mapping functions are so far limited to a
1706       straightforward mapping of elements. This should be extended with
1707       mechanisms for conversions of the element contents, and conditional
1708       mappings of elements based on the record contents.</emphasis>
1709     -->
1710
1711     <para>
1712      These are the directives of the schema mapping file format:
1713     </para>
1714
1715     <para>
1716      <variablelist>
1717
1718       <varlistentry>
1719        <term>targetName <emphasis>name</emphasis></term>
1720        <listitem>
1721         <para>
1722          (m) A symbolic name for the target schema
1723          of the table. Useful mostly for diagnostic purposes.
1724         </para>
1725        </listitem></varlistentry>
1726       <varlistentry>
1727        <term>targetRef <emphasis>OID-name</emphasis></term>
1728        <listitem>
1729         <para>
1730          (m) An OID name for the target schema.
1731          This is used, for instance, by a server receiving a request to present
1732          a record in a different schema from the native one.
1733          The name, again, is found in the <emphasis>oid</emphasis>
1734          module of <emphasis>&yaz;</emphasis>.
1735         </para>
1736        </listitem></varlistentry>
1737       <varlistentry>
1738        <term>map <emphasis>element-name target-path</emphasis></term>
1739        <listitem>
1740         <para>
1741          (o,r) Adds
1742          an element mapping rule to the table.
1743         </para>
1744        </listitem></varlistentry>
1745      </variablelist>
1746     </para>
1747
1748    </section>
1749
1750    <section id="grs-mar-files">
1751     <title>The &acro.marc; (ISO2709) Representation (.mar) Files</title>
1752
1753     <para>
1754      This file provides rules for representing a record in the ISO2709
1755      format. The rules pertain mostly to the values of the constant-length
1756      header of the record.
1757     </para>
1758
1759     <!--
1760      NOTE: FIXME! This will be described better. We're in the process of
1761       re-evaluating and most likely changing the way that &acro.marc; records are
1762       handled by the system.</emphasis>
1763     -->
1764
1765    </section>
1766   </section>
1767
1768   <section id="grs-exchange-formats">
1769    <title>&acro.grs1; Exchange Formats</title>
1770
1771    <para>
1772     Converting records from the internal structure to an exchange format
1773     is largely an automatic process. Currently, the following exchange
1774     formats are supported:
1775    </para>
1776
1777    <para>
1778     <itemizedlist>
1779      <listitem>
1780       <para>
1781        &acro.grs1;. The internal representation is based on &acro.grs1;/&acro.xml;, so the
1782        conversion here is straightforward. The system will create
1783        applied variant and supported variant lists as required, if a record
1784        contains variant information.
1785       </para>
1786      </listitem>
1787
1788      <listitem>
1789       <para>
1790        &acro.xml;. The internal representation is based on &acro.grs1;/&acro.xml; so
1791        the mapping is trivial. Note that &acro.xml; schemas, preprocessing
1792        instructions and comments are not part of the internal representation
1793        and therefore will never be part of a generated &acro.xml; record.
1794        Future versions of the &zebra; will support that.
1795       </para>
1796      </listitem>
1797
1798      <listitem>
1799       <para>
1800        &acro.sutrs;. Again, the mapping is fairly straightforward. Indentation
1801        is used to show the hierarchical structure of the record. All
1802        "&acro.grs1;" type records support both the &acro.grs1; and &acro.sutrs;
1803        representations.
1804        <!-- FIXME - What is &acro.sutrs; - should be expanded here -->
1805       </para>
1806      </listitem>
1807
1808      <listitem>
1809       <para>
1810        ISO2709-based formats (&acro.usmarc;, etc.). Only records with a
1811        two-level structure (corresponding to fields and subfields) can be
1812        directly mapped to ISO2709. For records with a different structuring
1813        (eg., GILS), the representation in a structure like &acro.usmarc; involves a
1814        schema-mapping (see <xref linkend="schema-mapping"/>), to an
1815        "implied" &acro.usmarc; schema (implied,
1816        because there is no formal schema which specifies the use of the
1817        &acro.usmarc; fields outside of ISO2709). The resultant, two-level record is
1818        then mapped directly from the internal representation to ISO2709. See
1819        the GILS schema definition files for a detailed example of this
1820        approach.
1821       </para>
1822      </listitem>
1823
1824      <listitem>
1825       <para>
1826        Explain. This representation is only available for records
1827        belonging to the Explain schema.
1828       </para>
1829      </listitem>
1830
1831      <listitem>
1832       <para>
1833        Summary. This ASN-1 based structure is only available for records
1834        belonging to the Summary schema - or schema which provide a mapping
1835        to this schema (see the description of the schema mapping facility
1836        above).
1837       </para>
1838      </listitem>
1839
1840      <!-- FIXME - Is this used anywhere ? -H -->
1841      <listitem>
1842       <para>
1843        SOIF. Support for this syntax is experimental, and is currently
1844        keyed to a private Index Data OID (1.2.840.10003.5.1000.81.2). All
1845        abstract syntaxes can be mapped to the SOIF format, although nested
1846        elements are represented by concatenation of the tag names at each
1847        level.
1848       </para>
1849      </listitem>
1850
1851     </itemizedlist>
1852    </para>
1853   </section>
1854
1855   <section id="grs-extended-marc-indexing">
1856    <title>Extended indexing of &acro.marc; records</title>
1857
1858    <para>Extended indexing of &acro.marc; records will help you if you need index a
1859     combination of subfields, or index only a part of the whole field,
1860     or use during indexing process embedded fields of &acro.marc; record.
1861    </para>
1862
1863    <para>Extended indexing of &acro.marc; records additionally allows:
1864     <itemizedlist>
1865
1866      <listitem>
1867       <para>to index data in LEADER of &acro.marc; record</para>
1868      </listitem>
1869
1870      <listitem>
1871       <para>to index data in control fields (with fixed length)</para>
1872      </listitem>
1873
1874      <listitem>
1875       <para>to use during indexing the values of indicators</para>
1876      </listitem>
1877
1878      <listitem>
1879       <para>to index linked fields for UNI&acro.marc; based formats</para>
1880      </listitem>
1881
1882     </itemizedlist>
1883    </para>
1884
1885    <note><para>In compare with simple indexing process the extended indexing
1886      may increase (about 2-3 times) the time of indexing process for &acro.marc;
1887      records.</para></note>
1888
1889    <section id="formula">
1890     <title>The index-formula</title>
1891
1892     <para>At the beginning, we have to define the term
1893      <emphasis>index-formula</emphasis> for &acro.marc; records. This term helps
1894      to understand the notation of extended indexing of &acro.marc; records by &zebra;.
1895      Our definition is based on the document
1896      <ulink url="http://www.rba.ru/rusmarc/soft/Z39-50.htm">"The table
1897       of conformity for &acro.z3950; use attributes and R&acro.usmarc; fields"</ulink>.
1898      The document is available only in russian language.</para>
1899
1900     <para>
1901      The <emphasis>index-formula</emphasis> is the combination of
1902      subfields presented in such way:
1903     </para>
1904
1905     <screen>
1906      71-00$a, $g, $h ($c){.$b ($c)} , (1)
1907     </screen>
1908
1909     <para>
1910      We know that &zebra; supports a &acro.bib1; attribute - right truncation.
1911      In this case, the <emphasis>index-formula</emphasis> (1) consists from
1912      forms, defined in the same way as (1)</para>
1913
1914     <screen>
1915      71-00$a, $g, $h
1916      71-00$a, $g
1917      71-00$a
1918     </screen>
1919
1920     <note>
1921      <para>The original &acro.marc; record may be without some elements, which included in <emphasis>index-formula</emphasis>.
1922      </para>
1923     </note>
1924
1925     <para>This notation includes such operands as:
1926      <variablelist>
1927
1928       <varlistentry>
1929        <term>#</term>
1930        <listitem><para>It means whitespace character.</para></listitem>
1931       </varlistentry>
1932
1933       <varlistentry>
1934        <term>-</term>
1935        <listitem><para>The position may contain any value, defined by
1936          &acro.marc; format.
1937          For example, <emphasis>index-formula</emphasis></para>
1938
1939         <screen>
1940          70-#1$a, $g , (2)
1941         </screen>
1942
1943         <para>includes</para>
1944
1945         <screen>
1946          700#1$a, $g
1947          701#1$a, $g
1948          702#1$a, $g
1949         </screen>
1950
1951        </listitem>
1952       </varlistentry>
1953
1954       <varlistentry>
1955        <term>{...}</term>
1956        <listitem>
1957         <para>The repeatable elements are defined in figure-brackets {}.
1958          For example,
1959          <emphasis>index-formula</emphasis></para>
1960
1961         <screen>
1962          71-00$a, $g, $h ($c){.$b ($c)} , (3)
1963         </screen>
1964
1965         <para>includes</para>
1966
1967         <screen>
1968          71-00$a, $g, $h ($c). $b ($c)
1969          71-00$a, $g, $h ($c). $b ($c). $b ($c)
1970          71-00$a, $g, $h ($c). $b ($c). $b ($c). $b ($c)
1971         </screen>
1972
1973        </listitem>
1974       </varlistentry>
1975      </variablelist>
1976
1977      <note>
1978       <para>
1979        All another operands are the same as accepted in &acro.marc; world.
1980       </para>
1981      </note>
1982     </para>
1983    </section>
1984
1985    <section id="notation">
1986     <title>Notation of <emphasis>index-formula</emphasis> for &zebra;</title>
1987
1988
1989     <para>Extended indexing overloads <literal>path</literal> of
1990      <literal>elm</literal> definition in abstract syntax file of &zebra;
1991      (<literal>.abs</literal> file). It means that names beginning with
1992      <literal>"mc-"</literal> are interpreted by &zebra; as
1993      <emphasis>index-formula</emphasis>. The database index is created and
1994      linked with <emphasis>access point</emphasis> (&acro.bib1; use attribute)
1995      according to this formula.</para>
1996
1997     <para>For example, <emphasis>index-formula</emphasis></para>
1998
1999     <screen>
2000      71-00$a, $g, $h ($c){.$b ($c)} , (4)
2001     </screen>
2002
2003     <para>in <literal>.abs</literal> file looks like:</para>
2004
2005     <screen>
2006      mc-71.00_$a,_$g,_$h_(_$c_){.$b_(_$c_)}
2007     </screen>
2008
2009
2010     <para>The notation of <emphasis>index-formula</emphasis> uses the operands:
2011      <variablelist>
2012
2013       <varlistentry>
2014        <term>_</term>
2015        <listitem><para>It means whitespace character.</para></listitem>
2016       </varlistentry>
2017
2018       <varlistentry>
2019        <term>.</term>
2020        <listitem><para>The position may contain any value, defined by
2021          &acro.marc; format. For example,
2022          <emphasis>index-formula</emphasis></para>
2023
2024         <screen>
2025          70-#1$a, $g , (5)
2026         </screen>
2027
2028         <para>matches <literal>mc-70._1_$a,_$g_</literal> and includes</para>
2029
2030         <screen>
2031          700_1_$a,_$g_
2032          701_1_$a,_$g_
2033          702_1_$a,_$g_
2034         </screen>
2035        </listitem>
2036       </varlistentry>
2037
2038       <varlistentry>
2039        <term>{...}</term>
2040        <listitem><para>The repeatable elements are defined in
2041          figure-brackets {}. For example,
2042          <emphasis>index-formula</emphasis></para>
2043
2044         <screen>
2045          71#00$a, $g, $h ($c) {.$b ($c)} , (6)
2046         </screen>
2047
2048         <para>matches
2049          <literal>mc-71.00_$a,_$g,_$h_(_$c_){.$b_(_$c_)}</literal> and
2050          includes</para>
2051
2052         <screen>
2053          71.00_$a,_$g,_$h_(_$c_).$b_(_$c_)
2054          71.00_$a,_$g,_$h_(_$c_).$b_(_$c_).$b_(_$c_)
2055          71.00_$a,_$g,_$h_(_$c_).$b_(_$c_).$b_(_$c_).$b_(_$c_)
2056         </screen>
2057        </listitem>
2058       </varlistentry>
2059
2060       <varlistentry>
2061        <term>&#60;...&#62;</term>
2062        <listitem><para>Embedded <emphasis>index-formula</emphasis> (for
2063          linked fields) is between &#60;&#62;. For example,
2064          <emphasis>index-formula</emphasis>
2065         </para>
2066
2067         <screen>
2068          4--#-$170-#1$a, $g ($c) , (7)
2069         </screen>
2070
2071         <para>matches
2072          <literal>mc-4.._._$1&#60;70._1_$a,_$g_(_$c_)&#62;_</literal> and
2073          includes</para>
2074
2075         <screen>
2076          463_._$1&#60;70._1_$a,_$g_(_$c_)&#62;_
2077         </screen>
2078
2079        </listitem>
2080       </varlistentry>
2081      </variablelist>
2082     </para>
2083
2084     <note>
2085      <para>All another operands are the same as accepted in &acro.marc; world.</para>
2086     </note>
2087
2088     <section id="grs-examples">
2089      <title>Examples</title>
2090
2091      <para>
2092       <orderedlist>
2093
2094        <listitem>
2095
2096         <para>indexing LEADER</para>
2097
2098         <para>You need to use keyword "ldr" to index leader. For example,
2099          indexing data from 6th and 7th position of LEADER</para>
2100
2101         <screen>
2102          elm mc-ldr[6] Record-type !
2103          elm mc-ldr[7] Bib-level   !
2104         </screen>
2105
2106        </listitem>
2107
2108        <listitem>
2109
2110         <para>indexing data from control fields</para>
2111
2112         <para>indexing date (the time added to database)</para>
2113
2114         <screen>
2115          elm mc-008[0-5] Date/time-added-to-db !
2116         </screen>
2117
2118         <para>or for R&acro.usmarc; (this data included in 100th field)</para>
2119
2120         <screen>
2121          elm mc-100___$a[0-7]_ Date/time-added-to-db !
2122         </screen>
2123
2124        </listitem>
2125
2126        <listitem>
2127
2128         <para>using indicators while indexing</para>
2129
2130         <para>For R&acro.usmarc; <emphasis>index-formula</emphasis>
2131          <literal>70-#1$a, $g</literal> matches</para>
2132
2133         <screen>
2134          elm 70._1_$a,_$g_ Author !:w,!:p
2135         </screen>
2136
2137         <para>When &zebra; finds a field according to
2138          <literal>"70."</literal> pattern it checks the indicators. In this
2139          case the value of first indicator doesn't mater, but the value of
2140          second one must be whitespace, in another case a field is not
2141          indexed.</para>
2142        </listitem>
2143
2144        <listitem>
2145
2146         <para>indexing embedded (linked) fields for UNI&acro.marc; based
2147          formats</para>
2148
2149         <para>For R&acro.usmarc; <emphasis>index-formula</emphasis>
2150          <literal>4--#-$170-#1$a, $g ($c)</literal> matches</para>
2151
2152         <screen><![CDATA[
2153          elm mc-4.._._$1<70._1_$a,_$g_(_$c_)>_ Author !:w,!:p
2154          ]]></screen>
2155
2156         <para>Data are extracted from record if the field matches to
2157          <literal>"4.._."</literal> pattern and data in linked field
2158          match to embedded
2159          <emphasis>index-formula</emphasis>
2160          <literal>70._1_$a,_$g_(_$c_)</literal>.</para>
2161
2162        </listitem>
2163
2164       </orderedlist>
2165      </para>
2166
2167
2168     </section>
2169    </section>
2170
2171   </section>
2172
2173  </chapter>
2174  <!-- Keep this comment at the end of the file
2175  Local variables:
2176  mode: sgml
2177  sgml-omittag:t
2178  sgml-shorttag:t
2179  sgml-minimize-attributes:nil
2180  sgml-always-quote-attributes:t
2181  sgml-indent-step:1
2182  sgml-indent-data:t
2183  sgml-parent-document: "zebra.xml"
2184  sgml-local-catalogs: nil
2185  sgml-namecase-general:t
2186  End:
2187  -->