doc/recordmodel-grs.xml

   1  <chapter id="grs">
   2   <!-- $Id: recordmodel-grs.xml,v 1.4 2006-09-03 21:37:27 adam Exp $ -->
   3   <title>GRS Record Model and Filter Modules</title>
   4
   5   <para>
   6    The record model described in this chapter applies to the fundamental,
   7    structured
   8    record type <literal>grs</literal>, introduced in
   9    <xref linkend="componentmodulesgrs"/>.
  10   </para>
  11
  12
  13   <section id="grs-filters">
  14    <title>GRS Record Filters</title>
  15    <para>
  16     Many basic subtypes of the <emphasis>grs</emphasis> type are
  17     currently available:
  18    </para>
  19
  20    <para>
  21     <variablelist>
  22      <varlistentry>
  23       <term><literal>grs.sgml</literal></term>
  24       <listitem>
  25        <para>
  26         This is the canonical input format
  27         described <xref linkend="grs-canonical-format"/>. It is using
  28         simple SGML-like syntax.
  29        </para>
  30       </listitem>
  31      </varlistentry>
  32      <varlistentry>
  33       <term><literal>grs.marc.</literal><replaceable>type</replaceable></term>
  34       <listitem>
  35        <para>
  36         This allows Zebra to read
  37         records in the ISO2709 (MARC) encoding standard.
  38         Last parameter <replaceable>type</replaceable> names the
  39         <literal>.abs</literal> file (see below)
  40         which describes the specific MARC structure of the input record as
  41         well as the indexing rules.
  42        </para>
  43        <para>The <literal>grs.marc</literal> uses an internal represtantion
  44         which is not XML conformant. In particular MARC tags are
  45         presented as elements with the same name. And XML elements
  46         may not start with digits. Therefore this filter is only
  47         suitable for systems returning GRS-1 and MARC records. For XML
  48         use <literal>grs.marcxml</literal> filter instead (see below).
  49        </para>
  50        <para>
  51          The loadable <literal>grs.marc</literal> filter module
  52          is packaged in the GNU/Debian package
  53         <literal>libidzebra2.0-mod-grs-marc</literal>
  54        </para>
  55       </listitem>
  56      </varlistentry>
  57      <varlistentry>
  58       <term><literal>grs.marcxml.</literal><replaceable>type</replaceable></term>
  59       <listitem>
  60        <para>
  61         This allows Zebra to read ISO2709 encoded records.
  62         Last parameter <replaceable>type</replaceable> names the
  63         <literal>.abs</literal> file (see below)
  64         which describes the specific MARC structure of the input record as
  65         well as the indexing rules.
  66        </para>
  67        <para>
  68         The internal representation for <literal>grs.marcxml</literal>
  69         is the same as for <ulink url="&url.marcxml;">MARCXML</ulink>.
  70         It slightly more complicated to work with than
  71         <literal>grs.marc</literal> but XML conformant.
  72        </para>
  73        <para>
  74         The loadable <literal>grs.marcxml</literal> filter module
  75         is also contained in the GNU/Debian package
  76         <literal>libidzebra2.0-mod-grs-marc</literal>
  77        </para>
  78       </listitem>
  79      </varlistentry>
  80      <varlistentry>
  81       <term><literal>grs.xml</literal></term>
  82       <listitem>
  83        <para>
  84         This filter reads XML records and uses
  85         <ulink url="http://expat.sourceforge.net/">Expat</ulink> to
  86         parse them and convert them into IDZebra's internal
  87         <literal>grs</literal> record model.
  88         Only one record per file is supported, due to the fact XML does
  89         not allow two documents to "follow" each other (there is no way
  90         to know when a document is finished).
  91         This filter is only available if Zebra is compiled with EXPAT support.
  92        </para>
  93        <para>
  94         The loadable <literal>grs.xml</literal> filter module
  95         is packagged in the GNU/Debian package
  96         <literal>libidzebra2.0-mod-grs-xml</literal>
  97         </para>
  98       </listitem>
  99      </varlistentry>
 100      <varlistentry>
 101       <term><literal>grs.regx.</literal><replaceable>filter</replaceable></term>
 102       <listitem>
 103        <para>
 104         This enables a user-supplied Regular Expressions input
 105         filter described in <xref linkend="grs-regx-tcl"/>.
 106        </para>
 107        <para>
 108         The loadable <literal>grs.regx</literal> filter module
 109         is packaged in the GNU/Debian package
 110         <literal>libidzebra2.0-mod-grs-regx</literal>
 111        </para>
 112       </listitem>
 113      </varlistentry>
 114      <varlistentry>
 115       <term><literal>grs.tcl.</literal><replaceable>filter</replaceable></term>
 116       <listitem>
 117        <para>
 118         Similar to grs.regx but using Tcl for rules, described in
 119         <xref linkend="grs-regx-tcl"/>.
 120        </para>
 121        <para>
 122         The loadable <literal>grs.tcl</literal> filter module
 123         is also packaged in the GNU/Debian package
 124         <literal>libidzebra2.0-mod-grs-regx</literal>
 125        </para>
 126       </listitem>
 127      </varlistentry>
 128
 129     </variablelist>
 130    </para>
 131
 132    <section id="grs-canonical-format">
 133     <title>GRS Canonical Input Format</title>
 134
 135     <para>
 136      Although input data can take any form, it is sometimes useful to
 137      describe the record processing capabilities of the system in terms of
 138      a single, canonical input format that gives access to the full
 139      spectrum of structure and flexibility in the system. In Zebra, this
 140      canonical format is an "SGML-like" syntax.
 141     </para>
 142
 143     <para>
 144      To use the canonical format specify <literal>grs.sgml</literal> as
 145      the record type.
 146     </para>
 147
 148     <para>
 149      Consider a record describing an information resource (such a record is
 150      sometimes known as a <emphasis>locator record</emphasis>).
 151      It might contain a field describing the distributor of the
 152      information resource, which might in turn be partitioned into
 153      various fields providing details about the distributor, like this:
 154     </para>
 155
 156     <para>
 157
 158      <screen>
 159       &#60;Distributor&#62;
 160         &#60;Name&#62; USGS/WRD &#60;/Name&#62;
 161         &#60;Organization&#62; USGS/WRD &#60;/Organization&#62;
 162         &#60;Street-Address&#62;
 163           U.S. GEOLOGICAL SURVEY, 505 MARQUETTE, NW
 164         &#60;/Street-Address&#62;
 165         &#60;City&#62; ALBUQUERQUE &#60;/City&#62;
 166         &#60;State&#62; NM &#60;/State&#62;
 167         &#60;Zip-Code&#62; 87102 &#60;/Zip-Code&#62;
 168         &#60;Country&#62; USA &#60;/Country&#62;
 169         &#60;Telephone&#62; (505) 766-5560 &#60;/Telephone&#62;
 170       &#60;/Distributor&#62;
 171      </screen>
 172
 173     </para>
 174
 175     <!-- There is no indentation in the example above!  -H
 176     -note-
 177      -para-
 178       The indentation used above is used to illustrate how Zebra
 179       interprets the mark-up. The indentation, in itself, has no
 180       significance to the parser for the canonical input format, which
 181       discards superfluous whitespace.
 182      -/para-
 183     -/note-
 184     -->
 185
 186     <para>
 187      The keywords surrounded by &lt;...&gt; are
 188      <emphasis>tags</emphasis>, while the sections of text
 189      in between are the <emphasis>data elements</emphasis>.
 190      A data element is characterized by its location in the tree
 191      that is made up by the nested elements.
 192      Each element is terminated by a closing tag - beginning
 193      with <literal>&#60;</literal>/, and containing the same symbolic
 194      tag-name as the corresponding opening tag.
 195      The general closing tag - <literal>&lt;/&gt;</literal> -
 196      terminates the element started by the last opening tag. The
 197      structuring of elements is significant.
 198      The element <emphasis>Telephone</emphasis>,
 199      for instance, may be indexed and presented to the client differently,
 200      depending on whether it appears inside the
 201      <emphasis>Distributor</emphasis> element, or some other,
 202      structured data element such a <emphasis>Supplier</emphasis> element.
 203     </para>
 204
 205     <section id="grs-record-root">
 206      <title>Record Root</title>
 207
 208      <para>
 209       The first tag in a record describes the root node of the tree that
 210       makes up the total record. In the canonical input format, the root tag
 211       should contain the name of the schema that lends context to the
 212       elements of the record
 213       (see <xref linkend="grs-internal-representation"/>).
 214       The following is a GILS record that
 215       contains only a single element (strictly speaking, that makes it an
 216       illegal GILS record, since the GILS profile includes several mandatory
 217       elements - Zebra does not validate the contents of a record against
 218       the Z39.50 profile, however - it merely attempts to match up elements
 219       of a local representation with the given schema):
 220      </para>
 221
 222      <para>
 223
 224       <screen>
 225        &#60;gils&#62;
 226           &#60;title&#62;Zen and the Art of Motorcycle Maintenance&#60;/title&#62;
 227        &#60;/gils&#62;
 228       </screen>
 229
 230      </para>
 231
 232     </section>
 233
 234     <section id="grs-variants">
 235      <title>Variants</title>
 236
 237      <para>
 238       Zebra allows you to provide individual data elements in a number of
 239       <emphasis>variant forms</emphasis>. Examples of variant forms are
 240       textual data elements which might appear in different languages, and
 241       images which may appear in different formats or layouts.
 242       The variant system in Zebra is essentially a representation of
 243       the variant mechanism of Z39.50-1995.
 244      </para>
 245
 246      <para>
 247       The following is an example of a title element which occurs in two
 248       different languages.
 249      </para>
 250
 251      <para>
 252
 253       <screen>
 254        &#60;title&#62;
 255        &#60;var lang lang "eng"&#62;
 256        Zen and the Art of Motorcycle Maintenance&#60;/&#62;
 257        &#60;var lang lang "dan"&#62;
 258        Zen og Kunsten at Vedligeholde en Motorcykel&#60;/&#62;
 259        &#60;/title&#62;
 260       </screen>
 261
 262      </para>
 263
 264      <para>
 265       The syntax of the <emphasis>variant element</emphasis> is
 266       <literal>&lt;var class type value&gt;</literal>.
 267       The available values for the <emphasis>class</emphasis> and
 268       <emphasis>type</emphasis> fields are given by the variant set
 269       that is associated with the current schema
 270       (see <xref linkend="grs-variants"/>).
 271      </para>
 272
 273      <para>
 274       Variant elements are terminated by the general end-tag &#60;/&#62;, by
 275       the variant end-tag &#60;/var&#62;, by the appearance of another variant
 276       tag with the same <emphasis>class</emphasis> and
 277       <emphasis>value</emphasis> settings, or by the
 278       appearance of another, normal tag. In other words, the end-tags for
 279       the variants used in the example above could have been omitted.
 280      </para>
 281
 282      <para>
 283       Variant elements can be nested. The element
 284      </para>
 285
 286      <para>
 287
 288       <screen>
 289        &#60;title&#62;
 290        &#60;var lang lang "eng"&#62;&#60;var body iana "text/plain"&#62;
 291        Zen and the Art of Motorcycle Maintenance
 292        &#60;/title&#62;
 293       </screen>
 294
 295      </para>
 296
 297      <para>
 298       Associates two variant components to the variant list for the title
 299       element.
 300      </para>
 301
 302      <para>
 303       Given the nesting rules described above, we could write
 304      </para>
 305
 306      <para>
 307
 308       <screen>
 309        &#60;title&#62;
 310        &#60;var body iana "text/plain&#62;
 311        &#60;var lang lang "eng"&#62;
 312        Zen and the Art of Motorcycle Maintenance
 313        &#60;var lang lang "dan"&#62;
 314        Zen og Kunsten at Vedligeholde en Motorcykel
 315        &#60;/title&#62;
 316       </screen>
 317
 318      </para>
 319
 320      <para>
 321       The title element above comes in two variants. Both have the IANA body
 322       type "text/plain", but one is in English, and the other in
 323       Danish. The client, using the element selection mechanism of Z39.50,
 324       can retrieve information about the available variant forms of data
 325       elements, or it can select specific variants based on the requirements
 326       of the end-user.
 327      </para>
 328
 329     </section>
 330
 331    </section>
 332
 333    <section id="grs-regx-tcl">
 334     <title>GRS REGX And TCL Input Filters</title>
 335
 336     <para>
 337      In order to handle general input formats, Zebra allows the
 338      operator to define filters which read individual records in their
 339      native format and produce an internal representation that the system
 340      can work with.
 341     </para>
 342
 343     <para>
 344      Input filters are ASCII files, generally with the suffix
 345      <literal>.flt</literal>.
 346      The system looks for the files in the directories given in the
 347      <emphasis>profilePath</emphasis> setting in the
 348      <literal>zebra.cfg</literal> files.
 349      The record type for the filter is
 350      <literal>grs.regx.</literal><emphasis>filter-filename</emphasis>
 351      (fundamental type <literal>grs</literal>, file read
 352      type <literal>regx</literal>, argument
 353      <emphasis>filter-filename</emphasis>).
 354     </para>
 355
 356     <para>
 357      Generally, an input filter consists of a sequence of rules, where each
 358      rule consists of a sequence of expressions, followed by an action. The
 359      expressions are evaluated against the contents of the input record,
 360      and the actions normally contribute to the generation of an internal
 361      representation of the record.
 362     </para>
 363
 364     <para>
 365      An expression can be either of the following:
 366     </para>
 367
 368     <para>
 369      <variablelist>
 370
 371       <varlistentry>
 372        <term>INIT</term>
 373        <listitem>
 374         <para>
 375          The action associated with this expression is evaluated
 376          exactly once in the lifetime of the application, before any records
 377          are read. It can be used in conjunction with an action that
 378          initializes tables or other resources that are used in the processing
 379          of input records.
 380         </para>
 381        </listitem>
 382       </varlistentry>
 383       <varlistentry>
 384        <term>BEGIN</term>
 385        <listitem>
 386         <para>
 387          Matches the beginning of the record. It can be used to
 388          initialize variables, etc. Typically, the
 389          <emphasis>BEGIN</emphasis> rule is also used
 390          to establish the root node of the record.
 391         </para>
 392        </listitem>
 393       </varlistentry>
 394       <varlistentry>
 395        <term>END</term>
 396        <listitem>
 397         <para>
 398          Matches the end of the record - when all of the contents
 399          of the record has been processed.
 400         </para>
 401        </listitem>
 402       </varlistentry>
 403       <varlistentry>
 404        <term>/pattern/</term>
 405        <listitem>
 406         <para>
 407          Matches a string of characters from the input record.
 408         </para>
 409        </listitem>
 410       </varlistentry>
 411       <varlistentry>
 412        <term>BODY</term>
 413        <listitem>
 414         <para>
 415          This keyword may only be used between two patterns.
 416          It matches everything between (not including) those patterns.
 417         </para>
 418        </listitem>
 419       </varlistentry>
 420       <varlistentry>
 421        <term>FINISH</term>
 422        <listitem>
 423         <para>
 424          The expression associated with this pattern is evaluated
 425          once, before the application terminates. It can be used to release
 426          system resources - typically ones allocated in the
 427          <emphasis>INIT</emphasis> step.
 428         </para>
 429        </listitem>
 430       </varlistentry>
 431      </variablelist>
 432     </para>
 433
 434     <para>
 435      An action is surrounded by curly braces ({...}), and
 436      consists of a sequence of statements. Statements may be separated
 437      by newlines or semicolons (;).
 438      Within actions, the strings that matched the expressions
 439      immediately preceding the action can be referred to as
 440      $0, $1, $2, etc.
 441     </para>
 442
 443     <para>
 444      The available statements are:
 445     </para>
 446
 447     <para>
 448      <variablelist>
 449
 450       <varlistentry>
 451        <term>begin <replaceable>type [parameter ... ]</replaceable></term>
 452        <listitem>
 453         <para>
 454          Begin a new
 455          data element. The <replaceable>type</replaceable> is one of
 456          the following:
 457          <variablelist>
 458
 459           <varlistentry>
 460            <term>record</term>
 461            <listitem>
 462             <para>
 463              Begin a new record. The following parameter should be the
 464              name of the schema that describes the structure of the record, eg.
 465              <literal>gils</literal> or <literal>wais</literal> (see below).
 466              The <literal>begin record</literal> call should precede
 467              any other use of the <replaceable>begin</replaceable> statement.
 468             </para>
 469            </listitem>
 470           </varlistentry>
 471           <varlistentry>
 472            <term>element</term>
 473            <listitem>
 474             <para>
 475              Begin a new tagged element. The parameter is the
 476              name of the tag. If the tag is not matched anywhere in the tagsets
 477              referenced by the current schema, it is treated as a local string
 478              tag.
 479             </para>
 480            </listitem>
 481           </varlistentry>
 482           <varlistentry>
 483            <term>variant</term>
 484            <listitem>
 485             <para>
 486              Begin a new node in a variant tree. The parameters are
 487              <replaceable>class type value</replaceable>.
 488             </para>
 489            </listitem>
 490           </varlistentry>
 491          </variablelist>
 492         </para>
 493        </listitem>
 494       </varlistentry>
 495       <varlistentry>
 496        <term>data <replaceable>parameter</replaceable></term>
 497        <listitem>
 498         <para>
 499          Create a data element. The concatenated arguments make
 500          up the value of the data element.
 501          The option <literal>-text</literal> signals that
 502          the layout (whitespace) of the data should be retained for
 503          transmission.
 504          The option <literal>-element</literal>
 505          <replaceable>tag</replaceable> wraps the data up in
 506          the <replaceable>tag</replaceable>.
 507          The use of the <literal>-element</literal> option is equivalent to
 508          preceding the command with a <replaceable>begin
 509           element</replaceable> command, and following
 510          it with the <replaceable>end</replaceable> command.
 511         </para>
 512        </listitem>
 513       </varlistentry>
 514       <varlistentry>
 515        <term>end <replaceable>[type]</replaceable></term>
 516        <listitem>
 517         <para>
 518          Close a tagged element. If no parameter is given,
 519          the last element on the stack is terminated.
 520          The first parameter, if any, is a type name, similar
 521          to the <replaceable>begin</replaceable> statement.
 522          For the <replaceable>element</replaceable> type, a tag
 523          name can be provided to terminate a specific tag.
 524         </para>
 525        </listitem>
 526       </varlistentry>
 527
 528       <varlistentry>
 529        <term>unread <replaceable>no</replaceable></term>
 530        <listitem>
 531         <para>
 532          Move the input pointer to the offset of first character that
 533          match rule given by <replaceable>no</replaceable>.
 534          The first rule from left-to-right is numbered zero,
 535          the second rule is named 1 and so on.
 536         </para>
 537        </listitem>
 538       </varlistentry>
 539
 540      </variablelist>
 541     </para>
 542
 543     <para>
 544      The following input filter reads a Usenet news file, producing a
 545      record in the WAIS schema. Note that the body of a news posting is
 546      separated from the list of headers by a blank line (or rather a
 547      sequence of two newline characters.
 548     </para>
 549
 550     <para>
 551
 552      <screen>
 553       BEGIN                { begin record wais }
 554
 555       /^From:/ BODY /$/    { data -element name $1 }
 556       /^Subject:/ BODY /$/ { data -element title $1 }
 557       /^Date:/ BODY /$/    { data -element lastModified $1 }
 558       /\n\n/ BODY END      {
 559          begin element bodyOfDisplay
 560          begin variant body iana "text/plain"
 561          data -text $1
 562          end record
 563       }
 564      </screen>
 565
 566     </para>
 567
 568     <para>
 569      If Zebra is compiled with support for Tcl enabled, the statements
 570      described above are supplemented with a complete
 571      scripting environment, including control structures (conditional
 572      expressions and loop constructs), and powerful string manipulation
 573      mechanisms for modifying the elements of a record.
 574     </para>
 575
 576    </section>
 577
 578   </section>
 579
 580   <section id="grs-internal-representation">
 581    <title>GRS Internal Record Representation</title>
 582
 583    <para>
 584     When records are manipulated by the system, they're represented in a
 585     tree-structure, with data elements at the leaf nodes, and tags or
 586     variant components at the non-leaf nodes. The root-node identifies the
 587     schema that lends context to the tagging and structuring of the
 588     record. Imagine a simple record, consisting of a 'title' element and
 589     an 'author' element:
 590    </para>
 591
 592    <para>
 593
 594     <screen>
 595      ROOT
 596         TITLE     "Zen and the Art of Motorcycle Maintenance"
 597         AUTHOR    "Robert Pirsig"
 598     </screen>
 599
 600    </para>
 601
 602    <para>
 603     A slightly more complex record would have the author element consist
 604     of two elements, a surname and a first name:
 605    </para>
 606
 607    <para>
 608
 609     <screen>
 610      ROOT
 611         TITLE  "Zen and the Art of Motorcycle Maintenance"
 612         AUTHOR
 613            FIRST-NAME "Robert"
 614            SURNAME    "Pirsig"
 615     </screen>
 616
 617    </para>
 618
 619    <para>
 620     The root of the record will refer to the record schema that describes
 621     the structuring of this particular record. The schema defines the
 622     element tags (TITLE, FIRST-NAME, etc.) that may occur in the record, as
 623     well as the structuring (SURNAME should appear below AUTHOR, etc.). In
 624     addition, the schema establishes element set names that are used by
 625     the client to request a subset of the elements of a given record. The
 626     schema may also establish rules for converting the record to a
 627     different schema, by stating, for each element, a mapping to a
 628     different tag path.
 629    </para>
 630
 631    <section id="grs-tagged-elements">
 632     <title>Tagged Elements</title>
 633
 634     <para>
 635      A data element is characterized by its tag, and its position in the
 636      structure of the record. For instance, while the tag "telephone
 637      number" may be used different places in a record, we may need to
 638      distinguish between these occurrences, both for searching and
 639      presentation purposes. For instance, while the phone numbers for the
 640      "customer" and the "service provider" are both
 641      representatives for the same type of resource (a telephone number), it
 642      is essential that they be kept separate. The record schema provides
 643      the structure of the record, and names each data element (defined by
 644      the sequence of tags - the tag path - by which the element can be
 645      reached from the root of the record).
 646     </para>
 647
 648    </section>
 649
 650    <section id="grs-variant-details">
 651     <title>Variants</title>
 652
 653     <para>
 654      The children of a tag node may be either more tag nodes, a data node
 655      (possibly accompanied by tag nodes),
 656      or a tree of variant nodes. The children of  variant nodes are either
 657      more variant nodes or a data node (possibly accompanied by more
 658      variant nodes). Each leaf node, which is normally a
 659      data node, corresponds to a <emphasis>variant form</emphasis> of the
 660      tagged element identified by the tag which parents the variant tree.
 661      The following title element occurs in two different languages:
 662     </para>
 663
 664     <para>
 665
 666      <screen>
 667       VARIANT LANG=ENG  "War and Peace"
 668       TITLE
 669       VARIANT LANG=DAN  "Krig og Fred"
 670      </screen>
 671
 672     </para>
 673
 674     <para>
 675      Which of the two elements are transmitted to the client by the server
 676      depends on the specifications provided by the client, if any.
 677     </para>
 678
 679     <para>
 680      In practice, each variant node is associated with a triple of class,
 681      type, value, corresponding to the variant mechanism of Z39.50.
 682     </para>
 683
 684    </section>
 685
 686    <section id="grs-data-elements">
 687     <title>Data Elements</title>
 688
 689     <para>
 690      Data nodes have no children (they are always leaf nodes in the record
 691      tree).
 692     </para>
 693
 694     <!--
 695     FIXME! Documentation needs extension here about types of nodes - numerical,
 696     textual, etc., plus the various types of inclusion notes.
 697    </para>
 698     -->
 699
 700    </section>
 701
 702   </section>
 703
 704   <section id="grs-conf">
 705    <title>GRS Record Model Configuration</title>
 706
 707    <para>
 708     The following sections describe the configuration files that govern
 709     the internal management of <literal>grs</literal> records.
 710     The system searches for the files
 711     in the directories specified by the <emphasis>profilePath</emphasis>
 712     setting in the <literal>zebra.cfg</literal> file.
 713    </para>
 714
 715    <section id="grs-abstract-syntax">
 716     <title>The Abstract Syntax</title>
 717
 718     <para>
 719      The abstract syntax definition (also known as an Abstract Record
 720      Structure, or ARS) is the focal point of the
 721      record schema description. For a given schema, the ABS file may state any
 722      or all of the following:
 723     </para>
 724
 725     <!--
 726      FIXME - Need a diagram here, or a simple explanation how it all hangs together -H
 727     -->
 728
 729     <para>
 730
 731      <itemizedlist>
 732       <listitem>
 733
 734        <para>
 735         The object identifier of the Z39.50 schema associated
 736         with the ARS, so that it can be referred to by the client.
 737        </para>
 738       </listitem>
 739
 740       <listitem>
 741        <para>
 742         The attribute set (which can possibly be a compound of multiple
 743         sets) which applies in the profile. This is used when indexing and
 744         searching the records belonging to the given profile.
 745        </para>
 746       </listitem>
 747
 748       <listitem>
 749        <para>
 750         The tag set (again, this can consist of several different sets).
 751         This is used when reading the records from a file, to recognize the
 752         different tags, and when transmitting the record to the client -
 753         mapping the tags to their numerical representation, if they are
 754         known.
 755        </para>
 756       </listitem>
 757
 758       <listitem>
 759        <para>
 760         The variant set which is used in the profile. This provides a
 761         vocabulary for specifying the <emphasis>forms</emphasis> of
 762         data that appear inside the records.
 763        </para>
 764       </listitem>
 765
 766       <listitem>
 767        <para>
 768         Element set names, which are a shorthand way for the client to
 769         ask for a subset of the data elements contained in a record. Element
 770         set names, in the retrieval module, are mapped to <emphasis>element
 771          specifications</emphasis>, which contain information equivalent to the
 772         <emphasis>Espec-1</emphasis> syntax of Z39.50.
 773        </para>
 774       </listitem>
 775
 776       <listitem>
 777        <para>
 778         Map tables, which may specify mappings to
 779         <emphasis>other</emphasis> database profiles, if desired.
 780        </para>
 781       </listitem>
 782
 783       <listitem>
 784        <para>
 785         Possibly, a set of rules describing the mapping of elements to a
 786         MARC representation.
 787
 788        </para>
 789       </listitem>
 790
 791       <listitem>
 792        <para>
 793         A list of element descriptions (this is the actual ARS of the
 794         schema, in Z39.50 terms), which lists the ways in which the various
 795         tags can be used and organized hierarchically.
 796        </para>
 797       </listitem>
 798
 799      </itemizedlist>
 800
 801     </para>
 802
 803     <para>
 804      Several of the entries above simply refer to other files, which
 805      describe the given objects.
 806     </para>
 807
 808    </section>
 809
 810    <section id="grs-configuration-files">
 811     <title>The Configuration Files</title>
 812
 813     <para>
 814      This section describes the syntax and use of the various tables which
 815      are used by the retrieval module.
 816     </para>
 817
 818     <para>
 819      The number of different file types may appear daunting at first, but
 820      each type corresponds fairly clearly to a single aspect of the Z39.50
 821      retrieval facilities. Further, the average database administrator,
 822      who is simply reusing an existing profile for which tables already
 823      exist, shouldn't have to worry too much about the contents of these tables.
 824     </para>
 825
 826     <para>
 827      Generally, the files are simple ASCII files, which can be maintained
 828      using any text editor. Blank lines, and lines beginning with a (#) are
 829      ignored. Any characters on a line followed by a (#) are also ignored.
 830      All other lines contain <emphasis>directives</emphasis>, which provide
 831      some setting or value to the system.
 832      Generally, settings are characterized by a single
 833      keyword, identifying the setting, followed by a number of parameters.
 834      Some settings are repeatable (r), while others may occur only once in a
 835      file. Some settings are optional (o), while others again are
 836      mandatory (m).
 837     </para>
 838
 839    </section>
 840
 841    <section id="abs-file">
 842     <title>The Abstract Syntax (.abs) Files</title>
 843
 844     <para>
 845      The name of this file type is slightly misleading in Z39.50 terms,
 846      since, apart from the actual abstract syntax of the profile, it also
 847      includes most of the other definitions that go into a database
 848      profile.
 849     </para>
 850
 851     <para>
 852      When a record in the canonical, SGML-like format is read from a file
 853      or from the database, the first tag of the file should reference the
 854      profile that governs the layout of the record. If the first tag of the
 855      record is, say, <literal>&lt;gils&gt;</literal>, the system will look
 856      for the profile definition in the file <literal>gils.abs</literal>.
 857      Profile definitions are cached, so they only have to be read once
 858      during the lifespan of the current process.
 859     </para>
 860
 861     <para>
 862      When writing your own input filters, the
 863      <emphasis>record-begin</emphasis> command
 864      introduces the profile, and should always be called first thing when
 865      introducing a new record.
 866     </para>
 867
 868     <para>
 869      The file may contain the following directives:
 870     </para>
 871
 872     <para>
 873      <variablelist>
 874
 875       <varlistentry>
 876        <term>name <replaceable>symbolic-name</replaceable></term>
 877        <listitem>
 878         <para>
 879          (m) This provides a shorthand name or
 880          description for the profile. Mostly useful for diagnostic purposes.
 881         </para>
 882        </listitem>
 883       </varlistentry>
 884       <varlistentry>
 885        <term>reference <replaceable>OID-name</replaceable></term>
 886        <listitem>
 887         <para>
 888          (m) The reference name of the OID for the profile.
 889          The reference names can be found in the <emphasis>util</emphasis>
 890          module of YAZ.
 891         </para>
 892        </listitem>
 893       </varlistentry>
 894       <varlistentry>
 895        <term>attset <replaceable>filename</replaceable></term>
 896        <listitem>
 897         <para>
 898          (m) The attribute set that is used for
 899          indexing and searching records belonging to this profile.
 900         </para>
 901        </listitem>
 902       </varlistentry>
 903       <varlistentry>
 904        <term>tagset <replaceable>filename</replaceable></term>
 905        <listitem>
 906         <para>
 907          (o) The tag set (if any) that describe
 908          that fields of the records.
 909         </para>
 910        </listitem>
 911       </varlistentry>
 912       <varlistentry>
 913        <term>varset <replaceable>filename</replaceable></term>
 914        <listitem>
 915         <para>
 916          (o) The variant set used in the profile.
 917         </para>
 918        </listitem>
 919       </varlistentry>
 920       <varlistentry>
 921        <term>maptab <replaceable>filename</replaceable></term>
 922        <listitem>
 923         <para>
 924          (o,r) This points to a
 925          conversion table that might be used if the client asks for the record
 926          in a different schema from the native one.
 927         </para>
 928        </listitem>
 929       </varlistentry>
 930       <varlistentry>
 931        <term>marc <replaceable>filename</replaceable></term>
 932        <listitem>
 933         <para>
 934          (o) Points to a file containing parameters
 935          for representing the record contents in the ISO2709 syntax.
 936          Read the description of the MARC representation facility below.
 937         </para>
 938        </listitem>
 939       </varlistentry>
 940       <varlistentry>
 941        <term>esetname <replaceable>name filename</replaceable></term>
 942        <listitem>
 943         <para>
 944          (o,r) Associates the
 945          given element set name with an element selection file. If an (@) is
 946          given in place of the filename, this corresponds to a null mapping for
 947          the given element set name.
 948         </para>
 949        </listitem>
 950       </varlistentry>
 951       <varlistentry>
 952        <term>all <replaceable>tags</replaceable></term>
 953        <listitem>
 954         <para>
 955          (o) This directive specifies a list of attributes
 956          which should be appended to the attribute list given for each
 957          element. The effect is to make every single element in the abstract
 958          syntax searchable by way of the given attributes. This directive
 959          provides an efficient way of supporting free-text searching across all
 960          elements. However, it does increase the size of the index
 961          significantly. The attributes can be qualified with a structure, as in
 962          the <replaceable>elm</replaceable> directive below.
 963         </para>
 964        </listitem>
 965       </varlistentry>
 966       <varlistentry>
 967        <term>elm <replaceable>path name attributes</replaceable></term>
 968        <listitem>
 969         <para>
 970          (o,r) Adds an element to the abstract record syntax of the schema.
 971          The <replaceable>path</replaceable> follows the
 972          syntax which is suggested by the Z39.50 document - that is, a sequence
 973          of tags separated by slashes (&#x2f;). Each tag is given as a
 974          comma-separated pair of tag type and -value surrounded by parenthesis.
 975          The <replaceable>name</replaceable> is the name of the element, and
 976          the <replaceable>attributes</replaceable>
 977          specifies which attributes to use when indexing the element in a
 978          comma-separated list.
 979          A <literal>!</literal> in place of the attribute name is equivalent
 980          to specifying an attribute name identical to the element name.
 981          A <literal>-</literal> in place of the attribute name
 982          specifies that no indexing is to take place for the given element.
 983          The attributes can be qualified with <replaceable>field
 984           types</replaceable> to specify which
 985          character set should govern the indexing procedure for that field.
 986          The same data element may be indexed into several different
 987          fields, using different character set definitions.
 988          See the <xref linkend="fields-and-charsets"/>.
 989          The default field type is <literal>w</literal> for
 990          <emphasis>word</emphasis>.
 991         </para>
 992        </listitem>
 993       </varlistentry>
 994
 995       <varlistentry>
 996        <term>xelm <replaceable>xpath attributes</replaceable></term>
 997        <listitem>
 998         <para>
 999          Specifies indexing for record nodes given by
1000          <replaceable>xpath</replaceable>. Unlike directive
1001          elm, this directive allows you to index attribute
1002          contents. The <replaceable>xpath</replaceable> uses
1003          a syntax similar to XPath. The <replaceable>attributes</replaceable>
1004          have same syntax and meaning as directive elm, except that operator
1005          ! refers to the nodes selected by <replaceable>xpath</replaceable>.
1006          <!--
1007          xelm   /         !:w                 default index
1008          xelm   //        !:w                 additional index
1009          xelm   /gils/title/@att    myatt:w   index attribute @att in myatt
1010          xelm   title/@att          myatt:w   same meaning.
1011          -->
1012         </para>
1013        </listitem>
1014       </varlistentry>
1015       <varlistentry>
1016        <term>melm <replaceable>field$subfield attributes</replaceable></term>
1017        <listitem>
1018         <para>
1019          This directive is specifically for MARC-formatted records,
1020          ingested either in the form of MARCXML documents, or in the
1021          ISO2709/Z39.2 format using the grs.marcxml input filter. You can
1022          specify indexing rules for any subfield, or you can leave off the
1023          <replaceable>$subfield</replaceable> part and specify default rules
1024          for all subfields of the given field (note: default rules should come
1025          after any subfield-specific rules in the configuration file). The
1026          <replaceable>attributes</replaceable> have the same syntax and meaning
1027          as for the 'elm' directive above.
1028         </para>
1029        </listitem>
1030       </varlistentry>
1031       <varlistentry>
1032        <term>encoding <replaceable>encodingname</replaceable></term>
1033        <listitem>
1034         <para>
1035          This directive specifies character encoding for external records.
1036          For records such as XML that specifies encoding within the
1037          file via a header this directive is ignored.
1038          If neither this directive is given, nor an encoding is set
1039          within external records, ISO-8859-1 encoding is assumed.
1040          </para>
1041        </listitem>
1042       </varlistentry>
1043       <varlistentry>
1044        <term>xpath <literal>enable</literal>/<literal>disable</literal></term>
1045        <listitem>
1046         <para>
1047          If this directive is followed by <literal>enable</literal>,
1048          then extra indexing is performed to allow for XPath-like queries.
1049          If this directive is not specified - equivalent to
1050          <literal>disable</literal> - no extra XPath-indexing is performed.
1051         </para>
1052        </listitem>
1053       </varlistentry>
1054
1055       <!-- Adam's version
1056       <varlistentry>
1057        <term>systag <replaceable>systemtag</replaceable> <replaceable>element</replaceable></term>
1058        <listitem>
1059         <para>
1060          This directive maps system information to an element during
1061          retrieval. This information is dynamically created. The
1062          following system tags are defined
1063          <variablelist>
1064           <varlistentry>
1065            <term>size</term>
1066            <listitem>
1067             <para>
1068              Size of record in bytes. By default this
1069              is mapped to element <literal>size</literal>.
1070             </para>
1071            </listitem>
1072           </varlistentry>
1073
1074           <varlistentry>
1075            <term>rank</term>
1076            <listitem>
1077             <para>
1078              Score/rank of record. By default this
1079              is mapped to element <literal>rank</literal>.
1080              If no score was calculated for the record (non-ranked
1081              searched) search this directive is ignored.
1082             </para>
1083            </listitem>
1084           </varlistentry>
1085
1086           <varlistentry>
1087            <term>sysno</term>
1088            <listitem>
1089             <para>
1090              Zebra's system number (record ID) for the
1091              record. By default this is mapped to element
1092              <literal>localControlNumber</literal>.
1093             </para>
1094            </listitem>
1095           </varlistentry>
1096          </variablelist>
1097          If you do not want a particular system tag to be applied,
1098          then set the resulting element to something undefined in the
1099          abs file (such as <literal>none</literal>).
1100         </para>
1101        </listitem>
1102       </varlistentry>
1103       -->
1104
1105       <!-- Mike's version -->
1106       <varlistentry>
1107        <term>
1108         systag
1109         <replaceable>systemTag</replaceable>
1110         <replaceable>actualTag</replaceable>
1111        </term>
1112        <listitem>
1113         <para>
1114          Specifies what information, if any, Zebra should
1115          automatically include in retrieval records for the
1116          ``system fields'' that it supports.
1117          <replaceable>systemTag</replaceable> may
1118          be any of the following:
1119          <variablelist>
1120           <varlistentry>
1121            <term><literal>rank</literal></term>
1122            <listitem><para>
1123             An integer indicating the relevance-ranking score
1124             assigned to the record.
1125            </para></listitem>
1126           </varlistentry>
1127           <varlistentry>
1128            <term><literal>sysno</literal></term>
1129            <listitem><para>
1130             An automatically generated identifier for the record,
1131             unique within this database.  It is represented by the
1132             <literal>&lt;localControlNumber&gt;</literal> element in
1133             XML and the <literal>(1,14)</literal> tag in GRS-1.
1134            </para></listitem>
1135           </varlistentry>
1136           <varlistentry>
1137            <term><literal>size</literal></term>
1138            <listitem><para>
1139             The size, in bytes, of the retrieved record.
1140            </para></listitem>
1141           </varlistentry>
1142          </variablelist>
1143         </para>
1144         <para>
1145          The <replaceable>actualTag</replaceable> parameter may be
1146          <literal>none</literal> to indicate that the named element
1147          should be omitted from retrieval records.
1148         </para>
1149        </listitem>
1150       </varlistentry>
1151      </variablelist>
1152     </para>
1153
1154     <note>
1155      <para>
1156       The mechanism for controlling indexing is not adequate for
1157       complex databases, and will probably be moved into a separate
1158       configuration table eventually.
1159      </para>
1160     </note>
1161
1162     <para>
1163      The following is an excerpt from the abstract syntax file for the GILS
1164      profile.
1165     </para>
1166
1167     <para>
1168
1169      <screen>
1170       name gils
1171       reference GILS-schema
1172       attset gils.att
1173       tagset gils.tag
1174       varset var1.var
1175
1176       maptab gils-usmarc.map
1177
1178       # Element set names
1179
1180       esetname VARIANT gils-variant.est  # for WAIS-compliance
1181       esetname B gils-b.est
1182       esetname G gils-g.est
1183       esetname F @
1184
1185       elm (1,10)               rank                        -
1186       elm (1,12)               url                         -
1187       elm (1,14)               localControlNumber     Local-number
1188       elm (1,16)               dateOfLastModification Date/time-last-modified
1189       elm (2,1)                title                       w:!,p:!
1190       elm (4,1)                controlIdentifier      Identifier-standard
1191       elm (2,6)                abstract               Abstract
1192       elm (4,51)               purpose                     !
1193       elm (4,52)               originator                  -
1194       elm (4,53)               accessConstraints           !
1195       elm (4,54)               useConstraints              !
1196       elm (4,70)               availability                -
1197       elm (4,70)/(4,90)        distributor                 -
1198       elm (4,70)/(4,90)/(2,7)  distributorName             !
1199       elm (4,70)/(4,90)/(2,10) distributorOrganization     !
1200       elm (4,70)/(4,90)/(4,2)  distributorStreetAddress    !
1201       elm (4,70)/(4,90)/(4,3)  distributorCity             !
1202      </screen>
1203
1204     </para>
1205
1206    </section>
1207
1208    <section id="attset-files">
1209     <title>The Attribute Set (.att) Files</title>
1210
1211     <para>
1212      This file type describes the <replaceable>Use</replaceable> elements of
1213      an attribute set.
1214      It contains the following directives.
1215     </para>
1216
1217     <para>
1218      <variablelist>
1219       <varlistentry>
1220        <term>name <replaceable>symbolic-name</replaceable></term>
1221        <listitem>
1222         <para>
1223          (m) This provides a shorthand name or
1224          description for the attribute set.
1225          Mostly useful for diagnostic purposes.
1226         </para>
1227        </listitem></varlistentry>
1228       <varlistentry>
1229        <term>reference <replaceable>OID-name</replaceable></term>
1230        <listitem>
1231         <para>
1232          (m) The reference name of the OID for
1233          the attribute set.
1234          The reference names can be found in the <replaceable>util</replaceable>
1235          module of <replaceable>YAZ</replaceable>.
1236         </para>
1237        </listitem></varlistentry>
1238       <varlistentry>
1239        <term>include <replaceable>filename</replaceable></term>
1240        <listitem>
1241         <para>
1242          (o,r) This directive is used to
1243          include another attribute set as a part of the current one. This is
1244          used when a new attribute set is defined as an extension to another
1245          set. For instance, many new attribute sets are defined as extensions
1246          to the <replaceable>bib-1</replaceable> set.
1247          This is an important feature of the retrieval
1248          system of Z39.50, as it ensures the highest possible level of
1249          interoperability, as those access points of your database which are
1250          derived from the external set (say, bib-1) can be used even by clients
1251          who are unaware of the new set.
1252         </para>
1253        </listitem></varlistentry>
1254       <varlistentry>
1255        <term>att
1256         <replaceable>att-value att-name [local-value]</replaceable></term>
1257        <listitem>
1258         <para>
1259          (o,r) This
1260          repeatable directive introduces a new attribute to the set. The
1261          attribute value is stored in the index (unless a
1262          <replaceable>local-value</replaceable> is
1263          given, in which case this is stored). The name is used to refer to the
1264          attribute from the <replaceable>abstract syntax</replaceable>.
1265         </para>
1266        </listitem></varlistentry>
1267      </variablelist>
1268     </para>
1269
1270     <para>
1271      This is an excerpt from the GILS attribute set definition.
1272      Notice how the file describing the <emphasis>bib-1</emphasis>
1273      attribute set is referenced.
1274     </para>
1275
1276     <para>
1277
1278      <screen>
1279       name gils
1280       reference GILS-attset
1281       include bib1.att
1282
1283       att 2001          distributorName
1284       att 2002          indextermsControlled
1285       att 2003          purpose
1286       att 2004          accessConstraints
1287       att 2005          useConstraints
1288      </screen>
1289
1290     </para>
1291
1292    </section>
1293
1294    <section id="grs-tag-files">
1295     <title>The Tag Set (.tag) Files</title>
1296
1297     <para>
1298      This file type defines the tagset of the profile, possibly by
1299      referencing other tag sets (most tag sets, for instance, will include
1300      tagsetG and tagsetM from the Z39.50 specification. The file may
1301      contain the following directives.
1302     </para>
1303
1304     <para>
1305      <variablelist>
1306
1307       <varlistentry>
1308        <term>name <emphasis>symbolic-name</emphasis></term>
1309        <listitem>
1310         <para>
1311          (m) This provides a shorthand name or
1312          description for the tag set. Mostly useful for diagnostic purposes.
1313         </para>
1314        </listitem></varlistentry>
1315       <varlistentry>
1316        <term>reference <emphasis>OID-name</emphasis></term>
1317        <listitem>
1318         <para>
1319          (o) The reference name of the OID for the tag set.
1320          The reference names can be found in the <emphasis>util</emphasis>
1321          module of <emphasis>YAZ</emphasis>.
1322          The directive is optional, since not all tag sets
1323          are registered outside of their schema.
1324         </para>
1325        </listitem></varlistentry>
1326       <varlistentry>
1327        <term>type <emphasis>integer</emphasis></term>
1328        <listitem>
1329         <para>
1330          (m) The type number of the tagset within the schema
1331          profile (note: this specification really should belong to the .abs
1332          file. This will be fixed in a future release).
1333         </para>
1334        </listitem></varlistentry>
1335       <varlistentry>
1336        <term>include <emphasis>filename</emphasis></term>
1337        <listitem>
1338         <para>
1339          (o,r) This directive is used
1340          to include the definitions of other tag sets into the current one.
1341         </para>
1342        </listitem></varlistentry>
1343       <varlistentry>
1344        <term>tag <emphasis>number names type</emphasis></term>
1345        <listitem>
1346         <para>
1347          (o,r) Introduces a new tag to the set.
1348          The <emphasis>number</emphasis> is the tag number as used
1349          in the protocol (there is currently no mechanism for
1350          specifying string tags at this point, but this would be quick
1351          work to add).
1352          The <emphasis>names</emphasis> parameter is a list of names
1353          by which the tag should be recognized in the input file format.
1354          The names should be separated by slashes (/).
1355          The <emphasis>type</emphasis> is the recommended data type of
1356          the tag.
1357          It should be one of the following:
1358
1359          <itemizedlist>
1360           <listitem>
1361            <para>
1362             structured
1363            </para>
1364           </listitem>
1365
1366           <listitem>
1367            <para>
1368             string
1369            </para>
1370           </listitem>
1371
1372           <listitem>
1373            <para>
1374             numeric
1375            </para>
1376           </listitem>
1377
1378           <listitem>
1379            <para>
1380             bool
1381            </para>
1382           </listitem>
1383
1384           <listitem>
1385            <para>
1386             oid
1387            </para>
1388           </listitem>
1389
1390           <listitem>
1391            <para>
1392             generalizedtime
1393            </para>
1394           </listitem>
1395
1396           <listitem>
1397            <para>
1398             intunit
1399            </para>
1400           </listitem>
1401
1402           <listitem>
1403            <para>
1404             int
1405            </para>
1406           </listitem>
1407
1408           <listitem>
1409            <para>
1410             octetstring
1411            </para>
1412           </listitem>
1413
1414           <listitem>
1415            <para>
1416             null
1417            </para>
1418           </listitem>
1419
1420          </itemizedlist>
1421
1422         </para>
1423        </listitem></varlistentry>
1424      </variablelist>
1425     </para>
1426
1427     <para>
1428      The following is an excerpt from the TagsetG definition file.
1429     </para>
1430
1431     <para>
1432      <screen>
1433       name tagsetg
1434       reference TagsetG
1435       type 2
1436
1437       tag       1       title           string
1438       tag       2       author          string
1439       tag       3       publicationPlace string
1440       tag       4       publicationDate string
1441       tag       5       documentId      string
1442       tag       6       abstract        string
1443       tag       7       name            string
1444       tag       8       date            generalizedtime
1445       tag       9       bodyOfDisplay   string
1446       tag       10      organization    string
1447      </screen>
1448     </para>
1449
1450    </section>
1451
1452    <section id="grs-var-files">
1453     <title>The Variant Set (.var) Files</title>
1454
1455     <para>
1456      The variant set file is a straightforward representation of the
1457      variant set definitions associated with the protocol. At present, only
1458      the <emphasis>Variant-1</emphasis> set is known.
1459     </para>
1460
1461     <para>
1462      These are the directives allowed in the file.
1463     </para>
1464
1465     <para>
1466      <variablelist>
1467
1468       <varlistentry>
1469        <term>name <emphasis>symbolic-name</emphasis></term>
1470        <listitem>
1471         <para>
1472          (m) This provides a shorthand name or
1473          description for the variant set. Mostly useful for diagnostic purposes.
1474         </para>
1475        </listitem></varlistentry>
1476       <varlistentry>
1477        <term>reference <emphasis>OID-name</emphasis></term>
1478        <listitem>
1479         <para>
1480          (o) The reference name of the OID for
1481          the variant set, if one is required. The reference names can be found
1482          in the <emphasis>util</emphasis> module of <emphasis>YAZ</emphasis>.
1483         </para>
1484        </listitem></varlistentry>
1485       <varlistentry>
1486        <term>class <emphasis>integer class-name</emphasis></term>
1487        <listitem>
1488         <para>
1489          (m,r) Introduces a new
1490          class to the variant set.
1491         </para>
1492        </listitem></varlistentry>
1493       <varlistentry>
1494        <term>type <emphasis>integer type-name datatype</emphasis></term>
1495        <listitem>
1496         <para>
1497          (m,r) Addes a
1498          new type to the current class (the one introduced by the most recent
1499          <emphasis>class</emphasis> directive).
1500          The type names belong to the same name space as the one used
1501          in the tag set definition file.
1502         </para>
1503        </listitem></varlistentry>
1504      </variablelist>
1505     </para>
1506
1507     <para>
1508      The following is an excerpt from the file describing the variant set
1509      <emphasis>Variant-1</emphasis>.
1510     </para>
1511
1512     <para>
1513
1514      <screen>
1515       name variant-1
1516       reference Variant-1
1517
1518       class 1 variantId
1519
1520       type      1       variantId               octetstring
1521
1522       class 2 body
1523
1524       type      1       iana                    string
1525       type      2       z39.50                  string
1526       type      3       other                   string
1527      </screen>
1528
1529     </para>
1530
1531    </section>
1532
1533    <section id="grs-est-files">
1534     <title>The Element Set (.est) Files</title>
1535
1536     <para>
1537      The element set specification files describe a selection of a subset
1538      of the elements of a database record. The element selection mechanism
1539      is equivalent to the one supplied by the <emphasis>Espec-1</emphasis>
1540      syntax of the Z39.50 specification.
1541      In fact, the internal representation of an element set
1542      specification is identical to the <emphasis>Espec-1</emphasis> structure,
1543      and we'll refer you to the description of that structure for most of
1544      the detailed semantics of the directives below.
1545     </para>
1546
1547     <note>
1548      <para>
1549       Not all of the Espec-1 functionality has been implemented yet.
1550       The fields that are mentioned below all work as expected, unless
1551       otherwise is noted.
1552      </para>
1553     </note>
1554
1555     <para>
1556      The directives available in the element set file are as follows:
1557     </para>
1558
1559     <para>
1560      <variablelist>
1561       <varlistentry>
1562        <term>defaultVariantSetId <emphasis>OID-name</emphasis></term>
1563        <listitem>
1564         <para>
1565          (o) If variants are used in
1566          the following, this should provide the name of the variantset used
1567          (it's not currently possible to specify a different set in the
1568          individual variant request). In almost all cases (certainly all
1569          profiles known to us), the name
1570          <literal>Variant-1</literal> should be given here.
1571         </para>
1572        </listitem></varlistentry>
1573       <varlistentry>
1574        <term>defaultVariantRequest <emphasis>variant-request</emphasis></term>
1575        <listitem>
1576         <para>
1577          (o) This directive
1578          provides a default variant request for
1579          use when the individual element requests (see below) do not contain a
1580          variant request. Variant requests consist of a blank-separated list of
1581          variant components. A variant compont is a comma-separated,
1582          parenthesized triple of variant class, type, and value (the two former
1583          values being represented as integers). The value can currently only be
1584          entered as a string (this will change to depend on the definition of
1585          the variant in question). The special value (@) is interpreted as a
1586          null value, however.
1587         </para>
1588        </listitem></varlistentry>
1589       <varlistentry>
1590        <term>simpleElement
1591         <emphasis>path ['variant' variant-request]</emphasis></term>
1592        <listitem>
1593         <para>
1594          (o,r) This corresponds to a simple element request
1595          in <emphasis>Espec-1</emphasis>.
1596          The path consists of a sequence of tag-selectors, where each of
1597          these can consist of either:
1598         </para>
1599
1600         <para>
1601          <itemizedlist>
1602           <listitem>
1603            <para>
1604             A simple tag, consisting of a comma-separated type-value pair in
1605             parenthesis, possibly followed by a colon (:) followed by an
1606             occurrences-specification (see below). The tag-value can be a number
1607             or a string. If the first character is an apostrophe ('), this
1608             forces the value to be interpreted as a string, even if it
1609             appears to be numerical.
1610            </para>
1611           </listitem>
1612
1613           <listitem>
1614            <para>
1615             A WildThing, represented as a question mark (?), possibly
1616             followed by a colon (:) followed by an occurrences
1617             specification (see below).
1618            </para>
1619           </listitem>
1620
1621           <listitem>
1622            <para>
1623             A WildPath, represented as an asterisk (*). Note that the last
1624             element of the path should not be a wildPath (wildpaths don't
1625             work in this version).
1626            </para>
1627           </listitem>
1628
1629          </itemizedlist>
1630
1631         </para>
1632
1633         <para>
1634          The occurrences-specification can be either the string
1635          <literal>all</literal>, the string <literal>last</literal>, or
1636          an explicit value-range. The value-range is represented as
1637          an integer (the starting point), possibly followed by a
1638          plus (+) and a second integer (the number of elements, default
1639          being one).
1640         </para>
1641
1642         <para>
1643          The variant-request has the same syntax as the defaultVariantRequest
1644          above. Note that it may sometimes be useful to give an empty variant
1645          request, simply to disable the default for a specific set of fields
1646          (we aren't certain if this is proper <emphasis>Espec-1</emphasis>,
1647          but it works in this implementation).
1648         </para>
1649        </listitem></varlistentry>
1650      </variablelist>
1651     </para>
1652
1653     <para>
1654      The following is an example of an element specification belonging to
1655      the GILS profile.
1656     </para>
1657
1658     <para>
1659
1660      <screen>
1661       simpleelement (1,10)
1662       simpleelement (1,12)
1663       simpleelement (2,1)
1664       simpleelement (1,14)
1665       simpleelement (4,1)
1666       simpleelement (4,52)
1667      </screen>
1668
1669     </para>
1670
1671    </section>
1672
1673    <section id="schema-mapping">
1674     <title>The Schema Mapping (.map) Files</title>
1675
1676     <para>
1677      Sometimes, the client might want to receive a database record in
1678      a schema that differs from the native schema of the record. For
1679      instance, a client might only know how to process WAIS records, while
1680      the database record is represented in a more specific schema, such as
1681      GILS. In this module, a mapping of data to one of the MARC formats is
1682      also thought of as a schema mapping (mapping the elements of the
1683      record into fields consistent with the given MARC specification, prior
1684      to actually converting the data to the ISO2709). This use of the
1685      object identifier for USMARC as a schema identifier represents an
1686      overloading of the OID which might not be entirely proper. However,
1687      it represents the dual role of schema and record syntax which
1688      is assumed by the MARC family in Z39.50.
1689     </para>
1690
1691     <!--
1692      <emphasis>NOTE: FIXME! The schema-mapping functions are so far limited to a
1693       straightforward mapping of elements. This should be extended with
1694       mechanisms for conversions of the element contents, and conditional
1695       mappings of elements based on the record contents.</emphasis>
1696     -->
1697
1698     <para>
1699      These are the directives of the schema mapping file format:
1700     </para>
1701
1702     <para>
1703      <variablelist>
1704
1705       <varlistentry>
1706        <term>targetName <emphasis>name</emphasis></term>
1707        <listitem>
1708         <para>
1709          (m) A symbolic name for the target schema
1710          of the table. Useful mostly for diagnostic purposes.
1711         </para>
1712        </listitem></varlistentry>
1713       <varlistentry>
1714        <term>targetRef <emphasis>OID-name</emphasis></term>
1715        <listitem>
1716         <para>
1717          (m) An OID name for the target schema.
1718          This is used, for instance, by a server receiving a request to present
1719          a record in a different schema from the native one.
1720          The name, again, is found in the <emphasis>oid</emphasis>
1721          module of <emphasis>YAZ</emphasis>.
1722         </para>
1723        </listitem></varlistentry>
1724       <varlistentry>
1725        <term>map <emphasis>element-name target-path</emphasis></term>
1726        <listitem>
1727         <para>
1728          (o,r) Adds
1729          an element mapping rule to the table.
1730         </para>
1731        </listitem></varlistentry>
1732      </variablelist>
1733     </para>
1734
1735    </section>
1736
1737    <section id="grs-mar-files">
1738     <title>The MARC (ISO2709) Representation (.mar) Files</title>
1739
1740     <para>
1741      This file provides rules for representing a record in the ISO2709
1742      format. The rules pertain mostly to the values of the constant-length
1743      header of the record.
1744     </para>
1745
1746     <!--
1747      NOTE: FIXME! This will be described better. We're in the process of
1748       re-evaluating and most likely changing the way that MARC records are
1749       handled by the system.</emphasis>
1750     -->
1751
1752    </section>
1753   </section>
1754
1755   <section id="grs-exchange-formats">
1756    <title>GRS Exchange Formats</title>
1757
1758    <para>
1759     Converting records from the internal structure to an exchange format
1760     is largely an automatic process. Currently, the following exchange
1761     formats are supported:
1762    </para>
1763
1764    <para>
1765     <itemizedlist>
1766      <listitem>
1767       <para>
1768        GRS-1. The internal representation is based on GRS-1/XML, so the
1769        conversion here is straightforward. The system will create
1770        applied variant and supported variant lists as required, if a record
1771        contains variant information.
1772       </para>
1773      </listitem>
1774
1775      <listitem>
1776       <para>
1777        XML. The internal representation is based on GRS-1/XML so
1778        the mapping is trivial. Note that XML schemas, preprocessing
1779        instructions and comments are not part of the internal representation
1780        and therefore will never be part of a generated XML record.
1781        Future versions of the Zebra will support that.
1782       </para>
1783      </listitem>
1784
1785      <listitem>
1786       <para>
1787        SUTRS. Again, the mapping is fairly straightforward. Indentation
1788        is used to show the hierarchical structure of the record. All
1789        "GRS" type records support both the GRS-1 and SUTRS
1790        representations.
1791        <!-- FIXME - What is SUTRS - should be expanded here -->
1792       </para>
1793      </listitem>
1794
1795      <listitem>
1796       <para>
1797        ISO2709-based formats (USMARC, etc.). Only records with a
1798        two-level structure (corresponding to fields and subfields) can be
1799        directly mapped to ISO2709. For records with a different structuring
1800        (eg., GILS), the representation in a structure like USMARC involves a
1801        schema-mapping (see <xref linkend="schema-mapping"/>), to an
1802        "implied" USMARC schema (implied,
1803        because there is no formal schema which specifies the use of the
1804        USMARC fields outside of ISO2709). The resultant, two-level record is
1805        then mapped directly from the internal representation to ISO2709. See
1806        the GILS schema definition files for a detailed example of this
1807        approach.
1808       </para>
1809      </listitem>
1810
1811      <listitem>
1812       <para>
1813        Explain. This representation is only available for records
1814        belonging to the Explain schema.
1815       </para>
1816      </listitem>
1817
1818      <listitem>
1819       <para>
1820        Summary. This ASN-1 based structure is only available for records
1821        belonging to the Summary schema - or schema which provide a mapping
1822        to this schema (see the description of the schema mapping facility
1823        above).
1824       </para>
1825      </listitem>
1826
1827      <!-- FIXME - Is this used anywhere ? -H -->
1828      <listitem>
1829       <para>
1830        SOIF. Support for this syntax is experimental, and is currently
1831        keyed to a private Index Data OID (1.2.840.10003.5.1000.81.2). All
1832        abstract syntaxes can be mapped to the SOIF format, although nested
1833        elements are represented by concatenation of the tag names at each
1834        level.
1835       </para>
1836      </listitem>
1837
1838     </itemizedlist>
1839    </para>
1840   </section>
1841
1842   <section id="grs-extended-marc-indexing">
1843    <title>Extended indexing of MARC records</title>
1844
1845    <para>Extended indexing of MARC records will help you if you need index a
1846     combination of subfields, or index only a part of the whole field,
1847     or use during indexing process embedded fields of MARC record.
1848    </para>
1849
1850    <para>Extended indexing of MARC records additionally allows:
1851     <itemizedlist>
1852
1853      <listitem>
1854       <para>to index data in LEADER of MARC record</para>
1855      </listitem>
1856
1857      <listitem>
1858       <para>to index data in control fields (with fixed length)</para>
1859      </listitem>
1860
1861      <listitem>
1862       <para>to use during indexing the values of indicators</para>
1863      </listitem>
1864
1865      <listitem>
1866       <para>to index linked fields for UNIMARC based formats</para>
1867      </listitem>
1868
1869     </itemizedlist>
1870    </para>
1871
1872    <note><para>In compare with simple indexing process the extended indexing
1873      may increase (about 2-3 times) the time of indexing process for MARC
1874      records.</para></note>
1875
1876    <section id="formula">
1877     <title>The index-formula</title>
1878
1879     <para>At the beginning, we have to define the term
1880      <emphasis>index-formula</emphasis> for MARC records. This term helps
1881      to understand the notation of extended indexing of MARC records by Zebra.
1882      Our definition is based on the document
1883      <ulink url="http://www.rba.ru/rusmarc/soft/Z39-50.htm">"The table
1884       of conformity for Z39.50 use attributes and RUSMARC fields"</ulink>.
1885      The document is available only in russian language.</para>
1886
1887     <para>
1888      The <emphasis>index-formula</emphasis> is the combination of
1889      subfields presented in such way:
1890     </para>
1891
1892     <screen>
1893      71-00$a, $g, $h ($c){.$b ($c)} , (1)
1894     </screen>
1895
1896     <para>
1897      We know that Zebra supports a Bib-1 attribute - right truncation.
1898      In this case, the <emphasis>index-formula</emphasis> (1) consists from
1899      forms, defined in the same way as (1)</para>
1900
1901     <screen>
1902      71-00$a, $g, $h
1903      71-00$a, $g
1904      71-00$a
1905     </screen>
1906
1907     <note>
1908      <para>The original MARC record may be without some elements, which included in <emphasis>index-formula</emphasis>.
1909      </para>
1910     </note>
1911
1912     <para>This notation includes such operands as:
1913      <variablelist>
1914
1915       <varlistentry>
1916        <term>#</term>
1917        <listitem><para>It means whitespace character.</para></listitem>
1918       </varlistentry>
1919
1920       <varlistentry>
1921        <term>-</term>
1922        <listitem><para>The position may contain any value, defined by
1923          MARC format.
1924          For example, <emphasis>index-formula</emphasis></para>
1925
1926         <screen>
1927          70-#1$a, $g , (2)
1928         </screen>
1929
1930         <para>includes</para>
1931
1932         <screen>
1933          700#1$a, $g
1934          701#1$a, $g
1935          702#1$a, $g
1936         </screen>
1937
1938        </listitem>
1939       </varlistentry>
1940
1941       <varlistentry>
1942        <term>{...}</term>
1943        <listitem>
1944         <para>The repeatable elements are defined in figure-brackets {}.
1945          For example,
1946          <emphasis>index-formula</emphasis></para>
1947
1948         <screen>
1949          71-00$a, $g, $h ($c){.$b ($c)} , (3)
1950         </screen>
1951
1952         <para>includes</para>
1953
1954         <screen>
1955          71-00$a, $g, $h ($c). $b ($c)
1956          71-00$a, $g, $h ($c). $b ($c). $b ($c)
1957          71-00$a, $g, $h ($c). $b ($c). $b ($c). $b ($c)
1958         </screen>
1959
1960        </listitem>
1961       </varlistentry>
1962      </variablelist>
1963
1964      <note>
1965       <para>
1966        All another operands are the same as accepted in MARC world.
1967       </para>
1968      </note>
1969     </para>
1970    </section>
1971
1972    <section id="notation">
1973     <title>Notation of <emphasis>index-formula</emphasis> for Zebra</title>
1974
1975
1976     <para>Extended indexing overloads <literal>path</literal> of
1977      <literal>elm</literal> definition in abstract syntax file of Zebra
1978      (<literal>.abs</literal> file). It means that names beginning with
1979      <literal>"mc-"</literal> are interpreted by Zebra as
1980      <emphasis>index-formula</emphasis>. The database index is created and
1981      linked with <emphasis>access point</emphasis> (Bib-1 use attribute)
1982      according to this formula.</para>
1983
1984     <para>For example, <emphasis>index-formula</emphasis></para>
1985
1986     <screen>
1987      71-00$a, $g, $h ($c){.$b ($c)} , (4)
1988     </screen>
1989
1990     <para>in <literal>.abs</literal> file looks like:</para>
1991
1992     <screen>
1993      mc-71.00_$a,_$g,_$h_(_$c_){.$b_(_$c_)}
1994     </screen>
1995
1996
1997     <para>The notation of <emphasis>index-formula</emphasis> uses the operands:
1998      <variablelist>
1999
2000       <varlistentry>
2001        <term>_</term>
2002        <listitem><para>It means whitespace character.</para></listitem>
2003       </varlistentry>
2004
2005       <varlistentry>
2006        <term>.</term>
2007        <listitem><para>The position may contain any value, defined by
2008          MARC format. For example,
2009          <emphasis>index-formula</emphasis></para>
2010
2011         <screen>
2012          70-#1$a, $g , (5)
2013         </screen>
2014
2015         <para>matches <literal>mc-70._1_$a,_$g_</literal> and includes</para>
2016
2017         <screen>
2018          700_1_$a,_$g_
2019          701_1_$a,_$g_
2020          702_1_$a,_$g_
2021         </screen>
2022        </listitem>
2023       </varlistentry>
2024
2025       <varlistentry>
2026        <term>{...}</term>
2027        <listitem><para>The repeatable elements are defined in
2028          figure-brackets {}. For example,
2029          <emphasis>index-formula</emphasis></para>
2030
2031         <screen>
2032          71#00$a, $g, $h ($c) {.$b ($c)} , (6)
2033         </screen>
2034
2035         <para>matches
2036          <literal>mc-71.00_$a,_$g,_$h_(_$c_){.$b_(_$c_)}</literal> and
2037          includes</para>
2038
2039         <screen>
2040          71.00_$a,_$g,_$h_(_$c_).$b_(_$c_)
2041          71.00_$a,_$g,_$h_(_$c_).$b_(_$c_).$b_(_$c_)
2042          71.00_$a,_$g,_$h_(_$c_).$b_(_$c_).$b_(_$c_).$b_(_$c_)
2043         </screen>
2044        </listitem>
2045       </varlistentry>
2046
2047       <varlistentry>
2048        <term>&#60;...&#62;</term>
2049        <listitem><para>Embedded <emphasis>index-formula</emphasis> (for
2050          linked fields) is between &#60;&#62;. For example,
2051          <emphasis>index-formula</emphasis>
2052         </para>
2053
2054         <screen>
2055          4--#-$170-#1$a, $g ($c) , (7)
2056         </screen>
2057
2058         <para>matches
2059          <literal>mc-4.._._$1&#60;70._1_$a,_$g_(_$c_)&#62;_</literal> and
2060          includes</para>
2061
2062         <screen>
2063          463_._$1&#60;70._1_$a,_$g_(_$c_)&#62;_
2064         </screen>
2065
2066        </listitem>
2067       </varlistentry>
2068      </variablelist>
2069     </para>
2070
2071     <note>
2072      <para>All another operands are the same as accepted in MARC world.</para>
2073     </note>
2074
2075     <section id="grs-examples">
2076      <title>Examples</title>
2077
2078      <para>
2079       <orderedlist>
2080
2081        <listitem>
2082
2083         <para>indexing LEADER</para>
2084
2085         <para>You need to use keyword "ldr" to index leader. For example,
2086          indexing data from 6th and 7th position of LEADER</para>
2087
2088         <screen>
2089          elm mc-ldr[6] Record-type !
2090          elm mc-ldr[7] Bib-level   !
2091         </screen>
2092
2093        </listitem>
2094
2095        <listitem>
2096
2097         <para>indexing data from control fields</para>
2098
2099         <para>indexing date (the time added to database)</para>
2100
2101         <screen>
2102          elm mc-008[0-5] Date/time-added-to-db !
2103         </screen>
2104
2105         <para>or for RUSMARC (this data included in 100th field)</para>
2106
2107         <screen>
2108          elm mc-100___$a[0-7]_ Date/time-added-to-db !
2109         </screen>
2110
2111        </listitem>
2112
2113        <listitem>
2114
2115         <para>using indicators while indexing</para>
2116
2117         <para>For RUSMARC <emphasis>index-formula</emphasis>
2118          <literal>70-#1$a, $g</literal> matches</para>
2119
2120         <screen>
2121          elm 70._1_$a,_$g_ Author !:w,!:p
2122         </screen>
2123
2124         <para>When Zebra finds a field according to
2125          <literal>"70."</literal> pattern it checks the indicators. In this
2126          case the value of first indicator doesn't mater, but the value of
2127          second one must be whitespace, in another case a field is not
2128          indexed.</para>
2129        </listitem>
2130
2131        <listitem>
2132
2133         <para>indexing embedded (linked) fields for UNIMARC based
2134          formats</para>
2135
2136         <para>For RUSMARC <emphasis>index-formula</emphasis>
2137          <literal>4--#-$170-#1$a, $g ($c)</literal> matches</para>
2138
2139         <screen><![CDATA[
2140          elm mc-4.._._$1<70._1_$a,_$g_(_$c_)>_ Author !:w,!:p
2141          ]]></screen>
2142
2143         <para>Data are extracted from record if the field matches to
2144          <literal>"4.._."</literal> pattern and data in linked field
2145          match to embedded
2146          <emphasis>index-formula</emphasis>
2147          <literal>70._1_$a,_$g_(_$c_)</literal>.</para>
2148
2149        </listitem>
2150
2151       </orderedlist>
2152      </para>
2153
2154
2155     </section>
2156    </section>
2157
2158   </section>
2159
2160  </chapter>
2161  <!-- Keep this comment at the end of the file
2162  Local variables:
2163  mode: sgml
2164  sgml-omittag:t
2165  sgml-shorttag:t
2166  sgml-minimize-attributes:nil
2167  sgml-always-quote-attributes:t
2168  sgml-indent-step:1
2169  sgml-indent-data:t
2170  sgml-parent-document: "zebra.xml"
2171  sgml-local-catalogs: nil
2172  sgml-namecase-general:t
2173  End:
2174  -->