<!doctype linuxdoc system>
<!--
- $Id: zebra.sgml,v 1.23 1996-03-26 15:59:46 adam Exp $
+ $Id: zebra.sgml,v 1.33 1996-12-11 12:07:45 adam Exp $
-->
<article>
<title>Zebra Server - Administrators's Guide and Reference
<author><htmlurl url="http://www.indexdata.dk/" name="Index Data">, <tt><htmlurl url="mailto:info@index.ping.dk" name="info@index.ping.dk"></>
-<date>$Revision: 1.23 $
+<date>$Revision: 1.33 $
<abstract>
The Zebra information server combines a versatile fielded/free-text
search engine with a Z39.50-1995 frontend to provide a powerful and flexible
<sect1>Features
<p>
-This is a listof some of the most important features of the
+This is a list of some of the most important features of the
system.
<itemize>
well as variant forms of data.
<item>
+Supports random storage formats. A system of input filters driven by
+regular expressions allows you to easily process most ASCII-based
+data formats.
+
+<item>
Supports boolean queries as well as relevance-ranking (free-text)
searching. Right truncation and masking in terms are supported, as
well as full regular expressions.
schema on the fly.
<item>
+Supports approximate matching in registers (ie. spelling mistakes,
+etc).
+
+<item>
Protocol support:
<itemize>
<itemize>
<item>
-*Allow the system to handle other input formats. Specifically
-MARC records and general, structured ASCII records (such as mail/news
-files) parameterized by regular expressions.
-
-<item>
*Complete the support for variants. Finalize support for the WAIS
retrieval methodology.
*Port the system to Windows NT.
<item>
-Add index and data compression to save disk space.
-
-<item>
Add more sophisticated relevance ranking mechanisms. Add support for soundex
-and stemming. Add relevance feedback support.
+and stemming. Add relevance <it/feedback/ support.
<item>
Add Explain support.
Support the Item Update extended service of the protocol.
<item>
-The Zebra search engine supports approximate string matching in the
-index. We'd like to find a way to support and control this from RPN.
-
-<item>
We want to add a management system that allows you to
control your databases and configuration tables from a graphical
interface. We'll probably use Tcl/Tk to stay platform-independent.
<sect>Compiling the software
<p>
-Zebra uses the YAZ package to implement Z39.50, so you
-have to compile YAZ before going further. Specifically, Zebra uses
-the YAZ header files in <tt>yaz/include/..</tt> and its public library
-<tt>yaz/lib/libyaz.a</tt>.
-
-As with YAZ, an ANSI C compiler is required in order to compile the Zebra
+An ANSI C compiler is required to compile the Zebra
server system — <tt/gcc/ works fine if your own system doesn't
provide an adequate compiler.
-Unpack the Zebra software. You might put Zebra in the same directory level
-as YAZ, for example if YAZ is placed in ..<tt>/src/yaz-xxx</tt>, then
-Zebra is placed in ..<tt>/src/zebra-yyy</tt>.
-
-Edit the top-level <tt>Makefile</tt> in the Zebra directory in which
-you specify the location of YAZ by setting make variables.
-The <tt>OSILIB</tt> should be empty if YAZ wasn't compiled with
-MOSI support. Some systems, such as Solaris, have separate socket
-libraries and for those systems you need to specify the
-<tt>NETLIB</tt> variable.
+Unpack the distribution archive. In some cases, you may want to edit
+the top-level <tt/Makefile/, eg. to select a different C compiler, or
+to specify machine-specific libraries in the <bf/NETLIB/ variable.
When you are done editing the <tt>Makefile</tt> type:
<tscreen><verb>
<tag><tt>zebraidx</tt></tag> The administrative tool for the search index.
</descrip>
-<sect>Quick Start
-
+<sect>Quick Start
<p>
In this section, we will test the system by indexing a small set of sample
GILS records that are included with the software distribution. Go to the
# Files that describe the attribute sets supported.
attset: bib1.att
attset: gils.att
+
+# Name of character map file.
+charMap: scan.chr
</verb></tscreen>
Now, edit the file and set <tt>profilePath</tt> to the path of the
The 48 test records are located in the sub directory <tt>records</tt>.
To index these, type:
<tscreen><verb>
-$ ../index/zebraidx -t grs update records
+$ ../index/zebraidx -t grs.sgml update records
</verb></tscreen>
In the command above the option <tt>-t</tt> specified the record
-type — in this case <tt>grs</tt>. The word <tt>update</tt> followed
+type — in this case <tt>grs.sgml</tt>. The word <tt>update</tt> followed
by a directory root updates all files below that directory node.
If your indexing command was successful, you are now ready to
with no prefix are used.
In the configuration file, the group name is placed before the option
-name
-itself, separated by a dot (.). For instance, to set the record type
-for group <tt/public/ to <tt/grs/ (the common format for structured
+name itself, separated by a dot (.). For instance, to set the record type
+for group <tt/public/ to <tt/grs.sgml/ (the SGML-like format for structured
records) you would write:
<tscreen><verb>
-public.recordType: grs
+public.recordType: grs.sgml
</verb></tscreen>
To set the default value of the record type to <tt/text/ write:
Specifies how records with the file extension <it>name</it> should
be handled by the indexer. This option may also be specified
as a command line option (<tt>-t</tt>). Note that if you do not
- specify a <it/name/, the setting applies to all files.
-<tag><it>group</it>.recordId</tag>
+ specify a <it/name/, the setting applies to all files. In general,
+ the record type specifier consists of the elements (each
+ element separated by dot), <it>fundamental-type</it>,
+ <it>file-read-type</it> and arguments. Currently, two
+ fundamental types exist, <tt>text</tt> and <tt>grs</tt>.
+ <tag><it>group</it>.recordId</tag>
Specifies how the records are to be identified when updated. See
section <ref id="locating-records" name="Locating Records">.
<tag><it>group</it>.database</tag>
Enables the <it/safe update/ facility of Zebra, and tells the system
where to place the required, temporary files. See section
<ref id="shadow-registers" name="Safe Updating - Using Shadow Registers">.
-<tag>lockPath</tag>
+<tag>lockDir</tag>
Directory in which various lock files are stored.
-<tag>tempSetPath</tag>
+<tag>keyTmpDir</tag>
+ Directory in which temporary files used during zebraidx' update
+ phase are stored.
+<tag>setTmpDir</tag>
Specifies the directory that the server uses for temporary result sets.
If not specified <tt>/tmp</tt> will be used.
<tag>profilePath</tag>
searching. At least the Bib-1 set should be loaded (<tt/bib1.att/).
The <tt/profilePath/ setting is used to look for the specified files.
See section <ref id="attset-files" name="The Attribute Set Files">
+<tag>charMap</tag>
+ Specifies the filename of a character mapping. Zebra uses the path,
+ <tt>profilePath</tt>, to locate this file.
+<tag>memMax</tag>
+ Specifies size of internal memory to use for the zebraidx program. The
+ amount is given in megabytes - default is 4 (4 MB).
</descrip>
-
<sect1>Locating Records<label id="locating-records">
<p>
The default behaviour of the Zebra system is to reference the
</descrip>
-<sect>Running the Z39.50 Server (zebrasrv)
+<sect>The Z39.50 Server
+
+<sect1>Running the Z39.50 Server (zebrasrv)
<p>
<bf/Syntax/
<tag>-w <it/working-directory/</tag>Change working directory.
-<tag/-i/Run under the Internet superserver, <tt/inetd/.
+<tag>-i <it/minutes/</tag>Run under the Internet superserver, <tt/inetd/.
+
+<tag>-t <it/timeout/</tag>Set the idle session timeout (default 60 minutes).
+
+<tag>-k <it/kilobytes/</tag>Set the (approximate) maximum size of
+present response messages. Default is 1024 Kb (1 Mb).
</descrip>
A <it/listener-address/ consists of a transport mode followed by a
The default behavior for <tt/zebrasrv/ is to establish a single TCP/IP
listener, for the Z39.50 protocol, on port 9999.
+<sect1>Z39.50 Protocol Support and Behavior
+
+<sect2>Initialization
+
+<p>
+During initialization, the server will negotiate to version 3 of the
+Z39.50 protocol, and the option bits for Search, Present, Scan,
+NamedResultSets, and concurrentOperations will be set, if requested by
+the client. The maximum PDU size is negotiated down to a maximum of
+1Mb by default.
+
+<sect2>Search
+
+<p>
+The supported query type are 1 and 101. All operators are currently
+supported except that only proximity units of type "word" are supported
+for the proximity operator. Queries can be arbitrarily complex. Named
+result sets are supported, and result sets can be used as operands with
+no limitations. Searches may span multiple databases.
+
+The server has full support for piggy-backed present requests (see
+also the following section).
+
+<bf/Use/ attributes are interpreted according to the attribute sets which
+have been loaded in the <tt/zebra.cfg/ file, and are matched against
+specific fields as specified in the <tt/.abs/ file which describes the
+profile of the records which have been loaded. If no <bf/Use/
+attribute is provided, a default of <bf/Any/ is assumed.
+
+If a <bf/Structure/ attribute of <bf/Phrase/ is used in conjunction with a
+<bf/Completeness/ attribute of <bf/Complete (Sub)field/, the term is
+matched against the contents of a phrase (long word) register, if one
+exists for the given <bf/Use/ attribute. If <bf/Structure/=<bf/Phrase/
+is used in conjunction with <bf/Incomplete Field/ - the default value
+for <bf/Completeness/, the search is directed against the normal word
+registers, but if the term contains multiple words, the term will only
+match if all of the words are found immediately adjacent, and in the
+given order. If the <bf/Structure/ attribute is <bf/Word List/,
+<bf/Free-form Text/, or <bf/Document Text/, the term is treated as a
+natural-language, relevance-ranked query.
+
+If the <bf/Relation/ attribute is <bf/Equals/ (default), the term is
+matched in a normal fashion (modulo truncation and processing of
+individual words, if required). If <bf/Relation/ is <bf/Less Than/,
+<bf/Less Than or Equal/, <bf/Greater than/, or <bf/Greater than or
+Equal/, the term is assumed to be numerical, and a standard regular
+expression is constructed to match the given expression. If
+<bf/Relation/ is <bf/Relevance/, the standard natural-language query
+processor is invoked.
+
+For the <bf/Truncation/ attribute, <bf/No Truncation/ is the default.
+<bf/Left Truncation/ is not supported. <bf/Process #/ is supported, as
+is <bf/Regxp-1/. <bf/Regxp-2/ enables the fault-tolerant (fuzzy)
+search. As a default, a single error (deletion, insertion,
+replacement) is accepted when terms are matched against the register
+contents.
+
+Zebra interprets queries in one the following ways:
+<descrip>
+<tag>1 Phrase search</tag>
+ Each token separated by white space is truncated according to the
+ value of truncation attribute. If the completeness attribute
+ is <bf/complete subfield/ the search is directed to the phrase
+ register. For other completeness attribute values the term is split
+ into tokens according to the white-space specification in the
+ character map. Only records in which each token exists in the order
+ specified are matched.
+<tag>2 Word search</tag>
+ The token is truncated according to the value of truncation attribute.
+ The completeness attribute is ignored.
+<tag>3 Ranked search</tag>
+ Each token separated by white space is truncated according to the value
+ of truncation attribute. The completenss attribute is ignored.
+<tag>4 Numeric relation</tag>
+ The token should consist of decimal digits. The integer is matched
+ against integers in the register according to the relation attribute.
+ The truncation - and the completenss attribute is ignored.
+<tag>5 Document identifier</tag>
+ The token consists of exactly one document identifier. The
+ truncation - and the completeness attribute is ignored.
+</descrip>
+
+For ranked searches the result sets are ranked and a score
+is associated with each record. All other result sets from the
+remaining four types are non-ranked.
+
+Combinations of the structure attribute and the relation attribute
+determine how the query is interpreted. The two following tables
+define how.
+
+<verb>
+ Structure Attribute (4)
+ none phrase(1) word(2) word list(6)
+
+ none 1 1 2 3
+ = (3) 1 1 2 3
+ < (1) 4 4 4 4
+Relation <= (2) 4 4 4 4
+Attribute >= (4) 4 4 4 4
+ (2) > (5) 4 4 4 4
+ <> (6) - - - -
+ rel (102) 3 3 3 3
+ other 1 1 2 3
+
+</verb>
+
+<verb>
+ Structure Attribute (4)
+ free-form- document- local- string
+ text text number
+ (105) (106) (107) (108)
+ none 3 3 5 1
+ = (3) 3 3 5 1
+ < (1) 4 4 5 4
+ Relation <= (2) 4 4 5 4
+ Attribute >= (4) 4 4 5 4
+ (2) > (5) 4 4 5 4
+ <> (6) - - 5 -
+ rel (102) 3 3 5 3
+ other 3 3 5 1
+
+</verb>
+
+<sect3>Regular expressions
+<p>
+
+Each term in a query is interpreted as a regular expression if
+the truncation value is either <bf/Regxp-1/ (102) or <bf/Regxp-2/ (103).
+Both query types follow the same syntax with the operands:
+<descrip>
+<tag/x/ Matches the character <it/x/.
+<tag/./ Matches any character.
+<tag><tt/[/..<tt/]/</tag> Matches the set of characters specified;
+ such as <tt/[abc]/ or <tt/[a-c]/.
+</descrip>
+and the operators:
+<descrip>
+<tag/x*/ Matches <it/x/ zero or more times. Priority: high.
+<tag/x+/ Matches <it/x/ one or more times. Priority: high.
+<tag/x?/ Matches <it/x/ once or twice. Priority: high.
+<tag/xy/ Matches <it/x/, then <it/y/. Priority: medium.
+<tag/x|y/ Matches either <it/x/ or <it/y/. Priority: low.
+</descrip>
+The order of evaluation may be changed by using parentheses.
+
+If the first character of the <bf/Regxp-2/ query is a plus character
+(<tt/+/) it marks the beginning of a section with non-standard
+specifiers. The next plus character marks the end of the section.
+Currently Zebra only supports one specifier, the error tolerance,
+which consists one digit.
+
+Since the plus operator is normally a suffix operator the addition to
+the query syntax doesn't violate the syntax for standard regular
+expressions.
+
+<sect3>Query examples
+<p>
+Phrase search for <bf/information retrieval/ in the title-register:
+<verb>
+ @attr 1=4 "information retrieval"
+</verb>
+
+Ranked search for the same thing:
+<verb>
+ @attr 1=4 @attr 2=102 "Information retrieval"
+</verb>
+
+Phrase search with a regular expression:
+<verb>
+ @attr 1=4 @attr 5=102 "informat.* retrieval"
+</verb>
+
+Ranked search with a regular expression:
+<verb>
+ @attr 1=4 @attr 5=102 @attr 2=102 "informat.* retrieval"
+</verb>
+
+<sect2>Present
+<p>
+The present facility is supported in a standard fashion. The requested
+record syntax is matched against the ones supported by the profile of
+each record retrieved. If no record syntax is given, SUTRS is the
+default. The requested element set name, again, is matched against any
+provided by the relevant record profiles.
+
+<sect2>Scan
+
+<p>
+The attribute combinations provided with the TermListAndStartPoint are
+processed in the same way as operands in a query (see above).
+Currently, only the term and the globalOccurrences are returned with
+the TermInfo structure.
+
+<sect2>Close
+
+<p>
+If a Close PDU is received, the server will respond with a Close PDU
+with reason=FINISHED, no matter which protocol version was negotiated
+during initialization. If the protocol version is 3 or more, the
+server will generate a Close PDU under certain circumstances,
+including a session timeout (60 minutes by default), and certain kinds of
+protocol errors. Once a Close PDU has been sent, the protocol
+association is considered broken, and the transport connection will be
+closed immediately upon receipt of further data, or following a short
+timeout.
+
<sect>The Record Model
<p>
Although it may be wise to use only a single schema within
one database, the system poses no such restrictions.
+The record model described in this chapter applies to the fundamental
+record type <tt>grs</tt> as introduced in
+section <ref id="record-types" name="Record Types">.
+
Records pass through three different states during processing in the
system.
spectrum of structure and flexibility in the system. In Zebra, this
canonical format is an &dquot;SGML-like&dquot; syntax.
+To use the canonical format specify <tt>grs.sgml</tt> as the record
+type,
+
Consider a record describing an information resource (such a record is
sometimes known as a <it/locator record/). It might contain a field
describing the distributor of the information resource, which might in
Input filters are ASCII files, generally with the suffix <tt/.flt/.
The system looks for the files in the directories given in the
-<bf/profilePath/ setting in the <tt/zebra.cfg/ file.
+<bf/profilePath/ setting in the <tt/zebra.cfg/ files. The record type
+for the filter is <tt>grs.regx.</tt><it>filter-filename</it>
+(fundamental type <tt>grs</tt>, file read type <tt>regx</tt>, argument
+<it>filter-filename</it>).
Generally, an input filter consists of a sequence of rules, where each
rule consists of a sequence of expressions, followed by an action. The
<sect>License
<p>
-Copyright © 1995, Index Data.
+Copyright © 1995,1996 Index Data.
All rights reserved.