+ <chapter id="architecture">
+ <!-- $Id: architecture.xml,v 1.1 2006-01-18 14:00:54 marc Exp $ -->
+ <title>Overview of Zebra Architecture</title>
+
+
+ <sect1 id="local-representation">
+ <title>Local Representation</title>
+
+ <para>
+ As mentioned earlier, Zebra places few restrictions on the type of
+ data that you can index and manage. Generally, whatever the form of
+ the data, it is parsed by an input filter specific to that format, and
+ turned into an internal structure that Zebra knows how to handle. This
+ process takes place whenever the record is accessed - for indexing and
+ retrieval.
+ </para>
+
+ <para>
+ The RecordType parameter in the <literal>zebra.cfg</literal> file, or
+ the <literal>-t</literal> option to the indexer tells Zebra how to
+ process input records.
+ Two basic types of processing are available - raw text and structured
+ data. Raw text is just that, and it is selected by providing the
+ argument <emphasis>text</emphasis> to Zebra. Structured records are
+ all handled internally using the basic mechanisms described in the
+ subsequent sections.
+ Zebra can read structured records in many different formats.
+ <!--
+ How this is done is governed by additional parameters after the
+ "grs" keyword, separated by "." characters.
+ -->
+ </para>
+ </sect1>
+
+ <sect1 id="workflow">
+ <title>Indexing and Retrieval Workflow</title>
+
+ <para>
+ Records pass through three different states during processing in the
+ system.
+ </para>
+
+ <para>
+
+ <itemizedlist>
+ <listitem>
+
+ <para>
+ When records are accessed by the system, they are represented
+ in their local, or native format. This might be SGML or HTML files,
+ News or Mail archives, MARC records. If the system doesn't already
+ know how to read the type of data you need to store, you can set up an
+ input filter by preparing conversion rules based on regular
+ expressions and possibly augmented by a flexible scripting language
+ (Tcl).
+ The input filter produces as output an internal representation,
+ a tree structure.
+
+ </para>
+ </listitem>
+ <listitem>
+
+ <para>
+ When records are processed by the system, they are represented
+ in a tree-structure, constructed by tagged data elements hanging off a
+ root node. The tagged elements may contain data or yet more tagged
+ elements in a recursive structure. The system performs various
+ actions on this tree structure (indexing, element selection, schema
+ mapping, etc.),
+
+ </para>
+ </listitem>
+ <listitem>
+
+ <para>
+ Before transmitting records to the client, they are first
+ converted from the internal structure to a form suitable for exchange
+ over the network - according to the Z39.50 standard.
+ </para>
+ </listitem>
+
+ </itemizedlist>
+
+ </para>
+ </sect1>
+
+
+ <sect1 id="maincomponents">
+ <title>Main Components</title>
+ <para>
+ The Zebra system is designed to support a wide range of data management
+ applications. The system can be configured to handle virtually any
+ kind of structured data. Each record in the system is associated with
+ a <emphasis>record schema</emphasis> which lends context to the data
+ elements of the record.
+ Any number of record schemas can coexist in the system.
+ Although it may be wise to use only a single schema within
+ one database, the system poses no such restrictions.
+ </para>
+ <para>
+ The Zebra indexer and information retrieval server consists of the
+ following main applications: the <literal>zebraidx</literal>
+ indexing maintenance utility, and the <literal>zebrasrv</literal>
+ information query and retireval server. Both are using some of the
+ same main components, which are presented here.
+ </para>
+ <para>
+ This virtual package installs all the necessary packages to start
+ working with IDZebra - including utility programs, development libraries,
+ documentation and modules.
+ <literal>idzebra1.4</literal>
+ </para>
+
+ <sect2 id="componentcore">
+ <title>Core Zebra Module Containing Common Functionality</title>
+ <para>
+ - loads external filter modules used for presenting
+ the recods in a search response.
+ - executes search requests in PQF/RPN, which are handed over from
+ the YAZ server frontend API
+ - calls resorting/reranking algorithms on the hit sets
+ - returns - possibly ranked - result sets, hit
+ numbers, and the like internal data to the YAZ server backend API.
+ </para>
+ <para>
+ This package contains all run-time libraries for IDZebra.
+ <literal>libidzebra1.4</literal>
+ This package includes documentation for IDZebra in PDF and HTML.
+ <literal>idzebra1.4-doc</literal>
+ This package includes common essential IDZebra configuration files
+ <literal>idzebra1.4-common</literal>
+ </para>
+ </sect2>
+
+
+ <sect2 id="componentindexer">
+ <title>Zebra Indexer</title>
+ <para>
+ the core Zebra indexer which
+ - loads external filter modules used for indexing data records of
+ different type.
+ - creates, updates and drops databases and indexes
+ </para>
+ <para>
+ This package contains IDZebra utilities such as the zebraidx indexer
+ utility and the zebrasrv server.
+ <literal>idzebra1.4-utils</literal>
+ </para>
+ </sect2>
+
+ <sect2 id="componentsearcher">
+ <title>Zebra Searcher/Retriever</title>
+ <para>
+ the core Zebra searcher/retriever which
+ </para>
+ <para>
+ This package contains IDZebra utilities such as the zebraidx indexer
+ utility and the zebrasrv server, and their associated man pages.
+ <literal>idzebra1.4-utils</literal>
+ </para>
+ </sect2>
+
+ <sect2 id="componentyazserver">
+ <title>YAZ Server Frontend</title>
+ <para>
+ The YAZ server frontend is
+ a full fledged stateful Z39.50 server taking client
+ connections, and forwarding search and scan requests to the
+ Zebra core indexer.
+ </para>
+ <para>
+ In addition to Z39.50 requests, the YAZ server frontend acts
+ as HTTP server, honouring
+ SRW SOAP requests, and SRU REST requests. Moreover, it can
+ translate inco ming CQL queries to PQF/RPN queries, if
+ correctly configured.
+ </para>
+ <para>
+ YAZ is a toolkit that allows you to develop software using the
+ ANSI Z39.50/ISO23950 standard for information retrieval.
+ SRW/SRU
+ <literal>libyazthread.so</literal>
+ <literal>libyaz.so</literal>
+ <literal>libyaz</literal>
+ </para>
+ </sect2>
+
+ <sect2 id="componentmodules">
+ <title>Record Models and Filter Modules</title>
+ <para>
+ all filter modules which do indexing and record display filtering:
+This virtual package contains all base IDZebra filter modules. EMPTY ???
+ <literal>libidzebra1.4-modules</literal>
+ </para>
+
+ <sect3 id="componentmodulestext">
+ <title>TEXT Record Model and Filter Module</title>
+ <para>
+ Plain ASCII text filter
+ <!--
+ <literal>text module missing as deb file<literal>
+ -->
+ </para>
+ </sect3>
+
+ <sect3 id="componentmodulesgrs">
+ <title>GRS Record Model and Filter Modules</title>
+ <para>
+ Chapter <xref linkend="record-model"/>
+
+ - grs.danbib GRS filters of various kind (*.abs files)
+IDZebra filter grs.danbib (DBC DanBib records)
+ This package includes grs.danbib filter which parses DanBib records.
+ DanBib is the Danish Union Catalogue hosted by DBC
+ (Danish Bibliographic Centre).
+ <literal>libidzebra1.4-mod-grs-danbib</literal>
+
+
+ - grs.marc
+ - grs.marcxml
+ This package includes the grs.marc and grs.marcxml filters that allows
+ IDZebra to read MARC records based on ISO2709.
+
+ <literal>libidzebra1.4-mod-grs-marc</literal>
+
+ - grs.regx
+ - grs.tcl GRS TCL scriptable filter
+ This package includes the grs.regx and grs.tcl filters.
+ <literal>libidzebra1.4-mod-grs-regx</literal>
+
+
+ - grs.sgml
+ <literal>libidzebra1.4-mod-grs-sgml not packaged yet ??</literal>
+
+ - grs.xml
+ This package includes the grs.xml filter which uses Expat to
+ parse records in XML and turn them into IDZebra's internal grs node.
+ <literal>libidzebra1.4-mod-grs-xml</literal>
+ </para>
+ </sect3>
+
+ <sect3 id="componentmodulesalvis">
+ <title>ALVIS Record Model and Filter Module</title>
+ <para>
+ - alvis Experimental Alvis XSLT filter
+ <literal>mod-alvis.so</literal>
+ <literal>libidzebra1.4-mod-alvis</literal>
+ </para>
+ </sect3>
+
+ <sect3 id="componentmodulessafari">
+ <title>SAFARI Record Model and Filter Module</title>
+ <para>
+ - safari
+ <!--
+ <literal>safari module missing as deb file<literal>
+ -->
+ </para>
+ </sect3>
+
+ </sect2>
+
+ <sect2 id="componentconfig">
+ <title>Configuration Files</title>
+ <para>
+ - yazserver XML based config file
+ - core Zebra ascii based config files
+ - filter module config files in many flavours
+ - CQL to PQF ascii based config file
+ </para>
+ </sect2>
+
+ </sect1>
+
+ <!--
+
+
+ <sect1 id="cqltopqf">
+ <title>Server Side CQL To PQF Conversion</title>
+ <para>
+ The cql2pqf.txt yaz-client config file, which is also used in the
+ yaz-server CQL-to-PQF process, is used to to drive
+ org.z3950.zing.cql.CQLNode's toPQF() back-end and the YAZ CQL-to-PQF
+ converter. This specifies the interpretation of various CQL
+ indexes, relations, etc. in terms of Type-1 query attributes.
+
+ This configuration file generates queries using BIB-1 attributes.
+ See http://www.loc.gov/z3950/agency/zing/cql/dc-indexes.html
+ for the Maintenance Agency's work-in-progress mapping of Dublin Core
+ indexes to Attribute Architecture (util, XD and BIB-2)
+ attributes.
+
+ a) CQL set prefixes are specified using the correct CQL/SRW/U
+ prefixes for the required index sets, or user-invented prefixes for
+ special index sets. An index set in CQL is roughly speaking equivalent to a
+ namespace specifier in XML.
+
+ b) The default index set to be used if none explicitely mentioned
+
+ c) Index mapping definitions of the form
+
+ index.cql.all = 1=text
+
+ which means that the index "all" from the set "cql" is mapped on the
+ bib-1 RPN query "@attr 1=text" (where "text" is some existing index
+ in zebra, see indexing stylesheet)
+
+ d) Relation mapping from CQL relations to bib-1 RPN "@attr 2= " stuff
+
+ e) Relation modifier mapping from CQL relations to bib-1 RPN "@attr
+ 2= " stuff
+
+ f) Position attributes
+
+ g) structure attributes
+
+ h) truncation attributes
+
+ See
+ http://www.indexdata.com/yaz/doc/tools.tkl#tools.cql.map for config
+ file details.
+
+
+ </para>
+ </sect1>
+
+
+ <sect1 id="ranking">
+ <title>Static and Dynamic Ranking</title>
+ <para>
+ Zebra uses internally inverted indexes to look up term occurencies
+ in documents. Multiple queries from different indexes can be
+ combined by the binary boolean operations AND, OR and/or NOT (which
+ is in fact a binary AND NOT operation). To ensure fast query execution
+ speed, all indexes have to be sorted in the same order.
+
+ The indexes are normally sorted according to document ID in
+ ascending order, and any query which does not invoke a special
+ re-ranking function will therefore retrieve the result set in document ID
+ order.
+
+ If one defines the
+
+ staticrank: 1
+
+ directive in the main core Zebra config file, the internal document
+ keys used for ordering are augmented by a preceeding integer, which
+ contains the static rank of a given document, and the index lists
+ are ordered
+ - first by ascending static rank
+ - then by ascending document ID.
+
+ This implies that the default rank "0" is the best rank at the
+ beginning of the list, and "max int" is the worst static rank.
+
+ The "alvis" and the experimental "xslt" filters are providing a
+ directive to fetch static rank information out of the indexed XML
+ records, thus making _all_ hit sets orderd after ascending static
+ rank, and for those doc's which have the same static rank, ordered
+ after ascending doc ID.
+ If one wants to do a little fiddeling with the static rank order,
+ one has to invoke additional re-ranking/re-ordering using dynamic
+ reranking or score functions. These functions return positive
+ interger scores, where _highest_ score is best, which means that the
+ hit sets will be sorted according to _decending_ scores (in contrary
+ to the index lists which are sorted according to _ascending_ rank
+ number and document ID)
+
+
+ Those are defined in the zebra C source files
+
+ "rank-1" : zebra/index/rank1.c
+ default TF/IDF like zebra dynamic ranking
+ "rank-static" : zebra/index/rankstatic.c
+ do-nothing dummy static ranking (this is just to prove
+ that the static rank can be used in dynamic ranking functions)
+ "zvrank" : zebra/index/zvrank.c
+ many different dynamic TF/IDF ranking functions
+
+ The are in the zebra config file enabled by a directive like:
+
+ rank: rank-static
+
+ Notice that the "rank-1" and "zvrank" do not use the static rank
+ information in the list keys, and will produce the same ordering
+ with our without static ranking enabled.
+
+ The dummy "rank-static" reranking/scoring function returns just
+ score = max int - staticrank
+ in order to preserve the ordering of hit sets with and without it's
+ call.
+
+ Obviously, one wants to make a new ranking function, which combines
+ static and dynamic ranking, which is left as an exercise for the
+ reader .. (Wray, this is your's ...)
+
+
+ </para>
+
+
+ <para>
+ yazserver frontend config file
+
+ db/yazserver.xml
+
+ Setup of listening ports, and virtual zebra servers.
+ Note path to server-side CQL-to-PQF config file, and to
+ SRW explain config section.
+
+ The <directory> path is relative to the directory where zebra.init is placed
+ and is started up. The other pathes are relative to <directory>,
+ which in this case is the same.
+
+ see: http://www.indexdata.com/yaz/doc/server.vhosts.tkl
+
+ </para>
+
+ <para>
+c) Main "alvis" XSLT filter config file:
+ cat db/filter_alvis_conf.xml
+
+ <?xml version="1.0" encoding="UTF8"?>
+ <schemaInfo>
+ <schema name="alvis" stylesheet="db/alvis2alvis.xsl" />
+ <schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
+ stylesheet="db/alvis2index.xsl" />
+ <schema name="dc" stylesheet="db/alvis2dc.xsl" />
+ <schema name="dc-short" stylesheet="db/alvis2dc_short.xsl" />
+ <schema name="snippet" snippet="25" stylesheet="db/alvis2snippet.xsl" />
+ <schema name="help" stylesheet="db/alvis2help.xsl" />
+ <split level="1"/>
+ </schemaInfo>
+
+ the pathes are relative to the directory where zebra.init is placed
+ and is started up.
+
+ The split level decides where the SAX parser shall split the
+ collections of records into individual records, which then are
+ loaded into DOM, and have the indexing XSLT stylesheet applied.
+
+ The indexing stylesheet is found by it's identifier.
+
+ All the other stylesheets are for presentation after search.
+
+- in data/ a short sample of harvested carnivorous plants
+ ZEBRA_INDEX_DIRS=data/carnivor_20050118_2200_short-346.xml
+
+- in root also one single data record - nice for testing the xslt
+ stylesheets,
+
+ xsltproc db/alvis2index.xsl carni*.xml
+
+ and so on.
+
+- in db/ a cql2pqf.txt yaz-client config file
+ which is also used in the yaz-server CQL-to-PQF process
+
+ see: http://www.indexdata.com/yaz/doc/tools.tkl#tools.cql.map
+
+- in db/ an indexing XSLT stylesheet. This is a PULL-type XSLT thing,
+ as it constructs the new XML structure by pulling data out of the
+ respective elements/attributes of the old structure.
+
+ Notice the special zebra namespace, and the special elements in this
+ namespace which indicate to the zebra indexer what to do.
+
+ <z:record id="67ht7" rank="675" type="update">
+ indicates that a new record with given id and static rank has to be updated.
+
+ <z:index name="title" type="w">
+ encloses all the text/XML which shall be indexed in the index named
+ "title" and of index type "w" (see file default.idx in your zebra
+ installation)
+
+
+ </para>
+
+ <para>
+ Z39.50 searching:
+
+ search like this (using client-side CQL-to-PQF conversion):
+
+ yaz-client -q db/cql2pqf.txt localhost:9999
+ > format xml
+ > querytype cql2rpn
+ > f text=(plant and soil)
+ > s 1
+ > elements dc
+ > s 1
+ > elements index
+ > s 1
+ > elements alvis
+ > s 1
+ > elements snippet
+ > s 1
+
+
+ search like this (using server-side CQL-to-PQF conversion):
+ (the only difference is "querytype cql" instead of
+ "querytype cql2rpn" and the call without specifying a local
+ conversion file)
+
+ yaz-client localhost:9999
+ > format xml
+ > querytype cql
+ > f text=(plant and soil)
+ > s 1
+ > elements dc
+ > s 1
+ > elements index
+ > s 1
+ > elements alvis
+ > s 1
+ > elements snippet
+ > s 1
+
+ NEW: static relevance ranking - see examples in alvis2index.xsl
+
+ > f text = /relevant (plant and soil)
+ > elem dc
+ > s 1
+
+ > f title = /relevant a
+ > elem dc
+ > s 1
+
+
+
+SRW/U searching
+ Surf into http://localhost:9999
+
+ firefox http://localhost:9999
+
+ gives you an explain record. Unfortunately, the data found in the
+ CQL-to-PQF text file must be added by hand-craft into the explain
+ section of the yazserver.xml file. Too bad, but this is all extreme
+ new alpha stuff, and a lot of work has yet to be done ..
+
+ Searching via SRU: surf into the URL (lines broken here - concat on
+ URL line)
+
+ - see number of hits:
+ http://localhost:9999/?version=1.1&operation=searchRetrieve
+ &query=text=(plant%20and%20soil)
+
+
+ - fetch record 5-7 in DC format
+ http://localhost:9999/?version=1.1&operation=searchRetrieve
+ &query=text=(plant%20and%20soil)
+ &startRecord=5&maximumRecords=2&recordSchema=dc
+
+
+ - even search using PQF queries using the extended verb "x-pquery",
+ which is special to YAZ/Zebra
+
+ http://localhost:9999/?version=1.1&operation=searchRetrieve
+ &x-pquery=@attr%201=text%20@and%20plant%20soil
+
+ More info: read the fine manuals at http://www.loc.gov/z3950/agency/zing/srw/
+278,280d299
+ Search via SRW:
+ read the fine manual at
+ http://www.loc.gov/z3950/agency/zing/srw/
+
+
+and so on. The list of available indexes is found in db/cql2pqf.txt
+
+
+7) How do you add to the index attributes of any other type than "w"?
+I mean, in the context of making CQL queries. Let's say I want a date
+attribute in there, so that one could do date > 20050101 in CQL.
+
+Currently for example 'date-modified' is of type 'w'.
+
+The 2-seconds-of-though solution:
+
+ in alvis2index.sl:
+
+ <z:index name="date-modified" type="d">
+ <xsl:value-of
+ select="acquisition/acquisitionData/modifiedDate"/>
+ </z:index>
+
+But here's the catch...doesn't the use of the 'd' type require
+structure type 'date' (@attr 4=5) in PQF? But then...how does that
+reflect in the CQL->RPN/PQF mapping - does it really work if I just
+change the type of an element in alvis2index.sl? I would think not...?
+
+
+
+
+ Kimmo
+
+
+Either do:
+
+ f @attr 4=5 @attr 1=date-modified 20050713
+
+or do
+
+
+Either do:
+
+ f @attr 4=5 @attr 1=date-modified 20050713
+
+or do
+
+querytype cql
+
+ f date-modified=20050713
+
+ f date-modified=20050713
+
+ Search ERROR 121 4 1+0 RPN: @attrset Bib-1 @attr 5=100 @attr 6=1 @attr 3=3 @att
+r 4=1 @attr 2=3 @attr "1=date-modified" 20050713
+
+
+
+ f date-modified eq 20050713
+
+Search OK 23 3 1+0 RPN: @attrset Bib-1 @attr 5=100 @attr 6=1 @attr 3=3 @attr 4=5
+ @attr 2=3 @attr "1=date-modified" 20050713
+
+
+ </para>
+
+ <para>
+E) EXTENDED SERVICE LIFE UPDATES
+
+The extended services are not enabled by default in zebra - due to the
+fact that they modify the system.
+
+In order to allow anybody to update, use
+perm.anonymous: rw
+in zebra.cfg.
+
+Or, even better, allow only updates for a particular admin user. For
+user 'admin', you could use:
+perm.admin: rw
+passwd: passwordfile
+
+And in passwordfile, specify users and passwords ..
+admin:secret
+
+We can now start a yaz-client admin session and create a database:
+
+$ yaz-client localhost:9999 -u admin/secret
+Authentication set to Open (admin/secret)
+Connecting...OK.
+Sent initrequest.
+Connection accepted by v3 target.
+ID : 81
+Name : Zebra Information Server/GFS/YAZ
+Version: Zebra 1.4.0/1.63/2.1.9
+Options: search present delSet triggerResourceCtrl scan sort
+extendedServices namedResultSets
+Elapsed: 0.007046
+Z> adm-create
+Admin request
+Got extended services response
+Status: done
+Elapsed: 0.045009
+:
+Now Default was created.. We can now insert an XML file (esdd0006.grs
+from example/gils/records) and index it:
+
+Z> update insert 1 esdd0006.grs
+Got extended services response
+Status: done
+Elapsed: 0.438016
+
+The 3rd parameter.. 1 here .. is the opaque record id from Ext update.
+It a record ID that _we_ assign to the record in question. If we do not
+assign one the usual rules for match apply (recordId: from zebra.cfg).
+
+Actually, we should have a way to specify "no opaque record id" for
+yaz-client's update command.. We'll fix that.
+
+Elapsed: 0.438016
+Z> f utah
+Sent searchRequest.
+Received SearchResponse.
+Search was a success.
+Number of hits: 1, setno 1
+SearchResult-1: term=utah cnt=1
+records returned: 0
+Elapsed: 0.014179
+
+Let's delete the beast:
+Z> update delete 1
+No last record (update ignored)
+Z> update delete 1 esdd0006.grs
+Got extended services response
+Status: done
+Elapsed: 0.072441
+Z> f utah
+Sent searchRequest.
+Received SearchResponse.
+Search was a success.
+Number of hits: 0, setno 2
+SearchResult-1: term=utah cnt=0
+records returned: 0
+Elapsed: 0.013610
+
+If shadow register is enabled you must run the adm-commit command in
+order write your changes..
+
+ </para>
+
+
+
+ </sect1>
+-->
+
+ </chapter>
+
+ <!-- Keep this comment at the end of the file
+ Local variables:
+ mode: sgml
+ sgml-omittag:t
+ sgml-shorttag:t
+ sgml-minimize-attributes:nil
+ sgml-always-quote-attributes:t
+ sgml-indent-step:1
+ sgml-indent-data:t
+ sgml-parent-document: "zebra.xml"
+ sgml-local-catalogs: nil
+ sgml-namecase-general:t
+ End:
+ -->