1 <chapter id="architecture">
2 <!-- $Id: architecture.xml,v 1.20 2007-02-02 11:10:08 marc Exp $ -->
3 <title>Overview of &zebra; Architecture</title>
5 <section id="architecture-representation">
6 <title>Local Representation</title>
9 As mentioned earlier, &zebra; places few restrictions on the type of
10 data that you can index and manage. Generally, whatever the form of
11 the data, it is parsed by an input filter specific to that format, and
12 turned into an internal structure that &zebra; knows how to handle. This
13 process takes place whenever the record is accessed - for indexing and
18 The RecordType parameter in the <literal>zebra.cfg</literal> file, or
19 the <literal>-t</literal> option to the indexer tells &zebra; how to
20 process input records.
21 Two basic types of processing are available - raw text and structured
22 data. Raw text is just that, and it is selected by providing the
23 argument <emphasis>text</emphasis> to &zebra;. Structured records are
24 all handled internally using the basic mechanisms described in the
26 &zebra; can read structured records in many different formats.
28 How this is done is governed by additional parameters after the
29 "grs" keyword, separated by "." characters.
34 <section id="architecture-maincomponents">
35 <title>Main Components</title>
37 The &zebra; system is designed to support a wide range of data management
38 applications. The system can be configured to handle virtually any
39 kind of structured data. Each record in the system is associated with
40 a <emphasis>record schema</emphasis> which lends context to the data
41 elements of the record.
42 Any number of record schemas can coexist in the system.
43 Although it may be wise to use only a single schema within
44 one database, the system poses no such restrictions.
47 The &zebra; indexer and information retrieval server consists of the
48 following main applications: the <command>zebraidx</command>
49 indexing maintenance utility, and the <command>zebrasrv</command>
50 information query and retrieval server. Both are using some of the
51 same main components, which are presented here.
54 The virtual Debian package <literal>idzebra-2.0</literal>
55 installs all the necessary packages to start
56 working with &zebra; - including utility programs, development libraries,
57 documentation and modules.
60 <section id="componentcore">
61 <title>Core &zebra; Libraries Containing Common Functionality</title>
63 The core &zebra; module is the meat of the <command>zebraidx</command>
64 indexing maintenance utility, and the <command>zebrasrv</command>
65 information query and retrieval server binaries. Shortly, the core
66 libraries are responsible for
69 <term>Dynamic Loading</term>
71 <para>of external filter modules, in case the application is
72 not compiled statically. These filter modules define indexing,
73 search and retrieval capabilities of the various input formats.
78 <term>Index Maintenance</term>
80 <para> &zebra; maintains Term Dictionaries and ISAM index
81 entries in inverted index structures kept on disk. These are
82 optimized for fast inset, update and delete, as well as good
88 <term>Search Evaluation</term>
90 <para>by execution of search requests expressed in &pqf;/&rpn;
91 data structures, which are handed over from
92 the &yaz; server frontend &api;. Search evaluation includes
93 construction of hit lists according to boolean combinations
94 of simpler searches. Fast performance is achieved by careful
95 use of index structures, and by evaluation specific index hit
96 lists in correct order.
101 <term>Ranking and Sorting</term>
104 components call resorting/re-ranking algorithms on the hit
105 sets. These might also be pre-sorted not only using the
106 assigned document ID's, but also using assigned static rank
112 <term>Record Presentation</term>
114 <para>returns - possibly ranked - result sets, hit
115 numbers, and the like internal data to the &yaz; server backend &api;
116 for shipping to the client. Each individual filter module
117 implements it's own specific presentation formats.
124 The Debian package <literal>libidzebra-2.0</literal>
125 contains all run-time libraries for &zebra;, the
126 documentation in PDF and HTML is found in
127 <literal>idzebra-2.0-doc</literal>, and
128 <literal>idzebra-2.0-common</literal>
129 includes common essential &zebra; configuration files.
134 <section id="componentindexer">
135 <title>&zebra; Indexer</title>
137 The <command>zebraidx</command>
138 indexing maintenance utility
139 loads external filter modules used for indexing data records of
140 different type, and creates, updates and drops databases and
141 indexes according to the rules defined in the filter modules.
144 The Debian package <literal>idzebra-2.0-utils</literal> contains
145 the <command>zebraidx</command> utility.
149 <section id="componentsearcher">
150 <title>&zebra; Searcher/Retriever</title>
152 This is the executable which runs the &z3950;/&sru;/&srw; server and
153 glues together the core libraries and the filter modules to one
154 great Information Retrieval server application.
157 The Debian package <literal>idzebra-2.0-utils</literal> contains
158 the <command>zebrasrv</command> utility.
162 <section id="componentyazserver">
163 <title>&yaz; Server Frontend</title>
165 The &yaz; server frontend is
166 a full fledged stateful &z3950; server taking client
167 connections, and forwarding search and scan requests to the
168 &zebra; core indexer.
171 In addition to &z3950; requests, the &yaz; server frontend acts
172 as HTTP server, honoring
173 <ulink url="&url.srw;">&sru; &soap;</ulink>
175 <ulink url="&url.sru;">&sru; &rest;</ulink>
176 requests. Moreover, it can
178 <ulink url="&url.cql;">&cql;</ulink>
180 <ulink url="&url.yaz.pqf;">&pqf;</ulink>
182 correctly configured.
185 <ulink url="&url.yaz;">&yaz;</ulink>
187 toolkit that allows you to develop software using the
188 &ansi; &z3950;/ISO23950 standard for information retrieval.
189 It is packaged in the Debian packages
190 <literal>yaz</literal> and <literal>libyaz</literal>.
194 <section id="componentmodules">
195 <title>Record Models and Filter Modules</title>
197 The hard work of knowing <emphasis>what</emphasis> to index,
198 <emphasis>how</emphasis> to do it, and <emphasis>which</emphasis>
199 part of the records to send in a search/retrieve response is
201 various filter modules. It is their responsibility to define the
202 exact indexing and record display filtering rules.
205 The virtual Debian package
206 <literal>libidzebra-2.0-modules</literal> installs all base filter
211 <section id="componentmodulesalvis">
212 <title>ALVIS &xml; Record Model and Filter Module</title>
214 The Alvis filter for &xml; files is an &xslt; based input
216 It indexes element and attribute content of any thinkable &xml; format
217 using full &xpath; support, a feature which the standard &zebra;
218 &grs1; &sgml; and &xml; filters lacked. The indexed documents are
219 parsed into a standard &xml; &dom; tree, which restricts record size
220 according to availability of memory.
224 uses &xslt; display stylesheets, which let
225 the &zebra; DB administrator associate multiple, different views on
226 the same &xml; document type. These views are chosen on-the-fly in
230 In addition, the Alvis filter configuration is not bound to the
231 arcane &bib1; &z3950; library catalogue indexing traditions and
232 folklore, and is therefore easier to understand.
235 Finally, the Alvis filter allows for static ranking at index
236 time, and to to sort hit lists according to predefined
237 static ranks. This imposes no overhead at all, both
238 search and indexing perform still
239 <emphasis>O(1)</emphasis> irrespectively of document
240 collection size. This feature resembles Googles pre-ranking using
241 their Pagerank algorithm.
244 Details on the experimental Alvis &xslt; filter are found in
245 <xref linkend="record-model-alvisxslt"/>.
248 The Debian package <literal>libidzebra-2.0-mod-alvis</literal>
249 contains the Alvis filter module.
253 <section id="componentmodulesgrs">
254 <title>&grs1; Record Model and Filter Modules</title>
256 The &grs1; filter modules described in
257 <xref linkend="grs"/>
258 are all based on the &z3950; specifications, and it is absolutely
259 mandatory to have the reference pages on &bib1; attribute sets on
260 you hand when configuring &grs1; filters. The GRS filters come in
261 different flavors, and a short introduction is needed here.
262 &grs1; filters of various kind have also been called ABS filters due
263 to the <filename>*.abs</filename> configuration file suffix.
266 The <emphasis>grs.marc</emphasis> and
267 <emphasis>grs.marcxml</emphasis> filters are suited to parse and
268 index binary and &xml; versions of traditional library &marc; records
269 based on the ISO2709 standard. The Debian package for both
271 <literal>libidzebra-2.0-mod-grs-marc</literal>.
274 &grs1; TCL scriptable filters for extensive user configuration come
275 in two flavors: a regular expression filter
276 <emphasis>grs.regx</emphasis> using TCL regular expressions, and
277 a general scriptable TCL filter called
278 <emphasis>grs.tcl</emphasis>
279 are both included in the
280 <literal>libidzebra-2.0-mod-grs-regx</literal> Debian package.
283 A general purpose &sgml; filter is called
284 <emphasis>grs.sgml</emphasis>. This filter is not yet packaged,
285 but planned to be in the
286 <literal>libidzebra-2.0-mod-grs-sgml</literal> Debian package.
290 <literal>libidzebra-2.0-mod-grs-xml</literal> includes the
291 <emphasis>grs.xml</emphasis> filter which uses <ulink
292 url="&url.expat;">Expat</ulink> to
293 parse records in &xml; and turn them into ID&zebra;'s internal &grs1; node
294 trees. Have also a look at the Alvis &xml;/&xslt; filter described in
299 <section id="componentmodulestext">
300 <title>TEXT Record Model and Filter Module</title>
302 Plain ASCII text filter. TODO: add information here.
307 <section id="componentmodulessafari">
308 <title>SAFARI Record Model and Filter Module</title>
310 SAFARI filter module TODO: add information here.
320 <section id="architecture-workflow">
321 <title>Indexing and Retrieval Workflow</title>
324 Records pass through three different states during processing in the
334 When records are accessed by the system, they are represented
335 in their local, or native format. This might be &sgml; or HTML files,
336 News or Mail archives, &marc; records. If the system doesn't already
337 know how to read the type of data you need to store, you can set up an
338 input filter by preparing conversion rules based on regular
339 expressions and possibly augmented by a flexible scripting language
341 The input filter produces as output an internal representation,
349 When records are processed by the system, they are represented
350 in a tree-structure, constructed by tagged data elements hanging off a
351 root node. The tagged elements may contain data or yet more tagged
352 elements in a recursive structure. The system performs various
353 actions on this tree structure (indexing, element selection, schema
361 Before transmitting records to the client, they are first
362 converted from the internal structure to a form suitable for exchange
363 over the network - according to the &z3950; standard.
372 <section id="special-retrieval">
373 <title>Retrieval of &zebra; internal record data</title>
375 Starting with <literal>&zebra;</literal> version 2.0.5 or newer, it is
376 possible to use a special element set which has the prefix
377 <literal>zebra::</literal>.
380 Using this element will, regardless of record type, return
381 &zebra;'s internal index structure/data for a record.
382 In particular, the regular record filters are not invoked when
384 This can in some cases make the retrival faster than regular
385 retrieval operations (for &marc;, &xml; etc).
387 <table id="special-retrieval-types">
388 <title>Special Retrieval Elements</title>
392 <entry>Element Set</entry>
393 <entry>Description</entry>
394 <entry>Syntax</entry>
399 <entry><literal>zebra::meta::sysno</literal></entry>
400 <entry>Get &zebra; record system ID</entry>
401 <entry>&xml; and &sutrs;</entry>
404 <entry><literal>zebra::data</literal></entry>
405 <entry>Get raw record</entry>
409 <entry><literal>zebra::meta</literal></entry>
410 <entry>Get &zebra; record internal metadata</entry>
411 <entry>&xml; and &sutrs;</entry>
414 <entry><literal>zebra::index</literal></entry>
415 <entry>Get all indexed keys for record</entry>
416 <entry>&xml; and &sutrs;</entry>
420 <literal>zebra::index::</literal><replaceable>f</replaceable>
423 Get indexed keys for field <replaceable>f</replaceable> for record
425 <entry>&xml; and &sutrs;</entry>
429 <literal>zebra::index::</literal><replaceable>f</replaceable>:<replaceable>t</replaceable>
432 Get indexed keys for field <replaceable>f</replaceable>
433 and type <replaceable>t</replaceable> for record
435 <entry>&xml; and &sutrs;</entry>
441 For example, to fetch the raw binary record data stored in the
442 zebra internal storage, or on the filesystem, the following
443 commands can be issued:
445 Z> f @attr 1=title my
447 Z> elements zebra::data
457 <literal>zebra::data</literal> element set name is
458 defined for any record syntax, but will always fetch
459 the raw record data in exactly the original form. No record syntax
460 specific transformations will be applied to the raw record data.
463 Also, &zebra; internal metadata about the record can be accessed:
465 Z> f @attr 1=title my
467 Z> elements zebra::meta::sysno
470 displays in <literal>&xml;</literal> record syntax only internal
471 record system number, whereas
473 Z> f @attr 1=title my
475 Z> elements zebra::meta
478 displays all available metadata on the record. These include sytem
479 number, database name, indexed filename, filter used for indexing,
480 score and static ranking information and finally bytesize of record.
483 Sometimes, it is very hard to figure out what exactly has been
484 indexed how and in which indexes. Using the indexing stylesheet of
485 the Alvis filter, one can at least see which portion of the record
486 went into which index, but a similar aid does not exist for all
487 other indexing filters.
491 <literal>zebra::index</literal> element set names are provided to
492 access information on per record indexed fields. For example, the
495 Z> f @attr 1=title my
497 Z> elements zebra::index
500 will display all indexed tokens from all indexed fields of the
501 first record, and it will display in <literal>&sutrs;</literal>
502 record syntax, whereas
504 Z> f @attr 1=title my
506 Z> elements zebra::index::title
508 Z> elements zebra::index::title:p
511 displays in <literal>&xml;</literal> record syntax only the content
512 of the zebra string index <literal>title</literal>, or
513 even only the type <literal>p</literal> phrase indexed part of it.
517 Trying to access numeric <literal>&bib1;</literal> use
518 attributes or trying to access non-existent zebra intern string
519 access points will result in a Diagnostic 25: Specified element set
520 'name not valid for specified database.
527 <!-- Keep this comment at the end of the file
532 sgml-minimize-attributes:nil
533 sgml-always-quote-attributes:t
536 sgml-parent-document: "zebra.xml"
537 sgml-local-catalogs: nil
538 sgml-namecase-general:t