<chapter id="introduction">
- <!-- $Id: introduction.xml,v 1.20 2002-11-08 13:23:52 adam Exp $ -->
+ <!-- $Id: introduction.xml,v 1.40 2006-09-22 12:34:45 adam Exp $ -->
<title>Introduction</title>
- <sect1>
+ <section id="overview">
<title>Overview</title>
<para>
- <ulink url="http://indexdata.dk/zebra/">
- Zebra</ulink>
+ <ulink url="http://indexdata.dk/zebra/">Zebra</ulink>
is a high-performance, general-purpose structured text
- indexing and retrieval engine. It reads structured records in a
+ indexing and retrieval engine. It reads records in a
variety of input formats (eg. email, XML, MARC) and provides access
to them through a powerful combination of boolean search
expressions and relevance-ranked free-text queries.
programs and toolkits, both commercial and free, which understand
this protocol. Application libraries are available to allow
bespoke clients to be written in Perl, C, C++, Java, Tcl, Visual
- Basic, Python, PHP and more - see
- <ulink url="http://zoom.z3950.org/">the ZOOM web site</ulink>
+ Basic, Python, PHP and more - see the
+ <ulink url="&url.zoom;">ZOOM web site</ulink>
for more information on some of these client toolkits.
</para>
and how to configure the server to give you the
functionality that you need.
</para>
- </sect1>
+ </section>
- <sect1 id="features">
+ <section id="features">
<title>Features</title>
<para>
<listitem>
<para>
- Very large databases: files for indexes, etc. can be
+ Very large databases: logical files can be
automatically partitioned over multiple disks.
</para>
</listitem>
<listitem>
<para>
Arbitrarily complex records. The internal data format
- is an structured format conceptually similar to XML or GRS-1,
+ is a structured format conceptually similar to XML or GRS-1,
which allows lists, nested structured data elements and
variant forms of data.
</para>
<listitem>
<para>
Zebra is written in portable C, so it runs on most Unix-like systems
- as well as Windows NT. A binary distribution for Windows NT is
- available at
- <ulink url="http://ftp.indexdata.dk/pub/zebra/win32/"/>,
- and pre-built packages are available for some Linux
+ as well as Windows (NT/2000/2003). A binary distribution for Windows
+ is available at
+ <ulink url="&url.idzebra.download.win32;"/>,
+ and pre-built packages are available for
+ <!--- some Linux
distributions:
Red Hat 7.x RPMs at
<ulink url="http://ftp.indexdata.dk/pub/zebra/RedHat7.X/"/>
and Debian packages at
- <ulink url="http://ftp.indexdata.dk/pub/zebra/debian/"/>
+ -->
+ <literal>GNU/Debian Linux</literal> at
+ <ulink url="&url.idzebra.download.debian;"/>.
</para>
</listitem>
</para>
<para>
- Z39.50 protocol support:
+ <ulink url="&url.z39.50;">Z39.50</ulink> protocol support:
</para>
<para>
Segmentation (support for very large records), Delete, Scan
(index browsing), Sort, Close and support for the ``update''
Extended Service to add or replace an existing XML record.
- <!-- Adam says:
- * Supported
- You can insert/delete/replace an XML record given an
- "external" ID. Actually this way of doing ES Update was
- meant for an OAI application that Ian Ibbotson had in
- mind to implement. The "update" command in YAZ client
- implements this on the client side. My plan is to make
- this available in ZOOM "extended" soon..
- -->
</para>
</listitem>
</itemizedlist>
</para>
+
- </sect1>
+ <para>
+ <ulink url="&url.sru;">SRU</ulink> Web Service support:
+ </para>
+ <para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ The protocol operations <literal>explain</literal>,
+ <literal>searchRetrieve</literal> and <literal>scan</literal>
+ are supported.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <ulink url="&url.cql;">CQL</ulink> to internal query model RPN
+ conversion is supported.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Multiple XML record formats
+ for data retrieval are supported, modelled over the GRS-1, SUTRS,
+ MARC record formats. Records can be mapped between record
+ schemas on the fly. Arbitrarily complex XSLT transformations
+ can be applied during record retrieval if one uses the
+ <literal>alvis</literal> filter module.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Extended RPN queries for search/retrieve and scan are supported,
+ for controlling approximate hit counts, etc.
+ </para>
+ </listitem>
+
+ </itemizedlist>
+
+ </para>
+
+
+ </section>
- <sect1 id="apps">
- <title>Applications</title>
+ <section id="introduction-apps">
+ <title>References and Zebra based Applications</title>
<para>
Zebra has been deployed in numerous applications, in both the
academic and commercial worlds, in application domains as diverse
Notable applications include the following:
</para>
- <sect2>
- <title>DADS - the DTV Article Database Service</title>
+
+ <section id="koha-ils">
+ <title>Koha free open-source ILS</title>
+ <para>
+ <ulink url="http://www.koha.org/">Koha</ulink> is a full-featured
+ open-source ILS, initially developed in
+ New Zealand by Katipo Communications Ltd, and first deployed in
+ January of 2000 for Horowhenua Library Trust. It is currently
+ maintained by a team of software providers and library technology
+ staff from around the globe.
+ </para>
+ <para>
+ <ulink url="http://liblime.com/">LibLime</ulink>,
+ a company that is marketing and supporting Koha, adds in
+ the new release of Koha 3.0 the Zebra
+ database server to drive its bibliographic database.
+ </para>
+ <para>
+ In early 2005, the Koha project development team began looking at
+ ways to improve MARC support and overcome scalability limitations
+ in the Koha 2.x series. After extensive evaluations of the best
+ of the Open Source textual database engines - including MySQL
+ full-text searching, PostgreSQL, Lucene and Plucene - the team
+ selected Zebra.
+ </para>
+ <para>
+ "Zebra completely eliminates scalability limitations, because it
+ can support tens of millions of records." explained Joshua
+ Ferraro, LibLime's Technology President and Koha's Project
+ Release Manager. "Our performance tests showed search results in
+ under a second for databases with over 5 million records on a
+ modest i386 900Mhz test server."
+ </para>
+ <para>
+ "Zebra also includes support for true boolean search expressions
+ and relevance-ranked free-text queries, both of which the Koha
+ 2.x series lack. Zebra also supports incremental and safe
+ database updates, which allow on-the-fly record
+ management. Finally, since Zebra has at its heart the Z39.50
+ protocol, it greatly improves Koha's support for that critical
+ library standard."
+ </para>
+ <para>
+ Although the bibliographic database will be moved to Zebra, Koha
+ 3.0 will continue to use a relational SQL-based database design
+ for the 'factual' database. "Relational database managers have
+ their strengths, in spite of their inability to handle large
+ numbers of bibliographic records efficiently," summed up Ferraro,
+ "We're taking the best from both worlds in our redesigned Koha
+ 3.0.
+ </para>
+ <para>
+ See also LibLime's newsletter article
+ <ulink url="http://www.liblime.com/newsletter/2006/01/features/koha-earns-its-stripes/">
+ Koha Earns its Stripes</ulink>.
+ </para>
+ </section>
+
+ <section id="emilda-ils">
+ <title>Emilda open source ILS</title>
<para>
+ <ulink url="http://www.emilda.org/">Emilda</ulink>
+ is a complete Integrated Library System, released under the
+ GNU General Public License. It has a
+ full featured Web-OPAC, allowing comprehensive system management
+ from virtually any computer with an Internet connection, has
+ template based layout allowing anyone to alter the visual
+ appearance of Emilda, and is
+ XML based language for fast and easy portability to virtually any
+ language.
+ Currently, Emilda is used at three schools in Espoo, Finland.
+ </para>
+ <para>
+ As a surplus, 100% MARC compatibility has been achieved using the
+ Zebra Server from Index Data as backend server.
+ </para>
+ </section>
+
+ <section id="reindex-ils">
+ <title>ReIndex.Net web based ILS</title>
+ <para>
+ <ulink url="http://www.reindex.net/index.php?lang=en">Reindex.net</ulink>
+ is a netbased library service offering all
+ traditional functions on a very high level plus many new
+ services. Reindex.net is a comprehensive and powerful WEB system
+ based on standards such as XML and Z39.50.
+ updates. Reindex supports MARC21, danMARC eller Dublin Core with
+ UTF8-encoding.
+ </para>
+ <para>
+ Reindex.net runs on GNU/Debian Linux with Zebra and Simpleserver
+ from Index
+ Data for bibliographic data. The relational database system
+ Sybase 9 XML is used for
+ administrative data.
+ Internally MARCXML is used for bibliographical records. Update
+ utilizes Z39.50 extended services.
+ </para>
+ </section>
+
+ <section id="dads-article-database">
+ <title>DADS - the DTV Article Database
+ Service</title>
+ <para>
DADS is a huge database of more than ten million records, totalling
over ten gigabytes of data. The records are metadata about academic
journal articles, primarily scientific; about 10% of these
</para>
<para>
More information can be found at
- <ulink url="http://www.dtv.dk/help/dads/index_e.htm"/>
+ <ulink url="http://www.dtv.dk/"/> and
+ <ulink url="http://dads.dtv.dk"/>
</para>
- </sect2>
+ </section>
- <sect2>
- <title>NLI-Z39.50 - a Natural Language Interface for Libraries</title>
+ <section id="infonet-eprints">
+ <title>Infonet Eprints</title>
<para>
- Fernuniversität Hagen in Germany have developed a natural
- language interface for access to library databases.
- <ulink url="http://ki212.fernuni-hagen.de/nli/NLIintro.html"/>
- In order to evaluate this interface for recall and precision, they
- chose Zebra as the basis for retrieval effectiveness. The Zebra
- server contains a copy of the GIRT database, consisting of more
- than 76000 records in SGML format (bibliographic records from
- social science), which are mapped to MARC for presentation.
- </para>
+ The InfoNet Eprints service from the
+ <ulink url="http://www.dtv.dk/">
+ Technical Knowledge Center of Denmark</ulink>
+ provides access to documents stored in
+ eprint/preprint servers and institutional research archives around
+ the world. The service is based on Open Archives Initiative metadata
+ harvesting of selected scientific archives around the world. These
+ open archives offer free and unrestricted access to their contents.
+ </para>
<para>
- (GIRT is the German Indexing and Retrieval Testdatabase. It is a
- standard German-language test database for intelligent indexing
- and retrieval systems. See
- <ulink url="http://www.gesis.org/forschung/informationstechnologie/clef-delos.htm"/>)
- </para>
- <para>
- Evaluation will take place as part of the TREC/CLEF campaign 2003
- <ulink url="http://clef.iei.pi.cnr.it or http://www4.eurospider.ch/CLEF/"/>
+ Infonet Eprints currently holds 1.4 million records from 16 archives.
+ The online search facility is found at
+ <ulink url="http://preprints.cvt.dk"/>.
</para>
+ </section>
+
+ <section id="alvis-project">
+ <title>Alvis</title>
<para>
- For more information, contact Johannes Leveling
- <email>Johannes.Leveling@FernUni-Hagen.De</email>
- </para>
- </sect2>
+ The <ulink url="http://www.alvis.info/alvis/">Alvis</ulink> EU
+ project run under the 6th Framework (IST-1-002068-STP)
+ is building a semantic-based peer-to-peer search engine. A
+ consortium of eleven partners from six different European
+ Community countries plus Switzerland and China contribute
+ with expertise in a broad range of specialties including network
+ topologies, routing algorithms, linguistic analysis and
+ bioinformatics.
+ </para>
+ <para>
+ The Zebra information retrieval indexing machine is used inside
+ the Alvis framework to
+ manage huge collections of natural language processed and
+ enhanced XML data, coming from a topic relevant web crawl.
+ In this application, Zebra swallows and manages 37GB of XML data
+ in about 4 hours, resulting in search times of fractions of
+ seconds.
+ </para>
+ </section>
+
- <sect2>
+ <section id="uls">
<title>ULS (Union List of Serials)</title>
<para>
- The M25-Link systems team
- (<ulink url="http://www.m25lib.ac.uk/M25link/"/>)
- are involved in a project called ULS to provide a union catalogue
- for periodicals in 21 member libraries. They do this with an
- unusual architecture which they call a
+ The M25 Systems Team
+ has created a union catalogue for the periodicals of the
+ twenty-one constituent libraries of the University of London and
+ the University of Westminster
+ (<ulink url="http://www.m25lib.ac.uk/ULS/"/>).
+ They have achieved this using an
+ unusual architecture, which they describe as a
``non-distributed virtual union catalogue''.
</para>
<para>
More information can be found at
<ulink url="http://www.m25lib.ac.uk/ULS/"/>
</para>
- </sect2>
+ </section>
+
+ <section id="nli">
+ <title>NLI-Z39.50 - a Natural Language Interface for Libraries</title>
+ <para>
+ Fernuniversität Hagen in Germany have developed a natural
+ language interface for access to library databases.
+ <!-- <ulink
+ url="http://ki212.fernuni-hagen.de/nli/NLIintro.html"/> -->
+ In order to evaluate this interface for recall and precision, they
+ chose Zebra as the basis for retrieval effectiveness. The Zebra
+ server contains a copy of the GIRT database, consisting of more
+ than 76000 records in SGML format (bibliographic records from
+ social science), which are mapped to MARC for presentation.
+ </para>
+ <para>
+ (GIRT is the German Indexing and Retrieval Testdatabase. It is a
+ standard German-language test database for intelligent indexing
+ and retrieval systems. See
+ <ulink url="http://www.gesis.org/forschung/informationstechnologie/clef-delos.htm"/>)
+ </para>
+ <para>
+ Evaluation will take place as part of the TREC/CLEF campaign 2003
+ <ulink url="http://clef.iei.pi.cnr.it"/>.
+ <!-- or <ulink url="http://www4.eurospider.ch/CLEF/"/> -->
+ </para>
+ <para>
+ For more information, contact Johannes Leveling
+ <email>Johannes.Leveling@FernUni-Hagen.De</email>
+ </para>
+ </section>
- <sect2>
+ <section id="various-web-indexes">
<title>Various web indexes</title>
<para>
Zebra has been used by a variety of institutions to construct
which is populated by the Harvest-NG web-crawling software.
</para>
<para>
- For more information, contact John Gilbertson
+ For more information on Liverpool university's intranet search
+ architecture, contact John Gilbertson
<email>jgilbert@liverpool.ac.uk</email>
</para>
- </sect2>
- </sect1>
-
-
- <sect1 id="support">
- <title>Support</title>
- <para>
- You can get support for Zebra from at least three sources.
- </para>
- <para>
- First, there's the Zebra web site at
- <ulink url="http://indexdata.dk/zebra/"/>,
- which always has the most recent version available for download.
- If you have a problem with Zebra, the first thing to do is see
- whether it's fixed in the current release.
- </para>
- <para>
- Second, there's the Zebra mailing list. Its home page at
- <ulink url="http://indexdata.dk/mailman/listinfo/zebralist"/>
- includes a complete archive of all messages that have ever been
- posted on the list. The Zebra mailing list is used both for
- announcements from the authors (new
- releases, bug fixes, etc.) and general discussion. You are welcome
- to seek support there. Join by sending email to
- <email>zebra-request@indexdata.dk</email>. Put the word
- <literal>subscribe</literal> in the body of the message.
- </para>
- <para>
- Third, it's possible to buy a commercial support contract, with
- well defined service levels and response times, from Index Data.
- See
- <ulink url="http://indexdata.dk/support/?lang=en"/>
- <!-- ### compare this page with http://indexdata.dk/support2/ -->
- for details.
- </para>
- </sect1>
+ <para>
+ Kang-Jin Lee
+ has recently modified the Harvest web indexer to use Zebra as
+ its native repository engine. His comments on the switch over
+ from the old engine are revealing:
+ <blockquote>
+ <para>
+ The first results after some testing with Zebra are very
+ promising. The tests were done with around 220,000 SOIF files,
+ which occupies 1.6GB of disk space.
+ </para>
+ <para>
+ Building the index from scratch takes around one hour with Zebra
+ where [old-engine] needs around five hours. While [old-engine]
+ blocks search requests when updating its index, Zebra can still
+ answer search requests.
+ [...]
+ Zebra supports incremental indexing which will speed up indexing
+ even further.
+ </para>
+ <para>
+ While the search time of [old-engine] varies from some seconds
+ to some minutes depending how expensive the query is, Zebra
+ usually takes around one to three seconds, even for expensive
+ queries.
+ [...]
+ Zebra can search more than 100 times faster than [old-engine]
+ and can process multiple search requests simultaneously
+ </para>
+ <para>
+ I am very happy to see such nice software available under GPL.
+ </para>
+ </blockquote>
+ </para>
+ </section>
+ </section>
+
+
+ <section id="introduction-support">
+ <title>Support</title>
+ <para>
+ You can get support for Zebra from at least three sources.
+ </para>
+ <para>
+ First, there's the Zebra web site at
+ <ulink url="&url.idzebra;"/>,
+ which always has the most recent version available for download.
+ If you have a problem with Zebra, the first thing to do is see
+ whether it's fixed in the current release.
+ </para>
+ <para>
+ Second, there's the Zebra mailing list. Its home page at
+ <ulink url="&url.idzebra.mailinglist;"/>
+ includes a complete archive of all messages that have ever been
+ posted on the list. The Zebra mailing list is used both for
+ announcements from the authors (new
+ releases, bug fixes, etc.) and general discussion. You are welcome
+ to seek support there. Join by filling the form on the list home page.
+ </para>
+ <para>
+ Third, it's possible to buy a commercial support contract, with
+ well defined service levels and response times, from Index Data.
+ See
+ <ulink url="&url.indexdata.support;"/>
+ for details.
+ </para>
+ </section>
- <sect1 id="future">
+ <section id="future">
<title>Future Directions</title>
<para>
Improved support for XML in search and retrieval. Eventually,
the goal is for Zebra to pull double duty as a flexible
information retrieval engine and high-performance XML
- repository.
- </para>
- <para>
- ### Partially done.
- </para>
- </listitem>
-
- <listitem>
- <para>
- Access to search engine through SOAP/RPC API to allow the
- construction of applications without requiring Z39.50 tools.
+ repository. The recent addition of XPath searching is one
+ example of the kind of enhancement we're working on.
</para>
<para>
- ### Partially done, thanks to the new SRW/Z39.50 gateway.
+ There is also the experimental <literal>ALVIS XSLT</literal>
+ XML input filter, which unleashes the full power of DOM based
+ XSLT transformations during indexing and record retrieval. Work
+ on this filter has been sponsored by the ALVIS EU project
+ <ulink url="http://www.alvis.info/alvis/"/>. We expect this filter to
+ mature soon, as it is planned to be included in the version 2.0
+ release of Zebra.
</para>
</listitem>
or check the contact info at the end of this manual.
</para>
- </sect1>
+ </section>
</chapter>
<!-- Keep this comment at the end of the file
Local variables: