All sorts of minor and semi-major improvements.

author Mike Taylor <mike@indexdata.com>

Sun, 1 Dec 2002 23:26:26 +0000 (23:26 +0000)

committer Mike Taylor <mike@indexdata.com>

Sun, 1 Dec 2002 23:26:26 +0000 (23:26 +0000)
author Mike Taylor <mike@indexdata.com>
Sun, 1 Dec 2002 23:26:26 +0000 (23:26 +0000)
committer Mike Taylor <mike@indexdata.com>
Sun, 1 Dec 2002 23:26:26 +0000 (23:26 +0000)
diff --git a/doc/examples.xml b/doc/examples.xml

index 10cbeb5..dc95e12 100644 (file)
--- a/doc/examples.xml
+++ b/doc/examples.xml
@@ -1,5 +1,5 @@
  <chapter id="examples">
- <!-- $Id: examples.xml,v 1.17 2002-11-08 17:00:57 mike Exp $ -->
+ <!-- $Id: examples.xml,v 1.18 2002-12-01 23:26:26 mike Exp $ -->
   <title>Example Configurations</title>
  
   <sect1>
@@ -19,23 +19,35 @@
  
      <listitem>
       <para>
-      Where to find subsidiary configuration files, including
-      <literal>default.idx</literal>
+      Where to find subsidiary configuration files, including both
+      those that are named explicitly and a few ``magic'' files such
+      as <literal>default.idx</literal>,
        which specifies the default indexing rules.
       </para>
      </listitem>
  
      <listitem>
       <para>
-      What attribute sets to recognise in searches.
+      What record schemas to support.  (Subsidiary files specifiy how
+      to index the contents of records in those schemas, and what
+      format to use when presenting records in those schemas to client
+      software.)
       </para>
      </listitem>
  
      <listitem>
       <para>
-      Policy details such as what record type to expect, what
-      low-level indexing algorithm to use, how to identify potential
-      duplicate records, etc.
+      What attribute sets to recognise in searches.  (Subsidiary files
+      specify how to interpret the attributes in terms
+      of the indexes that are created on the records.)
+     </para>
+    </listitem>
+
+    <listitem>
+     <para>
+      Policy details such as what type of input format to expect when
+      adding new records, what low-level indexing algorithm to use,
+      how to identify potential duplicate records, etc.
       </para>
      </listitem>
  
@@ -69,6 +81,10 @@
     <literal>dino.tree</literal>.)
     Type <literal>make records/dino.xml</literal>
     to make the XML data file.
+   (Or you could just type <literal>make</literal> to build the XML
+   data file, create the database and populate it with the taxonomic
+   records all in one shot - but then you wouldn't learn anything,
+   would you?  :-)
    </para>
    <para>
     Now we need to create a Zebra database to hold and index the XML
@@ -76,7 +92,7 @@
     Zebra indexer, <literal>zebraidx</literal>, which is
     driven by the <literal>zebra.cfg</literal> configuration file.
     For our purposes, we don't need any
-   special behaviour - we can use the defaults - so we start with a
+   special behaviour - we can use the defaults - so we can start with a
     minimal file that just tells <literal>zebraidx</literal> where to
     find the default indexing rules, and how to parse the records:
     <screen>
@@ -108,7 +124,7 @@
     XPath-based boolean queries and fetch the XML records that satisfy
     them:
     <screen>
-    $ yaz-client tcp:@:9999
+    $ yaz-client @:9999
      Connecting...Ok.
      Z&gt; find @attr 1=/Zthes/termName Sauroposeidon
      Number of hits: 1
@@ -118,6 +134,7 @@
       &lt;termId&gt;22&lt;/termId&gt;
       &lt;termName&gt;Sauroposeidon&lt;/termName&gt;
       &lt;termType&gt;PT&lt;/termType&gt;
+     &lt;termNote&gt;The tallest known dinosaur (18m)&lt;/termNote&gt;
       &lt;relation&gt;
        &lt;relationType&gt;BT&lt;/relationType&gt;
        &lt;termId&gt;21&lt;/termId&gt;
@@ -126,7 +143,7 @@
       &lt;/relation&gt;
  
        &lt;idzebra xmlns="http://www.indexdata.dk/zebra/"&gt;
-       &lt;size&gt;245&lt;/size&gt;
+       &lt;size&gt;300&lt;/size&gt;
         &lt;localnumber&gt;23&lt;/localnumber&gt;
         &lt;filename&gt;records/dino.xml&lt;/filename&gt;
        &lt;/idzebra&gt;
@@ -134,7 +151,7 @@
     </screen>
    </para>
    <para>
-   Now wasn't that easy?
+   Now wasn't that nice and easy?
    </para>
   </sect1>
  
@@ -158,7 +175,7 @@
     significantly because it ties searching semantics to the physical
     structure of the searched records.  You can't use the same search
     specification to search two databases if their internal
-   representations are different.  Consider an alternative taxonomy
+   representations are different.  Consider an different taxonomy
     database in which the records have taxon names specified
     inside a <literal>&lt;name&gt;</literal> element nested within a
     <literal>&lt;identification&gt;</literal> element
@@ -175,8 +192,8 @@
     said about implementation: in a given database, an access point
     might be implemented as an index, a path into physical records, an
     algorithm for interrogating relational tables or whatever works.
-   The key point is that the semantics of an access point are fixed
-   and well defined.
+   The only important thing point is that the semantics of an access
+   point are fixed and well defined.
    </para>
    <para>
     For convenience, access points are gathered into <firstterm>attribute
@@ -192,7 +209,7 @@
     In practice, the BIB-1 attribute set has tended to be a dumping
     ground for all sorts of access points, so that, for example, it
     includes some geospatial access points as well as strictly
-   bibliographic ones.  Nevertheless, the key point is that this model
+   bibliographic ones.  Nevertheless, this model
     allows a layer of abstraction over the physical representation of
     records in databases.
    </para>
@@ -210,6 +227,11 @@
     <literal>&lt;Zthes&gt;</literal> element.
    </para>
    <para>
+   ### Here's where it all goes to pieces.  The current arrangement is
+   very awkward (and somewhat embarrassing) to describe, and the new
+   arrangement hasn't actually been implemented yet.
+  </para>
+  <para>
     This is a two-step process.  First, we need to tell Zebra that we
     want to support the BIB-1 attribute set.  Then we need to tell it
     which elements of its record pertain to access point 4.
diff --git a/doc/harvest.mbox b/doc/harvest.mbox

deleted file mode 100644 (file)

index 0f38a3a..0000000
--- a/doc/harvest.mbox
+++ /dev/null
@@ -1,360 +0,0 @@
-From zebralist-admin@indexdata.dk  Sun Nov 24 23:16:24 2002
-MIME-Version: 1.0
-Envelope-to: zebra@miketaylor.org.uk
-Content-Type: text/plain;
-  charset="us-ascii"
-From: Kang-Jin Lee <lee@arco.de>
-To: zebralist@indexdata.dk
-User-Agent: KMail/1.4.3
-X-Spam-Level: 
-Subject: [Zebralist] Some progress on Harvest's move to Zebra
-Sender: zebralist-admin@indexdata.dk
-X-BeenThere: zebralist@indexdata.dk
-X-Mailman-Version: 2.0.11
-Precedence: bulk
-List-Help: <mailto:zebralist-request@indexdata.dk?subject=help>
-List-Post: <mailto:zebralist@indexdata.dk>
-List-Subscribe: <http://www.indexdata.dk/mailman/listinfo/zebralist>,
-       <mailto:zebralist-request@indexdata.dk?subject=subscribe>
-List-Id: Zebra Information Server <zebralist.indexdata.dk>
-List-Unsubscribe: <http://www.indexdata.dk/mailman/listinfo/zebralist>,
-       <mailto:zebralist-request@indexdata.dk?subject=unsubscribe>
-List-Archive: <http://www.indexdata.dk/pipermail/zebralist/>
-Date: Sun, 24 Nov 2002 20:45:19 +0100
-X-Spam-Status: No, hits=-1.0 required=5.0 tests=AWL version=2.20
-X-Spam-Level: 
-X-MIME-Autoconverted: from quoted-printable to 8bit by localhost.localdomain id gAONGNK15639
-
-Hi,
-
-I finished first steps to use Zebra as fulltext engine for Harvest
-(http://harvest.sourceforge.net/). The performance boost after
-some testing are quite impressive.
-
-Here is my article I wrote for the Harvest mailinglist.
-
-Many thanks for Zebra.
-
-------------------------------------------------------
-Hi,
-
-The first results after some testing with Zebra are very promising.
-
-The tests were done with around 220 000 SOIF files, which occupies
-1.6GB of disk space.
-
-Building the index from scratch takes around one hour with Zebra where
-Glimpse needs around five hours.
-
-While glimpse blocks search requests when updating its index, Zebra
-can still answer search requests.
-
-While the search time of glimpse varies from some seconds to some
-minutes depending how expensive the query is, Zebra usually takes
-around one to three seconds, even for expensive queries.
-
-Glimpse' index occupies around 250MB of disk space, Zebra's index
-takes around 570MB.
-
-Zebra supports incremental indexing which will speed up indexing even
-further.
-
-There are still potential for faster searches when necessary, using
-tweaks on apache.
-
-On the other hand, modeling data is not complete, yet.
-
-To sum it up:
-- Zebra indexes data five times faster than Glimpse
-- Zebra doesn't cause downtimes for indexupdate
-- Zebra's search time doesn't jump from seconds to minutes for no
-  obvious reason, but stays constant within a range of one to three
-  seconds
-- Zebra can search more than 100 times faster than Glimpse
-- Zebra can process multiple search requests simultaneously
-- Zebra can speed up indexing by using incremental indexing
-- Glimpse's index size is only around half of the Zebra's index
-
-kj
-------------------------------------------------------
-
-_______________________________________________
-Zebralist mailing list
-Zebralist@indexdata.dk
-http://www.indexdata.dk/mailman/listinfo/zebralist
-
-From mike@miketaylor.org.uk  Sun Nov 24 23:41:14 2002
-Date: Sun, 24 Nov 2002 23:41:13 GMT
-From: Mike Taylor <mike@miketaylor.org.uk>
-X-Was-To: lee@arco.de
-X-Was-CC: zebralist@indexdata.dk
-Cc: mike@localhost.localdomain
-In-reply-to: <200211242045.19196.lee@arco.de> (message from Kang-Jin Lee on
-       Sun, 24 Nov 2002 20:45:19 +0100)
-Subject: Re: [Zebralist] Some progress on Harvest's move to Zebra
-
-> Date: Sun, 24 Nov 2002 20:45:19 +0100
-> From: Kang-Jin Lee <lee@arco.de>
-> 
-> Here is my article I wrote for the Harvest mailinglist.
-
-Hi K-J,
-
-It's nice to read all this good stuff about Zebra!  I'm currently
-working on changes to the documentation for the next Zebra release,
-and I'd love to include a lightly-edited version of your message in
-the new document.  (Basically, I'd obscure the name of your old
-engine, so it's clear that we're trying to say good things about Zebra
-rather than score points off a competitor.)  Would it be OK for me to
-quote you?  If yes in principle, then I'll run the actual wording past
-you before submitting it.
-
-Thanks,
-
- _/|_   _______________________________________________________________
-/o ) \/  Mike Taylor   <mike@miketaylor.org.uk>   www.miketaylor.org.uk
-)_v__/\  "You question the worthiness of my code?  I should kill you
-        where you stand!" -- Klingon Programming Mantra
-
-From lee@arco.de Mon Nov 25 10:02:13 2002
-MIME-Version: 1.0
-Envelope-to: mike@miketaylor.org.uk
-Content-Type: text/plain;
-  charset="iso-8859-15"
-From: Kang-Jin Lee <lee@arco.de>
-To: Mike Taylor <mike@miketaylor.org.uk>
-Subject: Re: [Zebralist] Some progress on Harvest's move to Zebra
-Date: Mon, 25 Nov 2002 08:27:42 +0100
-User-Agent: KMail/1.4.3
-In-Reply-To: <200211242340.gAONefg15769@localhost.localdomain>
-X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20
-X-Spam-Level: 
-Content-Length: 836
-X-MIME-Autoconverted: from quoted-printable to 8bit by seatbooker.net id JAA28796
-
-Hi,
-
-On Monday 25 November 2002 00:40, you wrote:
-> > Date: Sun, 24 Nov 2002 20:45:19 +0100
-> > From: Kang-Jin Lee <lee@arco.de>
-> >
-> > Here is my article I wrote for the Harvest mailinglist.
->
-> Hi K-J,
->
-> It's nice to read all this good stuff about Zebra!  I'm currently
-> working on changes to the documentation for the next Zebra release,
-> and I'd love to include a lightly-edited version of your message in
-> the new document.  (Basically, I'd obscure the name of your old
-> engine, so it's clear that we're trying to say good things about Zebra
-> rather than score points off a competitor.)  Would it be OK for me to
-> quote you?  If yes in principle, then I'll run the actual wording past
-> you before submitting it.
-
-You are welcome to do this.
-
-I am very happy to see such a nice software available under GPL.
-
-Thanks.
-
-kj
-
-From zebralist-admin@indexdata.dk  Mon Nov 25 11:13:10 2002
-MIME-Version: 1.0
-Envelope-to: zebra@miketaylor.org.uk
-From: Pete <P.D.Mallinson@liverpool.ac.uk>
-X-X-Sender: qq15@uxa.liv.ac.uk
-To: Kang-Jin Lee <lee@arco.de>
-cc: zebralist@indexdata.dk
-Subject: Re: [Zebralist] Some progress on Harvest's move to Zebra
-In-Reply-To: <200211242045.19196.lee@arco.de>
-Content-Type: TEXT/PLAIN; charset=US-ASCII
-X-Spam-Level: 
-Sender: zebralist-admin@indexdata.dk
-X-BeenThere: zebralist@indexdata.dk
-X-Mailman-Version: 2.0.11
-Precedence: bulk
-List-Help: <mailto:zebralist-request@indexdata.dk?subject=help>
-List-Post: <mailto:zebralist@indexdata.dk>
-List-Subscribe: <http://www.indexdata.dk/mailman/listinfo/zebralist>,
-       <mailto:zebralist-request@indexdata.dk?subject=subscribe>
-List-Id: Zebra Information Server <zebralist.indexdata.dk>
-List-Unsubscribe: <http://www.indexdata.dk/mailman/listinfo/zebralist>,
-       <mailto:zebralist-request@indexdata.dk?subject=unsubscribe>
-List-Archive: <http://www.indexdata.dk/pipermail/zebralist/>
-Date: Mon, 25 Nov 2002 10:19:37 +0000 (GMT)
-X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20
-X-Spam-Level: 
-Content-Length: 2853
-
-On Sun, 24 Nov 2002, Kang-Jin Lee wrote:
-
->Hi,
->
->I finished first steps to use Zebra as fulltext engine for Harvest
->(http://harvest.sourceforge.net/). The performance boost after
->some testing are quite impressive.
-
-Hi ... I'd almost forgotten that the Harvest project is still active.
-
-We had a heap of challenges with our Harvest setup and with the
-time taken to index and search ... we switched to using
-Harvest-NG as the "reaper/gatherer" and modified Zebra to
-work with SOIF and our own ranking algorithm - it's been in
-service for over 6 months now.
-
-We had challenges with both speed of gathering and with
-speed of indexing and searching but most seem to be
-"managable" now.
-
-We offered our modifications to Zebra to Indexdata who
-offered to look at them since the latest release of Zebra
-is sufficiently different at the code level to make it
-non-trivial for us to apply our code modifications to
-it.
-
-
-Cheers
-
-Pete Mallinson
-
->
->Here is my article I wrote for the Harvest mailinglist.
->
->Many thanks for Zebra.
->
->------------------------------------------------------
->Hi,
->
->The first results after some testing with Zebra are very promising.
->
->The tests were done with around 220 000 SOIF files, which occupies
->1.6GB of disk space.
->
->Building the index from scratch takes around one hour with Zebra where
->Glimpse needs around five hours.
->
->While glimpse blocks search requests when updating its index, Zebra
->can still answer search requests.
->
->While the search time of glimpse varies from some seconds to some
->minutes depending how expensive the query is, Zebra usually takes
->around one to three seconds, even for expensive queries.
->
->Glimpse' index occupies around 250MB of disk space, Zebra's index
->takes around 570MB.
->
->Zebra supports incremental indexing which will speed up indexing even
->further.
->
->There are still potential for faster searches when necessary, using
->tweaks on apache.
->
->On the other hand, modeling data is not complete, yet.
->
->To sum it up:
->- Zebra indexes data five times faster than Glimpse
->- Zebra doesn't cause downtimes for indexupdate
->- Zebra's search time doesn't jump from seconds to minutes for no
->  obvious reason, but stays constant within a range of one to three
->  seconds
->- Zebra can search more than 100 times faster than Glimpse
->- Zebra can process multiple search requests simultaneously
->- Zebra can speed up indexing by using incremental indexing
->- Glimpse's index size is only around half of the Zebra's index
->
->kj
->------------------------------------------------------
->
->_______________________________________________
->Zebralist mailing list
->Zebralist@indexdata.dk
->http://www.indexdata.dk/mailman/listinfo/zebralist
->
-
-
-
-_______________________________________________
-Zebralist mailing list
-Zebralist@indexdata.dk
-http://www.indexdata.dk/mailman/listinfo/zebralist
-
-From zebralist-admin@indexdata.dk  Mon Nov 25 21:39:59 2002
-MIME-Version: 1.0
-Envelope-to: zebra@miketaylor.org.uk
-Content-Type: text/plain;
-  charset="iso-8859-1"
-From: Kang-Jin Lee <lee@arco.de>
-To: Pete <P.D.Mallinson@liverpool.ac.uk>
-Subject: Re: [Zebralist] Some progress on Harvest's move to Zebra
-User-Agent: KMail/1.4.3
-In-Reply-To: <Pine.GSO.4.44.0211251007060.15395-100000@uxa.liv.ac.uk>
-Cc: zebralist@indexdata.dk
-X-Spam-Level: 
-Sender: zebralist-admin@indexdata.dk
-X-BeenThere: zebralist@indexdata.dk
-X-Mailman-Version: 2.0.11
-Precedence: bulk
-List-Help: <mailto:zebralist-request@indexdata.dk?subject=help>
-List-Post: <mailto:zebralist@indexdata.dk>
-List-Subscribe: <http://www.indexdata.dk/mailman/listinfo/zebralist>,
-       <mailto:zebralist-request@indexdata.dk?subject=subscribe>
-List-Id: Zebra Information Server <zebralist.indexdata.dk>
-List-Unsubscribe: <http://www.indexdata.dk/mailman/listinfo/zebralist>,
-       <mailto:zebralist-request@indexdata.dk?subject=unsubscribe>
-List-Archive: <http://www.indexdata.dk/pipermail/zebralist/>
-Date: Mon, 25 Nov 2002 20:39:47 +0100
-X-Spam-Status: No, hits=-3.2 required=5.0 tests=IN_REP_TO,AWL version=2.20
-X-Spam-Level: 
-X-MIME-Autoconverted: from quoted-printable to 8bit by localhost.localdomain id gAPLdwK18535
-
-Hi,
-
-On Monday 25 November 2002 11:19, Pete wrote:
-
-> On Sun, 24 Nov 2002, Kang-Jin Lee wrote:
-
-> >I finished first steps to use Zebra as fulltext engine for Harvest
-> >(http://harvest.sourceforge.net/). The performance boost after
-> >some testing are quite impressive.
->
-> Hi ... I'd almost forgotten that the Harvest project is still active.
-
-It seems that everybody has forgotten Harvest. :-)
-
-> We had a heap of challenges with our Harvest setup and with the
-> time taken to index and search ... we switched to using
-> Harvest-NG as the "reaper/gatherer" and modified Zebra to
-> work with SOIF and our own ranking algorithm - it's been in
-> service for over 6 months now.
-
-I am very interested in your setup. Would it be possible to send
-your configuration files and modifications to me?
-I made some small modifications to soif.flt and am still wondering
-which query I should use. It would be very nice if I don't have to
-reinvent the wheel.
-
-> We had challenges with both speed of gathering and with
-> speed of indexing and searching but most seem to be
-> "managable" now.
-
-How big is your gatherer?
-
-> We offered our modifications to Zebra to Indexdata who
-> offered to look at them since the latest release of Zebra
-> is sufficiently different at the code level to make it
-> non-trivial for us to apply our code modifications to
-> it.
-
-I would like to take a look at the modifications, too.
-
-Thanks.
-
-kj
-
-
-_______________________________________________
-Zebralist mailing list
-Zebralist@indexdata.dk
-http://www.indexdata.dk/mailman/listinfo/zebralist
-
diff --git a/doc/installation.xml b/doc/installation.xml

index 05b9ab5..fd1e873 100644 (file)
--- a/doc/installation.xml
+++ b/doc/installation.xml
@@ -1,5 +1,5 @@
  <chapter id="installation">
- <!-- $Id: installation.xml,v 1.5 2002-10-08 08:09:43 mike Exp $ -->
+ <!-- $Id: installation.xml,v 1.6 2002-12-01 23:26:26 mike Exp $ -->
   <title>Installation</title>
   <para>
    An ANSI C compiler is required to compile the Zebra
@@ -11,7 +11,8 @@
    Unpack the distribution archive. The <literal>configure</literal>
    shell script attempts to guess correct values for various
    system-dependent variables used during compilation.
-  It uses those values to create a 'Makefile' in each directory of Zebra.
+  It uses those values to create a <literal>Makefile</literal> in each
+  directory of Zebra.
   </para>
   
   <para>
@@ -26,7 +27,7 @@
   <para>
    The configure script attempts to use C compiler specified by
    the <literal>CC</literal> environment variable.
-  If not set, <literal>cc</literal> or GNU C will be used.
+  If this is not set, <literal>cc</literal> or GNU C will be used.
    The <literal>CFLAGS</literal> environment variable holds
    options to be passed to the C compiler. If you're using a
    Bourne-shell compatible shell you may pass something like this:
@@ -34,27 +35,26 @@
    <screen>
    CC=/opt/ccs/bin/cc CFLAGS=-O ./configure
    </screen>
-  
-  The configure script takes a number of arguments, you can see
-  them all with
+ </para>
+ <para>
+  The configure script support various options: you can see what they
+  are with
    <screen>
    ./configure --help
    </screen>
-
   </para>
   
   <para>
-  When configured, build the software by typing:
-  
+  Once the build environment is configured, build the software by
+  typing:
    <screen>
    make
    </screen>
- 
   </para>
   
   <para>
-  If successful, two executables are created in the sub-directory
-  <literal>index</literal>.
+  If the build is successful, two executables are created in the
+  sub-directory <literal>index</literal>:
    <variablelist>
     
     <varlistentry>
@@ -85,7 +85,7 @@
    By default this will install the Zebra executables in 
    <filename>/usr/local/bin</filename>,
    and the standard configuration files in 
-  <filename>/usr/local/share/zebra</filename>
+  <filename>/usr/local/share/idzebra</filename>
    You can override this with the <literal>--prefix</literal> option
    to configure.
   </para>
diff --git a/doc/introduction.xml b/doc/introduction.xml

index ad1b558..475c3e5 100644 (file)
--- a/doc/introduction.xml
+++ b/doc/introduction.xml
@@ -1,15 +1,14 @@
  <chapter id="introduction">
- <!-- $Id: introduction.xml,v 1.21 2002-11-08 17:00:57 mike Exp $ -->
+ <!-- $Id: introduction.xml,v 1.22 2002-12-01 23:26:26 mike Exp $ -->
   <title>Introduction</title>
   
   <sect1>
    <title>Overview</title>
    
    <para>
-   <ulink url="http://indexdata.dk/zebra/">
-     Zebra</ulink>
+   <ulink url="http://indexdata.dk/zebra/">Zebra</ulink>
     is a high-performance, general-purpose structured text
-   indexing and retrieval engine. It reads structured records in a
+   indexing and retrieval engine. It reads records in a
     variety of input formats (eg. email, XML, MARC) and provides access
     to them through a powerful combination of boolean search
     expressions and relevance-ranked free-text queries.
@@ -49,7 +48,7 @@
  
      <listitem>
       <para>
-      Very large databases: files for indexes, etc. can be
+      Very large databases: logical files can be
        automatically partitioned over multiple disks.
       </para>
      </listitem>
@@ -57,7 +56,7 @@
      <listitem>
       <para>
        Arbitrarily complex records.  The internal data format
-      is an structured format conceptually similar to XML or GRS-1,
+      is a structured format conceptually similar to XML or GRS-1,
        which allows lists, nested structured data elements and
        variant forms of data.
       </para>
@@ -304,9 +303,45 @@
      which is populated by the Harvest-NG web-crawling software.
     </para>
     <para>
-    For more information, contact John Gilbertson
+    For more information on Liverpool university's intranet search
+    architecture, contact John Gilbertson
      <email>jgilbert@liverpool.ac.uk</email>
     </para>
+   <para>
+    Kang-Jin Lee
+    <email>lee@arco.de</email>,
+    has recently modified the Harvest-NG web crawler to use Zebra as
+    its native repository engine.  His comments on the switch over
+    from the old engine are revealing:
+    <blockquote>
+     <para>
+      The first results after some testing with Zebra are very
+      promising.  The tests were done with around 220,000 SOIF files,
+      which occupies 1.6GB of disk space.
+     </para>
+     <para>
+      Building the index from scratch takes around one hour with Zebra
+      where [old-engine] needs around five hours.  While [old-engine]
+      blocks search requests when updating its index, Zebra can still
+      answer search requests.
+      [...]
+      Zebra supports incremental indexing which will speed up indexing
+      even further.
+     </para>
+     <para>
+      While the search time of [old-engine] varies from some seconds
+      to some minutes depending how expensive the query is, Zebra
+      usually takes around one to three seconds, even for expensive
+      queries.
+      [...]
+      Zebra can search more than 100 times faster than [old-engine]
+      and can process multiple search requests simultaneously
+     </para>
+     <para>
+      I am very happy to see such nice software available under GPL.
+     </para>
+    </blockquote>
+   </para>
    </sect2>
   </sect1>
  
@@ -331,7 +366,7 @@
     announcements from the authors (new
     releases, bug fixes, etc.) and general discussion.  You are welcome
     to seek support there.  Join by sending email to
-   <email>zebra-request@indexdata.dk</email>. Put the word
+   <email>zebra-request@indexdata.dk</email> with the word
     <literal>subscribe</literal> in the body of the message.
    </para>
    <para>
@@ -360,20 +395,17 @@
         Improved support for XML in search and retrieval. Eventually,
         the goal is for Zebra to pull double duty as a flexible
         information retrieval engine and high-performance XML
-       repository.
-     </para>
-     <para>
-       ### Partially done.
+       repository.  The recent addition of XPath searching is one
+       example of the kind of enhancement we're working on.
       </para>
      </listitem>
  
      <listitem>
       <para>
-       Access to search engine through SOAP/RPC API to allow the
+       Access to the search engine through SOAP/RPC API to allow the
         construction of applications without requiring Z39.50 tools.
-     </para>
-     <para>
-       ### Partially done, thanks to the new SRW/Z39.50 gateway.
+       This will shortly be available by means of Index Data's
+       SRW-to-Z39.50 gateway, currently in beta test.
       </para>
      </listitem>
  
@@ -388,6 +420,15 @@
  
      <listitem>
       <para>
+       Support for the use of Perl both for access to the Zebra API
+       and for building extension ``plug-ins'' such as input filters.
+       The code for this has been contributed to the source tree, and
+       is in the process of being integrated and tested.
+     </para>
+    </listitem>
+
+    <listitem>
+     <para>
         Improved free-text searching. We're first and foremost octet jockeys and
         we're actively looking for organisations or people who'd like
         to contribute experience in relevance ranking and text
diff --git a/doc/quickstart.xml b/doc/quickstart.xml

index d7f0d00..1aae924 100644 (file)
--- a/doc/quickstart.xml
+++ b/doc/quickstart.xml
@@ -1,54 +1,27 @@
  <chapter id="quick-start">
- <!-- $Id: quickstart.xml,v 1.7 2002-10-30 14:35:09 adam Exp $ -->
+ <!-- $Id: quickstart.xml,v 1.8 2002-12-01 23:26:26 mike Exp $ -->
   <title>Quick Start </title>
- 
- <!--
-  FIXME - Start with the new improved example scripts that run 
-  without any configuration file changes!
-       ### do we want this now we have "examples.html"? - mike, 15/10/02
- -->
  
   <para>
-  In this section, we will test the system by indexing a small set of sample
-  GILS records that are included with the software distribution. Go to the
-  <literal>examples/gils</literal> subdirectory of the distribution archive.
-  There you will find a configuration
-  file named <literal>zebra.cfg</literal> with the following contents:
-  
-  <screen>
-   # Where the schema files, attribute files, etc are located.
-   profilePath: ../../tab
-
-   # Files that describe the attribute sets supported.
-   attset: bib1.att
-   attset: gils.att
-   attset: explain.att
-
-   recordtype: grs.sgml
-   isam: c
-  </screen>
+  <!-- ### ulink to GILS profile: what's the URL? -->
+  In this section, we will test the system by indexing a small set of
+  sample GILS records that are included with the Zebra distribution,
+  running Zebra a server against the newly created database, and
+  searching the indexes with a client that connects to that server.
   </para>
-
- <!--  No longer necessary
- <para>
-  If necessary, edit the file and set <literal>profilePath</literal> to the path of the
-  YAZ profile tables (sub directory <literal>tab</literal> of the YAZ
-  distribution archive).
- </para>
- -->
- 
   <para>
-  The 48 test records are located in the sub directory
-  <literal>records</literal>. To index these, type:
-  
+  Go to the <literal>examples/gils</literal> subdirectory of the
+  distribution archive.  The 48 test records are located in the sub
+  directory <literal>records</literal>. To index these, type:
    <screen>
     zebraidx update records
    </screen>
   </para>
   
   <para>
-  In the command above, the word <literal>update</literal> followed
-  by a directory root updates all files below that directory node.
+  In this command, the word <literal>update</literal> is followed
+  by the name of a directory: <literal>zebraidx</literal> updates all
+  files in the hierarchy rooted at that directory.
   </para>
   
   <para>
@@ -56,7 +29,7 @@
    fire up a server. To start a server on port 2100, type:
    
    <screen>
-   zebrasrv tcp:@:2100
+   zebrasrv @:2100
    </screen>
    
   </para>
@@ -66,17 +39,18 @@
    named <literal>Default</literal>.
    The database contains records structured according to
    the GILS profile, and the server will
-  return records in either either USMARC, GRS-1, or SUTRS depending
-  on what your client asks for.
+  return records in USMARC, GRS-1, or SUTRS format depending
+  on what the client asks for.
   </para>
   
   <para>
    To test the server, you can use any Z39.50 client.
-  For instance, you can use the demo client that comes with YAZ:
+  For instance, you can use the demo command-line client that comes
+  with YAZ:
   </para>
   <para>
    <screen>
-   yaz-client tcp:localhost:2100
+   yaz-client localhost:2100
    </screen>
   </para>
   
@@ -92,8 +66,9 @@
   </para>
   
   <para>
-  The default retrieval syntax for the client is USMARC. To try other
-  formats for the same record, try:
+  The default retrieval syntax for the client is USMARC, and the
+  default element set is <literal>F</literal> (``full record''). To
+  try other formats and element sets for the same record, try:
   </para>
   <para>
    <screen>
@@ -110,8 +85,8 @@
   
   <note>
    <para>You may notice that more fields are returned when your
-   client requests SUTRS or GRS-1 records. When retrieving GILS records,
-   this is normal - not all of the GILS data elements have mappings in
+   client requests SUTRS, GRS-1 or XML records.
+   This is normal - not all of the GILS data elements have mappings in
     the USMARC record format.
    </para>
   </note>
author	Mike Taylor <mike@indexdata.com>
	Sun, 1 Dec 2002 23:26:26 +0000 (23:26 +0000)
committer	Mike Taylor <mike@indexdata.com>
	Sun, 1 Dec 2002 23:26:26 +0000 (23:26 +0000)
doc/examples.xml		patch \| blob \| history
doc/harvest.mbox	[deleted file]	patch \| blob \| history
doc/installation.xml		patch \| blob \| history
doc/introduction.xml		patch \| blob \| history
doc/quickstart.xml		patch \| blob \| history