spelling checked

[idzebra-moved-to-github.git] / doc / administration.xml
diff --git a/doc/administration.xml b/doc/administration.xml

index 773edfd..0048634 100644 (file)
--- a/doc/administration.xml
+++ b/doc/administration.xml
@@ -1,5 +1,5 @@
  <chapter id="administration">
- <!-- $Id: administration.xml,v 1.31 2006-05-01 13:07:40 marc Exp $ -->
+ <!-- $Id: administration.xml,v 1.36 2006-06-12 11:59:11 marc Exp $ -->
   <title>Administrating Zebra</title>
   <!-- ### It's a bit daft that this chapter (which describes half of
            the configuration-file formats) is separated from
@@ -402,7 +402,7 @@
    <para>
     The default behavior of the Zebra system is to reference the
     records from their original location, i.e. where they were found when you
-   ran <literal>zebraidx</literal>.
+   run <literal>zebraidx</literal>.
     That is, when a client wishes to retrieve a record
     following a search operation, the files are accessed from the place
     where you originally put them - if you remove the files (without
@@ -925,6 +925,8 @@
   <sect1 id="administration-ranking">
    <title>Relevance Ranking and Sorting of Result Sets</title>
  
+  <sect2>
+   <title>Overview</title>
     <para>
      The default ordering of a result set is left up to the server,
      which inside Zebra means sorting in ascending document ID order. 
@@ -933,7 +935,7 @@
     </para>
  
     <para> 
-    In case a good presentation ordering can be computed at
+    In cases where a good presentation ordering can be computed at
      indexing time, we can use a fixed <literal>static ranking</literal>
      scheme, which is provided for the <literal>alvis</literal>
      indexing filter. This defines a fixed ordering of hit lists,
@@ -944,12 +946,10 @@
      There are cases, however, where relevance of hit set documents is
      highly dependent on the query processed.
      Simply put, <literal>dynamic relevance ranking</literal> 
-    sortes a set of retrieved 
-    records such
-    that those most likely to be relevant to your request are
-    retrieved first. 
-    Internally, Zebra  retrieves all documents ID's that satisfy your
-    search query, and re-orders the hit list to arrange them based on
+    sorts a set of retrieved records such that those most likely to be
+    relevant to your request are retrieved first. 
+    Internally, Zebra retrieves all documents that satisfy your
+    query, and re-orders the hit list to arrange them based on
      a measurement of similarity between your query and the content of
      each record. 
     </para>
@@ -960,7 +960,7 @@
      lexicographical ordering of certain sort indexes created at
      indexing time.
     </para>
-
+  </sect2>
  
  
   <sect2 id="administration-ranking-static">
@@ -989,27 +989,24 @@
      <screen>
      staticrank: 1 
      </screen> 
-    directive in the main core Zebra config file, the internal document
-    keys used for ordering are augmented by a preceeding integer, which
+    directive in the main core Zebra configuration file, the internal document
+    keys used for ordering are augmented by a preceding integer, which
      contains the static rank of a given document, and the index lists
      are ordered 
      first by ascending static rank,
      then by ascending document <literal>ID</literal>.
-   </para>
-   <para>
-    This implies that the default rank <literal>0</literal> 
-    is the best rank at the
-    beginning of the list, and <literal>max int</literal> 
-    is the worst static rank.
+    Zero
+    is the ``best'' rank, as it occurs at the
+    beginning of the list; higher numbers represent worse scores.
     </para>
     <para>
      The experimental <literal>alvis</literal> filter provides a
      directive to fetch static rank information out of the indexed XML
-    records, thus making <emphasis>all</emphasis> hit sets orderd
+    records, thus making <emphasis>all</emphasis> hit sets ordered
      after <emphasis>ascending</emphasis> static
      rank, and for those doc's which have the same static rank, ordered
      after <emphasis>ascending</emphasis> doc <literal>ID</literal>.
-    See <xref linkend="record-model-alvisxslt"/> for the glory details.
+    See <xref linkend="record-model-alvisxslt"/> for the gory details.
     </para>
      </sect2>
  
@@ -1017,126 +1014,352 @@
   <sect2 id="administration-ranking-dynamic">
    <title>Dynamic Ranking</title>
     <para>
-    If one wants to do a little fiddeling with the static rank order,
-    one has to invoke additional re-ranking/re-ordering using dynamic 
-    reranking or score functions. These functions return positive
-    interger scores, where <emphasis>highest</emphasis> score is 
-    <emphasis>best</emphasis>, which means that the
-    hit sets will be sorted according to
-    <emphasis>decending</emphasis> 
+    In order to fiddle with the static rank order, it is necessary to
+    invoke additional re-ranking/re-ordering using dynamic
+    ranking or score functions. These functions return positive
+    integer scores, where <emphasis>highest</emphasis> score is 
+    ``best'';
+    hit sets are sorted according to <emphasis>descending</emphasis> 
      scores (in contrary
      to the index lists which are sorted according to
-    <emphasis>ascending</emphasis> rank  number and document ID).
+    ascending rank number and document ID).
     </para>
     <para>
-    Those are in the zebra config file enabled by a directive like (use
-    only one of these a time!):
+    Dynamic ranking is enabled by a directive like one of the
+    following in the zebra configuration file (use only one of these a time!):
      <screen> 
      rank: rank-1        # default TDF-IDF like
      rank: rank-static   # dummy do-nothing
-    rank: zvrank        # configurable, experimental TDF-IDF like
      </screen>
-    Notice that the <literal>rank-1</literal> and
-    <literal>zvrank</literal> do not use the static rank 
-    information in the list keys, and will produce the same ordering
-    with our without static ranking enabled.
     </para>
+ 
     <para>
-    The dummy <literal>rank-static</literal> reranking/scoring
-    function returns just 
-    <literal>score = max int - staticrank</literal>
-    in order to preserve the ordering of hit sets with and without it's
-    call.
-     Obviously, to combine static and dynamic ranking usefully, one wants
-    to make a new ranking 
-    function, which is left
-    as an exercise for the reader. 
-   </para>
-
-
-   <para>
-    Invoking dynamic ranking is done in query time (this is why we
-    call it 'dynamic ranking' in the first place ..). One has to add
+    Dynamic ranking is done at query time rather than
+    indexing time (this is why we
+    call it ``dynamic ranking'' in the first place ...)
+    It is invoked by adding
      the Bib-1 relation attribute with
-    value "relevance" to the PQF query (that is, <literal>@attr
-    2=102</literal>, see also  
+    value ``relevance'' to the PQF query (that is,
+    <literal>@attr&nbsp;2=102</literal>, see also  
      <ulink url="ftp://ftp.loc.gov/pub/z3950/defs/bib1.txt">
       The BIB-1 Attribute Set Semantics</ulink>). 
-    To find all articles with the word 'Eoraptor' in
-    the title, and present them relevance ranked, one issues the PQF query:
+    To find all articles with the word <literal>Eoraptor</literal> in
+    the title, and present them relevance ranked, issue the PQF query:
      <screen>
-     Z> f @attr 2=102 @attr 1=4 Eoraptor
+     @attr 2=102 @attr 1=4 Eoraptor
      </screen>
     </para>
- 
+
+    <sect3 id="administration-ranking-dynamic-rank1">
+     <title>Dynamically ranking using PQF queries with the 'rank-1' 
+      algorithm</title>
+
     <para>
       The default <literal>rank-1</literal> ranking module implements a 
-     TF-IDF (Term Frequecy over Inverse Document Frequency) like algorithm.
+     TF/IDF (Term Frequecy over Inverse Document Frequency) like
+     algorithm. In contrast to the usual defintion of TF/IDF
+     algorithms, which only considers searching in one full-text
+     index, this one works on multiple indexes at the same time.
+     More precisely, 
+     Zebra does boolean queries and searches in specific addressed
+     indexes (there are inverted indexes pointing from terms in the
+     dictionary to documents and term positions inside documents). 
+     It works like this:
+     <variablelist>
+      <varlistentry>
+       <term>Query Components</term>
+       <listitem>
+        <para>
+         First, the boolean query is dismantled into it's principal components,
+         i.e. atomic queries where one term is looked up in one index.
+         For example, the query
+         <screen>
+        @attr 2=102 @and @attr 1=1010 Utah @attr 1=1018 Springer
+         </screen>
+         is a boolean AND between the atomic parts
+         <screen>
+       @attr 2=102 @attr 1=1010 Utah
+         </screen>
+          and
+         <screen>
+       @attr 2=102 @attr 1=1018 Springer
+         </screen>
+         which gets processed each for itself.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
+       <term>Atomic hit lists</term>
+       <listitem>
+        <para>
+         Second, for each atomic query, the hit list of documents is
+         computed.
+        </para>
+        <para>
+         In this example, two hit lists for each index  
+         <literal>@attr 1=1010</literal>  and  
+         <literal>@attr 1=1018</literal> are computed.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
+       <term>Atomic scores</term>
+       <listitem>
+        <para>
+         Third, each document in the hit list is assigned a score (_if_ ranking
+         is enabled and requested in the query)  using a TF/IDF scheme.
+        </para>
+        <para>
+         In this example, both atomic parts of the query assign the magic
+         <literal>@attr 2=102</literal> relevance attribute, and are
+         to be used in the relevance ranking functions. 
+        </para>
+        <para>
+         It is possible to apply dynamic ranking on only parts of the
+         PQF query: 
+         <screen>
+          @and @attr 2=102 @attr 1=1010 Utah @attr 1=1018 Springer
+         </screen>
+         searches for all documents which have the term 'Utah' on the
+         body of text, and which have the term 'Springer' in the publisher
+         field, and sort them in the order of the relevance ranking made on
+         the body-of-text index only. 
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
+       <term>Hit list merging</term>
+       <listitem>
+        <para>
+         Fourth, the atomic hit lists are merged according to the boolean
+         conditions to a final hit list of documents to be returned.
+        </para>
+        <para>
+        This step is always performed, independently of the fact that
+        dynamic ranking is enabled or not.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
+       <term>Document score computation</term>
+       <listitem>
+        <para>
+         Fifth, the total score of a document is computed as a linear
+         combination of the atomic scores of the atomic hit lists
+        </para>
+        <para>
+         Ranking weights may be used to pass a value to a ranking
+         algorithm, using the non-standard BIB-1 attribute type 9.
+         This allows one branch of a query to use one value while
+         another branch uses a different one.  For example, we can search
+         for <literal>utah</literal> in the 
+         <literal>@attr 1=4</literal> index with weight 30, as
+         well as in the <literal>@attr 1=1010</literal> index with weight 20:
+         <screen>
+         @attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 @attr 1=1010 city
+         </screen>
+        </para>
+        <para>
+         The default weight is
+         sqrt(1000) ~ 34 , as the Z39.50 standard prescribes that the top score
+         is 1000 and the bottom score is 0, encoded in integers.
+        </para>
+        <warning>
+         <para>
+          The ranking-weight feature is experimental. It may change in future
+          releases of zebra. 
+         </para>
+        </warning>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry>
+       <term>Re-sorting of hit list</term>
+       <listitem>
+        <para>
+         Finally, the final hit list is re-ordered according to scores.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>
+ 
+
+<!--
+Still need to describe the exact TF/IDF formula. Here's the info, need -->
+<!--to extract it in human readable form .. MC
+
+static int calc (void *set_handle, zint sysno, zint staticrank,
+                 int *stop_flag)
+{
+    int i, lo, divisor, score = 0;
+    struct rank_set_info *si = (struct rank_set_info *) set_handle;
+
+    if (!si->no_rank_entries)
+        return -1;   /* ranking not enabled for any terms */
+
+    for (i = 0; i < si->no_entries; i++)
+    {
+        yaz_log(log_level, "calc: i=%d rank_flag=%d lo=%d",
+                i, si->entries[i].rank_flag, si->entries[i].local_occur);
+        if (si->entries[i].rank_flag && (lo = si->entries[i].local_occur))
+            score += (8+log2_int (lo)) * si->entries[i].global_inv *
+                si->entries[i].rank_weight;
+    }
+    divisor = si->no_rank_entries * (8+log2_int (si->last_pos/si->no_entries));
+    score = score / divisor;
+    yaz_log(log_level, "calc sysno=" ZINT_FORMAT " score=%d", sysno, score);
+    if (score > 1000)
+        score = 1000;
+    /* reset the counts for the next term */
+    for (i = 0; i < si->no_entries; i++)
+        si->entries[i].local_occur = 0;
+    return score;
+}
+
+
+where lo = si->entries[i].local_occur is the local documents term-within-index frequency, si->entries[i].global_inv represents the IDF part (computed in static void *begin()), and
+si->entries[i].rank_weight is the weight assigner per index (default 34, or set in the @attr 9=xyz magic)
+
+Finally, the IDF part is computed as:
+
+static void *begin (struct zebra_register *reg,
+                    void *class_handle, RSET rset, NMEM nmem,
+                    TERMID *terms, int numterms)
+{
+    struct rank_set_info *si =
+        (struct rank_set_info *) nmem_malloc (nmem,sizeof(*si));
+    int i;
+
+    yaz_log(log_level, "rank-1 begin");
+    si->no_entries = numterms;
+    si->no_rank_entries = 0;
+    si->nmem=nmem;
+    si->entries = (struct rank_term_info *)
+        nmem_malloc (si->nmem, sizeof(*si->entries)*numterms);
+    for (i = 0; i < numterms; i++)
+    {
+        zint g = rset_count(terms[i]->rset);
+        yaz_log(log_level, "i=%d flags=%s '%s'", i,
+                terms[i]->flags, terms[i]->name );
+        if  (!strncmp (terms[i]->flags, "rank,", 5))
+        {
+            const char *cp = strstr(terms[i]->flags+4, ",w=");
+            si->entries[i].rank_flag = 1;
+            if (cp)
+                si->entries[i].rank_weight = atoi (cp+3);
+            else
+              si->entries[i].rank_weight = 34; /* sqrroot of 1000 */
+            yaz_log(log_level, " i=%d weight=%d g="ZINT_FORMAT, i,
+                     si->entries[i].rank_weight, g);
+            (si->no_rank_entries)++;
+        }
+        else
+            si->entries[i].rank_flag = 0;
+        si->entries[i].local_occur = 0;  /* FIXME */
+        si->entries[i].global_occur = g;
+        si->entries[i].global_inv = 32 - log2_int (g);
+        yaz_log(log_level, " global_inv = %d g = " ZINT_FORMAT,
+                (int) (32-log2_int (g)), g);
+        si->entries[i].term = terms[i];
+        si->entries[i].term_index=i;
+        terms[i]->rankpriv = &(si->entries[i]);
+    }
+    return si;
+}
+
+
+where g = rset_count(terms[i]->rset) is the count of all documents in this specific index hit list, and the IDF part then is
+
+ si->entries[i].global_inv = 32 - log2_int (g);
+   -->
+
     </para>
  
+
+    <para>
+    The <literal>rank-1</literal> algorithm
+    does not use the static rank 
+    information in the list keys, and will produce the same ordering
+    with or without static ranking enabled.
+    </para>
+ 
+    </sect3>
+
+    <!--
+    <sect3 id="administration-ranking-dynamic-rank1">
+     <title>Dynamically ranking PQF queries with the 'rank-static' 
+      algorithm</title>
+    <para>
+    The dummy <literal>rank-static</literal> reranking/scoring
+    function returns just 
+    <literal>score = max int - staticrank</literal>
+    in order to preserve the static ordering of hit sets that would
+    have been produced had it not been invoked.
+    Obviously, to combine static and dynamic ranking usefully,
+    it is necessary
+    to make a new ranking 
+    function; this is left
+    as an exercise for the reader. 
+   </para>
+    </sect3>
+    -->
+
+ 
     <warning>
       <para>
-      Notice that <literal>dynamic ranking</literal> is not compatible
+      <literal>Dynamic ranking</literal> is not compatible
        with <literal>estimated hit sizes</literal>, as all documents in
-      a hit set must be acessed to compute the correct placing in a
+      a hit set must be accessed to compute the correct placing in a
        ranking sorted list. Therefore the use attribute setting
-      <literal>@attr 2=102</literal> clashes with 
-      <literal>@attr 9=</literal>. 
+      <literal>@attr&nbsp;2=102</literal> clashes with 
+      <literal>@attr&nbsp;9=integer</literal>. 
       </para>
     </warning>  
  
-   <para>
-     It is possible to apply dynamic ranking on parts of the PQF query
-     allone:
-     <screen>
-     Z> f @and @attr 2=102 @attr 1=1010 Utah @attr 1=1018 Springer
-     </screen>
-     searches for all documents which have the term 'Utah' on the
-     body of text, and which have the term 'Springer' in the publisher
-     field, and sort them in the order of the relvance ranking made on
-     the body of text index only. 
-   </para>
-    <para>
-     Rank weight is a way to pass a value to a ranking algorithm - so that 
-     one APT has one value - while another as a different one. For
-     example, we can 
-     search for 'utah' in use attribute set 'title' with weight 30, as
-     well as in use attribute set 'any' with weight 20.
-     <screen>
-     Z> f @attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 utah
-     </screen>
-    </para>
-    <warning>
+   <!--
+    we might want to add ranking like this:
+    UNPUBLISHED:
+    Simple BM25 Extension to Multiple Weighted Fields
+    Stephen Robertson, Hugo Zaragoza and Michael Taylor
+    Microsoft Research
+    ser@microsoft.com
+    hugoz@microsoft.com
+    mitaylor2microsoft.com
+   -->
+
+
+    <sect3 id="administration-ranking-dynamic-cql">
+     <title>Dynamically ranking CQL queries</title>
       <para>
-      The rank weight feature is experimental. It may change in future
-      releases of zebra, and is not production mature. 
+      Dynamic ranking can be enabled during sever side CQL
+      query expansion by adding <literal>@attr&nbsp;2=102</literal>
+      chunks to the CQL config file. For example
+      <screen>
+       relationModifier.relevant               = 2=102
+      </screen>
+      invokes dynamic ranking each time a CQL query of the form 
+      <screen>
+       Z> querytype cql
+       Z> f alvis.text =/relevant house
+      </screen>
+      is issued. Dynamic ranking can also be automatically used on
+      specific CQL indexes by (for example) setting
+      <screen>
+       index.alvis.text                        = 1=text 2=102
+      </screen>
+      which then invokes dynamic ranking each time a CQL query of the form 
+      <screen>
+       Z> querytype cql
+       Z> f alvis.text = house
+      </screen>
+      is issued.
       </para>
-    </warning>
-    
-   <para>
-     Notice that <literal>dynamic ranking</literal> can be enabled in
-     sever side CQL query expansion by adding <literal>@attr
-     2=102</literal> to the CQL config file. For example
-     <screen>
-      relationModifier.relevant                = 2=102
-     </screen>
-     invokes dynamik ranking each time a CQL query of the form 
-    <screen>
-     Z> querytype cql
-     Z> f alvis.text =/relevant house
-    </screen>
-     is issued. Dynamic ranking can be enabled on specific CQL indexes
-     by (for example) setting
-     <screen>
-      index.alvis.text                        = 1=text 2=102
-     </screen>
-     which then invokes dynamik ranking each time a CQL query of the form 
-    <screen>
-     Z> querytype cql
-     Z> f alvis.text = house
-    </screen>
-     is issued.
-   </para>
+     
+    </sect3>
  
      </sect2>
  
@@ -1144,59 +1367,63 @@
   <sect2 id="administration-ranking-sorting">
    <title>Sorting</title>
     <para>
-     Sorting is enabled in the configuration of record indexing. For
-     example, to enable sorting according to the BIB-1
+     Zebra sorts efficiently using special sorting indexes
+     (type=<literal>s</literal>; so each sortable index must be known
+     at indexing time, specified in the configuration of record
+     indexing.  For example, to enable sorting according to the BIB-1
       <literal>Date/time-added-to-db</literal> field, one could add the line
       <screen>
          xelm /*/@created               Date/time-added-to-db:s
       </screen>
-     to any <literal>.abs</literal> record indexing config file, or
-     similarily, one could add an indexing element of the form
+     to any <literal>.abs</literal> record-indexing configuration file.
+     Similarly, one could add an indexing element of the form
       <screen><![CDATA[       
        <z:index name="date-modified" type="s">
         <xsl:value-of select="some/xpath"/>
        </z:index>
        ]]></screen>
-     to any <literal>alvis</literal> indexing rule.
+     to any <literal>alvis</literal>-filter indexing stylesheet.
       </para>
       <para>
-     To trigger a sorting on a pre-defined sorting index of type
-     <literal>s</literal>, we can issue a sort with BIB-1
-     embedded sort attribute set <literal>7</literal>. 
-     The embedded sort is a way to specify sort within a query - thus
-     removing the need to send a Z39.50 <literal>Sort
-     Request</literal> separately. 
+      Indexing can be specified at searching time using a query term
+      carrying the non-standard
+      BIB-1 attribute-type <literal>7</literal>.  This removes the
+      need to send a Z39.50 <literal>Sort Request</literal>
+      separately, and can dramatically improve latency when the client
+      and server are on separate networks.
+      The sorting part of the query is separate from the rest of the
+      query - the actual search specification - and must be combined
+      with it using OR.
       </para>
       <para>
-     The value after attribute type <literal>7</literal> is 
-     <literal>1</literal> (=ascending), or <literal>2</literal>
-     (=descending). 
-     The attributes+term (APT) node is separate from the rest of the
-     PQF query, and must be <literal>@or</literal>'ed. 
-     The term associated with this attribute is the sorting level,
-     where
-     <literal>0</literal> specifies the primary sort key,
-     <literal>1</literal> the secondary sort key, and so on. 
+      A sorting subquery needs two attributes: an index (such as a
+      BIB-1 type-1 attribute) specifying which index to sort on, and a
+      type-7 attribute whose value is be <literal>1</literal> for
+      ascending sorting, or <literal>2</literal> for descending.  The
+      term associated with the sorting attribute is the priority of
+      the sort key, where <literal>0</literal> specifies the primary
+      sort key, <literal>1</literal> the secondary sort key, and so
+      on.
       </para>
      <para>For example, a search for water, sort by title (ascending),
      is expressed by the PQF query
       <screen>
-     Z> f @or @attr 1=1016 water @attr 7=1 @attr 1=4 0
+     @or @attr 1=1016 water @attr 7=1 @attr 1=4 0
       </screen>
        whereas a search for water, sort by title ascending, 
       then date descending would be
       <screen>
-     Z> f @or @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 @attr 7=2 @attr 1=30 1
+     @or @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 @attr 7=2 @attr 1=30 1
       </screen>
      </para>
      <para>
       Notice the fundamental differences between <literal>dynamic
-     ranking</literal> and <literal>sorting</literal>: there can only
-     be one ranking function defined and configured, but there can be
-     specified multiple sorting indexes dynamically at search
-     time. Ranking does not need to use specific indexes, which means,
+     ranking</literal> and <literal>sorting</literal>: there can be
+     only one ranking function defined and configured; but multiple
+     sorting indexes can be specified dynamically at search
+     time. Ranking does not need to use specific indexes, so
       dynamic ranking can be enabled and disabled without
-     re-indexing. On the other hand, sorting indexes need to be
+     re-indexing; whereas, sorting indexes need to be
       defined before indexing.
       </para>