Added prototype recTypeClass_load_modules

[idzebra-moved-to-github.git] / doc / administration.xml
diff --git a/doc/administration.xml b/doc/administration.xml

index 42e13c1..01e8f22 100644 (file)
--- a/doc/administration.xml
+++ b/doc/administration.xml
@@ -1,5 +1,5 @@
  <chapter id="administration">
- <!-- $Id: administration.xml,v 1.27 2006-03-04 21:39:20 marc Exp $ -->
+ <!-- $Id: administration.xml,v 1.33 2006-05-02 12:55:18 mike Exp $ -->
   <title>Administrating Zebra</title>
   <!-- ### It's a bit daft that this chapter (which describes half of
            the configuration-file formats) is separated from
@@ -923,7 +923,50 @@
  
  
   <sect1 id="administration-ranking">
-  <title>Static and Dynamic Ranking</title>
+  <title>Relevance Ranking and Sorting of Result Sets</title>
+
+  <sect2>
+   <title>Overview</title>
+   <para>
+    The default ordering of a result set is left up to the server,
+    which inside Zebra means sorting in ascending document ID order. 
+    This is not always the order humans want to browse the sometimes
+    quite large hit sets. Ranking and sorting comes to the rescue.
+   </para>
+
+   <para> 
+    In cases where a good presentation ordering can be computed at
+    indexing time, we can use a fixed <literal>static ranking</literal>
+    scheme, which is provided for the <literal>alvis</literal>
+    indexing filter. This defines a fixed ordering of hit lists,
+    independently of the query issued. 
+   </para>
+
+   <para>
+    There are cases, however, where relevance of hit set documents is
+    highly dependent on the query processed.
+    Simply put, <literal>dynamic relevance ranking</literal> 
+    sorts a set of retrieved 
+    records such
+    that those most likely to be relevant to your request are
+    retrieved first. 
+    Internally, Zebra retrieves all documents that satisfy your
+    query, and re-orders the hit list to arrange them based on
+    a measurement of similarity between your query and the content of
+    each record. 
+   </para>
+
+   <para>
+    Finally, there are situations where hit sets of documents should be
+    <literal>sorted</literal> during query time according to the
+    lexicographical ordering of certain sort indexes created at
+    indexing time.
+   </para>
+  </sect2>
+
+
+ <sect2 id="administration-ranking-static">
+  <title>Static Ranking</title>
    
     <para>
      Zebra uses internally inverted indexes to look up term occurencies
@@ -954,12 +997,9 @@
      are ordered 
      first by ascending static rank,
      then by ascending document <literal>ID</literal>.
-   </para>
-   <para>
-    This implies that the default rank <literal>0</literal> 
-    is the best rank at the
-    beginning of the list, and <literal>max int</literal> 
-    is the worst static rank.
+    Zero
+    is the ``best'' rank, as it occurs at the
+    beginning of the list; higher numbers represent worse scores.
     </para>
     <para>
      The experimental <literal>alvis</literal> filter provides a
@@ -968,59 +1008,207 @@
      after <emphasis>ascending</emphasis> static
      rank, and for those doc's which have the same static rank, ordered
      after <emphasis>ascending</emphasis> doc <literal>ID</literal>.
-    See <xref linkend="record-model-alvisxslt"/> for the glory details.
+    See <xref linkend="record-model-alvisxslt"/> for the gory details.
     </para>
+    </sect2>
+
+
+ <sect2 id="administration-ranking-dynamic">
+  <title>Dynamic Ranking</title>
     <para>
-    If one wants to do a little fiddeling with the static rank order,
-    one has to invoke additional re-ranking/re-ordering using dynamic 
-    reranking or score functions. These functions return positive
-    interger scores, where <emphasis>highest</emphasis> score is 
-    <emphasis>best</emphasis>, which means that the
-    hit sets will be sorted according to
+    In order to fiddle with the static rank order, it is necessary to
+    invoke additional re-ranking/re-ordering using dynamic
+    ranking or score functions. These functions return positive
+    integer scores, where <emphasis>highest</emphasis> score is 
+    ``best'';
+    hit sets are sorted according to
      <emphasis>decending</emphasis> 
      scores (in contrary
      to the index lists which are sorted according to
-    <emphasis>ascending</emphasis> rank  number and document ID).
+    ascending rank number and document ID).
     </para>
-   <!--
-   <para>
-    Those are defined in the zebra C source files 
-    <screen>     
-    "rank-1" : zebra/index/rank1.c  
-               default TF/IDF like zebra dynamic ranking
-    "rank-static" : zebra/index/rankstatic.c
-               do-nothing dummy static ranking (this is just to prove
-               that the static rank can be used in dynamic ranking functions)  
-     "zvrank" : zebra/index/zvrank.c
-               many different dynamic TF/IDF ranking functions 
-    </screen> 
-   </para>
-   -->
     <para>
-    Those are in the zebra config file enabled by a directive like (use
-    only one of these a time!):
+    Dynamic ranking is enabled by a directive like one of the
+    following in the zebra config file (use only one of these a time!):
      <screen> 
-    rank: rank-1        # default
-    rank: rank-static   # dummy 
-    rank: zvrank        # TDF-IDF like
+    rank: rank-1        # default TDF-IDF like
+    rank: rank-static   # dummy do-nothing
+    rank: zvrank        # configurable, experimental TDF-IDF like
      </screen>
      Notice that the <literal>rank-1</literal> and
      <literal>zvrank</literal> do not use the static rank 
      information in the list keys, and will produce the same ordering
-    with our without static ranking enabled.
+    with or without static ranking enabled.
     </para>
     <para>
      The dummy <literal>rank-static</literal> reranking/scoring
      function returns just 
      <literal>score = max int - staticrank</literal>
-    in order to preserve the ordering of hit sets with and without it's
-    call.
-     Obviously, to combine static and dynamic ranking usefully, one wants
+    in order to preserve the static ordering of hit sets that would
+    have been produced had it not been invoked.
+    Obviously, to combine static and dynamic ranking usefully,
+    it is necessary
      to make a new ranking 
-    function, which is left
+    function; this is left
      as an exercise for the reader. 
     </para>
-   
+
+
+   <para>
+    Dynamic ranking is done at query time rather than
+    indexing time (this is why we
+    call it ``dynamic ranking'' in the first place ...)
+    It is invoked by adding
+    the Bib-1 relation attribute with
+    value ``relevance'' to the PQF query (that is,
+    <literal>@attr&nbsp;2=102</literal>, see also  
+    <ulink url="ftp://ftp.loc.gov/pub/z3950/defs/bib1.txt">
+     The BIB-1 Attribute Set Semantics</ulink>). 
+    To find all articles with the word <literal>Eoraptor</literal> in
+    the title, and present them relevance ranked, issue the PQF query:
+    <screen>
+     @attr 2=102 @attr 1=4 Eoraptor
+    </screen>
+   </para>
+ 
+   <para>
+     The default <literal>rank-1</literal> ranking module implements a 
+     TF-IDF (Term Frequecy over Inverse Document Frequency) like algorithm.
+   </para>
+
+   <warning>
+     <para>
+      Notice that <literal>dynamic ranking</literal> is not compatible
+      with <literal>estimated hit sizes</literal>, as all documents in
+      a hit set must be acessed to compute the correct placing in a
+      ranking sorted list. Therefore the use attribute setting
+      <literal>@attr&nbsp;2=102</literal> clashes with 
+      <literal>@attr&nbsp;9=integer</literal>. 
+     </para>
+   </warning>  
+
+   <para>
+     It is possible to apply dynamic ranking on only parts of the PQF query:
+     <screen>
+     @and @attr 2=102 @attr 1=1010 Utah @attr 1=1018 Springer
+     </screen>
+     searches for all documents which have the term 'Utah' on the
+     body of text, and which have the term 'Springer' in the publisher
+     field, and sort them in the order of the relvance ranking made on
+     the body-of-text index only. 
+   </para>
+    <para>
+     Ranking weights may be used to pass a value to a ranking
+     algorithm, using the non-standard BIB-1 attribute type 9.
+     This allows one branch of a query to use one value while
+     another branch uses a different one.  For example, we can search
+     for <literal>utah</literal> in the title index with weight 30, as
+     well as in the ``any'' index with weight 20:
+     <screen>
+     @attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 utah
+     </screen>
+    </para>
+    <warning>
+     <para>
+      The ranking-weight feature is experimental. It may change in future
+      releases of zebra, and is not production mature. 
+     </para>
+    </warning>
+    
+   <para>
+     Notice that dynamic ranking can be enabled in sever side CQL
+     query expansion by adding <literal>@attr&nbsp;2=102</literal> to
+     the CQL config file. For example
+     <screen>
+      relationModifier.relevant                = 2=102
+     </screen>
+     invokes dynamic ranking each time a CQL query of the form 
+    <screen>
+     Z> querytype cql
+     Z> f alvis.text =/relevant house
+    </screen>
+     is issued. Dynamic ranking can also be automatically used on
+     specific CQL indexes by (for example) setting
+     <screen>
+      index.alvis.text                        = 1=text 2=102
+     </screen>
+     which then invokes dynamic ranking each time a CQL query of the form 
+    <screen>
+     Z> querytype cql
+     Z> f alvis.text = house
+    </screen>
+     is issued.
+   </para>
+
+    </sect2>
+
+
+ <sect2 id="administration-ranking-sorting">
+  <title>Sorting</title>
+   <para>
+     Zebra sorts efficiently using special sorting indexes
+     (type=<literal>s</literal>; so each sortable index must be known
+     at indexing time, specified in the configuration of record
+     indexing.  For example, to enable sorting according to the BIB-1
+     <literal>Date/time-added-to-db</literal> field, one could add the line
+     <screen>
+        xelm /*/@created               Date/time-added-to-db:s
+     </screen>
+     to any <literal>.abs</literal> record-indexing configuration file.
+     Similarily, one could add an indexing element of the form
+     <screen><![CDATA[       
+      <z:index name="date-modified" type="s">
+       <xsl:value-of select="some/xpath"/>
+      </z:index>
+      ]]></screen>
+     to any <literal>alvis</literal>-filter indexing stylesheet.
+     </para>
+     <para>
+      Indexing can be specified at searching time using a query term
+      carrying the non-standard
+      BIB-1 attribute-type <literal>7</literal>.  This removes the
+      need to send a Z39.50 <literal>Sort Request</literal>
+      separately, and can dramatically improve latency when the client
+      and server are on separate networks.
+      The sorting part of the query is separate from the rest of the
+      query - the actual search specification - and must be combined
+      with it using OR.
+     </para>
+     <para>
+      A sorting subquery needs two attributes: an index (such as a
+      BIB-1 type-1 attribute) specifying which index to sort on, and a
+      type-7 attribute whose value is be <literal>1</literal> for
+      ascending sorting, or <literal>2</literal> for descending.  The
+      term associated with the sorting attribute is the priority of
+      the sort key, where <literal>0</literal> specifies the primary
+      sort key, <literal>1</literal> the secondary sort key, and so
+      on.
+     </para>
+    <para>For example, a search for water, sort by title (ascending),
+    is expressed by the PQF query
+     <screen>
+     @or @attr 1=1016 water @attr 7=1 @attr 1=4 0
+     </screen>
+      whereas a search for water, sort by title ascending, 
+     then date descending would be
+     <screen>
+     @or @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 @attr 7=2 @attr 1=30 1
+     </screen>
+    </para>
+    <para>
+     Notice the fundamental differences between <literal>dynamic
+     ranking</literal> and <literal>sorting</literal>: there can be
+     only one ranking function defined and configured; but multiple
+     sorting indexes can be specified dynamically at search
+     time. Ranking does not need to use specific indexes, so
+     dynamic ranking can be enabled and disabled without
+     re-indexing; whereas, sorting indexes need to be
+     defined before indexing.
+     </para>
+
+ </sect2>
+
+
   </sect1>
  
   <sect1 id="administration-extended-services">