From: Adam Dickmeiss <adam@indexdata.dk>
Date: Fri, 5 Oct 2012 12:57:34 +0000 (+0200)
Subject: Separate chapter about ranking
X-Git-Tag: v1.6.22~5^2~2^2~1
X-Git-Url: http://sru.miketaylor.org.uk/cgi-bin?a=commitdiff_plain;h=383965c54b667bba902d92d86b9eef4f3fa1810f;p=pazpar2-moved-to-github.git

Separate chapter about ranking
---

diff --git a/doc/book.xml b/doc/book.xml
index 9f96d75..294031e 100644
--- a/doc/book.xml
+++ b/doc/book.xml
@@ -825,7 +825,77 @@
    
   </section>
   
-
+  <section id="relevance_ranking">
+   <title>Relevance ranking</title>
+   <para>
+    Pazpar2 uses a variant of the fterm frequencyâinverse document frequency
+    (Tf-idf) ranking algorithm.
+   </para>
+   <para>
+    The Tf-part is straightforward to calculate and is based on the
+    documents that Pazpar2 fetches. The idf-part, however, is more tricky
+    since the corpus at hand is ONLY the relevant documents and not
+    irrelevant ones. Pazpar2 does not have the full corpus -- only the
+    documents that match a particular search.
+   </para>
+   <para>
+    Computatation of the Tf-part is based on the normalized documents.
+    The length, the position and terms are thus normalized at this point.
+    Also the computation if performed for each document received from the
+    target - before merging takes place. The result of a TF-compuation is
+    added to the TF-total of a cluster. Thus, if a document occurs twice,
+    then the TF-part is doubled. That, however, can be adjusted, because the
+    TF-part may be divided by the number of documents in a cluster.
+   </para>
+   <para>
+    The algorithm used by Pazpar2 has two phases. In phase one
+    Pazpar2 computes a tf-array .. This is being done as records are
+    fetched form the database. In this case, the rank weigth
+    <literal>w</literal>, the and rank tweaks <literal>lead</literal>,
+    <literal>follow</literal> and <literal>length</literal>.
+    
+   </para>
+   <screen><![CDATA[
+    tf[1,2,..N] = 0;
+    foreach document in a cluster
+       foreach field
+          w[1,2,..N] = 0;
+          for i = 1, .. N:  (each term)
+             foreach pos (where term i occurs in field)
+                // w is configured weight for field
+                // pos is position of term in field
+                w[i] += w / (1 + log2(1+lead*pos))
+                if (d > 0)
+                    w[i] += w[i] * follow / (1+log2(d)
+          // length: length of field (number of terms that is)
+	  if (length strategy is "linear")
+             tf[i] += w[i] / length;
+          else if (length strategy is "log")
+             tf[i] += w[i] / log2(length);
+          else if (length strategy is "none")
+             tf[i] += w[i];
+	  ]]></screen>
+   <para>
+    In phase two, the idf-array is computed and the final score
+    is computed. This is done for each cluster as part of each show command.
+    The rank tweak <literal>cluster</literal> is in use here.
+   </para>
+   <screen><![CDATA[
+    // dococcur[i]: number of records where term occurs
+    // doctotal: number of records
+    for i = 1, .., N (each term)
+      if (dococcur[i] > 0)
+         idf[i] = log(1 + doctotal / dococcur[i])
+      else
+         idf[i] = 0;
+
+    relevance = 0;
+    for i = 1, .., N: (each term)
+       if (cluster is "yes")
+          tf[i] = tf[i] / cluster_size;
+       relevance += 100000 * tf[i] / idf[i];
+       ]]></screen>
+  </section> <!-- relevance_ranking -->
  </chapter> <!-- Using Pazpar2 -->
 
  <reference id="reference">
diff --git a/doc/pazpar2_conf.xml b/doc/pazpar2_conf.xml
index d7de16f..cbad09d 100644
--- a/doc/pazpar2_conf.xml
+++ b/doc/pazpar2_conf.xml
@@ -268,7 +268,7 @@
 	      M [F N]
 	     </literallayout>
 	     where M is an integer, used as a
-	     multiplier against the basic TF*IDF score. A value of
+	     weight against the basic TF*IDF score. A value of
 	     1 is the base, higher values give additional weight to
 	     elements of this type. The default is '0', which
 	     excludes this element from the rank calculation.
@@ -289,6 +289,8 @@
 	     The per field rank was introduced in Pazpar2 1.6.15. Earlier
 	     releases only allowed a rank value M (simple integer).
 	    </para>
+	    See <xref linkend="relevance_ranking"/> for more
+	    about ranking.
 	   </listitem>
 	  </varlistentry>
 	  
@@ -585,18 +587,85 @@
 	<term>rank</term>
 	<listitem>
 	 <para>
-	  Customizes the ranking (relevance) algorithm.
-	  Attribute 'cluster' is a boolean
-	  that controls whether Pazpar2 should boost ranking for merged
-	  records. Is 'yes' by default. A value of 'no' will make
-	  Pazpar2 average ranking of each record in a cluster.
+	  Customizes the ranking (relevance) algorithm. Also known as
+	  rank tweaks. The rank element
+	  accepts the following attributes - all being optional:
 	 </para>
+	 <variablelist>
+	  <varlistentry>
+	   <term>cluster</term>
+	   <listitem>
+	    <para>
+	     Attribute 'cluster' is a boolean
+	     that controls whether Pazpar2 should boost ranking for merged
+	     records. Is 'yes' by default. A value of 'no' will make
+	     Pazpar2 average ranking of each record in a cluster.
+	    </para>
+	   </listitem>
+	  </varlistentry>
+	  <varlistentry>
+	   <term>debug</term>
+	   <listitem>
+	    <para>
+	     Attribute 'debug' is a boolean
+	     that controls whether Pazpar2 should include details
+	     about ranking for each document in the show command's
+	     response. Enable by using value "yes", disable by using
+	     value "no" (default).
+	    </para>
+	   </listitem>
+	  </varlistentry>
+	  <varlistentry>
+	   <term>follow</term>
+	   <listitem>
+	    <para>
+	     Attribute 'follow' is a a floating point number greater than
+	     or equal to 0. A positive number will boost weight for terms
+	     that occur close to each other (proximity, distance).
+	     A value of 1, will double the weight if two terms are in
+	     proximity distance of 1 (next to each other). The default
+	     value of 'follow' is 0 (order will not affect weight).
+	    </para>
+	   </listitem>
+	  </varlistentry>
+	  <varlistentry>
+	   <term>lead</term>
+	   <listitem>
+	    <para>
+	     Attribute 'lead' is a floating point number.
+	     It controls if term weight should be reduced by position
+	     from start in a metadata field. A positive value of 'lead'
+	     will reduce weight as it apperas further away from the lead
+	     of the field. Default value is 0 (no reduction of weight by
+	     position).
+	    </para>
+	   </listitem>
+	  </varlistentry>
+	  <varlistentry>
+	   <term>length</term>
+	   <listitem>
+	    <para>
+	     Attribute 'length' determines how/if term weight should be
+	     divided by lenght of metadata field. A value of "linear"
+	     divide by length. A value of "log" will divide by log2(length).
+	     A value of "none" will leave term weight as is (no division).
+	     Default value is "linear".
+	    </para>
+	   </listitem>
+	  </varlistentry>
+	 </variablelist>
 	 <para>
-	  This configuration was added in pazpar2 1.6.18.
+	  Refer to <xref linkend="relevance_ranking"/> to see how
+	  these tweaks are used in computation of score.
+	 </para>
+	 <para>
+	  Customization of ranking algorithm was introduced with
+	  Pazpar2 1.6.18. The semantics of some of the fields changed
+	  in versions up to 1.6.21.
 	 </para>
 	</listitem>
        </varlistentry>
-
+       
        <varlistentry id="sort-default">
 	<term>sort-default</term>
 	<listitem>