Separate chapter about ranking

author Adam Dickmeiss <adam@indexdata.dk>

Fri, 5 Oct 2012 12:57:34 +0000 (14:57 +0200)

committer Adam Dickmeiss <adam@indexdata.dk>

Fri, 5 Oct 2012 12:57:34 +0000 (14:57 +0200)
author Adam Dickmeiss <adam@indexdata.dk>
Fri, 5 Oct 2012 12:57:34 +0000 (14:57 +0200)
committer Adam Dickmeiss <adam@indexdata.dk>
Fri, 5 Oct 2012 12:57:34 +0000 (14:57 +0200)
diff --git a/doc/book.xml b/doc/book.xml

index 9f96d75..294031e 100644 (file)
--- a/doc/book.xml
+++ b/doc/book.xml
@@ -825,7 +825,77 @@
     
    </section>
    
-
+  <section id="relevance_ranking">
+   <title>Relevance ranking</title>
+   <para>
+    Pazpar2 uses a variant of the fterm frequency–inverse document frequency
+    (Tf-idf) ranking algorithm.
+   </para>
+   <para>
+    The Tf-part is straightforward to calculate and is based on the
+    documents that Pazpar2 fetches. The idf-part, however, is more tricky
+    since the corpus at hand is ONLY the relevant documents and not
+    irrelevant ones. Pazpar2 does not have the full corpus -- only the
+    documents that match a particular search.
+   </para>
+   <para>
+    Computatation of the Tf-part is based on the normalized documents.
+    The length, the position and terms are thus normalized at this point.
+    Also the computation if performed for each document received from the
+    target - before merging takes place. The result of a TF-compuation is
+    added to the TF-total of a cluster. Thus, if a document occurs twice,
+    then the TF-part is doubled. That, however, can be adjusted, because the
+    TF-part may be divided by the number of documents in a cluster.
+   </para>
+   <para>
+    The algorithm used by Pazpar2 has two phases. In phase one
+    Pazpar2 computes a tf-array .. This is being done as records are
+    fetched form the database. In this case, the rank weigth
+    <literal>w</literal>, the and rank tweaks <literal>lead</literal>,
+    <literal>follow</literal> and <literal>length</literal>.
+    
+   </para>
+   <screen><![CDATA[
+    tf[1,2,..N] = 0;
+    foreach document in a cluster
+       foreach field
+          w[1,2,..N] = 0;
+          for i = 1, .. N:  (each term)
+             foreach pos (where term i occurs in field)
+                // w is configured weight for field
+                // pos is position of term in field
+                w[i] += w / (1 + log2(1+lead*pos))
+                if (d > 0)
+                    w[i] += w[i] * follow / (1+log2(d)
+          // length: length of field (number of terms that is)
+         if (length strategy is "linear")
+             tf[i] += w[i] / length;
+          else if (length strategy is "log")
+             tf[i] += w[i] / log2(length);
+          else if (length strategy is "none")
+             tf[i] += w[i];
+         ]]></screen>
+   <para>
+    In phase two, the idf-array is computed and the final score
+    is computed. This is done for each cluster as part of each show command.
+    The rank tweak <literal>cluster</literal> is in use here.
+   </para>
+   <screen><![CDATA[
+    // dococcur[i]: number of records where term occurs
+    // doctotal: number of records
+    for i = 1, .., N (each term)
+      if (dococcur[i] > 0)
+         idf[i] = log(1 + doctotal / dococcur[i])
+      else
+         idf[i] = 0;
+
+    relevance = 0;
+    for i = 1, .., N: (each term)
+       if (cluster is "yes")
+          tf[i] = tf[i] / cluster_size;
+       relevance += 100000 * tf[i] / idf[i];
+       ]]></screen>
+  </section> <!-- relevance_ranking -->
   </chapter> <!-- Using Pazpar2 -->
  
   <reference id="reference">
diff --git a/doc/pazpar2_conf.xml b/doc/pazpar2_conf.xml

index d7de16f..cbad09d 100644 (file)
--- a/doc/pazpar2_conf.xml
+++ b/doc/pazpar2_conf.xml
@@ -268,7 +268,7 @@
               M [F N]
              </literallayout>
              where M is an integer, used as a
-            multiplier against the basic TF*IDF score. A value of
+            weight against the basic TF*IDF score. A value of
              1 is the base, higher values give additional weight to
              elements of this type. The default is '0', which
              excludes this element from the rank calculation.
@@ -289,6 +289,8 @@
              The per field rank was introduced in Pazpar2 1.6.15. Earlier
              releases only allowed a rank value M (simple integer).
             </para>
+           See <xref linkend="relevance_ranking"/> for more
+           about ranking.
            </listitem>
           </varlistentry>
           
@@ -585,18 +587,85 @@
         <term>rank</term>
         <listitem>
          <para>
-         Customizes the ranking (relevance) algorithm.
-         Attribute 'cluster' is a boolean
-         that controls whether Pazpar2 should boost ranking for merged
-         records. Is 'yes' by default. A value of 'no' will make
-         Pazpar2 average ranking of each record in a cluster.
+         Customizes the ranking (relevance) algorithm. Also known as
+         rank tweaks. The rank element
+         accepts the following attributes - all being optional:
          </para>
+        <variablelist>
+         <varlistentry>
+          <term>cluster</term>
+          <listitem>
+           <para>
+            Attribute 'cluster' is a boolean
+            that controls whether Pazpar2 should boost ranking for merged
+            records. Is 'yes' by default. A value of 'no' will make
+            Pazpar2 average ranking of each record in a cluster.
+           </para>
+          </listitem>
+         </varlistentry>
+         <varlistentry>
+          <term>debug</term>
+          <listitem>
+           <para>
+            Attribute 'debug' is a boolean
+            that controls whether Pazpar2 should include details
+            about ranking for each document in the show command's
+            response. Enable by using value "yes", disable by using
+            value "no" (default).
+           </para>
+          </listitem>
+         </varlistentry>
+         <varlistentry>
+          <term>follow</term>
+          <listitem>
+           <para>
+            Attribute 'follow' is a a floating point number greater than
+            or equal to 0. A positive number will boost weight for terms
+            that occur close to each other (proximity, distance).
+            A value of 1, will double the weight if two terms are in
+            proximity distance of 1 (next to each other). The default
+            value of 'follow' is 0 (order will not affect weight).
+           </para>
+          </listitem>
+         </varlistentry>
+         <varlistentry>
+          <term>lead</term>
+          <listitem>
+           <para>
+            Attribute 'lead' is a floating point number.
+            It controls if term weight should be reduced by position
+            from start in a metadata field. A positive value of 'lead'
+            will reduce weight as it apperas further away from the lead
+            of the field. Default value is 0 (no reduction of weight by
+            position).
+           </para>
+          </listitem>
+         </varlistentry>
+         <varlistentry>
+          <term>length</term>
+          <listitem>
+           <para>
+            Attribute 'length' determines how/if term weight should be
+            divided by lenght of metadata field. A value of "linear"
+            divide by length. A value of "log" will divide by log2(length).
+            A value of "none" will leave term weight as is (no division).
+            Default value is "linear".
+           </para>
+          </listitem>
+         </varlistentry>
+        </variablelist>
          <para>
-         This configuration was added in pazpar2 1.6.18.
+         Refer to <xref linkend="relevance_ranking"/> to see how
+         these tweaks are used in computation of score.
+        </para>
+        <para>
+         Customization of ranking algorithm was introduced with
+         Pazpar2 1.6.18. The semantics of some of the fields changed
+         in versions up to 1.6.21.
          </para>
         </listitem>
         </varlistentry>
-
+       
         <varlistentry id="sort-default">
         <term>sort-default</term>
         <listitem>
author	Adam Dickmeiss <adam@indexdata.dk>
	Fri, 5 Oct 2012 12:57:34 +0000 (14:57 +0200)
committer	Adam Dickmeiss <adam@indexdata.dk>
	Fri, 5 Oct 2012 12:57:34 +0000 (14:57 +0200)
doc/book.xml		patch \| blob \| history
doc/pazpar2_conf.xml		patch \| blob \| history