Fixed bug #736: Updates gets slower. The problem was that duplicate

[idzebra-moved-to-github.git] / doc / recordmodel-alvisxslt.xml
diff --git a/doc/recordmodel-alvisxslt.xml b/doc/recordmodel-alvisxslt.xml

index 764190d..f9ad320 100644 (file)
--- a/doc/recordmodel-alvisxslt.xml
+++ b/doc/recordmodel-alvisxslt.xml
@@ -1,5 +1,5 @@
   <chapter id="record-model-alvisxslt">
-  <!-- $Id: recordmodel-alvisxslt.xml,v 1.1 2006-02-15 11:07:47 marc Exp $ -->
+  <!-- $Id: recordmodel-alvisxslt.xml,v 1.11 2006-11-13 14:53:40 marc Exp $ -->
    <title>ALVIS XML Record Model and Filter Module</title>
    
  
@@ -12,44 +12,437 @@
     releases of the Zebra Information Server.
    </para>
  
-  
-
+  <para> This filter has been developed under the 
+   <ulink url="http://www.alvis.info/">ALVIS</ulink> project funded by
+   the European Community under the "Information Society Technologies"
+   Program (2002-2006).
+  </para>
    
    
-  <sect1 id="record-model-alvisxslt-filter">
-   <title>ALLVIS Record Filter</title>
+  <section id="record-model-alvisxslt-filter">
+   <title>ALVIS Record Filter</title>
     <para>
-    The experimental, loadable  Alvis XM/XSLT filter module
+    The experimental, loadable  Alvis XML/XSLT filter module
     <literal>mod-alvis.so</literal> is packaged in the GNU/Debian package
      <literal>libidzebra1.4-mod-alvis</literal>.
+    It is invoked by the <filename>zebra.cfg</filename> configuration statement
+    <screen>
+     recordtype.xml: alvis.db/filter_alvis_conf.xml
+    </screen>
+    In this example on all data files with suffix 
+    <filename>*.xml</filename>, where the
+    Alvis XSLT filter configuration file is found in the
+    path <filename>db/filter_alvis_conf.xml</filename>.
+   </para>
+   <para>The Alvis XSLT filter configuration file must be
+    valid XML. It might look like this (This example is
+    used for indexing and display of OAI harvested records):
+    <screen>
+    &lt;?xml version="1.0" encoding="UTF-8"?&gt;
+      &lt;schemaInfo&gt;
+        &lt;schema name="identity" stylesheet="xsl/identity.xsl" /&gt;
+        &lt;schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
+            stylesheet="xsl/oai2index.xsl" /&gt;
+        &lt;schema name="dc" stylesheet="xsl/oai2dc.xsl" /&gt;
+        &lt;!-- use split level 2 when indexing whole OAI Record lists --&gt;
+        &lt;split level="2"/&gt;
+      &lt;/schemaInfo&gt;
+    </screen> 
+   </para>
+   <para>
+    All named stylesheets defined inside
+    <literal>schema</literal> element tags 
+    are for presentation after search, including
+    the indexing stylesheet (which is a great debugging help). The
+    names defined in the <literal>name</literal> attributes must be
+    unique, these are the literal <literal>schema</literal> or 
+    <literal>element set</literal> names used in 
+      <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink>,
+      <ulink url="&url.sru;">SRU</ulink> and
+    Z39.50 protocol queries.
+    The paths in the <literal>stylesheet</literal> attributes
+    are relative to zebras working directory, or absolute to file
+    system root.
+   </para>
+   <para>
+    The <literal>&lt;split level="2"/&gt;</literal> decides where the
+    XML Reader shall split the
+    collections of records into individual records, which then are
+    loaded into DOM, and have the indexing XSLT stylesheet applied.
+   </para>
+   <para>
+    There must be exactly one indexing XSLT stylesheet, which is
+    defined by the magic attribute  
+    <literal>identifier="http://indexdata.dk/zebra/xslt/1"</literal>.
     </para>
  
-   <sect2 id="record-model-alvisxslt-internal">
-    <title>ALLVIS Internal Record Representation</title>   
-    <para>FIXME</para>
-   </sect2>
-
-   <sect2 id="record-model-alvisxslt-canonical">
-    <title>ALLVIS Canonical Format</title>   
-    <para>FIXME</para>
-   </sect2>
-
-
-  </sect1>
-
-
-  <sect1 id="record-model-alvisxslt-conf">
-   <title>ALLVIS Record Model Configuration</title>
-   <para>FIXME</para>
-
-
-
-  <sect2 id="record-model-alvisxslt-exchange">
+   <section id="record-model-alvisxslt-internal">
+    <title>ALVIS Internal Record Representation</title>   
+    <para>When indexing, an XML Reader is invoked to split the input
+    files into suitable record XML pieces. Each record piece is then
+    transformed to an XML DOM structure, which is essentially the
+    record model. Only XSLT transformations can be applied during
+    index, search and retrieval. Consequently, output formats are
+    restricted to whatever XSLT can deliver from the record XML
+    structure, be it other XML formats, HTML, or plain text. In case
+    you have <literal>libxslt1</literal> running with EXSLT support,
+    you can use this functionality inside the Alvis
+    filter configuration XSLT stylesheets.
+    </para>
+   </section>
+
+   <section id="record-model-alvisxslt-canonical">
+    <title>ALVIS Canonical Indexing Format</title>   
+    <para>The output of the indexing XSLT stylesheets must contain
+    certain elements in the magic 
+     <literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>
+    namespace. The output of the XSLT indexing transformation is then
+    parsed using DOM methods, and the contained instructions are
+    performed on the <emphasis>magic elements and their
+    subtrees</emphasis>.
+    </para>
+    <para>
+    For example, the output of the command
+     <screen>  
+      xsltproc xsl/oai2index.xsl one-record.xml
+     </screen> 
+     might look like this:
+     <screen>
+      &lt;?xml version="1.0" encoding="UTF-8"?&gt;
+      &lt;z:record xmlns:z="http://indexdata.dk/zebra/xslt/1" 
+           z:id="oai:JTRS:CP-3290---Volume-I" 
+           z:rank="47896"
+           z:type="update"&gt;
+       &lt;z:index name="oai_identifier" type="0"&gt;
+                oai:JTRS:CP-3290---Volume-I&lt;/z:index&gt;
+       &lt;z:index name="oai_datestamp" type="0"&gt;2004-07-09&lt;/z:index&gt;
+       &lt;z:index name="oai_setspec" type="0"&gt;jtrs&lt;/z:index&gt;
+       &lt;z:index name="dc_all" type="w"&gt;
+          &lt;z:index name="dc_title" type="w"&gt;Proceedings of the 4th 
+                International Conference and Exhibition:
+                World Congress on Superconductivity - Volume I&lt;/z:index&gt;
+          &lt;z:index name="dc_creator" type="w"&gt;Kumar Krishen and *Calvin
+                Burnham, Editors&lt;/z:index&gt;
+       &lt;/z:index&gt;
+     &lt;/z:record&gt;
+     </screen>
+    </para>
+    <para>This means the following: From the original XML file 
+     <literal>one-record.xml</literal> (or from the XML record DOM of the
+     same form coming from a splitted input file), the indexing
+     stylesheet produces an indexing XML record, which is defined by
+     the <literal>record</literal> element in the magic namespace
+     <literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>.
+     Zebra uses the content of 
+     <literal>z:id="oai:JTRS:CP-3290---Volume-I"</literal> as internal
+     record ID, and - in case static ranking is set - the content of 
+     <literal>z:rank="47896"</literal> as static rank. Following the
+     discussion in <xref linkend="administration-ranking"/>
+     we see that this records is internally ordered
+     lexicographically according to the value of the string
+     <literal>oai:JTRS:CP-3290---Volume-I47896</literal>.
+     The type of action performed during indexing is defined by
+     <literal>z:type="update"&gt;</literal>, with recognized values
+     <literal>insert</literal>, <literal>update</literal>, and 
+     <literal>delete</literal>. 
+    </para>
+    <para>In this example, the following literal indexes are constructed:
+     <screen>
+       oai:identifier
+       oai:datestamp
+       oai:setspec
+       dc:all
+       dc:title
+       dc:creator
+     </screen>
+     where the indexing type is defined in the 
+     <literal>type</literal> attribute 
+     (any value from the standard configuration
+     file <filename>default.idx</filename> will do). Finally, any 
+     <literal>text()</literal> node content recursively contained
+     inside the <literal>index</literal> will be filtered through the
+     appropriate charmap for character normalization, and will be
+     inserted in the index.
+    </para>
+    <para>
+     Specific to this example, we see that the single word
+     <literal>oai:JTRS:CP-3290---Volume-I</literal> will be literal,
+     byte for byte without any form of character normalization,
+     inserted into the index named <literal>oai:identifier</literal>,
+     the text 
+     <literal>Kumar Krishen and *Calvin Burnham, Editors</literal>
+     will be inserted using the <literal>w</literal> character
+     normalization defined in <filename>default.idx</filename> into
+     the index <literal>dc:creator</literal> (that is, after character
+     normalization the index will keep the inidividual words 
+     <literal>kumar</literal>, <literal>krishen</literal>, 
+     <literal>and</literal>, <literal>calvin</literal>,
+     <literal>burnham</literal>, and <literal>editors</literal>), and
+     finally both the texts
+     <literal>Proceedings of the 4th International Conference and Exhibition:
+      World Congress on Superconductivity - Volume I</literal> 
+     and
+     <literal>Kumar Krishen and *Calvin Burnham, Editors</literal> 
+     will be inserted into the index <literal>dc:all</literal> using
+     the same character normalization map <literal>w</literal>. 
+    </para>
+    <para>
+     Finally, this example configuration can be queried using PQF
+     queries, either transported by Z39.50, (here using a yaz-client) 
+     <screen>
+      <![CDATA[
+      Z> open localhost:9999
+      Z> elem dc
+      Z> form xml
+      Z>
+      Z> f @attr 1=dc:creator Kumar
+      Z> scan @attr 1=dc:creator adam
+      Z>
+      Z> f @attr 1=dc:title @attr 4=2 "proceeding congress superconductivity"
+      Z> scan @attr 1=dc:title abc
+      ]]>
+     </screen>
+     or the proprietary
+     extentions <literal>x-pquery</literal> and
+     <literal>x-pScanClause</literal> to
+     SRU, and SRW
+     <screen>
+      <![CDATA[
+      http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=%40attr+1%3Ddc%3Acreator+%40attr+4%3D6+%22the
+      http://localhost:9999/?version=1.1&operation=scan&x-pScanClause=@attr+1=dc:date+@attr+4=2+a
+      ]]>
+     </screen>
+     See <xref linkend="zebrasrv-sru"/> for more information on SRU/SRW
+     configuration, and <xref linkend="gfs-config"/> or the YAZ
+     <ulink url="&url.yaz.cql;">CQL section</ulink>
+     for the details or the YAZ frontend server.
+    </para>
+    <para>
+     Notice that there are no <filename>*.abs</filename>,
+     <filename>*.est</filename>, <filename>*.map</filename>, or other GRS-1
+     filter configuration files involves in this process, and that the
+     literal index names are used during search and retrieval.
+    </para>
+   </section>
+  </section>
+
+
+  <section id="record-model-alvisxslt-conf">
+   <title>ALVIS Record Model Configuration</title>
+
+
+  <section id="record-model-alvisxslt-index">
+   <title>ALVIS Indexing Configuration</title>
+    <para>
+     As mentioned above, there can be only one indexing
+     stylesheet, and configuration of the indexing process is a synonym
+     of writing an XSLT stylesheet which produces XML output containing the
+     magic elements discussed in  
+     <xref linkend="record-model-alvisxslt-internal"/>. 
+     Obviously, there are million of different ways to accomplish this
+     task, and some comments and code snippets are in order to lead
+     our paduans on the right track to the  good side of the force.
+    </para>
+    <para>
+     Stylesheets can be written in the <emphasis>pull</emphasis> or
+     the <emphasis>push</emphasis> style: <emphasis>pull</emphasis>
+     means that the output XML structure is taken as starting point of
+     the internal structure of the XSLT stylesheet, and portions of
+     the input XML are <emphasis>pulled</emphasis> out and inserted
+     into the right spots of the output XML structure. On the other
+     side, <emphasis>push</emphasis> XSLT stylesheets are recursavly
+     calling their template definitions, a process which is commanded
+     by the input XML structure, and avake to produce some output XML
+     whenever some special conditions in the input styelsheets are
+     met. The <emphasis>pull</emphasis> type is well-suited for input
+     XML with strong and well-defined structure and semantcs, like the
+     following OAI indexing example, whereas the
+     <emphasis>push</emphasis> type might be the only possible way to
+     sort out deeply recursive input XML formats.
+    </para>
+    <para> 
+     A <emphasis>pull</emphasis> stylesheet example used to index
+     OAI harvested records could use some of the following template
+     definitions:
+     <screen>
+      <![CDATA[
+      <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
+       xmlns:z="http://indexdata.dk/zebra/xslt/1"
+       xmlns:oai="http://www.openarchives.org/OAI/2.0/" 
+       xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" 
+       xmlns:dc="http://purl.org/dc/elements/1.1/"
+       version="1.0">
+
+       <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
+
+        <!-- disable all default text node output -->
+        <xsl:template match="text()"/>
+
+         <!-- match on oai xml record root -->
+         <xsl:template match="/">    
+          <z:record z:id="{normalize-space(oai:record/oai:header/oai:identifier)}" 
+           z:type="update">
+           <!-- you might want to use z:rank="{some XSLT function here}" --> 
+           <xsl:apply-templates/>
+          </z:record>
+         </xsl:template>
+
+         <!-- OAI indexing templates -->
+         <xsl:template match="oai:record/oai:header/oai:identifier">
+          <z:index name="oai_identifier" type="0">
+           <xsl:value-of select="."/>
+          </z:index>    
+         </xsl:template>
+
+         <!-- etc, etc -->
+
+         <!-- DC specific indexing templates -->
+         <xsl:template match="oai:record/oai:metadata/oai_dc:dc/dc:title">
+          <z:index name="dc_title" type="w">
+           <xsl:value-of select="."/>
+          </z:index>
+         </xsl:template>
+
+         <!-- etc, etc -->
+ 
+      </xsl:stylesheet>
+      ]]>
+     </screen>
+    </para>
+    <para>
+     Notice also,
+     that the names and types of the indexes can be defined in the
+     indexing XSLT stylesheet <emphasis>dynamically according to
+     content in the original XML records</emphasis>, which has
+     opportunities for great power and wizardery as well as grande
+     disaster.  
+    </para>
+    <para>
+     The following excerpt of a <emphasis>push</emphasis> stylesheet
+     <emphasis>might</emphasis> 
+     be a good idea according to your strict control of the XML
+     input format (due to rigerours checking against well-defined and
+     tight RelaxNG or XML Schema's, for example):
+     <screen>
+      <![CDATA[
+      <xsl:template name="element-name-indexes">     
+       <z:index name="{name()}" type="w">
+        <xsl:value-of select="'1'"/>
+       </z:index>
+      </xsl:template>
+      ]]>
+     </screen>
+     This template creates indexes which have the name of the working 
+     node of any input  XML file, and assigns a '1' to the index.
+     The example query 
+     <literal>find @attr 1=xyz 1</literal> 
+     finds all files which contain at least one
+     <literal>xyz</literal> XML element. In case you can not control
+     which element names the input files contain, you might ask for
+     disaster and bad karma using this technique.
+    </para>
+    <para>
+     One variation over the theme <emphasis>dynamically created
+     indexes</emphasis> will definitely be unwise:
+     <screen>
+      <![CDATA[  
+      <!-- match on oai xml record root -->
+      <xsl:template match="/">    
+       <z:record z:type="update">
+      
+        <!-- create dynamic index name from input content --> 
+        <xsl:variable name="dynamic_content">
+         <xsl:value-of select="oai:record/oai:header/oai:identifier"/>
+        </xsl:variable>
+        
+        <!-- create zillions of indexes with unknown names -->
+        <z:index name="{$dynamic_content}" type="w">
+         <xsl:value-of select="oai:record/oai:metadata/oai_dc:dc"/>
+        </z:index>          
+       </z:record>
+       
+      </xsl:template>
+      ]]>
+     </screen>
+     Don't be tempted to cross
+     the line to the dark side of the force, paduan; this leads
+     to suffering and pain, and universal
+     disentigration of your project schedule.
+    </para>
+  </section>
+
+  <section id="record-model-alvisxslt-elementset">
     <title>ALVIS Exchange Formats</title>
-   <para>FIXME</para>
-  </sect2>
-
-  </sect1>
+   <para>
+     An exchange format can be anything which can be the outcome of an
+     XSLT transformation, as far as the stylesheet is registered in
+     the main Alvis XSLT filter configuration file, see
+     <xref linkend="record-model-alvisxslt-filter"/>.
+     In principle anything that can be expressed in  XML, HTML, and
+     TEXT can be the output of a <literal>schema</literal> or 
+    <literal>element set</literal> directive during search, as long as
+     the information comes from the 
+     <emphasis>original input record XML DOM tree</emphasis>
+     (and not the transformed and <emphasis>indexed</emphasis> XML!!).
+    </para>
+    <para>
+     In addition, internal administrative information from the Zebra
+     indexer can be accessed during record retrieval. The following
+     example is a summary of the possibilities:
+     <screen>
+      <![CDATA[  
+      <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
+       xmlns:z="http://indexdata.dk/zebra/xslt/1"
+       version="1.0">
+
+       <!-- register internal zebra parameters -->       
+       <xsl:param name="id" select="''"/>
+       <xsl:param name="filename" select="''"/>
+       <xsl:param name="score" select="''"/>
+       <xsl:param name="schema" select="''"/>
+           
+       <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
+
+       <!-- use then for display of internal information -->
+       <xsl:template match="/">
+         <z:zebra>
+           <id><xsl:value-of select="$id"/></id>
+           <filename><xsl:value-of select="$filename"/></filename>
+           <score><xsl:value-of select="$score"/></score>
+           <schema><xsl:value-of select="$schema"/></schema>
+         </z:zebra>
+       </xsl:template>
+
+      </xsl:stylesheet>
+      ]]>
+     </screen>
+    </para>
+
+  </section>
+
+  <section id="record-model-alvisxslt-example">
+   <title>ALVIS Filter OAI Indexing Example</title>
+   <para>
+     The sourcecode tarball contains a working Alvis filter example in
+     the directory <filename>examples/alvis-oai/</filename>, which
+     should get you started.  
+    </para>
+    <para>
+     More example data can be harvested from any OAI complient server,
+     see details at the  OAI 
+     <ulink url="http://www.openarchives.org/">
+      http://www.openarchives.org/</ulink> web site, and the community
+      links at 
+     <ulink url="http://www.openarchives.org/community/index.html">
+      http://www.openarchives.org/community/index.html</ulink>.
+     There is a  tutorial
+     found at
+     <ulink url="http://www.oaforum.org/tutorial/">
+      http://www.oaforum.org/tutorial/</ulink>.
+    </para>
+   </section>
+
+  </section>
  
    
   </chapter>
@@ -72,7 +465,7 @@ c)  Main "alvis" XSLT filter config file:
      <split level="1"/>
    </schemaInfo>
  
-  the pathes are relative to the directory where zebra.init is placed
+  the paths are relative to the directory where zebra.init is placed
    and is started up.
  
    The split level decides where the SAX parser shall split the
@@ -94,7 +487,7 @@ c)  Main "alvis" XSLT filter config file:
    and so on.
  
  - in db/ a cql2pqf.txt yaz-client config file 
-  which is also used in the yaz-server <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>-to-PQF process
+  which is also used in the yaz-server <ulink url="&url.cql;">CQL</ulink>-to-PQF process
  
     see: http://www.indexdata.com/yaz/doc/tools.tkl#tools.cql.map