<chapter id="record-model-domxml">
- <!-- $Id: recordmodel-domxml.xml,v 1.9 2007-02-22 15:44:19 marc Exp $ -->
+ <!-- $Id: recordmodel-domxml.xml,v 1.13 2007-03-21 19:37:00 adam Exp $ -->
<title>&dom; &xml; Record Model and Filter Module</title>
-
+
<para>
The record model described in this chapter applies to the fundamental,
structured &xml;
</para>
</listitem>
<listitem>
- <para>The unique <literal>record</literal> instruction
- may have additional attributes <literal>id</literal> and
- <literal>rank</literal>, where the value of the opaque ID
- may be any string not containing the whitespace character
- <literal>' '</literal>, and the rank value must be a
+ <para>
+ The unique <literal>record</literal> instruction
+ may have additional attributes <literal>id</literal>,
+ <literal>rank</literal> and <literal>type</literal>.
+ Attribute <literal>id</literal> is the value of the opaque ID
+ and may be any string not containing the whitespace character
+ <literal>' '</literal>.
+ The <literal>rank</literal> attribute value must be a
non-negative integer. See
- <xref linkend="administration-ranking"/>
+ <xref linkend="administration-ranking"/> .
+ The <literal>type</literal> attribute specifies how the record
+ is to be treated. The following values may be given for
+ <literal>type</literal>:
+ <variablelist>
+ <varlistentry>
+ <term><literal>insert</literal></term>
+ <listitem>
+ <para>
+ The record is inserted. If the record already exists, it is
+ skipped (i.e. not replaced).
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term><literal>replace</literal></term>
+ <listitem>
+ <para>
+ The record is replaced. If the record does not already exist,
+ it is skipped (i.e. not inserted).
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term><literal>delete</literal></term>
+ <listitem>
+ <para>
+ The record is deleted. If the record does not already exist,
+ it is skipped (i.e. nothing is deleted).
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term><literal>update</literal></term>
+ <listitem>
+ <para>
+ The record is inserted or replaced depending on whether the
+ record exists or not. This is the default behavior but may
+ be effectively changed by "outside" the scope of the DOM
+ filter by zebraidx commands or extended services updates.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ Note that the value of <literal>type</literal> is only used to
+ determine the action if and only if the Zebra indexer is running
+ in "update" mode (i.e zebraidx update) or if the specialUpdate
+ action of the
+ <link linkend="administration-extended-services-z3950">Extended
+ Service Update</link> is used.
+ For this reason a specialUpdate may end up deleting records!
</para>
</listitem>
<listitem>
<xref linkend="fields-and-charsets"/> for details.
</para>
</listitem>
+ <listitem>
+ <para>
+ &dom; input documents which are not resulting in both one
+ unique valid
+ <literal>record</literal> instruction and one or more valid
+ <literal>index</literal> instructions can not be searched and
+ found. Therefore,
+ invalid document processing is aborted, and any content of
+ the <literal><extract></literal> and
+ <literal><store></literal> pipelines is discarted.
+ A warning is issued in the logs.
+ </para>
+ </listitem>
</itemizedlist>
</para>
-
<para>The examples work as follows:
From the original &xml; file
<!-- OAI indexing templates -->
<xsl:template match="oai:record/oai:header/oai:identifier">
- <z:index name="oai_identifier;0">
+ <z:index name="oai_identifier:0">
<xsl:value-of select="."/>
</z:index>
</xsl:template>
]]>
</screen>
</para>
+ </section>
+
+
+ <section id="record-model-domxml-index-marc">
+ <title>&dom; Indexing &marcxml;</title>
+ <para>
+ The &dom; filter allows indexing of both binary &marc; records
+ and &marcxml; records, depending on it's configuration.
+ A typical &marcxml; record might look like this:
+ <screen>
+ <![CDATA[
+ <record xmlns="http://www.loc.gov/MARC21/slim">
+ <rank>42</rank>
+ <leader>00366nam 22001698a 4500</leader>
+ <controlfield tag="001"> 11224466 </controlfield>
+ <controlfield tag="003">DLC </controlfield>
+ <controlfield tag="005">00000000000000.0 </controlfield>
+ <controlfield tag="008">910710c19910701nju 00010 eng </controlfield>
+ <datafield tag="010" ind1=" " ind2=" ">
+ <subfield code="a"> 11224466 </subfield>
+ </datafield>
+ <datafield tag="040" ind1=" " ind2=" ">
+ <subfield code="a">DLC</subfield>
+ <subfield code="c">DLC</subfield>
+ </datafield>
+ <datafield tag="050" ind1="0" ind2="0">
+ <subfield code="a">123-xyz</subfield>
+ </datafield>
+ <datafield tag="100" ind1="1" ind2="0">
+ <subfield code="a">Jack Collins</subfield>
+ </datafield>
+ <datafield tag="245" ind1="1" ind2="0">
+ <subfield code="a">How to program a computer</subfield>
+ </datafield>
+ <datafield tag="260" ind1="1" ind2=" ">
+ <subfield code="a">Penguin</subfield>
+ </datafield>
+ <datafield tag="263" ind1=" " ind2=" ">
+ <subfield code="a">8710</subfield>
+ </datafield>
+ <datafield tag="300" ind1=" " ind2=" ">
+ <subfield code="a">p. cm.</subfield>
+ </datafield>
+ </record>
+ ]]>
+ </screen>
+ </para>
+
<para>
- Notice also,
- that the names and types of the indexes can be defined in the
+ It is easily possible to make string manipulation in the &dom;
+ filter. For example, if you want to drop some leading articles
+ in the indexing of sort fields, you might want to pick out the
+ &marcxml; indicator attributes to chop of leading substrings. If
+ the above &xml; example would have an indicator
+ <literal>ind2="8"</literal> in the title field
+ <literal>245</literal>, i.e.
+ <screen>
+ <![CDATA[
+ <datafield tag="245" ind1="1" ind2="8">
+ <subfield code="a">How to program a computer</subfield>
+ </datafield>
+ ]]>
+ </screen>
+ one could write a template taking into account this information
+ to chop the first <literal>8</literal> characters from the
+ sorting index <literal>title:s</literal> like this:
+ <screen>
+ <![CDATA[
+ <xsl:template match="m:datafield[@tag='245']">
+ <xsl:variable name="chop">
+ <xsl:choose>
+ <xsl:when test="not(number(@ind2))">0</xsl:when>
+ <xsl:otherwise><xsl:value-of select="number(@ind2)"/></xsl:otherwise>
+ </xsl:choose>
+ </xsl:variable>
+
+ <z:index name="title:w title:p any:w">
+ <xsl:value-of select="m:subfield[@code='a']"/>
+ </z:index>
+
+ <z:index name="title:s">
+ <xsl:value-of select="substring(m:subfield[@code='a'], $chop)"/>
+ </z:index>
+
+ </xsl:template>
+ ]]>
+ </screen>
+ The output of the above &marcxml; and &xslt; excerpt would then be:
+ <screen>
+ <![CDATA[
+ <z:index name="title:w title:p any:w">How to program a computer</z:index>
+ <z:index name="title:s">program a computer</z:index>
+ ]]>
+ </screen>
+ and the record would be sorted in the title index under 'P', not 'H'.
+ </para>
+ </section>
+
+
+ <section id="record-model-domxml-index-wizzard">
+ <title>&dom; Indexing Wizardry</title>
+ <para>
+ The names and types of the indexes can be defined in the
indexing &xslt; stylesheet <emphasis>dynamically according to
content in the original &xml; records</emphasis>, which has
opportunities for great power and wizardry as well as grande
</para>
</section>
+ <section id="record-model-domxml-debug">
+ <title>Debuggig &dom; Filter Configurations</title>
+ <para>
+ It can be very hard to debug a &dom; filter setup due to the many
+ sucessive &marc; syntax translations, &xml; stream splitting and
+ &xslt; transformations involved. As an aid, you have always the
+ power of the <literal>-s</literal> command line switch to the
+ <literal>zebraidz</literal> indexing command at your hand:
+ <screen>
+ zebraidx -s -c zebra.cfg update some_record_stream.xml
+ </screen>
+ This command line simulates indexing and dumps a lot of debug
+ information in the logs, telling exactly which transformations
+ have been applied, how the documents look like after each
+ transformation, and which record ids and terms are send to the indexer.
+ </para>
+ </section>
+
+ <!--
<section id="record-model-domxml-elementset">
<title>&dom; Exchange Formats</title>
<para>
xmlns:z="http://indexdata.dk/zebra/xslt/1"
version="1.0">
- <!-- register internal zebra parameters -->
+ <!- - register internal zebra parameters - ->
<xsl:param name="id" select="''"/>
<xsl:param name="filename" select="''"/>
<xsl:param name="score" select="''"/>
<xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
- <!-- use then for display of internal information -->
+ <!- - use then for display of internal information - ->
<xsl:template match="/">
<z:zebra>
<id><xsl:value-of select="$id"/></id>
</para>
</section>
+ -->
<!--
<section id="record-model-domxml-example">