<chapter id="record-model-domxml">
- <!-- $Id: recordmodel-domxml.xml,v 1.4 2007-02-20 15:02:18 marc Exp $ -->
+ <!-- $Id: recordmodel-domxml.xml,v 1.8 2007-02-21 15:03:30 marc Exp $ -->
<title>&dom; &xml; Record Model and Filter Module</title>
<para>
<section id="record-model-domxml-filter">
- <title>&dom; Record Filter</title>
+ <title>&dom; Record Filter Architecture</title>
<para>
The &dom; &xml; filter uses a standard &dom; &xml; structure as
&marcxml; &dom; representation. Other binary document parsers
are planned to follow.
</para>
- </section>
-
-
- <section id="record-model-domxml-architecture">
- <title>&dom; &xml; filter architecture</title>
<para>
- The internal &dom; &xml; representation can be fed into four
- different pipelines, consisting of arbitraily many sucessive
- &xslt; transformations.
+ The &dom; filter architecture consists of four
+ different pipelines, each being a chain of arbitraily many sucessive
+ &xslt; transformations of the internal &dom; &xml;
+ representations of documents.
</para>
+ <figure id="record-model-domxml-architecture-fig">
+ <title>&dom; &xml; filter architecture</title>
+ <mediaobject>
+ <imageobject>
+ <imagedata fileref="domfilter.pdf" format="PDF" scale="50"/>
+ </imageobject>
+ <imageobject>
+ <imagedata fileref="domfilter.png" format="PNG"/>
+ </imageobject>
+ <textobject>
+ <!-- Fall back if none of the images can be used -->
+ <phrase>
+ [Here there should be a diagram showing the &dom; &xml;
+ filter architecture, but is seems that your
+ tool chain has not been able to include the diagram in this
+ document.]
+ </phrase>
+ </textobject>
+ </mediaobject>
+ </figure>
+
+
<table id="record-model-domxml-architecture-table" frame="top">
<title>&dom; &xml; filter pipelines overview</title>
<tgroup cols="5">
<entry>first</entry>
<entry>input parsing and initial
transformations to common &xml; format</entry>
- <entry>raw &xml; record buffers, &xml; streams and
+ <entry>Input raw &xml; record buffers, &xml; streams and
binary &marc; buffers</entry>
- <entry>single &dom; &xml; documents suitable for indexing and
- internal storage</entry>
+ <entry>Common &xml; &dom;</entry>
</row>
<row>
<entry><literal>extract</literal></entry>
<entry>second</entry>
<entry>indexing term extraction
transformations</entry>
- <entry>common single &dom; &xml; format</entry>
- <entry>&zebra; internal indexing &dom; &xml; document</entry>
+ <entry>Common &xml; &dom;</entry>
+ <entry>Indexing &xml; &dom;</entry>
</row>
<row>
<entry><literal>store</literal></entry>
<entry>second</entry>
<entry> transformations before internal document
storage</entry>
- <entry>common single &dom; &xml; format</entry>
- <entry>&zebra; internal storage &dom; &xml; document</entry>
+ <entry>Common &xml; &dom;</entry>
+ <entry>Storage &xml; &dom;</entry>
</row>
<row>
<entry><literal>retrieve</literal></entry>
<entry>multiple document retrieve transformations from
storage to different output
formats are possible</entry>
- <entry>&zebra; internal storage &dom; &xml; document</entry>
- <entry>output &xml; syntax and requested format</entry>
+ <entry>Storage &xml; &dom;</entry>
+ <entry>Output &xml; syntax in requested formats</entry>
</row>
</tbody>
</tgroup>
<screen>
recordtype.xml: dom.db/filter_dom_conf.xml
</screen>
- In this example on all data files with suffix
- <filename>*.xml</filename>, where the
- &dom; &xslt; filter configuration file is found in the
+ In this example the &dom; &xml; filter is configured to work
+ on all data files with suffix
+ <filename>*.xml</filename>, where the configuration file is found in the
path <filename>db/filter_dom_conf.xml</filename>.
</para>
]]>
</screen>
</para>
-
<para>
- All named stylesheets defined inside
- <literal>schema</literal> element tags
- are for presentation after search, including
- the indexing stylesheet (which is a great debugging help). The
- names defined in the <literal>name</literal> attributes must be
- unique, these are the literal <literal>schema</literal> or
- <literal>element set</literal> names used in
- <ulink url="http://www.loc.gov/standards/sru/srw/">&srw;</ulink>,
- <ulink url="&url.sru;">&sru;</ulink> and
- &z3950; protocol queries.
+ The root &xml; element <literal><dom></literal> and all other &dom;
+ &xml; filter elements are residing in the namespace
+ <literal>http://indexdata.com/zebra-2.0</literal>.
+ </para>
+ <para>
+ All pipeline definition elements - i.e. the
+ <literal><input></literal>,
+ <literal><extact></literal>,
+ <literal><store></literal>, and
+ <literal><retrieve></literal> elements - are optional.
+ Missing pipeline definitions are just interpreted
+ do-nothing identity pipelines.
+ </para>
+ <para>
+ All pipeine definition elements may contain zero or more
+ <literal><![CDATA[<xslt stylesheet="path/file.xsl"/>]]></literal>
+ &xslt; transformation instructions, which are performed
+ sequentially from top to bottom.
The paths in the <literal>stylesheet</literal> attributes
- are relative to zebras working directory, or absolute to file
+ are relative to zebras working directory, or absolute to the file
system root.
</para>
+
+
+ <section id="record-model-domxml-pipeline-input">
+ <title>Input pipeline</title>
<para>
- The <literal><split level="2"/></literal> decides where the
- &xml; Reader shall split the
- collections of records into individual records, which then are
- loaded into &dom;, and have the indexing &xslt; stylesheet applied.
+ The <literal><input></literal> pipeline definition element
+ may contain either one &xml; Reader definition
+ <literal><![CDATA[<xmlreader level="1"/>]]></literal>, used to split
+ an &xml; collection input stream into individual &xml; &dom;
+ documents at the prescribed element level,
+ or one &marc; binary
+ parsing instruction
+ <literal><![CDATA[<marc inputcharset="marc-8"/>]]></literal>, which defines
+ a conversion to &marcxml; format &dom; trees. The allowed values
+ of the <literal>inputcharset</literal> attribute depend on your
+ local <productname>iconv</productname> set-up.
</para>
<para>
- There must be exactly one indexing &xslt; stylesheet, which is
- defined by the magic attribute
- <literal>identifier="http://indexdata.dk/zebra/xslt/1"</literal>.
+ Both input parsers deliver individual &dom; &xml; documents to the
+ following chain of zero or more
+ <literal><![CDATA[<xslt stylesheet="path/file.xsl"/>]]></literal>
+ &xslt; transformations. At the end of this pipeline, the documents
+ are in the common format, used to feed both the
+ <literal><extact></literal> and
+ <literal><store></literal> pipelines.
</para>
+ </section>
+
+ <section id="record-model-domxml-pipeline-extract">
+ <title>Extract pipeline</title>
+ <para>
+ The <literal><extact></literal> pipeline takes documents
+ from any common &dom; &xml; format to the &zebra; specific
+ indexing &dom; &xml; format.
+ It may consist of zero ore more
+ <literal><![CDATA[<xslt stylesheet="path/file.xsl"/>]]></literal>
+ &xslt; transformations, and the outcome is handled to the
+ &zebra; core to drive the proces of building the inverted
+ indexes. See
+ <xref linkend="record-model-domxml-canonical-index"/> for
+ details.
+ </para>
+ </section>
- <section id="record-model-domxml-internal">
- <title>&dom; filter internal record representation</title>
- <para>When indexing, an &xml; Reader is invoked to split the input
- files into suitable record &xml; pieces. Each record piece is then
- transformed to an &xml; &dom; structure, which is essentially the
- record model. Only &xslt; transformations can be applied during
- index, search and retrieval. Consequently, output formats are
- restricted to whatever &xslt; can deliver from the record &xml;
- structure, be it other &xml; formats, HTML, or plain text. In case
- you have <literal>libxslt1</literal> running with E&xslt; support,
- you can use this functionality inside the &dom;
- filter configuration &xslt; stylesheets.
+ <section id="record-model-domxml-pipeline-store">
+ <title>Store pipeline</title>
+ The <literal><store></literal> pipeline takes documents
+ from any common &dom; &xml; format to the &zebra; specific
+ storage &dom; &xml; format.
+ It may consist of zero ore more
+ <literal><![CDATA[<xslt stylesheet="path/file.xsl"/>]]></literal>
+ &xslt; transformations, and the outcome is handled to the
+ &zebra; core for deposition into the internal storage system.
+ </section>
+
+ <section id="record-model-domxml-pipeline-retrieve">
+ <title>Retrieve pipeline</title>
+ <para>
+ Finally, there may be one or more
+ <literal><retrieve></literal> pipeline definitions, each
+ of them again consisting of zero or more
+ <literal><![CDATA[<xslt stylesheet="path/file.xsl"/>]]></literal>
+ &xslt; transformations. These are used for document
+ presentation after search, and take the internal storage &dom;
+ &xml; to the requested output formats during record present
+ requests.
</para>
+ <para>
+ The possible multiple
+ <literal><retrieve></literal> pipeline definitions
+ are distinguished by their unique <literal>name</literal>
+ attributes, these are the literal <literal>schema</literal> or
+ <literal>element set</literal> names used in
+ <ulink url="http://www.loc.gov/standards/sru/srw/">&srw;</ulink>,
+ <ulink url="&url.sru;">&sru;</ulink> and
+ &z3950; protocol queries.
+ </para>
</section>
- <section id="record-model-domxml-canonical">
- <title>&dom; Canonical Indexing Format</title>
+
+ <section id="record-model-domxml-canonical-index">
+ <title>Canonical Indexing Format</title>
+
+ <para>
+ &dom; &xml; indexing comes in two flavors: pure
+ processing-instruction governed plain &xml; documents, and - very
+ similar to the Alvis filter indexing format - &xml; documents
+ containing &xml; <literal><record></literal> and
+ <literal><index></literal> instructions from the magic
+ namespace <literal>xmlns:z="http://indexdata.dk/zebra-2.0"</literal>.
+ </para>
+
+ <section id="record-model-domxml-canonical-index-pi">
+ <title>Processing-instruction governed indexing format</title>
+
+ <para>The output of the processing instruction driven
+ indexing &xslt; stylesheets must contain
+ processing instructions named
+ <literal>zebra-2.0</literal>.
+ The output of the &xslt; indexing transformation is then
+ parsed using &dom; methods, and the contained instructions are
+ performed on the <emphasis>elements and their
+ subtrees directly following the processing instructions</emphasis>.
+ </para>
+ <para>
+ For example, the output of the command
+ <screen>
+ xsltproc dom-index-pi.xsl marc-one.xml
+ </screen>
+ might look like this:
+ <screen>
+ <![CDATA[
+ <?xml version="1.0" encoding="UTF-8"?>
+ <?zebra-2.0 record id=11224466 rank=42?>
+ <record>
+ <?zebra-2.0 index control:w?>
+ <control>11224466</control>
+ <?zebra-2.0 index title:w title:p title:s any:w?>
+ <title>How to program a computer</title>
+ </record>
+ ]]>
+ </screen>
+ </para>
+ </section>
+
+ <section id="record-model-domxml-canonical-index-element">
+ <title>Magic element governed indexing format</title>
+
<para>The output of the indexing &xslt; stylesheets must contain
certain elements in the magic
- <literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>
+ <literal>xmlns:z="http://indexdata.dk/zebra-2.0"</literal>
namespace. The output of the &xslt; indexing transformation is then
parsed using &dom; methods, and the contained instructions are
performed on the <emphasis>magic elements and their
</para>
<para>
For example, the output of the command
- <screen>
- xsltproc xsl/oai2index.xsl one-record.xml
+ <screen>
+ xsltproc dom-index-element.xsl marc-one.xml
</screen>
might look like this:
<screen>
- <?xml version="1.0" encoding="UTF-8"?>
- <z:record xmlns:z="http://indexdata.dk/zebra/xslt/1"
- z:id="oai:JTRS:CP-3290---Volume-I"
- z:rank="47896"
- z:type="update">
- <z:index name="oai_identifier" type="0">
- oai:JTRS:CP-3290---Volume-I</z:index>
- <z:index name="oai_datestamp" type="0">2004-07-09</z:index>
- <z:index name="oai_setspec" type="0">jtrs</z:index>
- <z:index name="dc_all" type="w">
- <z:index name="dc_title" type="w">Proceedings of the 4th
- International Conference and Exhibition:
- World Congress on Superconductivity - Volume I</z:index>
- <z:index name="dc_creator" type="w">Kumar Krishen and *Calvin
- Burnham, Editors</z:index>
- </z:index>
- </z:record>
+ <![CDATA[
+ <?xml version="1.0" encoding="UTF-8"?>
+ <z:record xmlns:z="http://indexdata.com/zebra-2.0"
+ z:id="11224466" z:rank="42">
+ <z:index name="control">11224466</z:index>
+ <z:index name="title:w title:p title:s any:w">
+ How to program a computer</z:index>
+ </z:record>
+ ]]>
</screen>
</para>
+ </section>
+
+
+ <section id="record-model-domxml-canonical-index-semantics">
+ <title>Semantics of the indexing formats</title>
+
+ <para>
+ Both indexing formats are defined with equal semantics and
+ behaviour in mind.
+ </para>
+
+
<para>This means the following: From the original &xml; file
<literal>one-record.xml</literal> (or from the &xml; record &dom; of the
same form coming from a splitted input file), the indexing
<literal>insert</literal>, <literal>update</literal>, and
<literal>delete</literal>.
</para>
- <para>In this example, the following literal indexes are constructed:
+
+
+ <para>In these examples, the following literal indexes are constructed:
<screen>
- oai_identifier
- oai_datestamp
- oai_setspec
- dc_all
- dc_title
- dc_creator
+ any:w
+ control:w
+ title:w
+ title:p
+ title:s
</screen>
- where the indexing type is defined in the
- <literal>type</literal> attribute
- (any value from the standard configuration
- file <filename>default.idx</filename> will do). Finally, any
+ where the indexing type is defined after the
+ literal <literal>':'</literal> charaacter.
+ Any value from the standard configuration
+ file <filename>default.idx</filename> will do.
+ Finally, any
<literal>text()</literal> node content recursively contained
- inside the <literal>index</literal> will be filtered through the
+ inside the <literal><z:index></literal> element, or any
+ element following a <literal>index</literal> processing instruction,
+ will be filtered through the
appropriate charmap for character normalization, and will be
- inserted in the index.
+ inserted in the named indexes.
</para>
+
+
<para>
Specific to this example, we see that the single word
<literal>oai:JTRS:CP-3290---Volume-I</literal> will be literal,
filter configuration files involves in this process, and that the
literal index names are used during search and retrieval.
</para>
+
+ </section>
+
</section>
</section>