From: Marc Cromme Date: Wed, 21 Feb 2007 13:38:22 +0000 (+0000) Subject: started explaining each dom filter pipeline X-Git-Tag: ZEBRA.2.0.12~37 X-Git-Url: http://sru.miketaylor.org.uk/cgi-bin?a=commitdiff_plain;h=799d116b6a50e44e9eeccf17ed81ffb70220b4f4;p=idzebra-moved-to-github.git started explaining each dom filter pipeline --- diff --git a/doc/recordmodel-domxml.xml b/doc/recordmodel-domxml.xml index 9c28187..8dfcdb6 100644 --- a/doc/recordmodel-domxml.xml +++ b/doc/recordmodel-domxml.xml @@ -1,5 +1,5 @@ - + &dom; &xml; Record Model and Filter Module @@ -79,26 +79,25 @@ first input parsing and initial transformations to common &xml; format - raw &xml; record buffers, &xml; streams and + Input raw &xml; record buffers, &xml; streams and binary &marc; buffers - single &dom; &xml; documents suitable for indexing and - internal storage + Common &xml; &dom; extract second indexing term extraction transformations - common single &dom; &xml; format - &zebra; internal indexing &dom; &xml; document + Common &xml; &dom; + Indexing &xml; &dom; store second transformations before internal document storage - common single &dom; &xml; format - &zebra; internal storage &dom; &xml; document + Common &xml; &dom; + Storage &xml; &dom; retrieve @@ -106,8 +105,8 @@ multiple document retrieve transformations from storage to different output formats are possible - &zebra; internal storage &dom; &xml; document - output &xml; syntax and requested format + Storage &xml; &dom; + Output &xml; syntax in requested formats @@ -132,9 +131,9 @@ recordtype.xml: dom.db/filter_dom_conf.xml - In this example on all data files with suffix - *.xml, where the - &dom; &xslt; filter configuration file is found in the + In this example the &dom; &xml; filter is configured to work + on all data files with suffix + *.xml, where the configuration file is found in the path db/filter_dom_conf.xml. @@ -164,33 +163,82 @@ ]]> - - All named stylesheets defined inside - schema element tags - are for presentation after search, including - the indexing stylesheet (which is a great debugging help). The - names defined in the name attributes must be - unique, these are the literal schema or - element set names used in - &srw;, - &sru; and - &z3950; protocol queries. + The root &xml; element <dom> and all other &dom; + &xml; filter elements are residing in the namespace + http://indexdata.com/zebra-2.0. + + + All pipeline definition elements - i.e. the + <input>, + <extact>, + <store>, and + <retrieve> elements - are optional. + Missing pipeline definitions are just interpreted + do-nothing identity pipelines. + + + All pipeine definition elements may contain zero or more + ]]> + &xslt; transformation instructions, which are performed + sequentially from top to bottom. The paths in the stylesheet attributes - are relative to zebras working directory, or absolute to file + are relative to zebras working directory, or absolute to the file system root. + + +
+ Input pipeline - The <split level="2"/> decides where the - &xml; Reader shall split the - collections of records into individual records, which then are - loaded into &dom;, and have the indexing &xslt; stylesheet applied. + The <input> pipeline definition element + may contain either one &xml; Reader definition + ]]>, used to split + an &xml; collection input stream into individual &xml; &dom; + documents at the prescribed element level, + or one &marc; binary + parsing instruction + ]]>, which defines + a conversion to &marcxml; format &dom; trees. The allowed values + of the inputcharset attribute depend on your + local iconv set-up. - There must be exactly one indexing &xslt; stylesheet, which is - defined by the magic attribute - identifier="http://indexdata.dk/zebra/xslt/1". + Both input parsers deliver individual &dom; &xml; documents to the + following chain of zero or more + ]]> + &xslt; transformations. At the end of this pipeline, the documents + are in the common format, used to feed both the + <extact> and + <store> pipelines. + +
+ +
+ Extract pipeline +
+ +
+ Store pipeline +
+ +
+ Retrieve pipeline + + + All named stylesheets defined inside + schema element tags + are for presentation after search, including + the indexing stylesheet (which is a great debugging help). The + names defined in the name attributes must be + unique, these are the literal schema or + element set names used in + &srw;, + &sru; and + &z3950; protocol queries. +
+
&dom; filter internal record representation