<!ENTITY % common SYSTEM "common/common.ent">
%common;
]>
-<!-- $Id: book.xml,v 1.12 2007-04-23 07:03:06 adam Exp $ -->
+<!-- $Id: book.xml,v 1.13 2007-05-25 12:30:27 marc Exp $ -->
<book id="book">
<bookinfo>
<title>Pazpar2 - User's Guide and Reference</title>
<author>
<firstname>Adam</firstname><surname>Dickmeiss</surname>
</author>
+ <author>
+ <firstname>Marc</firstname><surname>Cromme</surname>
+ </author>
<releaseinfo>&version;</releaseinfo>
<copyright>
<year>©right-year;</year>
</para>
</listitem>
</varlistentry>
+ <varlistentry><term><ulink url="&url.icu;">International
+ Components for Unicode (ICU)</ulink></term>
+ <listitem>
+ <para>
+ ICU provides Unicode support for non-english languages with
+ character sets outside the range of 7bit ASCII, like
+ Greek, Russian, German and Frensh. Pazpar2 uses the ICU
+ unicode character conversions, unicode normalization, case
+ folding and other fundamental operations needed in
+ tokenization, normalization and ranking of records.
+ </para>
+ <para>
+ Compiling, linking, and usage of the ICU libraries is optional,
+ but strongly recommended for usage in an international
+ environment.
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</para>
<para>
</para>
<screen>
apt-get install libyaz-dev
+ apt-get install libicu36-dev
</screen>
<para>
With these packages installed, the usual configure + make
<chapter id="using">
<title>Using pazpar2</title>
<para>
- This chapter provides a general introduction to the use and deployment of pazpar2.
+ This chapter provides a general introduction to the use and
+ deployment of pazpar2.
</para>
<section id="architecture">
functionality, but it isn't a requirement -- you can choose to use
pazpar2 entirely as a backend to your regular server-side scripting.
When you do use pazpar2 in conjunction
- with browser scripting (JavaScript/Ajax, Flash, applets, etc.), there are
- special considerations.
+ with browser scripting (JavaScript/Ajax, Flash, applets,
+ etc.), there are special considerations.
</para>
<para>
metasearching is really, really hard. If you want to build a
project with pazpar2, and you need access to resources with
non-standard interfaces, we can help. We run gateways to more than
- 2,000 popular, commercial databases and other resources, making it simple
+ 2,000 popular, commercial databases and other resources,
+ making it simple
to plug them directly into pazpar2. For a small annual fee per
database, we can help you establish connections to your licensed
resources. Meanwhile, you can help! If you build your own
implement it.
</para>
</section>
+
+ <section id="unicode">
+ <title>Unicode Compliance</title>
+ <para>
+ Pazpar2 is unicode compliant and language and locale aware to
+ the exted the used backend Z39.50 targets are. Just a few bad
+ behaving targets can spoil the search experience considerably
+ if for example Greek, Russian or otherwise non 7-bit ASCII
+ search terms are entered. In these cases some targets return
+ records irrelevant to the query, and the result screens wil be
+ cluttered with noise.
+ </para>
+ <para>
+ While noise from misbehaving targets can not be removed, it can
+ be reduced using truely unicode based ranking. This is an
+ option which is available to the system administrator if ICU
+ support is compiled into Pazpar2, see
+ <xref linkend="installation"/> for details.
+ </para>
+ <para>
+ In addition, the ICU tokenization and normalization rules must
+ be defined in the master configuration file described in
+ <xref linkend="config-server"/>.
+ </para>
+ </section>
+
</chapter> <!-- Using pazpar2 -->
<reference id="reference">
<!ENTITY % common SYSTEM "common/common.ent">
%common;
]>
-<!-- $Id: pazpar2_conf.xml,v 1.23 2007-04-24 04:37:58 quinn Exp $ -->
+<!-- $Id: pazpar2_conf.xml,v 1.24 2007-05-25 12:30:27 marc Exp $ -->
<refentry id="pazpar2_conf">
<refentryinfo>
<productname>Pazpar2</productname>
</varlistentry>
<varlistentry>
+ <term>icu_chain</term>
+ <listitem>
+ <para>
+ Definition of ICU tokenization and normalization rules
+ are used if ICU support is compiled in. The 'id'
+ attribute is currently not used, and the 'locale'
+ attribute must be set to one of the locale strings
+ defined in ICU. The child elements listed below can be
+ in any order, except the 'index' element which logically
+ belongs to the end of the list. The stated tokenization,
+ normalization and charmapping instructions are performed
+ in order from top to bottom.
+ </para>
+ <variablelist> <!-- Level 2 -->
+ <varlistentry><term>casemap</term>
+ <listitem>
+ <para>
+ The attribure 'rule' defines the direction of the
+ per-character casemapping, allowed values are "l"
+ (lower), "u" (upper), "t" (title).
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry><term>normalize</term>
+ <listitem>
+ <para>
+ Normalization and transformation of tokens follows
+ the rules defined in the 'rule' attribute. For
+ possible values we refer to the extensive ICU
+ documentation found at the
+ <ulink url="&url.icu.transform;">ICU
+ transformation</ulink> home page. Set filtering
+ principles are explained at the
+ <ulink url="&url.icu.unicode.set;">ICU set and
+ filtering</ulink> page.
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry><term>tokenize</term>
+ <listitem>
+ <para>
+ Tokenization is the only rule in the ICU chain
+ which splits one token into multiple tokens. The
+ 'rule' attribute may have the following values:
+ "s" (sentence), "l" (line-break), "w" (word), and
+ "c" (character), the later probably not beeing
+ very useful in a runing pazpar2 installation.
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry><term>index</term>
+ <listitem>
+ <para>
+ Finally the 'index' element instruction - without
+ any 'rule' attribute - is used to store the tokens
+ after chain processing in the relevance ranking
+ unit of Pazpar2. It will always be the last
+ instruction in the chain.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
<term>service</term>
<listitem>
<para>
<listitem>
<para>
This is the name of the data element. It is matched
- against the 'type' attribute of the 'metadata' element
+ against the 'type' attribute of the
+ 'metadata' element
in the normalized record. A warning is produced if
- metdata elements with an unknown name are found in the
- normalized record. This name is also used to represent
+ metdata elements with an unknown name are
+ found in the
+ normalized record. This name is also used to
+ represent
data elements in the records returned by the
webservice API, and to name sort lists and browse
facets.
<varlistentry><term>rank</term>
<listitem>
<para>
- Specifies that this element is to be used to help rank
+ Specifies that this element is to be used to
+ help rank
records against the user's query (when ranking is
requested). The value is an integer, used as a
multiplier against the basic TF*IDF score. A value of
- 1 is the base, higher values give additional weight to
+ 1 is the base, higher values give additional
+ weight to
elements of this type. The default is '0', which
excludes this element from the rank calculation.
</para>
termlist, or browse facet. Values are tabulated from
incoming records, and a highscore of values (with
their associated frequency) is made available to the
- client through the webservice API. The possible values
+ client through the webservice API.
+ The possible values
are 'yes' and 'no' (default).
</para>
</listitem>
<!-- <zproxy host="localhost:9000"/> -->
<!-- <zproxy port="9000"/> -->
+
+ <!-- optional ICU ranking configuration example -->
+ <!--
+ <icu_chain id="el:word" locale="el">
+ <normalize rule="[:Control:] Any-Remove"/>
+ <tokenize rule="l"/>
+ <normalize rule="[[:WhiteSpace:][:Punctuation:]] Remove"/>
+ <casemap rule="l"/>
+ <index/>
+ </icu_chain>
+ -->
+
<service>
<metadata name="title" brief="yes" sortkey="skiparticle" merge="longest" rank="6"/>
<metadata name="isbn" merge="unique"/>
<settings target="*">
<!-- This file introduces default settings for pazpar2 -->
- <!-- $Id: pazpar2_conf.xml,v 1.23 2007-04-24 04:37:58 quinn Exp $ -->
+ <!-- $Id: pazpar2_conf.xml,v 1.24 2007-05-25 12:30:27 marc Exp $ -->
<!-- mapping for unqualified search -->
<set name="pz:cclmap:term" value="u=1016 t=l,r s=al"/>