<chapter id="fields-and-charsets">
- <!-- $Id: field-structure.xml,v 1.8 2006-11-28 13:05:57 marc Exp $ -->
+ <!-- $Id: field-structure.xml,v 1.12 2007-02-02 09:58:39 marc Exp $ -->
<title>Field Structure and Character Sets
</title>
<para>
In order to provide a flexible approach to national character set
- handling, Zebra allows the administrator to configure the set up the
+ handling, &zebra; allows the administrator to configure the set up the
system to handle any 8-bit character set — including sets that
require multi-octet diacritics or other multi-octet characters. The
definition of a character set includes a specification of the
</listitem></varlistentry>
</variablelist>
</para>
+ <para>
+ Following are three excerpts of the standard
+ <filename>tab/default.idx</filename> configuration file. Notice
+ that the <literal>index</literal> and <literal>sort</literal>
+ are grouping directives, which bind all other following directives
+ to them:
+ <screen>
+ # Traditional word index
+ # Used if completenss is 'incomplete field' (@attr 6=1) and
+ # structure is word/phrase/word-list/free-form-text/document-text
+ index w
+ completeness 0
+ position 1
+ alwaysmatches 1
+ firstinfield 1
+ charmap string.chr
+
+ ...
+
+ # Null map index (no mapping at all)
+ # Used if structure=key (@attr 4=3)
+ index 0
+ completeness 0
+ position 1
+ charmap @
+
+ ...
+
+ # Sort register
+ sort s
+ completeness 1
+ charmap string.chr
+ </screen>
+ </para>
</section>
<section id="character-map-files">
<para>
The character map files are used to define the word tokenization
and character normalization performed before inserting text into
- the inverse indexes. Zebra ships with the predefined character map
+ the inverse indexes. &zebra; ships with the predefined character map
files <filename>tab/*.chr</filename>. Users are allowed to add
and/or modify maps according to their needs.
</para>
- <table id="querymodel-attribute-sets-table" frame="top">
- <title>Character maps predefined in Zebra</title>
+ <table id="character-map-table" frame="top">
+ <title>Character maps predefined in &zebra;</title>
<tgroup cols="3">
<thead>
<row>
<para>
The contents of the character map files are structured as follows:
<variablelist>
+ <varlistentry>
+ <term>encoding <replaceable>encoding-name</replaceable></term>
+ <listitem>
+ <para>
+ This directive must be at the very beginning of the file, and it
+ specifies the character encoding used in the entire file. If
+ omitted, the encoding <literal>ISO-8859-1</literal> is assumed.
+ </para>
+ <para>
+ For example, one of the test files found at
+ <literal>test/rusmarc/tab/string.chr</literal> contains the following
+ encoding directive:
+ <screen>
+ encoding koi8-r
+ </screen>
+ and the test file
+ <literal>test/charmap/string.utf8.chr</literal> is encoded
+ in UTF-8:
+ <screen>
+ encoding utf-8
+ </screen>
+ </para>
+ </listitem></varlistentry>
<varlistentry>
<term>lowercase <replaceable>value-set</replaceable></term>
<para>
In addition to specifying sort orders, space (blank) handling,
and upper/lowercase folding, you can also use the character map
- files to make Zebra ignore leading articles in sorting records,
+ files to make &zebra; ignore leading articles in sorting records,
or when doing complete field searching.
</para>
<para>