<term>estimatehits:: <replaceable>integer</replaceable></term>
<listitem>
<para>
- Controls whether &zebra; should calculate approximite hit counts and
+ Controls whether &zebra; should calculate approximate hit counts and
at which hit count it is to be enabled.
- A value of 0 disables approximiate hit counts.
- For a positive value approximaite hit count is enabled
+ A value of 0 disables approximate hit counts.
+ For a positive value approximate hit count is enabled
if it is known to be larger than <replaceable>integer</replaceable>.
</para>
<para>
<replaceable>permstring</replaceable></term>
<listitem>
<para>
- Specifies permissions (priviledge) for a user that are allowed
+ Specifies permissions (privilege) for a user that are allowed
to access &zebra; via the passwd system. There are two kinds
of permissions currently: read (r) and write(w). By default
users not listed in a permission directive are given the read
<para>
Names a file which lists database subscriptions for individual users.
The access file should consists of lines of the form <literal>username:
- dbnames</literal>, where dbnames is a list of database names, seprated by
+ dbnames</literal>, where dbnames is a list of database names, separated by
'+'. No whitespace is allowed in the database list.
</para>
</listitem>
<title>Static Ranking</title>
<para>
- &zebra; uses internally inverted indexes to look up term occurencies
+ &zebra; uses internally inverted indexes to look up term frequencies
in documents. Multiple queries from different indexes can be
combined by the binary boolean operations <literal>AND</literal>,
<literal>OR</literal> and/or <literal>NOT</literal> (which
<para>
The default <literal>rank-1</literal> ranking module implements a
TF/IDF (Term Frequecy over Inverse Document Frequency) like
- algorithm. In contrast to the usual defintion of TF/IDF
+ algorithm. In contrast to the usual definition of TF/IDF
algorithms, which only considers searching in one full-text
index, this one works on multiple indexes at the same time.
More precisely,
<sect2 id="administration-extended-services-debugging">
<title>Extended services debugging guide</title>
<para>
- When debugging ES over PHP we recomment the following order of tests:
+ When debugging ES over PHP we recommend the following order of tests:
</para>
<itemizedlist>
<literal>yaz-client</literal> like described in
<xref linkend="administration-extended-services-yaz-client"/>,
and
- remeber the <literal>-a</literal> option which tells you what
+ remember the <literal>-a</literal> option which tells you what
goes over the wire! Notice also the section on permissions:
try
<screen>
perm.anonymous: rw
</screen>
in <literal>zebra.cfg</literal> to make sure you do not run into
- permission problems (but never expose such an unsecure setup on the
+ permission problems (but never expose such an insecure setup on the
internet!!!). Then, make sure to set the general
<literal>recordType</literal> instruction, pointing correctly
to the GRS-1 filters,
</para>
<para>
The internal &acro.dom; &acro.xml; representation can be fed into four
- different pipelines, consisting of arbitraily many sucessive
+ different pipelines, consisting of arbitrarily many successive
&acro.xslt; transformations; these are for
<itemizedlist>
<listitem><para>input parsing and initial
static ranks. This imposes no overhead at all, both
search and indexing perform still
<emphasis>O(1)</emphasis> irrespectively of document
- collection size. This feature resembles Googles pre-ranking using
- their Pagerank algorithm.
+ collection size. This feature resembles Google's pre-ranking using
+ their PageRank algorithm.
</para>
<para>
Details on the experimental Alvis &acro.xslt; filter are found in
&zebra;'s internal index structure/data for a record.
In particular, the regular record filters are not invoked when
these are in use.
- This can in some cases make the retrival faster than regular
+ This can in some cases make the retrieval faster than regular
retrieval operations (for &acro.marc;, &acro.xml; etc).
</para>
<table id="special-retrieval-types">
Z> elements zebra::meta
Z> s 1+1
</screen>
- displays all available metadata on the record. These include sytem
+ displays all available metadata on the record. These include system
number, database name, indexed filename, filter used for indexing,
score and static ranking information and finally bytesize of record.
</para>
<listitem>
<para>
- What record schemas to support. (Subsidiary files specifiy how
+ What record schemas to support. (Subsidiary files specify how
to index the contents of records in those schemas, and what
format to use when presenting records in those schemas to client
software.)
<calloutlist>
<callout arearefs="attset.zthes">
<para>
- Declare Thesausus attribute set. See <filename>zthes.att</filename>.
+ Declare Thesaurus attribute set. See <filename>zthes.att</filename>.
</para>
</callout>
<callout arearefs="attset.attset">
<callout arearefs="termName">
<para>
Make <literal>termName</literal> word searchable by both
- Zthes attribute termName (1002) and &acro.bib1; atttribute title (4).
+ Zthes attribute termName (1002) and &acro.bib1; attribute title (4).
</para>
</callout>
</calloutlist>
(non-space characters) separated by single space characters
(normalized to " " on display). When completeness is
disabled, each word is indexed as a separate entry. Complete subfield
- indexing is most useful for fields which are typically browsed (eg.
+ indexing is most useful for fields which are typically browsed (e.g.,
titles, authors, or subjects), or instances where a match on a
- complete subfield is essential (eg. exact title searching). For fields
+ complete subfield is essential (e.g., exact title searching). For fields
where completeness is disabled, the search engine will interpret a
search containing space characters as a word proximity search.
</para>
to them:
<screen>
# Traditional word index
- # Used if completenss is 'incomplete field' (@attr 6=1) and
+ # Used if completeness is 'incomplete field' (@attr 6=1) and
# structure is word/phrase/word-list/free-form-text/document-text
index w
completeness 0
<para>
Curly braces {} may be used to enclose ranges of single
characters (possibly using the escape convention described in the
- preceding point), eg. {a-z} to introduce the
+ preceding point), e.g., {a-z} to introduce the
standard range of ASCII characters.
Note that the interpretation of such a range depends on
the concrete representation in your local, physical character set.
<listitem>
<para>
- paranthesises () may be used to enclose multi-byte characters -
- eg. diacritics or special national combinations (eg. Spanish
+ parentheses () may be used to enclose multi-byte characters -
+ e.g., diacritics or special national combinations (e.g., Spanish
"ll"). When found in the input stream (or a search term),
these characters are viewed and sorted as a single character, with a
sorting value depending on the position of the group in the value
<example id="indexing-marcxml-example"><title>MARCXML indexing using ICU</title>
<para>
The directory <filename>examples/marcxml</filename> includes
- a complete sample with MARCXML recordst that are DOM XML indexed
+ a complete sample with MARCXML records that are DOM XML indexed
using ICU chain rules. Study the
<filename>README</filename> in the <filename>marcxml</filename>
directory for details.
<para>
The software is regularly tested on
<ulink url="&url.debian;">Debian GNU/Linux</ulink>,
- <ulink url="&url.redhat;">Redhat Linux</ulink>,
+ <ulink url="&url.redhat;">Red Hat Linux</ulink>,
<ulink url="&url.gentoo;">Gentoo Linux</ulink>,
<ulink url="&url.suse;">SuSE Linux</ulink>,
<ulink url="&url.freebsd;">FreeBSD (i386)</ulink>,
</para>
</note>
<para>
- The attribute set defintion files may no longer contain
+ The attribute set definition files may no longer contain
redirection to other fields.
For example the following snippet of
a custom <filename>custom/bib1.att</filename>
<ulink url="http://indexdata.dk/zebra/">&zebra;</ulink>
is a high-performance, general-purpose structured text
indexing and retrieval engine. It reads records in a
- variety of input formats (eg. email, &acro.xml;, &acro.marc;) and provides access
+ variety of input formats (e.g. email, &acro.xml;, &acro.marc;) and provides access
to them through a powerful combination of boolean search
expressions and relevance-ranked free-text queries.
</para>
<entry>Predefined field types</entry>
<entry>user defined</entry>
<entry>Data fields can be indexed as phrase, as into word
- tokenized text, as numeric values, url's, dates, and raw binary
+ tokenized text, as numeric values, URLs, dates, and raw binary
data.</entry>
<entry><xref linkend="character-map-files"/> and
<xref linkend="querymodel-pqf-apt-mapping-structuretype"/>
<entry>Regular expression matching</entry>
<entry>available</entry>
<entry>Full regular expression matching and "approximate
- matching" (eg. spelling mistake corrections) are handled.</entry>
+ matching" (e.g. spelling mistake corrections) are handled.</entry>
<entry><xref linkend="querymodel-regular"/></entry>
</row>
<row>
Why does Kete wants to use Zebra?? Speed, Scalability and easy
integration with Koha. Read their
<ulink
- url="http://kete.net.nz/blog/topics/show/44-who-what-why-when-answering-some-of-the-niggly-development-questions">detailled
+ url="http://kete.net.nz/blog/topics/show/44-who-what-why-when-answering-some-of-the-niggly-development-questions">detailed
reasoning here.</ulink>
</para>
</section>
&zebra; has been used by a variety of institutions to construct
indexes of large web sites, typically in the region of tens of
millions of pages. In this role, it functions somewhat similarly
- to the engine of google or altavista, but for a selected intranet
+ to the engine of Google or AltaVista, but for a selected intranet
or a subset of the whole Web.
</para>
<para>
for &acro.marc; records. This term helps to understand the notation of extended indexing of MARC records
by &zebra;. Our definition is based on the document <ulink url="http://www.rba.ru/rusmarc/soft/Z39-50.htm">"The
table of conformity for &acro.z3950; use attributes and R&acro.usmarc; fields"</ulink>.
-The document is available only in russian language.</para>
+The document is available only in Russian language.</para>
<para>The <emphasis>index-formula</emphasis> is the combination of subfields presented in such way:</para>
The <ulink url="&url.yaz.pqf;">&acro.pqf; grammar</ulink>
is documented in the &yaz; manual, and shall not be
repeated here. This textual &acro.pqf; representation
- is not transmistted to &zebra; during search, but it is in the
+ is not transmitted to &zebra; during search, but it is in the
client mapped to the equivalent &acro.z3950; binary
query parse tree.
</para>
<para>
It is possible to search
in any silly string index - if it's defined in your
- indexation rules and can be parsed by the &acro.pqf; parser.
+ indexing rules and can be parsed by the &acro.pqf; parser.
This is definitely not the recommended use of
this facility, as it might confuse your users with some very
unexpected results.
<emphasis>string</emphasis> attributes which in appearance
<emphasis>resemble XPath queries</emphasis>. There are two
problems with this approach: first, the XPath-look-alike has to
- be defined at indexation time, no new undefined
+ be defined at indexing time, no new undefined
XPath queries can entered at search time, and second, it might
confuse users very much that an XPath-alike index name in fact
gets populated from a possible entirely different &acro.xml; element
<literal>Word list (6)</literal>
is supported, and maps to the boolean <literal>AND</literal>
combination of words supplied. The word list is useful when
- google-like bag-of-word queries need to be translated from a GUI
+ Google-like bag-of-word queries need to be translated from a GUI
query language to &acro.pqf;. For example, the following queries
are equivalent:
<screen>
search and scan in index <literal>type="p"</literal>.
</para>
<para>
- The <literal>Complete subfield (2)</literal> is a reminiscens
+ The <literal>Complete subfield (2)</literal> is a reminiscent
from the happy <literal>&acro.marc;</literal>
binary format days. &zebra; does not support it, but maps silently
to <literal>Complete field (3)</literal>.
</para>
<para>
By setting an estimation limit size of the resultset of the &acro.apt;
- leaves, &zebra; stoppes processing the result set when the limit
+ leaves, &zebra; stops processing the result set when the limit
length is reached.
Hit counts under this limit are still precise, but hit counts over it
are estimated using the statistics gathered from the chopped
result set.
</para>
<para>
- Specifying a limit of <literal>0</literal> resuts in exact hit counts.
+ Specifying a limit of <literal>0</literal> results in exact hit counts.
</para>
<para>
For example, we might be interested in exact hit count for a, but
<entry>key (@attr 4=3)</entry>
<entry>ignored</entry>
<entry>Null bitmap ('0')</entry>
- <entry>Used for non-tokenizated and non-normalized bit sequences</entry>
+ <entry>Used for non-tokenized and non-normalized bit sequences</entry>
</row>
<row>
<entry>year (@attr 4=4)</entry>
<entry>ignored</entry>
<entry>Year ('y')</entry>
- <entry>Non-tokenizated and non-normalized 4 digit numbers</entry>
+ <entry>Non-tokenized and non-normalized 4 digit numbers</entry>
</row>
<row>
<entry>date (@attr 4=5)</entry>
<entry>ignored</entry>
<entry>Date ('d')</entry>
- <entry>Non-tokenizated and non-normalized ISO date strings</entry>
+ <entry>Non-tokenized and non-normalized ISO date strings</entry>
</row>
<row>
<entry>ignored</entry>
found. Therefore,
invalid document processing is aborted, and any content of
the <literal><extract></literal> and
- <literal><store></literal> pipelines is discarted.
+ <literal><store></literal> pipelines is discarded.
A warning is issued in the logs.
</para>
</listitem>
<title>Debuggig &acro.dom; Filter Configurations</title>
<para>
It can be very hard to debug a &acro.dom; filter setup due to the many
- sucessive &acro.marc; syntax translations, &acro.xml; stream splitting and
+ successive &acro.marc; syntax translations, &acro.xml; stream splitting and
&acro.xslt; transformations involved. As an aid, you have always the
power of the <literal>-s</literal> command line switch to the
<literal>zebraidz</literal> indexing command at your hand:
which describes the specific &acro.marc; structure of the input record as
well as the indexing rules.
</para>
- <para>The <literal>grs.marc</literal> uses an internal represtantion
+ <para>The <literal>grs.marc</literal> uses an internal representation
which is not &acro.xml; conformant. In particular &acro.marc; tags are
presented as elements with the same name. And &acro.xml; elements
may not start with digits. Therefore this filter is only
</para>
<para>
The loadable <literal>grs.xml</literal> filter module
- is packagged in the GNU/Debian package
+ is packaged in the GNU/Debian package
<literal>libidzebra2.0-mod-grs-xml</literal>
</para>
</listitem>
<listitem>
<para>
Begin a new record. The following parameter should be the
- name of the schema that describes the structure of the record, eg.
+ name of the schema that describes the structure of the record, e.g.,
<literal>gils</literal> or <literal>wais</literal> (see below).
The <literal>begin record</literal> call should precede
any other use of the <replaceable>begin</replaceable> statement.
provides a default variant request for
use when the individual element requests (see below) do not contain a
variant request. Variant requests consist of a blank-separated list of
- variant components. A variant compont is a comma-separated,
+ variant components. A variant component is a comma-separated,
parenthesized triple of variant class, type, and value (the two former
values being represented as integers). The value can currently only be
entered as a string (this will change to depend on the definition of
ISO2709-based formats (&acro.usmarc;, etc.). Only records with a
two-level structure (corresponding to fields and subfields) can be
directly mapped to ISO2709. For records with a different structuring
- (eg., GILS), the representation in a structure like &acro.usmarc; involves a
+ (e.g., GILS), the representation in a structure like &acro.usmarc; involves a
schema-mapping (see <xref linkend="schema-mapping"/>), to an
"implied" &acro.usmarc; schema (implied,
because there is no formal schema which specifies the use of the
Our definition is based on the document
<ulink url="http://www.rba.ru/rusmarc/soft/Z39-50.htm">"The table
of conformity for &acro.z3950; use attributes and R&acro.usmarc; fields"</ulink>.
- The document is available only in russian language.</para>
+ The document is available only in Russian language.</para>
<para>
The <emphasis>index-formula</emphasis> is the combination of
</para>
<para>
Additional OAI test records can be downloaded by running a shell
- script (you may want to abort the script when you have waitet
- longer than your coffe brews ..).
+ script (you may want to abort the script when you have waited
+ longer than your coffee brews ..).
<screen>
cd data
./fetch_OAI_data.sh
<para>
Searching and retrieving &acro.xml; records is easy. For example,
- you can point your browser to one of the following url's to
+ you can point your browser to one of the following URLs to
search for the term <literal>the</literal>. Just point your
browser at this link:
<ulink
<warning>
<para>
- These URL's woun't work unless you have indexed the example data
+ These URLs won't work unless you have indexed the example data
and started an &zebra; server as outlined in the previous section.
</para>
</warning>
<para>
In case we actually want to retrieve one record, we need to alter
- our URl to the following
+ our URL to the following
<ulink url="http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=dc">
http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=dc
</ulink>
<literal>conf/oai2dc.xsl</literal>, and
the <literal>zebra</literal> schema implemented in
<literal>conf/oai2zebra.xsl</literal>.
- The URL's for acessing both are the same, except for the different
+ The URLs for accessing both are the same, except for the different
value of the <literal>recordSchema</literal> parameter:
<ulink url="http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=dc">
http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=dc
The &acro.oai; indexing example defines many different index
names, a study of the <literal>conf/oai2index.xsl</literal>
stylesheet reveals the following word type indexes (i.e. those
- swith suffix <literal>:w</literal>):
+ with suffix <literal>:w</literal>):
<screen>
any:w
title:w
<title>Investigating the content of the indexes</title>
<para>
- How doess the magic work? What is inside the indexes? Why is a certain
+ How does the magic work? What is inside the indexes? Why is a certain
record found by a search, and another not?. The answer is in the
inverted indexes. You can easily investigate them using the
special &zebra; schema
<para>
The &acro.sru; specification mandates that the &acro.cql; query
language is supported and properly configure. Also, the server
- needs to be able to emmit a proper &acro.explain; &acro.xml;
+ needs to be able to emit a proper &acro.explain; &acro.xml;
record, which is used to determine the capabilities of the
specific server instance.
</para>
<para>
- In this example configuration we expoit the similarities between
+ In this example configuration we exploit the similarities between
the &acro.explain; record and the &acro.cql; query language
configuration, we generate the later from the former using an
&acro.xslt; transformation.
url="http://localhost:9999/?version=1.1&operation=scan&scanClause=dc.identifier=fish">
http://localhost:9999/?version=1.1&operation=scan&scanClause=dc.identifier=fish
</ulink>
- accesses the indexed indentifiers.
+ accesses the indexed identifiers.
</para>
<para>
- In addition, all &zebra; internal special elemen sets or record
+ In addition, all &zebra; internal special element sets or record
schema's of the form
<literal>zebra::</literal> just work right out of the box
<ulink
<para>
The default behavior for <literal>zebrasrv</literal> - if started
- as non-priviledged user - is to establish
+ as non-privileged user - is to establish
a single TCP/IP listener, for the &acro.z3950; protocol, on port 9999.
<screen>
zebrasrv @
<para>
The Virtual hosts mechanism allows a &yaz; frontend server to
support multiple backends. A backend is selected on the basis of
- the TCP/IP binding (port+listening adddress) and/or the virtual host.
+ the TCP/IP binding (port+listening address) and/or the virtual host.
</para>
<para>
A backend can be configured to execute in a particular working
<para>
Specifies listener for this server. If this attribute is not
given, the server is accessible from all listener. In order
- for the server to be used for real, howeever, the virtual host
+ for the server to be used for real, however, the virtual host
must match (if specified in the configuration).
</para>
</listitem>
<listitem>
<para>
Specifies a working directory for this backend server. If
- specifid, the &yaz; fronend changes current working directory
+ specified, the &yaz; frontend changes current working directory
to this directory whenever a backend of this type is
started (backend handler bend_start), stopped (backend handler hand_stop)
and initialized (bend_init).
<refsect1><title>DESCRIPTION</title>
<para>Zebra is a high-performance, general-purpose structured text indexing
and retrieval engine. It reads structured records in a variety of input
- formats (eg. email, &acro.xml;, &acro.marc;) and allows access to them through exact
+ formats (e.g. email, &acro.xml;, &acro.marc;) and allows access to them through exact
boolean search expressions and relevance-ranked free-text queries.
</para>
<para>
This will display the &acro.xml;-formatted &acro.sru; response that includes the
first record in the result-set found by the query
<literal>mineral</literal>. (For clarity, the &acro.sru; URL is shown
- here broken across lines, but the lines should be joined to gether
+ here broken across lines, but the lines should be joined together
to make single-line URL for the browser to submit.)
</para>
</note>
new alpha stuff, and a lot of work has yet to be done ..
</para>
<para>
- There is no linkeage whatsoever between the &acro.z3950; explain model
+ There is no linkage whatsoever between the &acro.z3950; explain model
and the &acro.sru; explain response (well, at least not implemented
in Zebra, that is ..). Zebra does not provide a means using
&acro.z3950; to obtain the ZeeRex record.