From: Adam Dickmeiss Date: Mon, 23 Jan 2006 10:35:16 +0000 (+0000) Subject: Added mail indexing example. X-Git-Tag: ZEBRA.1.3.34~15 X-Git-Url: http://sru.miketaylor.org.uk/cgi-bin?a=commitdiff_plain;h=8f787a1bff5ec72f4253cac78043b3d9e404dd28;p=idzebra-moved-to-github.git Added mail indexing example. --- diff --git a/configure.in b/configure.in index aa93a90..b4659ea 100644 --- a/configure.in +++ b/configure.in @@ -1,5 +1,5 @@ dnl Zebra, Index Data Aps, 1995-2005 -dnl $Id: configure.in,v 1.91.2.16 2005-11-08 10:51:50 adam Exp $ +dnl $Id: configure.in,v 1.91.2.17 2006-01-23 10:35:16 adam Exp $ dnl AC_INIT(include/zebraver.h) AM_INIT_AUTOMAKE(idzebra,1.3.32) @@ -353,6 +353,7 @@ AC_OUTPUT([ test/dmoz/Makefile test/xpath/Makefile test/sort/Makefile test/zsh/Makefile test/marcxml/Makefile test/charmap/Makefile test/codec/Makefile examples/Makefile examples/gils/Makefile examples/zthes/Makefile + examples/mail/Makefile idzebra.spec ]) if test -x "$perlbin"; then diff --git a/examples/Makefile.am b/examples/Makefile.am index 94a8048..98556c0 100644 --- a/examples/Makefile.am +++ b/examples/Makefile.am @@ -1,5 +1,5 @@ -SUBDIRS=gils zthes +SUBDIRS=gils zthes mail EXTRA_DIST = README diff --git a/examples/mail/.cvsignore b/examples/mail/.cvsignore new file mode 100644 index 0000000..1bca8cf --- /dev/null +++ b/examples/mail/.cvsignore @@ -0,0 +1,8 @@ +*.mf +*.LCK +log +zebrasrv.pid +Makefile.in +Makefile +register +shadow diff --git a/examples/mail/Makefile.am b/examples/mail/Makefile.am new file mode 100644 index 0000000..930910e --- /dev/null +++ b/examples/mail/Makefile.am @@ -0,0 +1,7 @@ +# $Id: Makefile.am,v 1.1.2.1 2006-01-23 10:35:17 adam Exp $ + +EXTRA_DIST = zebra.cfg mail.flt mail.abs zigmails.mbox + +clean: + rm -f *.mf *..LCK zebrasrv.pid + diff --git a/examples/mail/mbox.abs b/examples/mail/mbox.abs new file mode 100644 index 0000000..0571281 --- /dev/null +++ b/examples/mail/mbox.abs @@ -0,0 +1,23 @@ +# $Id: mbox.abs,v 1.1.2.1 2006-01-23 10:35:17 adam Exp $ + +# This is an abstrac-syntax for mbox mails.. It is similar to +# wais.abs but is using xelm rather than elm and is tuned for +# emails only. + +name mbox +attset bib1.att + +# Allow Brief and Full to mean the whole thing +esetname F @ +esetname B @ + +# No X-Path indexing. +xpath disable + +# Author,BodyOfText,Title,Local-Number are all Bib-1 attributes. + +xelm /mbox/name Author:w +xelm /mbox/body BodyOfText:w +xelm /mbox/title Title:w +xelm /mbox/id Local-Number:0 + diff --git a/examples/mail/mbox.flt b/examples/mail/mbox.flt new file mode 100644 index 0000000..a55e7cb --- /dev/null +++ b/examples/mail/mbox.flt @@ -0,0 +1,15 @@ +# $Id: mbox.flt,v 1.1.2.1 2006-01-23 10:35:17 adam Exp $ +# This reads mbox mails. This filter is similar to mail.flt but is +# using string tags rathar than numeric tags (and tag sets). +# We do our best to index the Message-ID so that we can uniquely identify +# the mail being indexed. +BEGIN { begin record mbox } +/^From:/ BODY /$/ { data -element name $1 } +/^[Mm][Ee][Ss][Ss][Aa][Gg][Ee]-[Ii][Dd]:/ BODY /$/ { data -element id $1 } +/^Subject:/ BODY /$/ { data -element title $1 } +/^Date:/ BODY /$/ { data -element date $1 } +/^$/ BODY /^From / { + data -text -element body $1 + unread 2 + end record + } diff --git a/examples/mail/zebra.cfg b/examples/mail/zebra.cfg new file mode 100644 index 0000000..5e7f911 --- /dev/null +++ b/examples/mail/zebra.cfg @@ -0,0 +1,16 @@ +# Simple Zebra configuration file +# $Id: zebra.cfg,v 1.1.2.1 2006-01-23 10:35:17 adam Exp $ +# +# Where the schema files, attribute files, etc are located. +profilePath: .:../../tab + +# Files that describe the attribute sets supported. +attset: bib1.att +attset: gils.att +attset: explain.att + +# The Message-ID in our emails.. +recordId: (bib1,12) + +recordtype: grs.regx.mbox + diff --git a/examples/mail/zigmails.mbox b/examples/mail/zigmails.mbox new file mode 100644 index 0000000..0c451be --- /dev/null +++ b/examples/mail/zigmails.mbox @@ -0,0 +1,268 @@ +From - Tue Jul 1 11:00:58 2003 +X-UIDL: 3f014dc900000006 +X-Mozilla-Status: 0001 +X-Mozilla-Status2: 00000000 +Envelope-to: adam@indexdata.dk +Received: from frink.w3.org ([18.29.1.71]) + by bagel.index with esmtp (Exim 3.35 #1 (Debian)) + id 19XGUC-0002xp-00; Tue, 01 Jul 2003 10:27:37 +0200 +Received: from frink.w3.org (localhost [127.0.0.1]) + by frink.w3.org (8.12.9/8.12.9) with ESMTP id h618RZSn002707; + Tue, 1 Jul 2003 04:27:35 -0400 (EDT) +Received: (from lists@localhost) + by frink.w3.org (8.12.9/8.12.9/Submit) id h618QkDn002527; + Tue, 1 Jul 2003 04:26:46 -0400 (EDT) +Resent-Date: Tue, 1 Jul 2003 04:26:46 -0400 (EDT) +Resent-Message-Id: <200307010826.h618QkDn002527@frink.w3.org> +Received: from dr-nick.w3.org (dr-nick.w3.org [18.29.1.73]) + by frink.w3.org (8.12.9/8.12.9) with ESMTP id h618QhSn002466 + for ; Tue, 1 Jul 2003 04:26:43 -0400 (EDT) +Received: from auntie.miketaylor.org.uk (pc-62-30-152-189-hr.blueyonder.co.uk [62.30.152.189]) + by dr-nick.w3.org (8.12.3/8.12.3/Debian-6.4) with ESMTP id h618Qgvx030432 + for ; Tue, 1 Jul 2003 04:26:43 -0400 +Received: from mike by auntie.miketaylor.org.uk with local (Exim 3.35 #1 (Debian)) + id 19XGT6-00016e-00 + for ; Tue, 01 Jul 2003 09:26:28 +0100 +From: Mike Taylor +To: www-zig@w3.org +Message-Id: +Date: Tue, 01 Jul 2003 09:26:28 +0100 +Subject: Revised XML Proposal +X-Archived-At: http://www.w3.org/mid/E19XGT6-00016e-00@auntie.miketaylor.org.uk +Resent-From: www-zig@w3.org +X-Mailing-List: archive/latest/1324 +X-Loop: www-zig@w3.org +Sender: www-zig-request@w3.org +Resent-Sender: www-zig-request@w3.org +Precedence: list +List-Id: +List-Help: +List-Unsubscribe: +Resent-Bcc: +Status: + + +Ray (mostly), + +Regarding the revised _Requesting XML Record_ proposal at +http://www.loc.gov/z3950/agency/proposals/request-xml.html + +The technical content looks good to me. I have a few quibbles with +the wording, but they are not important. + + The globally unique identifier may but need not be a + URI. If a URI, it may, but need not be the locator of + an XML schema definition or DTD. (And if a URI, it + need not be an HTTP URI.) + +The phrase "may be need not be" feels awkward to me, and would read +better as "may be, but need not be,". And the last of these three +sentences should use the same phrasing as the first two: + + The globally unique identifier may be, but need not + be, a URI. If it is a URI it may be, but need not be, + the locator of an XML schema definition or DTD. (And + if a URI, it may be, but need not be, an HTTP URI.) + +Also: + + Example 3. The identifier: + "http://www.editeur.org/onix/ReferenceNames" would be + used as the element set name to indicate the dtd at + http://www.editeur.org/onix/2.0/reference/onix-international.dtd. + +"DTD" is an acronym and so should be all-caps. + +And finally: + + Example 4. The identifier: + "http://www.kb.nl/persons/theo/dcx/" would be used as + the element set name to indicate that records are to + be composed according to the definition at that URI + (http://www.kb.nl/persons/theo/dcx). + +It would be more explicitly exemplify what's going on if this said +"... composed according to the prose definition at ..." + +Sorry to be picky. (But not so sorry that I won't :-) + + _/|_ _______________________________________________________________ +/o ) \/ Mike Taylor http://www.miketaylor.org.uk +)_v__/\ "Ho ho ho ... Very witty, Wilde! Very, very witty!" -- + Monty Python's Flying Circus. + +-- +Listen to my wife's new CD of kids' music, _Child's Play_, at + http://www.pipedreaming.org.uk/childsplay/ + + +From - Wed Jul 2 00:57:47 2003 +X-UIDL: 3f0211ea00000001 +X-Mozilla-Status: 0011 +X-Mozilla-Status2: 00000000 +Envelope-to: adam@indexdata.dk +Received: from frink.w3.org ([18.29.1.71]) + by bagel.index with esmtp (Exim 3.35 #1 (Debian)) + id 19XU3R-0000o8-00; Wed, 02 Jul 2003 00:56:53 +0200 +Received: from frink.w3.org (localhost [127.0.0.1]) + by frink.w3.org (8.12.9/8.12.9) with ESMTP id h61Mn1Sn023384; + Tue, 1 Jul 2003 18:49:01 -0400 (EDT) +Received: (from lists@localhost) + by frink.w3.org (8.12.9/8.12.9/Submit) id h61MmH7B022195; + Tue, 1 Jul 2003 18:48:17 -0400 (EDT) +Resent-Date: Tue, 1 Jul 2003 18:48:17 -0400 (EDT) +Resent-Message-Id: <200307012248.h61MmH7B022195@frink.w3.org> +Received: from dr-nick.w3.org (dr-nick.w3.org [18.29.1.73]) + by frink.w3.org (8.12.9/8.12.9) with ESMTP id h61MmESn022042 + for ; Tue, 1 Jul 2003 18:48:14 -0400 (EDT) +Received: from io.mds.rmit.edu.au (io.mds.rmit.edu.au [131.170.70.10]) + by dr-nick.w3.org (8.12.3/8.12.3/Debian-6.4) with ESMTP id h61MmDvx015709 + for ; Tue, 1 Jul 2003 18:48:14 -0400 +Received: by io.mds.rmit.edu.au (Postfix, from userid 301) + id 3B6A649B70; Wed, 2 Jul 2003 08:48:06 +1000 (EST) +Date: Wed, 2 Jul 2003 08:48:06 +1000 +From: Alan Kent +To: www-zig@w3.org +Message-ID: <20030702084805.B20964@io.mds.rmit.edu.au> +References: <3F005826.B395FDF4@loc.gov> +Mime-Version: 1.0 +Content-Type: text/plain; charset=us-ascii +X-Mailer: Mutt 0.95i +In-Reply-To: <3F005826.B395FDF4@loc.gov>; from Ray Denenberg on Mon, Jun 30, 2003 at 11:32:54AM -0400 +Subject: Re: Proposal: requesting XML records +X-Archived-At: http://www.w3.org/mid/20030702084805.B20964@io.mds.rmit.edu.au +Resent-From: www-zig@w3.org +X-Mailing-List: archive/latest/1325 +X-Loop: www-zig@w3.org +Sender: www-zig-request@w3.org +Resent-Sender: www-zig-request@w3.org +Precedence: list +List-Id: +List-Help: +List-Unsubscribe: +Resent-Bcc: +Status: + + +On Mon, Jun 30, 2003 at 11:32:54AM -0400, Ray Denenberg wrote: +> Thanks for the comments on this proposal. It's been updated. See: +> http://www.loc.gov/z3950/agency/proposals/request-xml.html +> --Ray + +I may have been out of things for a bit, but was there some reason +why Comp-spec was explicitly excluded? + + "If Comp-spec is used, this agreement does not apply." + +Is the purpose to say Comp-spec is out of scope (not defined) or that +it is recommended that Comp-spec *should not* support these set names. +I can understand Comp-spec being undefined/out of scope. But it would +seem wrong to disallow it. + +Another observation (not a problem, just an observation), from memory +element set names are not case sensitive (in a quick skim of standard +I could not find this just now, but I recall seeing it previously). +URIs on the other hand are case sensitive. I guess there is no issue +as long as clients are told to always supply the URI with the correct +case. (If they don't, should it still work - maybe just undefined). +But as I said, probably not an issue in practice so not worth mentioning. + +Alan + + +From - Wed Jul 2 22:13:36 2003 +X-UIDL: 3f033cee00000006 +X-Mozilla-Status: 0011 +X-Mozilla-Status2: 00000000 +Envelope-to: adam@indexdata.dk +Received: from frink.w3.org ([18.29.1.71]) + by bagel.index with esmtp (Exim 3.35 #1 (Debian)) + id 19XnuF-0007jb-00; Wed, 02 Jul 2003 22:08:43 +0200 +Received: from frink.w3.org (localhost [127.0.0.1]) + by frink.w3.org (8.12.9/8.12.9) with ESMTP id h62JxFSn024233; + Wed, 2 Jul 2003 15:59:15 -0400 (EDT) +Received: (from lists@localhost) + by frink.w3.org (8.12.9/8.12.9/Submit) id h62JwYov024041; + Wed, 2 Jul 2003 15:58:34 -0400 (EDT) +Resent-Date: Wed, 2 Jul 2003 15:58:34 -0400 (EDT) +Resent-Message-Id: <200307021958.h62JwYov024041@frink.w3.org> +Received: from dr-nick.w3.org (dr-nick.w3.org [18.29.1.73]) + by frink.w3.org (8.12.9/8.12.9) with ESMTP id h62JwXSn024006 + for ; Wed, 2 Jul 2003 15:58:33 -0400 (EDT) +Received: from sun8.loc.gov (sun8.loc.gov [140.147.249.48]) + by dr-nick.w3.org (8.12.3/8.12.3/Debian-6.4) with ESMTP id h62JwWvx008936 + for ; Wed, 2 Jul 2003 15:58:32 -0400 +Received: from loc.gov (LSWRKC.dhcp.loc.gov [140.147.156.132]) + by sun8.loc.gov with ESMTP id h62JwWqc015205 + for ; Wed, 2 Jul 2003 15:58:32 -0400 (EDT) +Message-ID: <3F033967.67747812@loc.gov> +Date: Wed, 02 Jul 2003 15:58:32 -0400 +From: Ray Denenberg +Organization: Library Of Congress +X-Mailer: Mozilla 4.77 [en] (Windows NT 5.0; U) +X-Accept-Language: en +MIME-Version: 1.0 +To: zig +References: <3F005826.B395FDF4@loc.gov> <20030702084805.B20964@io.mds.rmit.edu.au> +Content-Type: text/plain; charset=us-ascii +Content-Transfer-Encoding: 7bit +Subject: Re: Proposal: requesting XML records +X-Archived-At: http://www.w3.org/mid/3F033967.67747812@loc.gov +Resent-From: www-zig@w3.org +X-Mailing-List: archive/latest/1326 +X-Loop: www-zig@w3.org +Sender: www-zig-request@w3.org +Resent-Sender: www-zig-request@w3.org +Precedence: list +List-Id: +List-Help: +List-Unsubscribe: +Resent-Bcc: +Status: + + +Alan Kent wrote: + +> "If Comp-spec is used, this agreement does not apply." +> +> Is the purpose to say Comp-spec is out of scope (not defined) or that +> it is recommended that Comp-spec *should not* support these set names. + +"Out-of-scope". The proposed agreement simply would not apply to compSpec +and there is no intent to suggest that compSpec should or should not support +this. + + +> Another observation (not a problem, just an observation), from memory +> element set names are not case sensitive (in a quick skim of standard +> I could not find this just now, but I recall seeing it previously). +> URIs on the other hand are case sensitive. I guess there is no issue +> as long as clients are told to always supply the URI with the correct +> case. (If they don't, should it still work - maybe just undefined). +> But as I said, probably not an issue in practice so not worth mentioning. + +Yes, ESNs are case insensitive and URIs are case sensitive. + +So, suppose "http://www. ........... xyz" and "http://www. ........... XYZ" +identify two different schemas. Suppose a server knows the first, and the +client sends the second. The server will treat it as the first. The only +reasonable answer to this problem is that this never should have happened to +begin with (these two names identifying different things). But how do you +prevent that from happening? + +I think the practical answer is this: I think that the domain-name part of +an http uri is case insensitive, at least in practice. For example, +HTTP://WWW.LOC.GOV resolves to http://www.loc.gov, whereas +HTTP://WWW.LOC.GOV/Z3950 does not resolves to http://www.loc.gov/z3950. + +So, assuming that +(1) the naming authority part of a uri is, in practice, case insensitive; and + +(2) a single naming authority has enough sense not to assign conflicting +names; +then I think the problem should not arise. + +--Ray + + +