1 <!-- $Id: book.xml,v 1.6 2006-04-19 15:43:40 mike Exp $ -->
3 <title>Metaproxy - User's Guide and Reference</title>
5 <firstname>Mike</firstname><surname>Taylor</surname>
8 <firstname>Adam</firstname><surname>Dickmeiss</surname>
12 <holder>Index Data</holder>
16 Metaproxy is a universal Z39.50/SRU router, proxy and encapsulated
17 metasearcher. It accepts, processes, interprets and redirects
18 requests from IR clients using standard protocols such as
19 ANSI/NISO Z39.50, SRU and SRW, as well as functioning as a limited
20 HTTP server. Metaproxy is configured by an XML file which
21 specifies how the software should function in terms of routes that
22 the request packets can take through the proxy, each step on a
23 route being an instantiation of a filter. Filters come in many
24 types, one for each operation: accepting Z39.50 packets, logging,
25 query transformation, multiplexing, etc. Further filter-types can
26 be added as loadable modules to extend Metaproxy functionality,
30 The terms under which Metaproxy will be distributed have yet to be
31 established, but it will not necessarily be open source; so users
32 should not at this stage redistribute the code without explicit
33 written permission from the copyright holders, Index Data ApS.
38 <chapter id="introduction">
39 <title>Introduction</title>
43 <ulink url="http://indexdata.dk/metaproxy/">Metaproxy</ulink>
44 is a standalone program that acts as a universal router, proxy and
45 encapsulated metasearcher for information retrieval protocols such
46 as Z39.50 and SRU/SRW. To clients, it acts as a server of these
47 protocols: it can be searched, records can be retrieved from it,
48 etc. To servers, it acts as a client: it searches in them,
49 retrieves records from them, etc. it satisfies its clients'
50 requests by transforming them, multiplexing them, forwarding them
51 on to zero or more servers, merging the results, transforming
52 them, and delivering them back to the client. In addition, it
53 acts as a simple HTTP server; support for further protocols can be
54 added in a module fashion, through the creation of new filters.
59 Cold bananas, fish, pyjamas,
60 Mutton, beef and trout!
61 - attributed to Cole Porter.
64 Metaproxy is a more capable alternative to
65 <ulink url="http://indexdata.dk/yazproxy/">YAZ Proxy</ulink>,
66 being more powerful, flexible, configurable and extensible. Among
67 its many advantages over the older, more pedestrian work are
68 support for multiplexing (encapsulated metasearching), routing by
69 database name, authentication and authorisation and serving local
70 files via HTTP. Equally significant, its modular architecture
71 facilitites the creation of pluggable modules implementing further
78 <chapter id="licence">
79 <title>The Metaproxy Licence</title>
81 <emphasis role="strong">
82 No decision has yet been made on the terms under which
83 Metaproxy will be distributed.
85 It is possible that, unlike
86 other Index Data products, metaproxy may not be released under a
87 free-software licence such as the GNU GPL. Until a decision is
88 made and a public statement made, then, and unless it has been
89 delivered to you other specific terms, please treat Metaproxy as
90 though it were proprietary software.
91 The code should not be redistributed without explicit
92 written permission from the copyright holders, Index Data ApS.
98 <chapter id="architecture">
99 <title>The Metaproxy Architecture</title>
101 The Metaproxy architecture is based on three concepts:
102 the <emphasis>package</emphasis>,
103 the <emphasis>route</emphasis>
104 and the <emphasis>filter</emphasis>.
108 <term>Packages</term>
111 A package is request or response, encoded in some protocol,
112 issued by a client, making its way through Metaproxy, send to or
113 received from a server, or sent back to the client.
116 The core of a package is the protocol unit - for example, a
117 Z39.50 Init Request or Search Response, or an SRU searchRetrieve
118 URL or Explain Response. In addition to this core, a package
119 also carries some extra information added and used by Metaproxy
123 In general, packages are doctored as they pass through
124 Metaproxy. For example, when the proxy performs authentication
125 and authorisation on a Z39.50 Init request, it removes the
126 authentication credentials from the package so that they are not
127 passed onto the back-end server; and when search-response
128 packages are obtained from multiple servers, they are merged
129 into a single unified package that makes its way back to the
138 Packages make their way through routes, which can be thought of
139 as programs that operate on the package data-type. Each
140 incoming package initially makes its way through a default
141 route, but may be switched to a different route based on various
142 considerations. Routes are made up of sequences of filters (see
151 Filters provide the individual instructions within a route, and
152 effect the necessary transformations on packages. A particular
153 configuration of Metaproxy is essentially a set of filters,
154 described by configuration details and arranged in order in one
155 or more routes. There are many kinds of filter - about a dozen
156 at the time of writing with more appearing all the time - each
157 performing a specific function and configured by different
161 The word ``filter'' is sometimes used rather loosely, in two
162 different ways: it may be used to mean a particular
163 <emphasis>type</emphasis> of filter, as when we speak of ``the
164 auth_simplefilter'' or ``the multi filter''; or it may be used
165 to be a specific <emphasis>instance</emphasis> of a filter
166 within a Metaproxy configuration. For example, a single
167 configuration will often contain multiple instances of the
168 <literal>z3950_client</literal> filter. In
169 operational terms, of these is a separate filter. In practice,
170 context always make it clear which sense of the word ``filter''
174 Extensibility of Metaproxy is primarily through the creation of
175 plugins that provide new filters. The filter API is small and
176 conceptually simple, but there are many details to master. See
178 <link linkend="extensions">extensions</link>.
184 Since packages are created and handled by the system itself, and
185 routes are conceptually simple, most of the remainder of this
186 document concentrates on filters. After a brief overview of the
187 filter types follows, along with some thoughts on possible future
194 <chapter id="filters">
195 <title>Filters</title>
199 <title>Introductory notes</title>
201 It's useful to think of Metaproxy as an interpreter providing a small
202 number of primitives and operations, but operating on a very
203 complex data type, namely the ``package''.
206 A package represents a Z39.50 or SRW/U request (whether for Init,
207 Search, Scan, etc.) together with information about where it came
208 from. Packages are created by front-end filters such as
209 <literal>frontend_net</literal> (see below), which reads them from
210 the network; other front-end filters are possible. They then pass
211 along a route consisting of a sequence of filters, each of which
212 transforms the package and may also have side-effects such as
213 generating logging. Eventually, the route will yield a response,
214 which is sent back to the origin.
217 There are many kinds of filter: some that are defined statically
218 as part of Metaproxy, and others may be provided by third parties
219 and dynamically loaded. They all conform to the same simple API
220 of essentially two methods: <function>configure()</function> is
221 called at startup time, and is passed a DOM tree representing that
222 part of the configuration file that pertains to this filter
223 instance: it is expected to walk that tree extracting relevant
224 information; and <function>process()</function> is called every
225 time the filter has to processes a package.
228 While all filters provide the same API, there are different modes
229 of functionality. Some filters are sources: they create
231 (<literal>frontend_net</literal>);
232 others are sinks: they consume packages and return a result
233 (<literal>z3950_client</literal>,
234 <literal>backend_test</literal>,
235 <literal>http_file</literal>);
236 the others are true filters, that read, process and pass on the
237 packages they are fed
238 (<literal>auth_simple</literal>,
239 <literal>log</literal>,
240 <literal>multi</literal>,
241 <literal>query_rewrite</literal>,
242 <literal>session_shared</literal>,
243 <literal>template</literal>,
244 <literal>virt_db</literal>).
250 <title>Overview of filter types</title>
252 We now briefly consider each of the types of filter supported by
253 the core Metaproxy binary. This overview is intended to give a
254 flavour of the available functionality; more detailed information
255 about each type of filter is included below in the Module
259 The filters are here named by the string that is used as the
260 <literal>type</literal> attribute of a
261 <literal><filter></literal> element in the configuration
262 file to request them, with the name of the class that implements
263 them in parentheses. (The classname is not needed for normal
264 configuration and use of Metaproxy; it is useful only to
268 The filters are here listed in alphabetical order:
272 <title><literal>auth_simple</literal>
273 (mp::filter::AuthSimple)</title>
275 Simple authentication and authorisation. The configuration
276 specifies the name of a file that is the user register, which
277 lists <varname>username</varname>:<varname>password</varname>
278 pairs, one per line, colon separated. When a session begins, it
279 is rejected unless username and passsword are supplied, and match
280 a pair in the register. The configuration file may also specific
281 the name of another file that is the target register: this lists
282 lists <varname>username</varname>:<varname>dbname</varname>,<varname>dbname</varname>...
283 sets, one per line, with multiple database names separated by
284 commas. When a search is processed, it is rejected unless the
285 database to be searched is one of those listed as available to
291 <title><literal>backend_test</literal>
292 (mp::filter::Backend_test)</title>
294 A sink that provides dummy responses in the manner of the
295 <literal>yaz-ztest</literal> Z39.50 server. This is useful only
296 for testing. Seriously, you don't need this. Pretend you didn't
297 even read this section.
302 <title><literal>frontend_net</literal>
303 (mp::filter::FrontendNet)</title>
305 A source that accepts Z39.50 and SRW connections from a port
306 specified in the configuration, reads protocol units, and
307 feeds them into the next filter in the route. When the result is
308 revceived, it is returned to the original origin.
313 <title><literal>http_file</literal>
314 (mp::filter::HttpFile)</title>
316 A sink that returns the contents of files from the local
317 filesystem in response to HTTP requests. (Yes, Virginia, this
318 does mean that Metaproxy is also a Web-server in its spare time. So
319 far it does not contain either an email-reader or a Lisp
320 interpreter, but that day is surely coming.)
325 <title><literal>log</literal>
326 (mp::filter::Log)</title>
328 Writes logging information to standard output, and passes on
329 the package unchanged.
334 <title><literal>multi</literal>
335 (mp::filter::Multi)</title>
337 Performs multicast searching.
339 <link linkend="multidb">the extended discussion</link>
340 of virtual databases and multi-database searching below.
345 <title><literal>query_rewrite</literal>
346 (mp::filter::QueryRewrite)</title>
348 Rewrites Z39.50 Type-1 and Type-101 (``RPN'') queries by a
349 three-step process: the query is transliterated from Z39.50
350 packet structures into an XML representation; that XML
351 representation is transformed by an XSLT stylesheet; and the
352 resulting XML is transliterated back into the Z39.50 packet
358 <title><literal>session_shared</literal>
359 (mp::filter::SessionShared)</title>
361 When this is finished, it will implement global sharing of
362 result sets (i.e. between threads and therefore between
363 clients), yielding performance improvements especially when
364 incoming requests are from a stateless environment such as a
365 web-server, in which the client process representing a session
366 might be any one of many. However:
370 This filter is not yet completed.
376 <title><literal>template</literal>
377 (mp::filter::Template)</title>
379 Does nothing at all, merely passing the packet on. (Maybe it
380 should be called <literal>nop</literal> or
381 <literal>passthrough</literal>?) This exists not to be used, but
382 to be copied - to become the skeleton of new filters as they are
383 written. As with <literal>backend_test</literal>, this is not
384 intended for civilians.
389 <title><literal>virt_db</literal>
390 (mp::filter::Virt_db)</title>
392 Performs virtual database selection: based on the name of the
393 database in the search request, a server is selected, and its
394 address added to the request in a <literal>VAL_PROXY</literal>
395 otherInfo packet. It will subsequently be used by a
396 <literal>z3950_client</literal> filter.
398 <link linkend="multidb">the extended discussion</link>
399 of virtual databases and multi-database searching below.
404 <title><literal>z3950_client</literal>
405 (mp::filter::Z3950Client)</title>
407 Performs Z39.50 searching and retrieval by proxying the
408 packages that are passed to it. Init requests are sent to the
409 address specified in the <literal>VAL_PROXY</literal> otherInfo
410 attached to the request: this may have been specified by client,
411 or generated by a <literal>virt_db</literal> filter earlier in
412 the route. Subsequent requests are sent to the same address,
413 which is remembered at Init time in a Session object.
420 <title>Future directions</title>
422 Some other filters that do not yet exist, but which would be
423 useful, are briefly described. These may be added in future
424 releases (or may be created by third parties, as loadable
430 <term><literal>frontend_cli</literal> (source)</term>
433 Command-line interface for generating requests.
438 <term><literal>srw2z3950</literal> (filter)</term>
441 Translate SRW requests into Z39.50 requests.
446 <term><literal>srw_client</literal> (sink)</term>
449 SRW searching and retrieval.
454 <term><literal>sru_client</literal> (sink)</term>
457 SRU searching and retrieval.
462 <term><literal>opensearch_client</literal> (sink)</term>
465 A9 OpenSearch searching and retrieval.
475 <chapter id="multidb">
476 <title>Virtual databases and multi-database searching</title>
480 <title>Introductory notes</title>
482 Two of Metaproxy's filters are concerned with multiple-database
483 operations. Of these, <literal>virt_db</literal> can work alone
484 to control the routing of searches to one of a number of servers,
485 while <literal>multi</literal> can work with the output of
486 <literal>virt_db</literal> to perform multicast searching, merging
487 the results into a unified result-set. The interaction between
488 these two filters is necessarily complex, reflecting the real
489 complexity of multicast searching in a protocol such as Z39.50
490 that separates initialisation from searching, with the database to
491 search known only during the latter operation.
494 ### Much, much more to say!
501 <chapter id="configuration">
502 <title>Configuration: the Metaproxy configuration file format</title>
506 <title>Introductory notes</title>
508 If Metaproxy is an interpreter providing operations on packages, then
509 its configuration file can be thought of as a program for that
510 interpreter. Configuration is by means of a single file, the name
511 of which is supplied as the sole command-line argument to the
512 <command>yp2</command> program.
515 The configuration files are written in XML. (But that's just an
516 implementation detail - they could just as well have been written
517 in YAML or Lisp-like S-expressions, or in a custom syntax.)
520 Since XML has been chosen, an XML schema,
521 <filename>config.xsd</filename>, is provided for validating
522 configuration files. This file is supplied in the
523 <filename>etc</filename> directory of the Metaproxy distribution. It
524 can be used by (among other tools) the <command>xmllint</command>
525 program supplied as part of the <literal>libxml2</literal>
529 xmllint --noout --schema etc/config.xsd my-config-file.xml
532 (A recent version of <literal>libxml2</literal> is required, as
533 support for XML Schemas is a relatively recent addition.)
538 <title>Overview of XML structure</title>
540 All elements and attributes are in the namespace
541 <ulink url="http://indexdata.dk/yp2/config/1"/>.
542 This is most easily achieved by setting the default namespace on
543 the top-level element, as here:
546 <yp2 xmlns="http://indexdata.dk/yp2/config/1">
549 The top-level element is <yp2>. This contains a
550 <start> element, a <filters> element and a
551 <routes> element, in that order. <filters> is
552 optional; the other two are mandatory. All three are
556 The <start> element is empty, but carries a
557 <literal>route</literal> attribute, whose value is the name of
558 route at which to start running - analogouse to the name of the
559 start production in a formal grammar.
562 If present, <filters> contains zero or more <filter>
563 elements; filters carry a <literal>type</literal> attribute and
564 contain various elements that provide suitable configuration for
565 filters of that type. The filter-specific elements are described
566 below. Filters defined in this part of the file must carry an
567 <literal>id</literal> attribute so that they can be referenced
571 <routes> contains one or more <route> elements, each
572 of which must carry an <literal>id</literal> element. One of the
573 routes must have the ID value that was specified as the start
574 route in the <start> element's <literal>route</literal>
575 attribute. Each route contains zero or more <filter>
576 elements. These are of two types. They may be empty, but carry a
577 <literal>refid</literal> attribute whose value is the same as the
578 <literal>id</literal> of a filter previously defined in the
579 <filters> section. Alternatively, a route within a filter
580 may omit the <literal>refid</literal> attribute, but contain
581 configuration elements similar to those used for filters defined
582 in the <filters> section.
588 <title>Filter configuration</title>
590 All <filter> elements have in common that they must carry a
591 <literal>type</literal> attribute whose value is one of the
592 supported ones, listed in the schema file and discussed below. In
593 additional, <filters>s occurring the <filters> section
594 must have an <literal>id</literal> attribute, and those occurring
595 within a route must have either a <literal>refid</literal>
596 attribute referencing a previously defined filter or contain its
597 own configuration information.
600 In general, each filter recognises different configuration
601 elements within its element, as each filter has different
602 functionality. These are as follows:
606 <title><literal>auth_simple</literal></title>
608 <filter type="auth_simple">
609 <userRegister>../etc/example.simple-auth</userRegister>
615 <title><literal>backend_test</literal></title>
617 <filter type="backend_test"/>
622 <title><literal>frontend_net</literal></title>
624 <filter type="frontend_net">
625 <threads>10</threads>
626 <port>@:9000</port>
632 <title><literal>http_file</literal></title>
634 <filter type="http_file">
635 <mimetypes>/etc/mime.types</mimetypes>
637 <documentroot>.</documentroot>
638 <prefix>/etc</prefix>
645 <title><literal>log</literal></title>
647 <filter type="log">
648 <message>B</message>
654 <title><literal>multi</literal></title>
656 <filter type="multi"/>
661 <title><literal>query_rewrite</literal></title>
663 <filter type="query_rewrite">
664 <xslt>pqf2pqf.xsl</xslt>
670 <title><literal>session_shared</literal></title>
672 <filter type="session_shared">
679 <title><literal>template</literal></title>
681 <filter type="template"/>
686 <title><literal>virt_db</literal></title>
688 <filter type="virt_db">
690 <database>loc</database>
691 <target>z3950.loc.gov:7090/voyager</target>
694 <database>idgils</database>
695 <target>indexdata.dk/gils</target>
702 <title><literal>z3950_client</literal></title>
704 <filter type="z3950_client">
705 <timeout>30</timeout>
714 <chapter id="moduleref">
715 <title>Module Reference</title>
717 The material in this chapter includes the man pages material
722 <chapter id="extensions">
723 <title>Writing extensions for Metaproxy</title>
727 <chapter id="classes">
728 <title>Classes in the Metaproxy source code</title>
732 <title>Introductory notes</title>
734 <emphasis>Stop! Do not read this!</emphasis>
735 You won't enjoy it at all.
738 This chapter contains documentation of the Metaproxy source code, and is
739 of interest only to maintainers and developers. If you need to
740 change Metaproxy's behaviour or write a new filter, then you will most
741 likely find this chapter helpful. Otherwise it's a waste of your
742 good time. Seriously: go and watch a film or something.
743 <citetitle>This is Spinal Tap</citetitle> is particularly good.
746 Still here? OK, let's continue.
749 In general, classes seem to be named big-endianly, so that
750 <literal>FactoryFilter</literal> is not a filter that filters
751 factories, but a factory that produces filters; and
752 <literal>FactoryStatic</literal> is a factory for the statically
753 registered filters (as opposed to those that are dynamically
759 <title>Individual classes</title>
761 The classes making up the Metaproxy application are here listed by
762 class-name, with the names of the source files that define them in
767 <title><literal>mp::FactoryFilter</literal>
768 (<filename>factory_filter.cpp</filename>)</title>
770 A factory class that exists primarily to provide the
771 <literal>create()</literal> method, which takes the name of a
772 filter class as its argument and returns a new filter of that
773 type. To enable this, the factory must first be populated by
774 calling <literal>add_creator()</literal> for static filters (this
775 is done by the <literal>FactoryStatic</literal> class, see below)
776 and <literal>add_creator_dyn()</literal> for filters loaded
782 <title><literal>mp::FactoryStatic</literal>
783 (<filename>factory_static.cpp</filename>)</title>
785 A subclass of <literal>FactoryFilter</literal> which is
786 responsible for registering all the statically defined filter
787 types. It does this by knowing about all those filters'
788 structures, which are listed in its constructor. Merely
789 instantiating this class registers all the static classes. It is
790 for the benefit of this class that <literal>struct
791 yp2_filter_struct</literal> exists, and that all the filter
792 classes provide a static object of that type.
797 <title><literal>mp::filter::Base</literal>
798 (<filename>filter.cpp</filename>)</title>
800 The virtual base class of all filters. The filter API is, on the
801 surface at least, extremely simple: two methods.
802 <literal>configure()</literal> is passed a DOM tree representing
803 that part of the configuration file that pertains to this filter
804 instance, and is expected to walk that tree extracting relevant
805 information. And <literal>process()</literal> processes a
806 package (see below). That surface simplicitly is a bit
807 misleading, as <literal>process()</literal> needs to know a lot
808 about the <literal>Package</literal> class in order to do
814 <title><literal>mp::filter::AuthSimple</literal>,
815 <literal>Backend_test</literal>, etc.
816 (<filename>filter_auth_simple.cpp</filename>,
817 <filename>filter_backend_test.cpp</filename>, etc.)</title>
819 Individual filters. Each of these is implemented by a header and
820 a source file, named <filename>filter_*.hpp</filename> and
821 <filename>filter_*.cpp</filename> respectively. All the header
822 files should be pretty much identical, in that they declare the
823 class, including a private <literal>Rep</literal> class and a
824 member pointer to it, and the two public methods. The only extra
825 information in any filter header is additional private types and
826 members (which should really all be in the <literal>Rep</literal>
827 anyway) and private methods (which should also remain known only
828 to the source file, but C++'s brain-damaged design requires this
829 dirty laundry to be exhibited in public. Thanks, Bjarne!)
832 The source file for each filter needs to supply:
837 A definition of the private <literal>Rep</literal> class.
842 Some boilerplate constructors and destructors.
847 A <literal>configure()</literal> method that uses the
848 appropriate XML fragment.
853 Most important, the <literal>process()</literal> method that
854 does all the actual work.
861 <title><literal>mp::Package</literal>
862 (<filename>package.cpp</filename>)</title>
864 Represents a package on its way through the series of filters
865 that make up a route. This is essentially a Z39.50 or SRU APDU
866 together with information about where it came from, which is
867 modified as it passes through the various filters.
872 <title><literal>mp::Pipe</literal>
873 (<filename>pipe.cpp</filename>)</title>
875 This class provides a compatibility layer so that we have an IPC
876 mechanism that works the same under Unix and Windows. It's not
877 particularly exciting.
882 <title><literal>mp::RouterChain</literal>
883 (<filename>router_chain.cpp</filename>)</title>
890 <title><literal>mp::RouterFleXML</literal>
891 (<filename>router_flexml.cpp</filename>)</title>
898 <title><literal>mp::Session</literal>
899 (<filename>session.cpp</filename>)</title>
906 <title><literal>mp::ThreadPoolSocketObserver</literal>
907 (<filename>thread_pool_observer.cpp</filename>)</title>
914 <title><literal>mp::util</literal>
915 (<filename>util.cpp</filename>)</title>
917 A namespace of various small utility functions and classes,
918 collected together for convenience. Most importantly, includes
919 the <literal>mp::util::odr</literal> class, a wrapper for YAZ's
925 <title><literal>mp::xml</literal>
926 (<filename>xmlutil.cpp</filename>)</title>
928 A namespace of various XML utility functions and classes,
929 collected together for convenience.
936 <title>Other Source Files</title>
938 In addition to the Metaproxy source files that define the classes
939 described above, there are a few additional files which are
940 briefly described here:
944 <term><literal>metaproxy_prog.cpp</literal></term>
947 The main function of the <command>yp2</command> program.
952 <term><literal>ex_router_flexml.cpp</literal></term>
955 Identical to <literal>metaproxy_prog.cpp</literal>: it's not clear why.
960 <term><literal>test_*.cpp</literal></term>
963 Unit-tests for various modules.
969 ### Still to be described:
970 <literal>ex_filter_frontend_net.cpp</literal>,
971 <literal>filter_dl.cpp</literal>,
972 <literal>plainfile.cpp</literal>,
973 <literal>tstdl.cpp</literal>.
982 <!-- This is just a lame way to get some vertical whitespace at
983 the end of the document -->
992 <!-- Keep this comment at the end of the file
997 sgml-minimize-attributes:nil
998 sgml-always-quote-attributes:t
1001 sgml-parent-document: "main.xml"
1002 sgml-local-catalogs: nil
1003 sgml-namecase-general:t
1004 nxml-child-indent: 1