Chapter 8. ALVIS XML Record Model and Filter Module

Table of Contents

1. ALVIS Record Filter
1.1. ALVIS Internal Record Representation
1.2. ALVIS Canonical Indexing Format
2. ALVIS Record Model Configuration
2.1. ALVIS Indexing Configuration
2.2. ALVIS Exchange Formats
2.3. ALVIS Filter OAI Indexing Example

Warning

The functionality of this record model has been improved and replaced by the DOM XML record model, see Chapter 7, DOM XML Record Model and Filter Module. The Alvis XML record model is considered obsolete, and will eventually be removed from future releases of the Zebra software.

The record model described in this chapter applies to the fundamental, structured XML record type alvis, introduced in Section 2.5.2, “ALVIS XML Record Model and Filter Module”.

This filter has been developed under the ALVIS project funded by the European Community under the "Information Society Technologies" Program (2002-2006).

1. ALVIS Record Filter

The experimental, loadable Alvis XML/XSLT filter module mod-alvis.so is packaged in the GNU/Debian package libidzebra1.4-mod-alvis. It is invoked by the zebra.cfg configuration statement

     recordtype.xml: alvis.db/filter_alvis_conf.xml
    

In this example on all data files with suffix *.xml, where the Alvis XSLT filter configuration file is found in the path db/filter_alvis_conf.xml.

The Alvis XSLT filter configuration file must be valid XML. It might look like this (This example is used for indexing and display of OAI harvested records):

     <?xml version="1.0" encoding="UTF-8"?>
     <schemaInfo>
     <schema name="identity" stylesheet="xsl/identity.xsl" />
     <schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
     stylesheet="xsl/oai2index.xsl" />
     <schema name="dc" stylesheet="xsl/oai2dc.xsl" />
     <!-- use split level 2 when indexing whole OAI Record lists -->
     <split level="2"/>
     </schemaInfo>
    

All named stylesheets defined inside schema element tags are for presentation after search, including the indexing stylesheet (which is a great debugging help). The names defined in the name attributes must be unique, these are the literal schema or element set names used in SRW, SRU and Z39.50 protocol queries. The paths in the stylesheet attributes are relative to zebras working directory, or absolute to file system root.

The <split level="2"/> decides where the XML Reader shall split the collections of records into individual records, which then are loaded into DOM, and have the indexing XSLT stylesheet applied.

There must be exactly one indexing XSLT stylesheet, which is defined by the magic attribute identifier="http://indexdata.dk/zebra/xslt/1".

1.1. ALVIS Internal Record Representation

When indexing, an XML Reader is invoked to split the input files into suitable record XML pieces. Each record piece is then transformed to an XML DOM structure, which is essentially the record model. Only XSLT transformations can be applied during index, search and retrieval. Consequently, output formats are restricted to whatever XSLT can deliver from the record XML structure, be it other XML formats, HTML, or plain text. In case you have libxslt1 running with EXSLT support, you can use this functionality inside the Alvis filter configuration XSLT stylesheets.

1.2. ALVIS Canonical Indexing Format

The output of the indexing XSLT stylesheets must contain certain elements in the magic xmlns:z="http://indexdata.dk/zebra/xslt/1" namespace. The output of the XSLT indexing transformation is then parsed using DOM methods, and the contained instructions are performed on the magic elements and their subtrees.

For example, the output of the command

      xsltproc xsl/oai2index.xsl one-record.xml
     

might look like this:

      <?xml version="1.0" encoding="UTF-8"?>
      <z:record xmlns:z="http://indexdata.dk/zebra/xslt/1"
      z:id="oai:JTRS:CP-3290---Volume-I"
      z:rank="47896">
      <z:index name="oai_identifier" type="0">
      oai:JTRS:CP-3290---Volume-I</z:index>
      <z:index name="oai_datestamp" type="0">2004-07-09</z:index>
      <z:index name="oai_setspec" type="0">jtrs</z:index>
      <z:index name="dc_all" type="w">
      <z:index name="dc_title" type="w">Proceedings of the 4th
      International Conference and Exhibition:
      World Congress on Superconductivity - Volume I</z:index>
      <z:index name="dc_creator" type="w">Kumar Krishen and *Calvin
      Burnham, Editors</z:index>
      </z:index>
      </z:record>
     

This means the following: From the original XML file one-record.xml (or from the XML record DOM of the same form coming from a split input file), the indexing stylesheet produces an indexing XML record, which is defined by the record element in the magic namespace xmlns:z="http://indexdata.dk/zebra/xslt/1". Zebra uses the content of z:id="oai:JTRS:CP-3290---Volume-I" as internal record ID, and - in case static ranking is set - the content of z:rank="47896" as static rank. Following the discussion in Section 9, “Relevance Ranking and Sorting of Result Sets” we see that this records is internally ordered lexicographically according to the value of the string oai:JTRS:CP-3290---Volume-I47896.

In this example, the following literal indexes are constructed:

      oai_identifier
      oai_datestamp
      oai_setspec
      dc_all
      dc_title
      dc_creator
     

where the indexing type is defined in the type attribute (any value from the standard configuration file default.idx will do). Finally, any text() node content recursively contained inside the index will be filtered through the appropriate char map for character normalization, and will be inserted in the index.

Specific to this example, we see that the single word oai:JTRS:CP-3290---Volume-I will be literal, byte for byte without any form of character normalization, inserted into the index named oai:identifier, the text Kumar Krishen and *Calvin Burnham, Editors will be inserted using the w character normalization defined in default.idx into the index dc:creator (that is, after character normalization the index will keep the individual words kumar, krishen, and, calvin, burnham, and editors), and finally both the texts Proceedings of the 4th International Conference and Exhibition: World Congress on Superconductivity - Volume I and Kumar Krishen and *Calvin Burnham, Editors will be inserted into the index dc:all using the same character normalization map w.

Finally, this example configuration can be queried using PQF queries, either transported by Z39.50, (here using a yaz-client)

      
      Z> open localhost:9999
      Z> elem dc
      Z> form xml
      Z>
      Z> f @attr 1=dc_creator Kumar
      Z> scan @attr 1=dc_creator adam
      Z>
      Z> f @attr 1=dc_title @attr 4=2 "proceeding congress superconductivity"
      Z> scan @attr 1=dc_title abc
      
     

or the proprietary extensions x-pquery and x-pScanClause to SRU, and SRW

      
      http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=%40attr+1%3Ddc_creator+%40attr+4%3D6+%22the
      http://localhost:9999/?version=1.1&operation=scan&x-pScanClause=@attr+1=dc_date+@attr+4=2+a
      
     

See the section called “The SRU Server” for more information on SRU/SRW configuration, and the section called “YAZ server virtual hosts” or the YAZ CQL section for the details or the YAZ frontend server.

Notice that there are no *.abs, *.est, *.map, or other GRS-1 filter configuration files involves in this process, and that the literal index names are used during search and retrieval.