3. DOM Record Model Configuration

3.1. DOM Indexing Configuration

As mentioned above, there can be only one indexing pipeline, and configuration of the indexing process is a synonym of writing an XSLT stylesheet which produces XML output containing the magic processing instructions or elements discussed in Section 2.5, “Canonical Indexing Format”. Obviously, there are million of different ways to accomplish this task, and some comments and code snippets are in order to enlighten the wary.

Stylesheets can be written in the pull or the push style: pull means that the output XML structure is taken as starting point of the internal structure of the XSLT stylesheet, and portions of the input XML are pulled out and inserted into the right spots of the output XML structure. On the other side, push XSLT stylesheets are recursively calling their template definitions, a process which is commanded by the input XML structure, and is triggered to produce some output XML whenever some special conditions in the input stylesheets are met. The pull type is well-suited for input XML with strong and well-defined structure and semantics, like the following OAI indexing example, whereas the push type might be the only possible way to sort out deeply recursive input XML formats.

A pull stylesheet example used to index OAI harvested records could use some of the following template definitions:

      
      <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
      xmlns:z="http://indexdata.com/zebra-2.0"
      xmlns:oai="http://www.openarchives.org/&acro.oai;/2.0/"
      xmlns:oai_dc="http://www.openarchives.org/&acro.oai;/2.0/oai_dc/"
      xmlns:dc="http://purl.org/dc/elements/1.1/"
      version="1.0">

      <!-- Example pull and magic element style Zebra indexing -->
      <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>

      <!-- disable all default text node output -->
      <xsl:template match="text()"/>

      <!-- disable all default recursive element node transversal -->
      <xsl:template match="node()"/>

      <!-- match only on oai xml record root -->
      <xsl:template match="/">
      <z:record z:id="{normalize-space(oai:record/oai:header/oai:identifier)}">
      <!-- you may use z:rank="{some XSLT; function here}" -->

      <!-- explicetly calling defined templates -->
      <xsl:apply-templates/>
     </z:record>
     </xsl:template>

      <!-- OAI indexing templates -->
      <xsl:template match="oai:record/oai:header/oai:identifier">
      <z:index name="oai_identifier:0">
      <xsl:value-of select="."/>
     </z:index>
     </xsl:template>

      <!-- etc, etc -->

      <!-- DC specific indexing templates -->
      <xsl:template match="oai:record/oai:metadata/oai_dc:dc/dc:title">
      <z:index name="dc_any:w dc_title:w dc_title:p dc_title:s ">
      <xsl:value-of select="."/>
     </z:index>
     </xsl:template>

      <!-- etc, etc -->

     </xsl:stylesheet>
      
     

3.2. DOM Indexing MARCXML

The DOM filter allows indexing of both binary MARC records and MARCXML records, depending on its configuration. A typical MARCXML record might look like this:

      
      <record xmlns="http://www.loc.gov/MARC21/slim">
      <rank>42</rank>
      <leader>00366nam  22001698a 4500</leader>
      <controlfield tag="001">   11224466   </controlfield>
      <controlfield tag="003">DLC  </controlfield>
      <controlfield tag="005">00000000000000.0  </controlfield>
      <controlfield tag="008">910710c19910701nju           00010 eng    </controlfield>
      <datafield tag="010" ind1=" " ind2=" ">
      <subfield code="a">   11224466 </subfield>
     </datafield>
      <datafield tag="040" ind1=" " ind2=" ">
      <subfield code="a">DLC</subfield>
      <subfield code="c">DLC</subfield>
     </datafield>
      <datafield tag="050" ind1="0" ind2="0">
      <subfield code="a">123-xyz</subfield>
     </datafield>
      <datafield tag="100" ind1="1" ind2="0">
      <subfield code="a">Jack Collins</subfield>
     </datafield>
      <datafield tag="245" ind1="1" ind2="0">
      <subfield code="a">How to program a computer</subfield>
     </datafield>
      <datafield tag="260" ind1="1" ind2=" ">
      <subfield code="a">Penguin</subfield>
     </datafield>
      <datafield tag="263" ind1=" " ind2=" ">
      <subfield code="a">8710</subfield>
     </datafield>
      <datafield tag="300" ind1=" " ind2=" ">
      <subfield code="a">p. cm.</subfield>
     </datafield>
     </record>
      
     

It is easily possible to make string manipulation in the DOM filter. For example, if you want to drop some leading articles in the indexing of sort fields, you might want to pick out the MARCXML indicator attributes to chop of leading substrings. If the above XML example would have an indicator ind2="8" in the title field 245, i.e.

      
      <datafield tag="245" ind1="1" ind2="8">
      <subfield code="a">How to program a computer</subfield>
     </datafield>
      
     

one could write a template taking into account this information to chop the first 8 characters from the sorting index title:s like this:

      
      <xsl:template match="m:datafield[@tag='245']">
      <xsl:variable name="chop">
      <xsl:choose>
      <xsl:when test="not(number(@ind2))">0</xsl:when>
      <xsl:otherwise><xsl:value-of select="number(@ind2)"/></xsl:otherwise>
     </xsl:choose>
     </xsl:variable>

      <z:index name="title:w title:p any:w">
      <xsl:value-of select="m:subfield[@code='a']"/>
     </z:index>

      <z:index name="title:s">
      <xsl:value-of select="substring(m:subfield[@code='a'], $chop)"/>
     </z:index>

     </xsl:template>
      
     

The output of the above MARCXML and XSLT excerpt would then be:

      
      <z:index name="title:w title:p any:w">How to program a computer</z:index>
      <z:index name="title:s">program a computer</z:index>
      
     

and the record would be sorted in the title index under 'P', not 'H'.

3.3. DOM Indexing Wizardry

The names and types of the indexes can be defined in the indexing XSLT stylesheet dynamically according to content in the original XML records, which has opportunities for great power and wizardry as well as grande disaster.

The following excerpt of a push stylesheet might be a good idea according to your strict control of the XML input format (due to rigorous checking against well-defined and tight RelaxNG or XML Schema's, for example):

      
      <xsl:template name="element-name-indexes">
      <z:index name="{name()}:w">
      <xsl:value-of select="'1'"/>
     </z:index>
     </xsl:template>
      
     

This template creates indexes which have the name of the working node of any input XML file, and assigns a '1' to the index. The example query find @attr 1=xyz 1 finds all files which contain at least one xyz XML element. In case you can not control which element names the input files contain, you might ask for disaster and bad karma using this technique.

One variation over the theme dynamically created indexes will definitely be unwise:

      
      <!-- match on oai xml record root -->
      <xsl:template match="/">
      <z:record>

      <!-- create dynamic index name from input content -->
      <xsl:variable name="dynamic_content">
      <xsl:value-of select="oai:record/oai:header/oai:identifier"/>
     </xsl:variable>

      <!-- create zillions of indexes with unknown names -->
      <z:index name="{$dynamic_content}:w">
      <xsl:value-of select="oai:record/oai:metadata/oai_dc:dc"/>
     </z:index>
     </z:record>

     </xsl:template>
      
     

Don't be tempted to play too smart tricks with the power of XSLT, the above example will create zillions of indexes with unpredictable names, resulting in severe Zebra index pollution..

3.4. Debuggig DOM Filter Configurations

It can be very hard to debug a DOM filter setup due to the many successive MARC syntax translations, XML stream splitting and XSLT transformations involved. As an aid, you have always the power of the -s command line switch to the zebraidz indexing command at your hand:

      zebraidx -s -c zebra.cfg update some_record_stream.xml
     

This command line simulates indexing and dumps a lot of debug information in the logs, telling exactly which transformations have been applied, how the documents look like after each transformation, and which record ids and terms are send to the indexer.