Turbomarc, faster XML for MARC records

Our metasearch middleware, Pazpar2, spends a lot of time doing XML transformations. When we use Pazpar2 with traditional library data sources that return MARC21, we internally convert the received records into MARCXML (if they’re not already represented as such) and then transform into the internal pazpar2 XML format using XSLT (more on this process here).

MARCXML is nice to look at, but it’s not an optimal format on which to perform XSL transformations.

So we did performance testing, and we found that much of the CPU usage was around the transformation from MARCXML to our internal, normalized data format using XSL transformations.

We decided to try out a format (we call it Turbomarc) where we represent the names of MARC fields and subfields as XML element names rather than attributes, value, but leave an option for special cases open. This MARCXML:

<collection xmlns=”http://www.loc.gov/<span class="caps">MARC21</span>/slim”>
<record>
  <leader>00492nam a22001455a 4500</leader>
  <controlfield tag=”001”>000277485</controlfield>
  <datafield tag=”100” ind1=”1” ind2=” “>
    <subfield code=”a”>Μαρούδης, Κωνσταντίνος Ιω</subfield>
  </datafield>
  <datafield tag=”250” ind1=” ” ind2=” “>
    <subfield code=”η”> εκδ.</subfield>
  </datafield></record>
</collection>
will in Turbomarc be:
<c xmlns=”http://www.indexdata.com/turbomarc”>
<r>
  <l>00492nam a22001455a 4500</l>
  <c001>000277485</c001>
  <d100 i1=”1” i2=” “>
    <sa>Μαρούδης, Κωνσταντίνος Ιω</sa>
  </d100>
  <d250 i1=” ” i2=” “>
    <s code=”η”> εκδ.</s>
  </d250>
</r>
</c>
This shows the special case where a non-alphanumeric attribute value is not combined into an element name, but is left as an attribute (this happens very rarely in real use).

Using xsltproc –timing showed that our transformations were faster by a factor of 4-5. Shortening the element names only improved performance fractionally, but since everything counts, we decided to do this as well.

The single user probably won’t notice the difference but it will for sure enable more throughput. Measuring average response time given different numbers of users we saw the following: average response time which shows that we can double the number of users in a high-throughput stress-test given a fixed average response time for a typical Pazpar2 webservice command. The corresponding number of ‘real’ users would be much higher.

Support for the Turbomarc format was released with version 4.0.1 of YAZ and is supported by the ZOOM layer and thus by zoomsh by using txml instead of xml in the show command:

<span class="caps">ZOOM</span>>open a-z-target
<span class="caps">ZOOM</span>>search water
<span class="caps">ZOOM</span>>show 0 1 txml

Pazpar2 supports Turbomarc from version 1.4.0 and Turbomarc will eventually become the default format; however, at present, MARCXML remains the default to avoid breaking existing applications of Pazpar2.

Another new change in pazpar2 is support for threading, to make better use of multi-core systems, but this is still at beta level and could be a topic for a future blog entry.

AttachmentSize
Image icon combined_show.png21.69 KB

3 Comments

Turbomarc is faster also on SimpleXML

Turbomarc format is also faster (30-40%) than marcxml on parsing by simplexml on PHP. This also makes turbomarc simplexml object very flexible handling (uni)marc record manipulation.

Yes, this is kinda clever. I

Yes, this is kinda clever. I know since we've been doing it in LIBRIS for the last five years :)

wow 5 years ago!... any

wow 5 years ago!... any information on the web from LIBRIS about that?