Simple MARC Manipulation Using XSLT

In my colleague Wolfram's series of blog posts on using Z39.50, he shows how easy it is to acquire MARC records from openly available sources (but do show consideration for the people who run the servers!) This inspired me to think about other ways to use simple Unix command-line tools to manipulate these records once you have them. I know some folks will give me the evil eye for even acknowledging the existence of venerable MARC, but hey, I figure if more people know how to change the format around, maybe we'll get away from it faster!

I really like the idea of using XSLT to manipulate MARC data -- it's not a perfect translation language, to be sure, but it's a standard, which means it gives us a shared language. We can share tricks, patterns, and entire stylesheets for different purposes, from data conversion to display formatting, and high-performance XSLT processors can be embedded into just about any kind of software platform. The LoC makes a few nice, reusable stylesheets available for MARCXML work -- it would be nice to see something like that evolve into a public repository of stylesheets.

As I mentioned, it is not perfect. If you need very high performance, like in an interactive metasearch tool, you can do things faster by coding directly against the MARC schema (although I must say I am deeply amazed at how fast the libxslt library is), and running a very large dataset through xsltproc on the command line can cause memory problems (I have not experimented with streaming XSLT processors to get around this problem). Our Zebra indexing engine, which uses libxslt to cook incoming records for different purposes, doesn't have these memory issues because it works on each record independently. The examples below operate on a whole file of MARC records in XML.

Because we're Unix people, we'll do things in a pipe. If we have a raw MARC input, we can use yaz-marcdump from the YAZ toolkit to generate a MARCXML output: % yaz-marcdump -f marc8 -o marcxml myrecords.mrc

There. Nice, clean XML. Much nicer to work with. Tweak the parameters if your input doesn't use the MARC-8 character set. The output will be UTF-8. Now, a data field with a single subfield looks like this:

Jack Collins

Let's begin with a stylesheet that simply passes through the data unchanged:

There's something about XSLT that just pleases me no end... those brackets have a verbose kind of aesthetic all their own. I know this probably brands me as slightly goofy, but I like XSLT.

To run this stylesheet in a pipe -- assuming you have a pipe-friendly XSLT processor (I use xsltproc), do

% yaz-marcdump -f marc8 -o marcxml myrecords.mrc | xsltproc convert.xsl - > converted.xml

Now, with this 'null translation' as a base, we can start to add in all kinds of fun things. To begin with, let's say we simply want to remove a field, say, field 300. We can do this by simply adding the following empty-bodied template to the stylesheet above:

To add a field is only slightly more involved. Here is one way to add a field to the end of every record:

My private field

We can use the string processing functions in XPATH to change the contents of fields, too, or we can use values from existing fields to construct new ones. Assume that you have an OPAC that allows you to deep-link to any record using a URL like 'http://myopac.org/ID', where ID is the content of your 001 field. You'd like to generate a MARC file that contains an 856 field for each record with this URL. Here's one way of doing it:

http://myopac.org/

Adding the right indicators and other subfields to the 856 is left as an exercise for the reader. The 856 field is, indeed, a particularly grim example of the failings of the MARC format. Nevertheless, I hope these examples give some folks sense that you don't need specialized tools to manipulate it -- the XML standard and its many associated tools and technologies come to the rescue. The transformations can be arbitrarily complex, and may combine multiple data elements, transform values, etc. The full arsenal of XSLT is available.

To round things off, let's finish our project by turning the transformed XML output back into MARC, in case we need to work with it in that form later.

% yaz-marcdump -i marcxml -o marc -f utf8 -t marc8 converted.xml > converted.mrc

1 Comment

XSL tags should be HTML-encoded

Your article would be a whole load better if the XSL tags had been HTML-encoded -- currently nothing displays for most of it :(