Z39.50 for Dummies - Part 4

This is part 4 of the series Z39.50 for dummies.

Libraries store and exchange bibliographic data in MARC records. A MARC record is a MAchine-Readable Cataloging record. It was developed at the Library of Congress (LoC) beginning in the 1960s.

A dump of the LoC catalog (and other libraries) is available at the Internet Archive in the collection marcrecords. The LoC catalog dump is split into 29 files, part01.dat to part29.dat. Each file is roughly 200MB large.

The great news is that the data from LoC is public domain (already paid by the US taxpayers, thank you!) and you can use the data for your own system.

Before you can import data, you must validate, convert, or fix the bibliographic data. I will show now how you can do this with the Index Data YAZ toolkit.The YAZ toolkit contains the program yaz-marcdump to dump MARC records.

yaz-marcdump called without an option will print the records in line format:

$ yaz-marcdump part01.dat | more 00720cam 22002051 4500 001 00000002 003 DLC 005 20040505165105.0 008 800108s1899 ilu 000 0 eng 010 $a 00000002 035 $a (OCoLC)5853149 040 $a DLC $c DSI $d DLC 050 00 $a RX671 $b .A92 100 1 $a Aurand, Samuel Herbert, $d 1854- 245 10 $a Botanical materia medica and pharmacology; $b drugs considered from a botanical, pharmaceutical, physiological, therapeutical and toxicological standpoint. $c By S. H. Aurand. 260 $a Chicago, $b P. H. Mallen Company, $c 1899. 300 $a 406 p. $c 24 cm. 500 $a Homeopathic formulae. 650 0 $a Botany, Medical. 650 0 $a Homeopathy $x Materia medica and therapeutics. [...]

First converts the MARC21 records in MARC-8 encoding to MARC21 in UTF-8 encoding:

$ yaz-marcdump -f marc-8 -t utf-8 -o marc \ part01.dat > part.mrc

For MARC21, the leader offset 9 tells whether it is really MARC8 (almost always the case) or whether it's UTF-8. A MARC21 must have position 9='a' (value 97). For this reason, the option -l for yaz-marcdump may come in handy:

$ yaz-marcdump -f marc-8 -t utf-8 -o marc \ -l 9=97 part01.dat > part.mrc

If you prefer MARCXML instead MARC21 records you may convert the records:

$ yaz-marcdump -o marcxml -f MARC-8 -t UTF-8 \ part01.dat > part.marcxml 00720cam a22002051 4500 00000002 DLC 20040505165105.0 800108s1899 ilu 000 0 eng 00000002 (OCoLC)5853149 [...]

The Library of Congress has over 7 million records. That's huge data, total 5.6GB raw data. If you compress that data it is only 1.7GB.

To convert compressed data, run yaz-marcdump in a UNIX pipe:

$ zcat part01.dat.gz | yaz-marcdump -f MARC-8 \ -t UTF-8 -o marcxml /dev/stdin > part01.marcxml

You can search a marc dump with the UNIX grep tool:

$ yaz-marcdump -f marc-8 -t utf-8 part01.dat | \ grep Sausalito 260 $a Sausalito, Calif. : $b University Science Books, $c 2000. 260 $a Sausalito, Calif. : $b Math Solutions Publications, $c c2000. 260 $a Sausalito, Calif. : $b Post-Apollo Press, $c c2000. 260 $a Sausalito, Calif. : $b University Science Books, $c c2002. 260 $a Sausalito, Calif. : $b Post-Apollo Press, $c c2000. 260 $a Sausalito, CA : $b Toland Communications, $c c2000. 260 $a Sausalito, CA : $b In Between Books, $c 2001. [...]

The yaz-marcdump tool supports the character sets UTF-8, MARC-8, ISO8859-1, ISO5426 and some other encodings. For more information, see the yaz-iconv manual pages.

In this article I showed how to validate, convert, or fix bibliographic data dumped in MARC format. Next time I will show some advanced examples how to analyze MARC records on modern standard PC hardware.


Read the other articles of the series Z39.50 for Dummies: Part I, Part II, Part III, Part V

1 Comment

When I tried to access LoC

When I tried to access LoC catalog from Internet Archive from link in your article, I got "The item is not available due to issues with the item's content.
If you would like to report this problem as an error report, you may do so here." It looks like they remove the marc records.