Z39.50 for Dummies - Part 5

This is part 5 of the series Z39.50 for dummies. In the 4th part I showed how to run convert MARC21 records to line format or XML.

In this article I will show you how to analyze MARC data on a modern PC hardware. PC are very fast now and incredibly cheap. You can rent a quad-core Intel machine with 8GB RAM and unlimited traffic for 40 Euro/month (+VAT) in a data center.

If the computer is fast enough, you don’t have to spend too much time on complex algorithms. You can use the raw power of your computer and do a brute force approach.

In the following example I will use the 7 million records from a dump of the Library of Congress (LoC) catalog. For details, please read the previous article Z39.50 for Dummies - Part 4.

$ for i in *.dat; do
    yaz-marcdump -f marc-8 -t utf-8 -o line
  done > loc.txt

$ du -hs loc.txt

The line dump of the LoC is 4.9GB large and fits into main memory - great!

# count for the last name “Calaminus”
$ egrep -c Calaminus loc.txt

4 hits, the search took 4 seconds real time

# count records with <span class="caps">ISBN</span> number
$ egrep -c ^020 loc.txt

There are nearly 4 million ISBN numbers (out of 7 million records). The search took 11 seconds.

# count <span class="caps">URL</span>s
$ egrep -c http:// loc.txt
There are 265,540 URLs in the LoC records.

# check for subject headings for the city of
# Sausalito, California using regular expression
$ egrep -c ‘^[67][0-<span class="caps">9[0</span>-9].*Sausalito’ loc.txt
There are 19 subject headings for Sausalito

# search with a typo in name (a => o)
$ egrep Sausolito loc.txt

No hits due a typo in the name, try it with agrep, a grep program with approximate matching capabilities:

$ agrep -c -1 Sausolito loc.txt
282 hits, the search took 8 seconds

The examples above are for software developers and experienced librarians. They are helpful for a quick check of your bibliographic records, for data mining, analyzing or to double-check if your indexer works correctly.

If you want setup a public system for end-users you need of course a real full text engine as our zebra software.

Read the other articles of the series Z39.50 for Dummies: Part I, Part II, Part III, Part IV



Fun little tutorial. Demonstrates the need for an indexer, but pretty cool how you can do the same thing (sort of) with egrep. Very nice.