Z39.50 for Dummies - Part 5

This is part 5 of the series Z39.50 for dummies. In the 4th part I showed how to run convert MARC21 records to line format or XML.

In this article I will show you how to analyze MARC data on a modern PC hardware. PC are very fast now and incredibly cheap. You can rent a quad-core Intel machine with 8GB RAM and unlimited traffic for 40 Euro/month (+VAT) in a data center.

If the computer is fast enough, you don’t have to spend too much time on complex algorithms. You can use the raw power of your computer and do a brute force approach.

In the following example I will use the 7 million records from a dump of the Library of Congress (LoC) catalog. For details, please read the previous article Z39.50 for Dummies - Part 4.

$ for i in *.dat; do
    yaz-marcdump -f marc-8 -t utf-8 -o line
  done > loc.txt

$ du -hs loc.txt
4.9G

The line dump of the LoC is 4.9GB large and fits into main memory - great!

# count for the last name “Calaminus”
$ egrep -c Calaminus loc.txt

4 hits, the search took 4 seconds real time

# count records with <span class="caps">ISBN</span> number
$ egrep -c ^020 loc.txt
3999863

There are nearly 4 million ISBN numbers (out of 7 million records). The search took 11 seconds.

# count <span class="caps">URL</span>s
$ egrep -c http:// loc.txt
265540
There are 265,540 URLs in the LoC records.

# check for subject headings for the city of
# Sausalito, California using regular expression
$ egrep -c ‘^[67][0-<span class="caps">9[0</span>-9].*Sausalito’ loc.txt
19
There are 19 subject headings for Sausalito

# search with a typo in name (a => o)
$ egrep Sausolito loc.txt

No hits due a typo in the name, try it with agrep, a grep program with approximate matching capabilities:

$ agrep -c -1 Sausolito loc.txt
282
282 hits, the search took 8 seconds

The examples above are for software developers and experienced librarians. They are helpful for a quick check of your bibliographic records, for data mining, analyzing or to double-check if your indexer works correctly.

If you want setup a public system for end-users you need of course a real full text engine as our zebra software.


Read the other articles of the series Z39.50 for Dummies: Part I, Part II, Part III, Part IV

3 Comments

z39.50

Fun little tutorial. Demonstrates the need for an indexer, but pretty cool how you can do the same thing (sort of) with egrep. Very nice.