In this article I will show you how to analyze MARC data on a modern PC hardware. PC are very fast now and incredibly cheap. You can rent a quad-core Intel machine with 8GB RAM and unlimited traffic for 40 Euro/month (+VAT) in a data center.
If the computer is fast enough, you don’t have to spend too much time on complex algorithms. You can use the raw power of your computer and do a brute force approach.
In the following example I will use the 7 million records from a dump of the Library of Congress (LoC) catalog. For details, please read the previous article Z39.50 for Dummies - Part 4.
yaz-marcdump -f marc-8 -t utf-8 -o line
done > loc.txt
$ du -hs loc.txt
The line dump of the LoC is 4.9GB large and fits into main memory - great!
$ egrep -c Calaminus loc.txt
4 hits, the search took 4 seconds real time
$ egrep -c ^020 loc.txt
There are nearly 4 million ISBN numbers (out of 7 million records). The search took 11 seconds.
$ egrep -c http:// loc.txt
# Sausalito, California using regular expression
$ egrep -c ‘^[0-<span class="caps">9[0</span>-9].*Sausalito’ loc.txt
$ egrep Sausolito loc.txt
No hits due a typo in the name, try it with agrep, a grep program with approximate matching capabilities:
The examples above are for software developers and experienced librarians. They are helpful for a quick check of your bibliographic records, for data mining, analyzing or to double-check if your indexer works correctly.
If you want setup a public system for end-users you need of course a real full text engine as our zebra software.