|
|
||||
|
|
This is part 5 of the series Z39.50 for dummies. In the 4th part I showed how to run convert MARC21 records to line format or XML. In this article I will show you how to analyze MARC data on a modern PC hardware. PC are very fast now and incredibly cheap. You can rent a quad-core Intel machine with 8GB RAM and unlimited traffic for 40 Euro/month (+VAT) in a data center. If the computer is fast enough, you don't have to spend too much time on complex algorithms. You can use the raw power of your computer and do a brute force approach. In the following example I will use the 7 million records from a dump of the Library of Congress (LoC) catalog. For details, please read the previous article Z39.50 for Dummies - Part 4. $ for i in *.dat; do yaz-marcdump -f marc-8 -t utf-8 -o line done > loc.txt $ du -hs loc.txt 4.9G The line dump of the LoC is 4.9GB large and fits into main memory - great! # count for the last name "Calaminus" $ egrep -c Calaminus loc.txt 4 hits, the search took 4 seconds real time # count records with ISBN number $ egrep -c ^020 loc.txt 3999863 There are nearly 4 million ISBN numbers (out of 7 million records). The search took 11 seconds. # count URLs $ egrep -c http:// loc.txt 265540 There are 265,540 URLs in the LoC records. # check for subject headings for the city of # Sausalito, California using regular expression $ egrep -c '^[67][0-9[0-9].*Sausalito' loc.txt 19 There are 19 subject headings for Sausalito # search with a typo in name (a => o) $ egrep Sausolito loc.txt No hits due a typo in the name, try it with agrep, a grep program with approximate matching capabilities: $ agrep -c -1 Sausolito loc.txt 282 282 hits, the search took 8 seconds The examples above are for software developers and experienced librarians. They are helpful for a quick check of your bibliographic records, for data mining, analyzing or to double-check if your indexer works correctly. If you want setup a public system for end-users you need of course a real full text engine as our zebra software. Read the other articles of the series Z39.50 for Dummies: Part I, Part II, Part III, Part IV |
|
||
|
|
||||
| Copyright Index Data LLC 2010 | ||||
z39.50
Fun little tutorial. Demonstrates the need for an indexer, but pretty cool how you can do the same thing (sort of) with egrep. Very nice.