Creative Use of Pazpar2

It’s always fun to see someone do something really neat with your software. This elegantly designed search interface for Asia studies makes excellent use of Pazpar2. I particularly like the clever use of a bar chart for the date facet. Nice work!

What if it’s about the People, Stupid?

Most of the science of Information Retrieval centers around being able to find and rank the right set of documents in response to a given query. We spend much time arguing about technical details like ranking algorithms and the benefits of indexing versus broadcast searching. Every Information Professional I know both deifies and fears Google because they get it right most of the time — enough so that many people tend to assume that whatever pops to the top of a Google search MUST be right, because it’s right there, in the result screen. Producing that seemingly simple list of documents from a query is ultimately what our work and art boils down to.

But what if we’ve got it all wrong?

Web2.0, by many definitions, is about collaborative content creation. Tim Berners-Lee says that Web3.0 is going to be about Linked Data.

What if the most important function of the web is really to bring people together around shared goals, dreams, and interests — and those precious documents are really just so much metadata about people’s skills knowledge, and potential?

Watch somebody study a field long enough and deep enough, and eventually it seems like it becomes more about the Big Thinkers who have characterized the field than about any individual theory or document. You want to get into their heads; understand what inspires them. If you can, you want to ask them questions and learn from them. Then you build upon it.

It’s easy to find the Einsteins, the Darwins, the Freuds. Dig a little deeper, and you find the people, themselves giants, who stood on their shoulders. But what if you want to find the person or team who knows the MOST about a very specific type of protein, about building Firefox Plugins, or about adjusting the wheel bearings on a certain kind of vintage motorcycle? This seems to happen to me all the time when I am digging into something for work or play. If you get deep enough, looking for documents doesn’t sate your appetite, and you start to look for the masters of the domain — to see what they’re up to; to get into their heads, to ask them questions, or to hire them.

We spend an awful lot of time talking about workset merging, deduplication, persistent URLs for works. But maybe we need to spend more time talking about the authors. Maybe at a certain level of research, that humble Author facet becomes what it’s all about, and we should all be spending more time worrying about the dismal state of author name normal forms and authorities in the sources we work with. Are there things we should learn about relationships among authors in published works (and mentions of authors in casual websites) that might help guide our users more quickly to the experts, the super-groups, and that one forum where most people are friendly and knowledgeable?

Index Data’s Integrated Discovery Model

We are often asked about where we stand on the discussion of central indexing versus broadcast metasearching. Our standard answer: “You probably need some of both” always calls for further explanation. Some time ago, I wrote this up for a potential business partner. If it sounds a little like a marketing spiel… guilty as charged. I hope the content will still seem interesting to some folks thinking about these issues. While our specific approach and technology may be ours alone, the technical issues described here are pretty universal.

It is our thesis that a discovery platform which bases itself wholly on either broadcast metasearching or a centralized harvest-and-index model is inherently limited, as compared to one which seamlessly integrates both. It is our goal to provide a versatile platform, capable of handling very large indexes as well as dynamically searching remote content. Our MasterKey platform is an expression of this idea.

At the core of our platform is Pazpar2, a highly optimized multi-protocol search engine, capable of searching large numbers of heterogeneous resources in parallel. It features a web services API, a data model-agnostic approach to incoming search results, and a cloud-friendly, dynamic configuration mechanism, designed to fit into a variety of application environments. Pazpar2 will search a number of resources concurrently, through multiple protocols, while normalizing and integrating results.

Through its SRU/Z39.50 capability, Pazpar2 can access standards-compliant library catalogs, commercial database providers, etc. Through gateway technologies, it can access a range of other resources: Our SimpleServer provides a simple platform on which to implement gateways to proprietary APIs; our Connector Platform enables access to searchable resources through web-based user interfaces. Facets are constructed on the fly through analysis of incoming results.

Through its integrated Solr client, Pazpar2 can access locally and remotely maintained indexes, taking direct advantage of the capabilities of the Solr search engine to produce high-quality ranked and faceted results with lightning-fast response times.

While accessing a locally operated Solr instance, configured to support an application-specific data model in tandem with Pazpar2, the data normalization step is eliminated, conserving CPU cycles. Facets can be derived directly from the Solr system, eliminating the need to analyze individual records. All local results can be integrated and merged before presentation to the user, providing a highly responsive experience. Data can be freely distributed across Solr instances, allowing for scalability management options in addition to those offered by Solr itself.

Through Pazpar2’s data normalization mechanism, support for remotely maintained Solr instances with arbitrary record models is available, provided that a normalization stylesheet for the data model is provided. This allows for a more sophisticated level of integration with the increasing number of Solr-based applications in the field, and increases the versatility of the platform.

Combining these elements, the user of a hybrid local/remote discovery platform will experience excellent response times on local content, and will see remote results added to the display as soon as they become available (how this is accomplished is a user interface decision).

In addition to the elements described above, the MasterKey platform includes a harvester to schedule and manage harvesting jobs for local indexing purposes (HTTP bulk file download and OAI-PMH supported at present) as well as a service-oriented architecture for managing searchable resources, users, and subscriptions in a hierarchical way designed to address consortial requirements. All components are modular, sharing a service-oriented architecture, and designed to fit into as many architectural and operational contexts as possible.

On preferring open-source software

spent most of last week up in Edinburgh, for the Open Edge conference on open-source software in libraries, attended mostly by academic librarians and their technical people. It was an interesting time, and I met a lot of interesting people. At the risk of overusing the word “interesting”, it was also of interest to see how widespread the deployment of “next-generation OPACs” like VuFind and Blacklight has become. Since both of these use Solr as their back-end database, their progressive adoption in libraries has helped to drive that of Solr, which is part of why we recently added support for Solr to our own protocol toolkit.

One reason I went to Open Edge was to give my own talk on Metasearching and local indexes: making it Just Work when two worlds collide. I got to present how our thinking about resource discovery is evolving, and our growing conviction that it’s unnecessary to insist on either metasearching or harvesting data into a big central index and searching that. We can and should (and indeed we do, for some of our customers) do both, and integrate them seamlessly.

Open Edge also gave me a strong sense that the world is changing. As recently as two or three years ago, conferences about open-source software had the tenor of “Oh, please recognise that open-source is a valid model, don’t write us off as a bunch of hobbyists”. Now that war is largely won, and it’s universally recognised, at least by the kinds of people who come to Open Edge, that open-source software is a mainstream option rather than some kind of communist weirdo alternative.

Back in the Bad Old Days (by which I guess I mean 2008), meetings like Open Edge felt very slightly clandestine … as though we were meeting in secret, or at least off on our own somewhere, because the Big Boys – the proprietary vendors – wouldn’t let us join in any reindeer games. Open-sourcers were tolerated at big events like ALA, where we could get lost in the crowd, but if we wanted to give presentations and suchlike then matters might be different. Three years on, everything is very, very different.

So …

It was against that backdrop that I listened to Ross Gardler’s talk, the last one on the final day of the conference: an introduction to discussion on the subject “Steps to building capacity in the library community”. Ross is Service Manager for JISC’s OSS Watch, Vice-President of Community Development at the Apache Software Foundation, and the organiser of the TransferSummit conference – all in all, someone with fingers on a lot of pulses, and with as much idea as anyone of which way the wind is blowing. Every two years, OSS Watch takes a survey of attitudes to open-source software in UK Higher Education and Further Education. Ross presented some findings from the most recent, as yet unpublished, 2010 survey, and compared them with those in the 2008 survey.

(In the UK, Higher Education, or HE, means universities; and Further Education, or FE, means more vocational training past the usual school-leaving age.)

The most striking part of the survey for me was the bit about policies towards software procurement: the breakdown by percentage of how many HE and FE institutions have the following policies with respect to open-source software (OSS):

• OSS is not mentioned
• Policy of not using OSS
• Explicitly considers OSS
• OSS is the preferred option

There were two “Wait, what?” moments here for me. The first one was that there even is a “Policy of not using OSS”. But it’s there, and its percentages, though low, are non-zero. In 2006, 4% of HE and 2% of FEinstitutions surveyed said that they had formal policy in place not to use open-source software.

Just think about that for a moment. Someone, somewhere, came up with the idea that if your vendor gives you free access to the source code and does not charge you a licence fee, then that makes it a worse deal. In fact, not just a worse deal, but so bad a deal that it won’t even be considered. You can just imagine open-source vendors/integrators talking to the administrators:

Vendor: … And we don’t charge a licence fee, and we’ll give you the source code in case you want to make local customisations.

Administrator: Certainly not! We reject your terrible offer.

Vendor: All right, then. We’ll charge you £50,000 per year, and we won’t give you the source code.

Administrator: You’ve got yourself a deal!

It hurts, doesn’t it?

So, anyway. That was the first of my two “Wait, what?” moments. The second was a comment that Ross himself made as he was discussing these figures. The flip side of the stats that I found so incredible is that in 2006, 7% of FE institutions said that “OSS is the preferred option”, and Ross’s comment was something along the lines that he thought that was just as wrong as the converse.

Well, usually when I hear something like this, I speak up immediately, especially in a session like this Open Edge one where we’d been explicitly invited to chip in. But instead I sat there, sort of paralysed, gazing into the middle distance. I struggled to get my brain back onto the tracks, having had it knocked sideways by such a tremendous whack of (let me be frank) irrationality.

The reality is, OSS is simply not all that controversial anymore. The Open Source Initiative has been up and running for 13 years (and the Free Software Foundation for more than a quarter of a century). OSStechnologies are everywhere around us. The pros and cons of the different models are well understood. Mostly, for us, we have found that it allows us to communicate freely with a large community of developers while also working with an incredible range of commercial and public organizations. It would be nice to see people move past both the hype and the irrational fear, and recognize that this is an approach to collaboration that is here to stay, even if it isn’t right for everyone.

As my friend Matt Wedel likes to say, we’re all living in the future now. Let’s not pretend we’re still shackled by past practices. Onward!

Clustering Snippets With Carrot2

We’ve been investigating ways we might add result clustering to our metasearch tools. Here’s a short introduction to the topic and to an open source platform for experimenting in this area.

Clustering

Using a search interface that just takes some keywords often leads to miscommunication. The computer has no sense of context and users may not realise their query is ambiguous. When I wanted to represent a matrix in LaTeX typesetting markup, I searched for the keywords ‘latex’ and ‘matrix’. At that moment I wasn’t thinking along the lines of “latex cat-suits, like in the Matrix movies”. But, since one of those movies was just recently out, the search engine included some results of that sort. The information it might use to determine relevance (word frequency, number and recency of links, what people ultimately click on when performing similar searches, etc.) would not be enough to choose which interpretation of my query terms I was interested in.

To aid the user in narrowing results to just those applicable to the context they’re thinking about, a good deal of work has been done in the area of “clustering” searches. The idea is to take the usual flat, one dimensional result list and instead split it into several groups of results corresponding to different categories. Mining data to find patterns has been studied for some time and current work on automatically grouping documents has roots in this research.

One common way to represent a document, both for searching and data mining, is the vector space model. Introduced by Gerard Salton, underappreciated pioneer of text search, the idea is to distil documents down to statistics about the words they contain. Particularly, it involves considering the terms as dimensions so we can apply some handy matrix math to them. Then a document becomes a vector in term-space with a magnitude corresponding to how relevant the document is considered to be to those terms. In other words, a sushi menu would be located further along in the ‘fish’ and ‘rice’ directions than in the ‘dinosaur’ direction.

This kind of bag-of-words model is very useful for separating documents into groups. Problem is, it’s not very clear what to call those groups. A cluster that was formed of documents with prominent features of ‘voltage’, ‘circuit’, ‘capacitor’ and ‘resistor’ is certainly forming a group around a specific topic, but none of the terms used to determine that make quite so good a label as the more general word ‘electronics’. There has consequently been a shift in document clustering research away from “data-centric” models that aim to make the clearest groupings to “description-centric” models that aim to create the best groupings that can be clearly labelled.

Another differentiator among clustering algorithms is when the clustering happens, before or after search. When processing search results the focus is on a postretrieval approach which lets the clustering algorithm work with the search algorithm rather than against it. Search algorithms determine which documents are relevant to the query. Since the goal of document clustering in this context is to help select among only relevant documents, it is best to find clusters within that set. If we instead consider all documents it may be that most of the relevant documents are in the same cluster and many clusters will not be relevant.

Similarly, we can leverage another part of the search system: snippet generation. Rather than just a list of documents, many search tools will include contextual excerpts containing the query terms. Here too we have a system that is winnowing the data set. Much as the retrieval algorithm limits clustering to only relevant documents, these provide candidates for the most relevant few phrases or sentences.

Carrot2

A postretrieval clustering system designed to work with short excerpts is precisely the sort that could be put to use in federated search, at least for those targets that return snippets. It just so happens that there is an open source one to play with! It’s called Carrot2 and is a Java framework for document clustering devised by Dawid Weiss and Stanislaw Osinski as an experimental platform for their clustering research and promoting their proprietary algrorithm. They provide implementations for two algorithms which I’ll attempt to summarize below but are best described in their paper A Survey of Web Clustering Engines and in more detail in some of their other research.

STC

Suffix Tree Clustering (STC) is one of the first feasible snippet-based document clustering algorithms, proposed in 1998 by Zamir and Etzioni. It uses an interesting data structure, the generalised suffix tree, to efficiently build a list of the most frequently used phrases in the snippets from the search results. This tree is constructed from the snippets and their suffixes, that is, all the snippets, all the snippets without their first words, and every sequence that ends at the end of the snippet including the last words by themselves. From this list the documents are divided into groups based on which words they begin with, with no two groups starting with the same word and no groups with only one document. In turn, each of those groups is divided by the next words in sequence. Here’s a compact example:

The groups generated this way form a list of phrases and associated number of snippets. Some of these phrases are selected to decide which will be the base clusters. They are ranked according to a set of heuristic rules. These rules take into account whether the phrases are present in too many or too few documents and how long the phrases are after discarding uninteresting “stop” words. Heuristics are also employed to merge clusters deemed too similar. The top few clusters are then presented to the users.

This approach is very efficient and the use of frequent phrases provides better quality labels than a purely term based technique.

Lingo

The second algorithm, Lingo, was developed as the PhD thesis of Dawid Weiss, one of the partners in the Carrot project. Lingo constructs a “term-document matrix” where each snippet gets a column, each word a row and the values are the frequency of that word in that snippet. It then applies a matrix factorization called singular value decomposition or SVD. Part of the result of this is used to from collections of related terms. Specially, they possess a latent semantic relationship. If you’re interested in the how this works, the most approachable tutorial I’ve found on the topic presents it using golf scores. A more thorough treatment on how SVD is applied to text can be found here, along with some discussion on popular myths and misunderstandings of the approach.

Grouping the words in the snippets by topic helps form clusters but does not suggest a label. To this end, Lingo uses a suffix-based approach like STC to find a list of frequently occurring phrases for use as candidate labels. Since both the collections of words and the labels are in the same set of snippets, they exist in the same term-document space. As in our earlier example of ‘fish’ being closer to ‘sushi’ than to ‘dinosaur’, there is a concept of distance here. The label closest to a collection is used. Collections without a label close enough and labels without collections are discarded. The documents with snippets matching these remaining labels then form the clusters.

Lingo improves over STC by selecting not only frequent phrases but ones that apply to groupings of documents known to be related in a way that takes into account more content than just that one phrase. STC avoids choosing phrases that are too general by discarding those that occur in too many documents, but it’s still quite possible to get very non-descriptive labels. The added precision Lingo brings with the SVDcomes at a steep computational cost as matrix operations are much slower. The authors sell a proprietary algorithm, Lingo3G, that overcomes the performance limitations.

Other approaches

Cluster analysis can be applied to a wide range of fields and all sorts of non-textual data: Pixels, genes, chemicals, criminals, anything with some qualities to group by. General techniques can usually only provide a data-driven approach to clustering text and, as discussed above, labeling is very important.

Often the best label for a cluster isn’t even in the text being clustered. Some research investigates pulling in other data to very good effect. One of the best examples of this is the idea of using Wikipedia titles. This works well because the titles of Wikipedia pages are not only relevant to the content contained therein but are in fact explicitly chosen as a label to represent that content. Search query logs are another potential source of cluster labels.

Clustering is still very much an open problem and an active research area. Even the best description centric approaches are still no match for manually assigned subject categories.

SOLR support in ZOOM, Pazpar2, and MasterKey

We have always held that the schism between broadcast metasearching and local indexing is rather goofy – that in practice, you do whatever it takes to get the results in front of your user when and where he needs it, and the best solutions will allow for whatever approach is needed in the moment. Inspired by the increasing popularity of the SOLR/Lucene indexing server in the library world, we have just completed a project to add support for SOLR targets in the ZOOM API implementation in the YAZ library. So YAZ now supports Z39.50, SRU/SRW 1.x and the SOLR API.

Since SOLR uses HTTP as its transport protocol, triggering a SOLR search is done setting the “sru” option to value “solr”.

Using zoomsh, a SOLR target can be searched as follows:

<span class="caps">ZOOM</span>> set sru solr
<span class="caps">ZOOM</span>> connect http://localhost:8984/solr/select
<span class="caps">ZOOM</span>> find title:water
http://localhost:8984/solr/select: 498 hits
<span class="caps">ZOOM</span>>show 1 1
1 database=unknown syntax<span class="caps">=XML</span> schema=unknown
<doc>
<str name=”author”>Bamforth, Charles W.,</str>
<str name=”author-date”>1952-</str>
….
</doc>
<span class="caps">ZOOM</span>>

This functionality has also been integrated in our open-source metasearch engine Pazpar2, and by extension into our commercial MasterKey suite. We have experimented with using a database schema in SOLR that directly matches pazpar2 internal format so practically no conversion is required on searching/retrieval. Performance is blazing.

Even without doing any tuning, SOLR is very fast as long at it is given enough memory. The 7 million records in the Library of Congress catalog runs on a memory footprint of 2GB. And it does seem to need a setting for a warm-up; otherwise the first search is very slow.

Facets

The ZOOM API (and implementation) has been extended to handle facet results generated by SOLR and Z39.50 (via an extension of the Z39.50 protocol) targets. Facets definition is a comma separated attribute list in PQF syntax:

<span class="caps">ZOOM</span>>set facets “@attr 1=title @attr 3=5, @attr 1=subject @attr 3=2”
<span class="caps">ZOOM</span>>connect http://localhost:8987/solr/select
<span class="caps">ZOOM</span>>find title:beer
http://localhost:8987/solr/select: 586 hits
<span class="caps">ZOOM</span>>facets
Facets:
title:
De Beers(3)
One Fierce Beer Coaster(3)
Wilhelm Beer(3)
August Beer(2)
Beer pong(2)
subject:
Beer – United States (1)
Food adulteration and inspection(1)
<span class="caps">ZOOM</span>>

which specifies that we want to get up to 5 title facets and 2 subject facets on a SOLR index of wikipedia-abstracts. Pazpar2 still does its own facet count for targets that don’t return facets, but having the target return facets when possible makes them available immediately.

This means that SOLR-based databases, when accessed from Pazpar2 or a MasterKey application, feel just as fast as they do in applications based directly on SOLR. In other words, we can freely mix and match locally indexed resources and remote databases, without compromising on either type – a huge leap forward.

Of course, the extension to our ZOOM API means that anyone who uses our YAZ-based ZOOMimplementation can cross-search SOLRZ39.50, and SRU targets. Magically, this extends to ZOOMimplementations in other languages that base themselves on our code, including APIs in Perl, PHP, C++, Ruby, Java, Visual Basic, Tcl, and, astonishingly, Squeal). Depending on how those APIs represent option values, changes may or may not be necessary to their implementation, but a small extension is necessary to express facet values. Time will tell whether the ZOOM maintainers decide to canonize our particular approach to handling facets.

A release of YAZ and Pazpar2 with these features is imminent.

Turbomarc, faster XML for MARC records

Our metasearch middleware, Pazpar2, spends a lot of time doing XML transformations. When we use Pazpar2 with traditional library data sources that return MARC21, we internally convert the received records into MARCXML (if they’re not already represented as such) and then transform into the internal pazpar2 XML format using XSLT (more on this process here).

MARCXML is nice to look at, but it’s not an optimal format on which to perform XSL transformations.

So we did performance testing, and we found that much of the CPU usage was around the transformation from MARCXML to our internal, normalized data format using XSL transformations.

We decided to try out a format (we call it Turbomarc) where we represent the names of MARC fields and subfields as XML element names rather than attributes, value, but leave an option for special cases open. This MARCXML:

<collection xmlns=”http://www.loc.gov/<span class="caps">MARC21</span>/slim”>
<record>
<controlfield tag=”001”>000277485</controlfield>
<datafield tag=”100” ind1=”1” ind2=” “>
<subfield code=”a”>Μαρούδης, Κωνσταντίνος Ιω</subfield>
</datafield>
<datafield tag=”250” ind1=” ” ind2=” “>
<subfield code=”η”> εκδ.</subfield>
</datafield></record>
</collection>

will in Turbomarc be:

<c xmlns=”http://www.indexdata.com/turbomarc”>
<r>
<l>00492nam a22001455a 4500</l>
<c001>000277485</c001>
<d100 i1=”1” i2=” “>
<sa>Μαρούδης, Κωνσταντίνος Ιω</sa>
</d100>
<d250 i1=” ” i2=” “>
<s code=”η”> εκδ.</s>
</d250>
</r>
</c>

This shows the special case where a non-alphanumeric attribute value is not combined into an element name, but is left as an attribute (this happens very rarely in real use).Using xsltproc –timing showed that our transformations were faster by a factor of 4-5. Shortening the element names only improved performance fractionally, but since everything counts, we decided to do this as well.

The single user probably won’t notice the difference but it will for sure enable more throughput. Measuring average response time given different numbers of users we saw the following: average response time which shows that we can double the number of users in a high-throughput stress-test given a fixed average response time for a typical Pazpar2 webservice command. The corresponding number of ‘real’ users would be much higher.

Support for the Turbomarc format was released with version 4.0.1 of YAZ and is supported by the ZOOMlayer and thus by zoomsh by using txml instead of xml in the show command:

<span class="caps">ZOOM</span>>open a-z-target
<span class="caps">ZOOM</span>>search water
<span class="caps">ZOOM</span>>show 0 1 txml

Pazpar2 supports Turbomarc from version 1.4.0 and Turbomarc will eventually become the default format; however, at present, MARCXML remains the default to avoid breaking existing applications of Pazpar2.

Another new change in pazpar2 is support for threading, to make better use of multi-core systems, but this is still at beta level and could be a topic for a future blog entry.

Building a simpler HTTP-to-Z39.50 gateway using Ruby-ZOOM and Thin

Inspired by Jakub’s posting yesterday, I wondered how easy it would be to build an HTTP-to-Z39.50 gateway similar to his in Ruby, my language of the moment. Different languages offer different tools and different ways of doing things, and it’s always instructive to compare.

Ruby libraries are generally distributed in the form of “gems”: packages analogous to the .deb files used by the Debian and Ubuntu GNU/Linux operating systems, so before we start installing the Ruby libraries that we’re going to need, we have to install the rubygems package that does that installation. The gems system is like Debian’s dpkg and apt-get all rolled into one, as it knows how to fetch packages as well as how to install them; and it has something of the BSD “ports” mechanism about it, too, as gem installation can include building, most usually when the gem is not pure Ruby but includes glue-code to an underlying C library – as in the case of Ruby-ZOOM.

While we’re installing the gems system, we may as well also ensure that we have Ruby itself installed, plus the header-files needed for building gems that include a C component, and Ruby’s OpenSSL and FastCGI support. On my Ubuntu 9.04 system, I used:

$sudo apt-get install ruby1.8 ruby1.8-dev rubygems1.8 libopenssl-ruby1.8 libfcgi-dev Once that’s been done, we can install the Ruby gems that we need for the Web-to-Z39.50 gateway. These are: • rack is the web-server API, which works with various different specific web servers, invoking application-specific code to handle requests. • thin is one of several specific Ruby-based web servers that we could use – it’s roughly analogous to something like Tomcat. • zoom is the Ruby implementation of the ZOOM abstract API . Depending on how your Ruby installation is set up, you may or may not already have the various prerequisite gems that thin needs; it’s harmless to ask for an already-installed gem to be installed, so you may as well just ask. So to install the necessary gems, then, I did: $ sudo gem install rack thin zoom test-spec camping memcache-client mongrel

We’re all done with installing – now it’s time to write the web server. And here it is!

require ‘rubygems’
require ‘thin’
require ‘zoom’

class <span class="caps">ZOOMC</span>lient
def call(env)
req = Rack::Request.new(env)
headers = { ‘Content-Type’ => ‘text/plain’ }
zurl = req[‘zurl’] or return [ 400, headers, “no zurl specified” ]
query = req[‘query’] or return [ 400, headers, “no query specified” ]
syntax = req[‘syntax’] or return [ 400, headers, “no syntax specified” ]
maxrecs = req[‘maxrecs’] ? Integer(req[‘maxrecs’]) : 10
    res = []
res << “<span class="caps">SEARCH</span> <span class="caps">PARAMETERS</span>“
res << “zurl: #{zurl}”
res << “query: #{query}”
res << “syntax: #{syntax}”
res << “maxrecs: #{maxrecs}”
res << ”

<span class="caps">ZOOM</span>::Connection.open(zurl) do |conn|
conn.preferred_record_syntax = syntax
rset = conn.search(query) res << “Showing #{maxrecs} of #{rset.size}” << ”
[ maxrecs, rset.size ].min.times { |i| res << String(rset[i]) << ” }
end

[ 200, headers, res.map { |line| line + “\n” } ]
end
end

app = Rack::<span class="caps">URLM</span>ap.new(‘/zgate’  => <span
class="caps">ZOOMC</span>lient.new)
Thin::Server.new(nil, 12368, app).start!

This doesn’t need wiring into an existing web-server installation: it is a web-server, listening on port 12368 and ready to accept requests in the /zgate area. Despite its name, thin is not a trivial piece of software: it’s highly secure, stable, fast and extensible. It’s also a lot easier to wire into than some other servers coughTomcat cough.

The sweet thing about the Rack API is just how simple it is. You provide a class for each web-application, which need have only one method, call: and that method accepts a request object and returns a triple of HTTP status-code, hash of headers, and content. Of course, much, much more sophisticated behaviour is possible, but it’s nice that doing the simple thing is, well, simple.

So, anyway – to run our server just use:

ruby web-to-z.rb

And now we can fetch URLs like http://localhost:12368/zgate/?zurl=z3950.loc.gov:7090/voyager&query=@att…

And we’re done!

Building a simple HTTP-to-Z39.50 gateway using Yaz4j and Tomcat

Yaz4J is a wrapper library over the client-specific parts of YAZ, a C-based Z39.50 toolkit, and allows you to use the ZOOM API directly from Java. Initial version of Yaz4j has been written by Rob Styles from Talis and the project is now developed and maintained at Index Data. ZOOM is a relatively straightforward API and with a few lines of code you can write a basic application that can establish connection to a Z39.50 server. Here we will try to build a very simple HTTP-to-Z3950 gateway using yaz4j and the Java Servlet technology.

COMPILINGANDINSTALLINGYAZ4J

Yaz4j is still an experimental piece of software and as such is not distributed via Index Data’s public Debian Apt repository and there is no Windows build (yet) either. While it is possible to use the pre-built Linux binaries, users of other OSes will have to compile yaz4j from source. No need to worry (yet) – the process of compiling yaz4j is quite simple and we will be up and running in no time :).

As a prerequisite, to complete th build process you will need JDK, Maven, Swig and Yaz (development package) installed on your machine. On Debian/Ubuntu you can get those easily via apt:

apt-get install sun-java6-jdk maven2 libyaz4-dev swig


The Yaz4j’s source code can be checked-out out from our Git repository, and assuming you have Git installed on your machine you can do that with:

git clone git://git.indexdata.com/yaz4j


The compilation of both native and Java source code is controlled by Maven2, to build the library, invoke the following commands:

cd yaz4j
mvn install


That’s it. If the build has completed successfully you end up with two files: os-independent jar archive with Java ZOOM API classes (yaz4j/any/target/yaz4j-any-VERSION.jar) and os-dependent shared library (yaz4j/linux/target/libyaz4j.so or yaz4j/win32/target/yaz4j.dll) that contains all necessary JNI “glue” to make the native calls possible from Java. If we were writing a command line Java application, like any other external Java library, yaz4j-any-VERSION.jar would have to be placed on your application classpath and the native, shared library would have to be added to your system shared library path (LD_LIBRARY_PATH on linux, PATH on Windows) or specified as a Java system property (namely the java.library.path) just before your application is executed:

java -cp /path/to/yaz4j-*.jar -Djava.library.path=/path/to/libyaz4j.so MyApp


SETTINGUPTHEDEVELOPMENTENVIRONMENT

Setting up a development/runtime environment for a web (servlet) application is a bit more complicated. First, you are not invoking the JVM directly, but the servlet container (e.g Tomcat) run-script is doing that for you. At this point the shared library (so or dll) has to be placed on the servlet container’s shared libraries load path. Unless your library is deployed to the standard system location for shared libs (/usr/lib on Linux) or it’s location is already added to the path, the easiest way to do this in Tomcat is by editing (create it if it does not exist) the CATALINA_HOME/bin/setenv.sh (setenv.bat on Windows) script and putting the following lines in there:

LD_LIBRARY_PATH=\$LD_LIBRARY_PATH:/path/to/libyaz4j.so
export LD_LIBRARY_PATH


on Windows (though no Windows build is yet provided)

 set PATH=%PATH;X:\path\to\yaz4j.dll


That’s one way of doing it, another would be to alter the standard set of arguments passed to the JVMbefore the Tomcat starts and add -Djava.library.path=/path/to/lib there. Depending on a situation this might be preferable/easier (on Debian/Ubuntu you can specify JVM arguments in the/etc/default/tomcat6 file).

With the shared library installed we need to install the pure-Java yaz4j-any*jar with ZOOM API classes by placing it in Tomcat’s lib directory (CATALINA_HOME/lib). As this library makes the Java System call to load the native library into the JVM you cannot simply package it along with your web application (inside the .war file) – it would try to load the library each time you deploy the webapp and all consecutive deployments would fail.

WRITING A SERVLET–BASEDGATEWAY

With your servlet environment set up all that is left is to write the actual application (peanuts :)). At Index Data we use Maven for managing builds of our Java software components but Maven is also a great tool for quickly starting up a project. To generate a skeleton for our webapp use the Maven archetype plugin:

mvn -DarchetypeVersion=1.0.1 -Darchetype.interactive=false \
-DarchetypeArtifactId=webapp-jee5 -DarchetypeGroupId=org.codehaus.mojo.archetypes \
-Dpackage=com.indexdata.zgate -DgroupId=com.indexdata -DartifactId=zgate \
archetype:generate --batch-mode


This will generate a basic webapp project structure:

|pom.xml
<code>-- src&#10;|-- main&#10;|   |-- java&#10;|   |</code>– com
|   |       <code>-- indexdata&#10;|   |</code>– zgate
|   <code>-- webapp&#10;|       |-- WEB-INF&#10;|       |</code>– web.xml
|       <code>-- index.jsp&#10;</code>– test
<code>-- java&#10;</code>– com
<code>-- indexdata&#10;</code>– zgate

Maven has already added basic JEE APIs for web development as the project dependencies, we need to do the same for yaz4j, so edit the pom.xml and add the following lines in the dependencies section:

<dependency>
<groupId>org.yaz4j</groupId>
<artifactId>yaz4j-any</artifactId>
<version><span class="caps">VERSION</span></version>
<scope>provided</scope>
</dependency>

It’s crucial that the scope of this dependency is set to provided otherwise the library would end up packaged in the .war archive and we don’t want that.

The implementation of our simple gateway will be contained in a single servlet – ZGateServlet – which we need to place under src/main/webapp/com/indexdata/zgate. The gateway will work by answering HTTP GET requests and will be controlled solely by HTTP parameters, the servlet doGet method is shown below:

protected void doGet(HttpServletRequest request, HttpServletResponse response)
throws ServletException, <span class="caps">IOE</span>xception {
String zurl = request.getParameter(“zurl”);
if (zurl == null || zurl.isEmpty()) {
response.sendError(400, “Missing parameter ‘zurl’ ”);
return;
}String query = request.getParameter(“query”);
if (query == null || query.isEmpty()) {
response.sendError(400, “Missing parameter ‘query’ ”);
return;
}String syntax = request.getParameter(“syntax”);
if (syntax == null || syntax.isEmpty()) {
response.sendError(400, “Missing parameter ‘syntax’ ”);
return;
}int maxrecs=10;
if (request.getParameter(“maxrecs”) != null
&& !request.getParameter(“maxrecs”).isEmpty()) {
try {
maxrecs = Integer.parseInt(request.getParameter(“maxrecs”));
} catch (NumberFormatException nfe) {
response.sendError(400, “Malformed parameter ‘maxrecs’ ”);
return;
}
}response.getWriter().println(“<span class="caps">SEARCH</span> <span class="caps">PARAMETERS</span>”);
response.getWriter().println(“zurl: ” + zurl);
response.getWriter().println(“query: ” + query);
response.getWriter().println(“syntax: ” + syntax);
response.getWriter().println(“maxrecs: ” + maxrecs);
response.getWriter().println();Connection con = new Connection(zurl, 0);
con.setSyntax(syntax);
try {
con.connect();
ResultSet set = con.search(query, Connection.QueryType.PrefixQuery);
response.getWriter().println(“Showing “+maxrecs+” of “+set.getSize());
response.getWriter().println();
for(int i=0; i<set.getSize() && i<maxrecs; i++) {
Record rec = set.getRecord(i);
response.getWriter().print(rec.render());
}
} catch (ZoomException ze) {
throw new ServletException(ze);
} finally {
con.close();
}
}

With the code in-place we can try to compile the project:

mvn compile


If all is OK, the next step is to register our servlet and map it to an URL in src/main/webapp/WEBINF/web.xml:

<servlet>
<servlet-name>ZgateServlet</servlet-name>
<servlet-class>com.indexdata.zgate.ZgateServlet</servlet-class>
</servlet>
<servlet-mapping>
<servlet-name>ZgateServlet</servlet-name>
<url-pattern>/zgate</url-pattern>
</servlet-mapping>

On top of that, we will also make sure that our servlet is automatically triggered when accessing the root path of our application:

<welcome-file-list>
<welcome-file>zgate</welcome-file>
<welcome-file>index.jsp</welcome-file>
</welcome-file-list>

Now we are ready to build our webapp:

mvn package


The resulting .war archive is located under target/zgate.war, we can deploy it on tomcat (e.g by using the /admin Tomcat admin console) and test by issuing the following request with your browser or curl (assuming Tomcat is running on localhost:8080):

http://localhost:8080/zgate/?zurl=z3950.loc.gov:7090/voyager…

That’s it! You just build yourself a HTTP-to-Z3950 gateway! Just be careful with exposing it to the outside world – it’s not very secure and could be easily exploited. The source code and the gateway’s Maven project is available in the Yaz4j’s Git repository under examples/zgate. In the meantime, Index Data is working on a Debian/Ubuntu package to make the installation of Yaz4j and Tomcat configuration greatly simplified – so stay tuned!. If you are interested in Windows support – e.g. Visual Studio based build or an installer – please let us know.

Competition and the Marketplace of Ideas

Recently, my son asked me a series of questions about the cold war, and the political/military paradigm of mutually assured destruction (MAD for short). It’s always seemed like an odd premise to me, and somehow, discussing it with a 13-year old doesn’t make it look any more sensible. However, we came to agree that landing on the moon was a pretty cool thing. Would the lunar landing have happened, realistically, without the cold war? There’s probably not a stronger force in life than competition – indeed, life as we know it would not be possible without it; a race for resources, for growth (or procreation), for power. Is there a direct line between two-single celled organisms clustering together for warmth, and a whole nation uniting behind the wildest, craziest engineering feat just to prove a point to another nation?

These past few years, Open Source Software has emerged as a highly visible concept in libraries. It means different things to different people. Personally, when I released my first piece of open source code nearly 20 years ago (it was an adventure game, not library software), it was a way to share my code and ideas with others without possessing the infrastructure of a software company. In retrospect, it was also a way to participate in what I imagined to be the academic model of openly sharing ideas and knowledge, without the supporting scaffolding of publishing companies and research grants of which I was blissfully unaware at the time.

Index Data began releasing code under Open Source licenses shortly after we founded the company. We didn’t give a whole lot of thought to it at first; it just seemed natural. There was no established model for running a business on this premise at the time, but we knew we had no skills in selling software, and we figured this way at least people might get to see the code, maybe run it, and give us feedback (sometimes, it seemed like we craved the praise more than money). That became the beginning of an amazing 15-year journey; of searching for a business model that works for a group of geeks who enjoy writing industrial-strength code but who have little or no skills in marketing. That story probably would make a fun post in its own right.

But, recently, Open Source has gone from being something very obscure in the broader library business (when we first exhibited at ALA, in 2004, people would come up to us and ask what kind of a company name was ‘Open Source’) to the hot new thing in town, and as such, it has come to mean many different things to different people.

For libraries, it’s sometimes seen as a way to save money, or a possibility of getting new features by tweaking code (or having other people do it for them), and maybe a kick in the behind for some of the more staid, established vendors. For academic and library geeks and coders, it’s a way to get a seat at the big table; to write exciting code and influence the direction of technology at the deepest level. For some new service/support companies, it’s an opportunity to enter into the library system market without the overheads and upfront investment of creating a whole software platform from scratch. For some existing vendors, it’s perhaps seen as new competition, or a dilution of a marketplace that was already crowded. Some see a paradigm shift, an unstoppable wave towards a new way of doing business; others see a distraction, a wasted effort.

I work every day with libraries, with librarians, with library geeks and coders, with academics, and with different kinds of library service and software providers, and I have come to form a different perspective.

To my mind, the central aspect of open source software is that it allows for a different, more direct dialog between software developers and users of that software. It can break down institutional boundaries. Sometimes things get messy. Sometimes a lot of effort are wasted by groups of libraries in well-meaning efforts to build a better mousetrap. But sometimes, new ideas can be brought from the whiteboard into production literally in days, as an inspiration for others to follow. As someone who enjoys designing software tools for others to use, I get to have relationships with individual coders and geeks, as well as the CIOs of large businesses or organizations (and, quite a few times, I have been able to watch the former evolve into the latter), and our code is stronger and better for the range of challenges it is faced with, as are we as programmers.

Above the daily effort of coders coding, geeks trying out new mashups, and companies competing is a larger dialog. A kind of marketplace of ideas. Much like a real marketplace, or any economy, it is less governed by rules and regulations than by our intrinsic human desire to work together, to share, to compete, and to win. It is a messy and chaotic process, like life itself, and a process that probably spends as much time moving sideways as it does forwards or upwards.

The players in this marketplace are different kinds of organizations, groups, and people: Established software vendors; library interest groups; national and regional standards bodies; library consortia; formally collaborating groups of libraries and informally collaborating individuals. All of them breaking their backs and minds to come up with the best answers to the hardest questions, each from their own perspective and with their own experiences to guide them.

I would like to stipulate that no single organization in this marketplace of ideas holds the one true answer: The key to the future; the secret to how libraries can behave and work to carve a place for themselves in the Internet age. But this is okay, because the marketplace itself, not the individual players, is our best tool for finding the answer. It is in the open competition of ideas, thoughts, experience, and passion that we move forward as a community, and it is the challenge posed by the marketplace that drives each of us to do our best, as individuals and organizations.

It is my hope that libraries and librarianship will continue to be able to support a rich flora of ideas and approaches, to attract people willing to pour their hearts into making things better, even if they don’t always know how.