SOLR support in ZOOM, Pazpar2, and MasterKey

We have always held that the schism between broadcast metasearching and local indexing is rather goofy – that in practice, you do whatever it takes to get the results in front of your user when and where he needs it, and the best solutions will allow for whatever approach is needed in the moment. Inspired by the increasing popularity of the SOLR/Lucene indexing server in the library world, we have just completed a project to add support for SOLR targets in the ZOOM API implementation in the YAZ library. So YAZ now supports Z39.50, SRU/SRW 1.x and the SOLR API.

Since SOLR uses HTTP as its transport protocol, triggering a SOLR search is done setting the “sru” option to value “solr”.

Using zoomsh, a SOLR target can be searched as follows:

<span class="caps">ZOOM</span>> set sru solr
<span class="caps">ZOOM</span>> connect http://localhost:8984/solr/select
<span class="caps">ZOOM</span>> find title:water
http://localhost:8984/solr/select: 498 hits
<span class="caps">ZOOM</span>>show 1 1
1 database=unknown syntax<span class="caps">=XML</span> schema=unknown
<doc>
  <str name=”author”>Bamforth, Charles W.,</str>
  <str name=”author-date”>1952-</str>
….
</doc>
<span class="caps">ZOOM</span>>

This functionality has also been integrated in our open-source metasearch engine Pazpar2, and by extension into our commercial MasterKey suite. We have experimented with using a database schema in SOLR that directly matches pazpar2 internal format so practically no conversion is required on searching/retrieval. Performance is blazing.

Even without doing any tuning, SOLR is very fast as long at it is given enough memory. The 7 million records in the Library of Congress catalog runs on a memory footprint of 2GB. And it does seem to need a setting for a warm-up; otherwise the first search is very slow.

Facets

The ZOOM API (and implementation) has been extended to handle facet results generated by SOLR and Z39.50 (via an extension of the Z39.50 protocol) targets. Facets definition is a comma separated attribute list in PQF syntax:

<span class="caps">ZOOM</span>>set facets “@attr 1=title @attr 3=5, @attr 1=subject @attr 3=2”
<span class="caps">ZOOM</span>>connect http://localhost:8987/solr/select
<span class="caps">ZOOM</span>>find title:beer
http://localhost:8987/solr/select: 586 hits
<span class="caps">ZOOM</span>>facets
Facets:
  title:
    De Beers(3)
    One Fierce Beer Coaster(3)
    Wilhelm Beer(3)
    August Beer(2)
    Beer pong(2)
  subject:
    Beer – United States (1)
    Food adulteration and inspection(1)
<span class="caps">ZOOM</span>>

which specifies that we want to get up to 5 title facets and 2 subject facets on a SOLR index of wikipedia-abstracts. Pazpar2 still does its own facet count for targets that don’t return facets, but having the target return facets when possible makes them available immediately.

This means that SOLR-based databases, when accessed from Pazpar2 or a MasterKey application, feel just as fast as they do in applications based directly on SOLR. In other words, we can freely mix and match locally indexed resources and remote databases, without compromising on either type – a huge leap forward.

Of course, the extension to our ZOOM API means that anyone who uses our YAZ-based ZOOMimplementation can cross-search SOLRZ39.50, and SRU targets. Magically, this extends to ZOOMimplementations in other languages that base themselves on our code, including APIs in Perl, PHP, C++, Ruby, Java, Visual Basic, Tcl, and, astonishingly, Squeal). Depending on how those APIs represent option values, changes may or may not be necessary to their implementation, but a small extension is necessary to express facet values. Time will tell whether the ZOOM maintainers decide to canonize our particular approach to handling facets.

A release of YAZ and Pazpar2 with these features is imminent.