Turbomarc, faster XML for MARC records

Our metasearch middleware, Pazpar2, spends a lot of time doing XML transformations. When we use Pazpar2 with traditional library data sources that return MARC21, we internally convert the received records into MARCXML (if they’re not already represented as such) and then transform into the internal pazpar2 XML format using XSLT (more on this process here).

MARCXML is nice to look at, but it’s not an optimal format on which to perform XSL transformations.

So we did performance testing, and we found that much of the CPU usage was around the transformation from MARCXML to our internal, normalized data format using XSL transformations.

We decided to try out a format (we call it Turbomarc) where we represent the names of MARC fields and subfields as XML element names rather than attributes, value, but leave an option for special cases open. This MARCXML:

<collection xmlns=”http://www.loc.gov/<span class="caps">MARC21</span>/slim”>
   <leader>00492nam a22001455a 4500</leader>
   <controlfield tag=”001”>000277485</controlfield>
   <datafield tag=”100” ind1=”1” ind2=” “>
     <subfield code=”a”>Μαρούδης, Κωνσταντίνος Ιω</subfield>
   <datafield tag=”250” ind1=” ” ind2=” “>
     <subfield code=”η”> εκδ.</subfield>

will in Turbomarc be:

<c xmlns=”http://www.indexdata.com/turbomarc”>
  <l>00492nam a22001455a 4500</l>
  <d100 i1=”1” i2=” “>
    <sa>Μαρούδης, Κωνσταντίνος Ιω</sa>
  <d250 i1=” ” i2=” “>
     <s code=”η”> εκδ.</s>

This shows the special case where a non-alphanumeric attribute value is not combined into an element name, but is left as an attribute (this happens very rarely in real use).Using xsltproc –timing showed that our transformations were faster by a factor of 4-5. Shortening the element names only improved performance fractionally, but since everything counts, we decided to do this as well.

The single user probably won’t notice the difference but it will for sure enable more throughput. Measuring average response time given different numbers of users we saw the following: average response time which shows that we can double the number of users in a high-throughput stress-test given a fixed average response time for a typical Pazpar2 webservice command. The corresponding number of ‘real’ users would be much higher.

Support for the Turbomarc format was released with version 4.0.1 of YAZ and is supported by the ZOOMlayer and thus by zoomsh by using txml instead of xml in the show command:

<span class="caps">ZOOM</span>>open a-z-target
<span class="caps">ZOOM</span>>search water
<span class="caps">ZOOM</span>>show 0 1 txml

Pazpar2 supports Turbomarc from version 1.4.0 and Turbomarc will eventually become the default format; however, at present, MARCXML remains the default to avoid breaking existing applications of Pazpar2.

Another new change in pazpar2 is support for threading, to make better use of multi-core systems, but this is still at beta level and could be a topic for a future blog entry.

Building a simpler HTTP-to-Z39.50 gateway using Ruby-ZOOM and Thin

Inspired by Jakub’s posting yesterday, I wondered how easy it would be to build an HTTP-to-Z39.50 gateway similar to his in Ruby, my language of the moment. Different languages offer different tools and different ways of doing things, and it’s always instructive to compare.

Ruby libraries are generally distributed in the form of “gems”: packages analogous to the .deb files used by the Debian and Ubuntu GNU/Linux operating systems, so before we start installing the Ruby libraries that we’re going to need, we have to install the rubygems package that does that installation. The gems system is like Debian’s dpkg and apt-get all rolled into one, as it knows how to fetch packages as well as how to install them; and it has something of the BSD “ports” mechanism about it, too, as gem installation can include building, most usually when the gem is not pure Ruby but includes glue-code to an underlying C library – as in the case of Ruby-ZOOM.

While we’re installing the gems system, we may as well also ensure that we have Ruby itself installed, plus the header-files needed for building gems that include a C component, and Ruby’s OpenSSL and FastCGI support. On my Ubuntu 9.04 system, I used:

$ sudo apt-get install ruby1.8 ruby1.8-dev rubygems1.8 libopenssl-ruby1.8 libfcgi-dev

Once that’s been done, we can install the Ruby gems that we need for the Web-to-Z39.50 gateway. These are:

  • rack is the web-server API, which works with various different specific web servers, invoking application-specific code to handle requests.
  • thin is one of several specific Ruby-based web servers that we could use – it’s roughly analogous to something like Tomcat.
  • zoom is the Ruby implementation of the ZOOM abstract API .

Depending on how your Ruby installation is set up, you may or may not already have the various prerequisite gems that thin needs; it’s harmless to ask for an already-installed gem to be installed, so you may as well just ask.

So to install the necessary gems, then, I did:

$ sudo gem install rack thin zoom test-spec camping memcache-client mongrel

We’re all done with installing – now it’s time to write the web server. And here it is!

require ‘rubygems’
require ‘thin’
require ‘zoom’

class <span class="caps">ZOOMC</span>lient
  def call(env)
    req = Rack::Request.new(env)
    headers = { ‘Content-Type’ => ‘text/plain’ }
    zurl = req[‘zurl’] or return [ 400, headers, “no zurl specified” ]
    query = req[‘query’] or return [ 400, headers, “no query specified” ]
    syntax = req[‘syntax’] or return [ 400, headers, “no syntax specified” ]
    maxrecs = req[‘maxrecs’] ? Integer(req[‘maxrecs’]) : 10
    res = [] 
     res << “<span class="caps">SEARCH</span> <span class="caps">PARAMETERS</span>“ 
     res << “zurl: #{zurl}” 
     res << “query: #{query}” 
     res << “syntax: #{syntax}” 
     res << “maxrecs: #{maxrecs}” 
     res << ” 

     <span class="caps">ZOOM</span>::Connection.open(zurl) do |conn| 
       conn.preferred_record_syntax = syntax 
       rset = conn.search(query) res << “Showing #{maxrecs} of #{rset.size}” << ” 
       [ maxrecs, rset.size ].min.times { |i| res << String(rset[i]) << ” } 

    [ 200, headers, res.map { |line| line + “\n” } ] 

app = Rack::<span class="caps">URLM</span>ap.new(‘/zgate’  => <span 
Thin::Server.new(nil, 12368, app).start!

This doesn’t need wiring into an existing web-server installation: it is a web-server, listening on port 12368 and ready to accept requests in the /zgate area. Despite its name, thin is not a trivial piece of software: it’s highly secure, stable, fast and extensible. It’s also a lot easier to wire into than some other servers coughTomcat cough.

The sweet thing about the Rack API is just how simple it is. You provide a class for each web-application, which need have only one method, call: and that method accepts a request object and returns a triple of HTTP status-code, hash of headers, and content. Of course, much, much more sophisticated behaviour is possible, but it’s nice that doing the simple thing is, well, simple.

So, anyway – to run our server just use:

ruby web-to-z.rb

And now we can fetch URLs like http://localhost:12368/zgate/?zurl=z3950.loc.gov:7090/voyager&query=@att…

And we’re done!

Building a simple HTTP-to-Z39.50 gateway using Yaz4j and Tomcat

Yaz4J is a wrapper library over the client-specific parts of YAZ, a C-based Z39.50 toolkit, and allows you to use the ZOOM API directly from Java. Initial version of Yaz4j has been written by Rob Styles from Talis and the project is now developed and maintained at Index Data. ZOOM is a relatively straightforward API and with a few lines of code you can write a basic application that can establish connection to a Z39.50 server. Here we will try to build a very simple HTTP-to-Z3950 gateway using yaz4j and the Java Servlet technology.


Yaz4j is still an experimental piece of software and as such is not distributed via Index Data’s public Debian Apt repository and there is no Windows build (yet) either. While it is possible to use the pre-built Linux binaries, users of other OSes will have to compile yaz4j from source. No need to worry (yet) – the process of compiling yaz4j is quite simple and we will be up and running in no time :).

As a prerequisite, to complete th build process you will need JDK, Maven, Swig and Yaz (development package) installed on your machine. On Debian/Ubuntu you can get those easily via apt:

apt-get install sun-java6-jdk maven2 libyaz4-dev swig

The Yaz4j’s source code can be checked-out out from our Git repository, and assuming you have Git installed on your machine you can do that with:

git clone git://git.indexdata.com/yaz4j

The compilation of both native and Java source code is controlled by Maven2, to build the library, invoke the following commands:

cd yaz4j
mvn install

That’s it. If the build has completed successfully you end up with two files: os-independent jar archive with Java ZOOM API classes (yaz4j/any/target/yaz4j-any-VERSION.jar) and os-dependent shared library (yaz4j/linux/target/libyaz4j.so or yaz4j/win32/target/yaz4j.dll) that contains all necessary JNI “glue” to make the native calls possible from Java. If we were writing a command line Java application, like any other external Java library, yaz4j-any-VERSION.jar would have to be placed on your application classpath and the native, shared library would have to be added to your system shared library path (LD_LIBRARY_PATH on linux, PATH on Windows) or specified as a Java system property (namely the java.library.path) just before your application is executed:

java -cp /path/to/yaz4j-*.jar -Djava.library.path=/path/to/libyaz4j.so MyApp


Setting up a development/runtime environment for a web (servlet) application is a bit more complicated. First, you are not invoking the JVM directly, but the servlet container (e.g Tomcat) run-script is doing that for you. At this point the shared library (so or dll) has to be placed on the servlet container’s shared libraries load path. Unless your library is deployed to the standard system location for shared libs (/usr/lib on Linux) or it’s location is already added to the path, the easiest way to do this in Tomcat is by editing (create it if it does not exist) the CATALINA_HOME/bin/setenv.sh (setenv.bat on Windows) script and putting the following lines in there:


on Windows (though no Windows build is yet provided)

 set PATH=%PATH;X:\path\to\yaz4j.dll

That’s one way of doing it, another would be to alter the standard set of arguments passed to the JVMbefore the Tomcat starts and add -Djava.library.path=/path/to/lib there. Depending on a situation this might be preferable/easier (on Debian/Ubuntu you can specify JVM arguments in the/etc/default/tomcat6 file).

With the shared library installed we need to install the pure-Java yaz4j-any*jar with ZOOM API classes by placing it in Tomcat’s lib directory (CATALINA_HOME/lib). As this library makes the Java System call to load the native library into the JVM you cannot simply package it along with your web application (inside the .war file) – it would try to load the library each time you deploy the webapp and all consecutive deployments would fail.


With your servlet environment set up all that is left is to write the actual application (peanuts :)). At Index Data we use Maven for managing builds of our Java software components but Maven is also a great tool for quickly starting up a project. To generate a skeleton for our webapp use the Maven archetype plugin:

mvn -DarchetypeVersion=1.0.1 -Darchetype.interactive=false \
-DarchetypeArtifactId=webapp-jee5 -DarchetypeGroupId=org.codehaus.mojo.archetypes \ 
-Dpackage=com.indexdata.zgate -DgroupId=com.indexdata -DartifactId=zgate \ 
archetype:generate --batch-mode

This will generate a basic webapp project structure:

<code>-- src&#10;|-- main&#10;|   |-- java&#10;|   |</code>– com
|   |       <code>-- indexdata&#10;|   |</code>– zgate
|   <code>-- webapp&#10;|       |-- WEB-INF&#10;|       |</code>– web.xml
|       <code>-- index.jsp&#10;</code>– test
<code>-- java&#10;</code>– com
<code>-- indexdata&#10;</code>– zgate

Maven has already added basic JEE APIs for web development as the project dependencies, we need to do the same for yaz4j, so edit the pom.xml and add the following lines in the dependencies section:

<version><span class="caps">VERSION</span></version>

It’s crucial that the scope of this dependency is set to provided otherwise the library would end up packaged in the .war archive and we don’t want that.

The implementation of our simple gateway will be contained in a single servlet – ZGateServlet – which we need to place under src/main/webapp/com/indexdata/zgate. The gateway will work by answering HTTP GET requests and will be controlled solely by HTTP parameters, the servlet doGet method is shown below:

protected void doGet(HttpServletRequest request, HttpServletResponse response)
throws ServletException, <span class="caps">IOE</span>xception {
String zurl = request.getParameter(“zurl”);
if (zurl == null || zurl.isEmpty()) {
response.sendError(400, “Missing parameter ‘zurl’ ”);
}String query = request.getParameter(“query”);
if (query == null || query.isEmpty()) {
response.sendError(400, “Missing parameter ‘query’ ”);
}String syntax = request.getParameter(“syntax”);
if (syntax == null || syntax.isEmpty()) {
response.sendError(400, “Missing parameter ‘syntax’ ”);
}int maxrecs=10;
if (request.getParameter(“maxrecs”) != null
&& !request.getParameter(“maxrecs”).isEmpty()) {
try {
maxrecs = Integer.parseInt(request.getParameter(“maxrecs”));
} catch (NumberFormatException nfe) {
response.sendError(400, “Malformed parameter ‘maxrecs’ ”);
}response.getWriter().println(“<span class="caps">SEARCH</span> <span class="caps">PARAMETERS</span>”);
response.getWriter().println(“zurl: ” + zurl);
response.getWriter().println(“query: ” + query);
response.getWriter().println(“syntax: ” + syntax);
response.getWriter().println(“maxrecs: ” + maxrecs);
response.getWriter().println();Connection con = new Connection(zurl, 0);
try {
ResultSet set = con.search(query, Connection.QueryType.PrefixQuery);
response.getWriter().println(“Showing “+maxrecs+” of “+set.getSize());
for(int i=0; i<set.getSize() && i<maxrecs; i++) {
Record rec = set.getRecord(i);
} catch (ZoomException ze) {
throw new ServletException(ze);
} finally {

With the code in-place we can try to compile the project:

mvn compile

If all is OK, the next step is to register our servlet and map it to an URL in src/main/webapp/WEBINF/web.xml:


On top of that, we will also make sure that our servlet is automatically triggered when accessing the root path of our application:


Now we are ready to build our webapp:

mvn package

The resulting .war archive is located under target/zgate.war, we can deploy it on tomcat (e.g by using the /admin Tomcat admin console) and test by issuing the following request with your browser or curl (assuming Tomcat is running on localhost:8080):


That’s it! You just build yourself a HTTP-to-Z3950 gateway! Just be careful with exposing it to the outside world – it’s not very secure and could be easily exploited. The source code and the gateway’s Maven project is available in the Yaz4j’s Git repository under examples/zgate. In the meantime, Index Data is working on a Debian/Ubuntu package to make the installation of Yaz4j and Tomcat configuration greatly simplified – so stay tuned!. If you are interested in Windows support – e.g. Visual Studio based build or an installer – please let us know.

Competition and the Marketplace of Ideas

Recently, my son asked me a series of questions about the cold war, and the political/military paradigm of mutually assured destruction (MAD for short). It’s always seemed like an odd premise to me, and somehow, discussing it with a 13-year old doesn’t make it look any more sensible. However, we came to agree that landing on the moon was a pretty cool thing. Would the lunar landing have happened, realistically, without the cold war? There’s probably not a stronger force in life than competition – indeed, life as we know it would not be possible without it; a race for resources, for growth (or procreation), for power. Is there a direct line between two-single celled organisms clustering together for warmth, and a whole nation uniting behind the wildest, craziest engineering feat just to prove a point to another nation?

These past few years, Open Source Software has emerged as a highly visible concept in libraries. It means different things to different people. Personally, when I released my first piece of open source code nearly 20 years ago (it was an adventure game, not library software), it was a way to share my code and ideas with others without possessing the infrastructure of a software company. In retrospect, it was also a way to participate in what I imagined to be the academic model of openly sharing ideas and knowledge, without the supporting scaffolding of publishing companies and research grants of which I was blissfully unaware at the time.

Index Data began releasing code under Open Source licenses shortly after we founded the company. We didn’t give a whole lot of thought to it at first; it just seemed natural. There was no established model for running a business on this premise at the time, but we knew we had no skills in selling software, and we figured this way at least people might get to see the code, maybe run it, and give us feedback (sometimes, it seemed like we craved the praise more than money). That became the beginning of an amazing 15-year journey; of searching for a business model that works for a group of geeks who enjoy writing industrial-strength code but who have little or no skills in marketing. That story probably would make a fun post in its own right.

But, recently, Open Source has gone from being something very obscure in the broader library business (when we first exhibited at ALA, in 2004, people would come up to us and ask what kind of a company name was ‘Open Source’) to the hot new thing in town, and as such, it has come to mean many different things to different people.

For libraries, it’s sometimes seen as a way to save money, or a possibility of getting new features by tweaking code (or having other people do it for them), and maybe a kick in the behind for some of the more staid, established vendors. For academic and library geeks and coders, it’s a way to get a seat at the big table; to write exciting code and influence the direction of technology at the deepest level. For some new service/support companies, it’s an opportunity to enter into the library system market without the overheads and upfront investment of creating a whole software platform from scratch. For some existing vendors, it’s perhaps seen as new competition, or a dilution of a marketplace that was already crowded. Some see a paradigm shift, an unstoppable wave towards a new way of doing business; others see a distraction, a wasted effort.

I work every day with libraries, with librarians, with library geeks and coders, with academics, and with different kinds of library service and software providers, and I have come to form a different perspective.

To my mind, the central aspect of open source software is that it allows for a different, more direct dialog between software developers and users of that software. It can break down institutional boundaries. Sometimes things get messy. Sometimes a lot of effort are wasted by groups of libraries in well-meaning efforts to build a better mousetrap. But sometimes, new ideas can be brought from the whiteboard into production literally in days, as an inspiration for others to follow. As someone who enjoys designing software tools for others to use, I get to have relationships with individual coders and geeks, as well as the CIOs of large businesses or organizations (and, quite a few times, I have been able to watch the former evolve into the latter), and our code is stronger and better for the range of challenges it is faced with, as are we as programmers.

Above the daily effort of coders coding, geeks trying out new mashups, and companies competing is a larger dialog. A kind of marketplace of ideas. Much like a real marketplace, or any economy, it is less governed by rules and regulations than by our intrinsic human desire to work together, to share, to compete, and to win. It is a messy and chaotic process, like life itself, and a process that probably spends as much time moving sideways as it does forwards or upwards.

The players in this marketplace are different kinds of organizations, groups, and people: Established software vendors; library interest groups; national and regional standards bodies; library consortia; formally collaborating groups of libraries and informally collaborating individuals. All of them breaking their backs and minds to come up with the best answers to the hardest questions, each from their own perspective and with their own experiences to guide them.

I would like to stipulate that no single organization in this marketplace of ideas holds the one true answer: The key to the future; the secret to how libraries can behave and work to carve a place for themselves in the Internet age. But this is okay, because the marketplace itself, not the individual players, is our best tool for finding the answer. It is in the open competition of ideas, thoughts, experience, and passion that we move forward as a community, and it is the challenge posed by the marketplace that drives each of us to do our best, as individuals and organizations.

It is my hope that libraries and librarianship will continue to be able to support a rich flora of ideas and approaches, to attract people willing to pour their hearts into making things better, even if they don’t always know how.