A hard-to-find bug, and equivalent APIs in low- and high-level languages

We recently ran into a mysterious problem in one of our glue-layer programs. As soon as we had the “Aha!” moment and realised what was going wrong, we concluded that the mistake was more cultural than technical, and that it was potentially of wide interest – hence this blog post.

The bug came up in the IRSpy Toroid, a gateway daemon which searches in the IRSpy database (a catalogue of Z39.50 and SRU servers), and presents the results in a format that can be used to drive the configuration of our Masterkey metasearcher. The toroid is written in Ruby, a language that several of us at Index Data really like, but which we’ve not yet used much for production code.

The Toroid code is conceptually pretty simple. It acts as an HTTP server, accepting requests of the form /torus/REALM?query=QUERY, submits the specific query to the IRSpy Z39.50 server, transforms the returned records into the appropriate format by means of XSLT transformations, and returns the resulting record list. This worked fine for our initial set of test-cases, but when we started to run it against larger data-sets – a few hundred records or so – we saw the Toroid mysteriously go wrong. Some way into processing the result list, it would start to find that it couldn’t retrieve records: the fetch-record operation was yielding an undefined value, and no error indication was available.

In ZOOM-C terms, the ZOOM_resultset_record() function was returning a null pointer. Since the Toroid is written in Ruby, we were using the ZOOM-Ruby gem, which wraps this call up, idiomatically for Ruby, as an array indexing operation. So rs[i] in Ruby is equivalent to ZOOM_resultset_record(rs, i) in C. Looking at the function in the ZOOM-Ruby extension packages that implements this indexing operation, we couldn’t see an obvious mistake. Looking at the server logs, we found that records were not even being requested past the not-working-any-more point; and repeated runs showed that this point would randomly fall at different points in the result-set traversal – indeed, sometimes, unpredictably, the operation would work perfectly.

(Now that I see this description written down in black and white, it seems embarrassingly obvious what the nature of the problem was. But as is so often the case, the trees obscured the wood.)

It turned out that the problem was in the function that the Toroid used to search in the IRSpy database, which was something like the following (simplified for clarity):

def search_in_zoom_db(query)
  conn = ZOOM::Connection.open(Host, Port)
  return conn.search(query)

And the calling code was something like:

rs = search_in_zoom_db(query)
for i in (first..last)
  record = rs[i]
  result << transform(record)

Our mistake was in assuming that the Result-Set object returned from search_in_zoom_db() was valid; or rather, that it would remain valid. We assumed that it would keep a reference to the Connection that created it. But we were wrong: the Ruby ZOOM binding follows the semantics of the underlying ZOOM-C library, in which Result-Sets may outlive their connections – something that is potentially useful if, say, the Result-Set has had its cache populated with records, but the Connection is to be closed. Accordingly, ZOOM-C requires that you keep the Connection object around, destroying it only after the Result-Set has finished fetching from the server – which is often, but not always, when you’ve finished with the Result-Set all together.

So the IRSpy Toroid had the bug that the Connection object went out of scope as soon as the Result-Set was returned. This of course meant that the Connection could be garbage-collected and destructed at any time thereafter, but we didn’t see this bug at first because we were mostly testing on small result-sets and/or not doing much processing that would force (or allow time for) a GC before finishing the fetch operations.

When we started running on larger result-sets, and also when we added new functionality within the loop (guessing the vendor of the server described by each record, by running a bunch of regular expressions over its implementationId), that changed; and the bug showed up. The Connection started getting GC‘d out from under our feet so that the record requests in the loop would start yielding null results.

(All of this was made more obscure by what turned out to be a quite unrelated bug in the XSLT library that we were initially using, which would corrupt the parameters passed into its serve() method. I worked around this by switching to libxml-libxslt, which has had the fringe-benefit of making the Toroid a lot faster.)

The short-term fix for our garbage-collection problem is obvious: I’ve just made the Connection object an instance variable of the IRSpyToroid object that is used to serve the HTTP requests, so that it outlives the function call in which it’s created. (I’ll revisit this soon and think in more detail about the Connection’s lifetime.)

But we also spent some time thinking about what the correct semantics should be. YAZ’s Java binding (YAZ4J, though it should really be called ZOOM4J) does indeed do what we all assumed the Ruby binding does: it has its Result-Set class keep a reference to the Connection. The reference-count is decremented when the Result-Set is destructed, allowing the Connection to follow it at its leisure. For a high-level language, this seems like The Right Thing, and I plan to fix Ruby-ZOOM accordingly. I have some other fixes to make, most notably exposing the ZOOM_Connection_error() method, so I’ll be submitting patches soon.

But we’ve concluded that the semantics of the underlying calls are completely appropriate for a low-level language such as C, where garbage-collection is not used and all objects’ lifetimes are explicitly managed. C programmers expect to have to do this kind of thing; programmers in Java and Ruby do not, and we think that the bindings in those languages should reflect those expectations.

We’ve also concluded that the ZOOM-C documentation should be updated to be very, very clear about how it handles Connection and Result-Set management. (The somewhat analogous issue with Result-Set ownership of Record objects never seems to cause anyone any problems, probably because the existence of ZOOM_Record_clone() makes it clear what’s going on.)

So the moral is: when translating an API from low-level language to a high-level one – most likely by wrapping the former in the latter – we need to be careful to tweak the detailed semantics to match what’s idiomatic in the high-level language. That may be different from what’s idiomatic in the low-level language, and indeed may vary from one high-level language to another.

A final thought: it’s only now that I reach the end of this article that I realise I’ve been quite unconcernedly calling C a “low-level” language. I guess most programmers these days would concur; we’ve come a long way since I started writing C programs on Unix systems, back in 1986!


Similar issues, similar situation

I had a similar situation when using a ruby library to access an LDAP service a couple of years ago.