Developer's introduction to Index Data

We offer a set of tools that combine into resource discovery systems including local indicies and harvesting capability. The pieces of this puzzle are web services exchanging XML or JSON via a few sensible APIs and industry standards. Our aim is that you can easily integrate any or all of them into your system; build the new and interesting using our humble bricks. Here’s a bird’s eye view:

Search targets

In order to search many things at once, we need to be able to search at all. Our metasearch middleware knows how to query several types of targets, described here along with some of our software that ties in:

Z39.50 & SRU/SRW

Our company has its roots in federated search of targets that speak these standard protocols for searching bibliographic data. We created and have long maintained the Yaz toolkit, which has spared countless library programmers from having to implement the binary Z39.50 protocol, and includes a variety of helpful utilities. Yaz is by far the most widely used implementation of these protocols worldwide and, in the spirit of open source, our language bindings have been augmented by the contributions of many others. As a result, it’s likely available in whatever language you happen to be developing in. For more information, and to get straight into interfacing with library catalogues, take a look at our tutorial.

Connectors

The Connector Platform, our web automation framework, can be used to build a screen-scraper to provide results from an any HTML search interface or XML web-service on the web. Or you can use one of the thousands already built and available through MasterKey Connect. See it in action at emusik.dk where our connectors power a music search that spans multiple streaming services.

SOLR & Zebra

You can map the schema from existing SOLR collections. Or use our Harvester to build a new local Solr index from remote collections via OAI-PMH or through connectors that traverse and harvest web sites. Before SOLR was available, we had a search engine of our own: Zebra. If you want performance, don’t already have SOLR expertise in house, and are primarily interested in bibliographic metadata, it may still be the best solution.

Metasearch

Several of our flagship projects have been federated search systems i.e. tools for searching multiple resources and combining the results. Such systems have two components at their code: searching happens in Pazpar2 and the Service Proxy adds tools suitable for managing thousands of targets for thousands of member institutions.

Pazpar 2

Pazpar2 takes a search query in CCL (Common Query Language) and performs the search against any supported targets. Targets are configured via an XML format where you define a set of defaults and override them on a per-target basis. Target definitions reference an XSLT mapping between an XML representation of the target’s output and a common set of metadata that you can code against. We’ve bundled mappings for popular formats like Dublin Core and MARC. You’re not limited to searching the indexes and protocols described in the previous section: there is a plugin API for define new ones. For example, we’ve implemented one for Primo.

The output of a search is a set of documents normalised via the mapping into the fields you’ve specified in the configuration and merged into records. Merging, ranking, and sorting are all tuneable. Metadata fields can optionally be made into termlists, which can be used to implements facets in a UI. Result sets are not static—in metasearching, not all targets return at once. Once a search is established it can be polled on an ongoing basis for the latest information, including how many targets have yet to report.

Service Proxy

Our proprietary Service Proxy builds on Pazpar2 to add authentication and target management through use of the Torus data structure. This enables target sources (the harvester, the connector repository, IRSpy, etc.) to expose their entire catalogue of targets for discovery. Accounts to be configured with access to a subset of these. Target selections are tailored for library consortia to provide a relevant federated search to their members. This is done using MKAdmin, an administrative interface where you can browse targets and check them off.

The Service Proxy is addressable via an extended version of the same protocol as Pazpar2. Since it’s Pazpar2 underneath, it uses the same XML to map metadata and configure targets. You can build your own target source (a toroid) by implementing a webservice that serves target definitions.

User interfaces

Given that our searching middleware is addressed with a simple web-service protocol, one can build a variety of UIs using any platform capable of HTTP. In particular, UIs can run inside web browsers, using AJAX to present a dynamic UI. This, our preferred approach, allows results to be rendered and rerendered fully on the client side and reduces bandwidth to only compressed record XML rather than full markup. To facilitate building such UIs and integrating them into your own platform, we provide pz2.js, a convenient JavaScript library that handles AJAX polling of the search service and passes the data to your UI via callbacks.

We have several implementations already available for you to use: MK2, a standalone web site (demo); MkDru (demo), a themeable Drupal module that works out of the box; and MkJSF, a UI built using the JavaServer Faces framework. Also available is a third party module for the Typo3 CMS.

Widgets

For even more convenient embedding of search interfaces on the web, we’re actively developing MKWS, the MasterKey Widget Set. It’s a script that replaces appropriately labelled elements on your page with working search interface components. As convenient as embedding a YouTube video or such. Check back as we’re building new widgets and demos, like the quick prototype that looks up any word you select by metasearching both Webster’s and a German-English dictionary.

Web automation

The Connector Platform lets you conveniently build an XML API for any arbitrary actions you’d like to perform on similar web sites. We call these actions “tasks”. Comprised of “steps” (Javascript modules with a config UI), they are passed a JSON object as input and return one when complete. For our search connectors we use init, search, parse, and next page tasks; but you can create connectors that define any arbitrary verbs. Key to this model is that no attempt is made within the connectors to enforce constraints such as parse needing to be run after search—the point is for the connector to abstract away the web site into a uniform API. Most anything common to all connectors of is best left to the external logic wrapping them to provide both flexibility and simplicity.

This leaves the software open to be used for pretty much anything on the web. Create a “connector template” (XML to define which tasks/parameters exist) that defines tasks for things like send, list, new_folder, etc. and you’re well on your way to building an SMTP/IMAP daemon that wraps a bunch of closed off webmail services. Or you could wrap an ILS system to get an API for circulation. We did: let’s talk if you’re curious.

Scaling and performance

One would expect that web scraping with a browser instance would be slow. Surprisingly enough, it’s not – at least, not for many useful applications. We routinely do 20-40 transactions a second in production and things hold up just fine on commodity hardware. This is thanks in part to Metaproxy, our search-routing maestro that does intelligent filtering, load-balancing, caching and more for your SRU or Z39.50 requests.