Using Smart Widgets to Integrate Information Access

Next post: Adding Discovery and more to Koha with Smart Widgets.

This is the first of a series of blog posts in which we will talk about a concept that we have been developing over the past few years. We call it ‘smart widgets’ to distinguish our approach to widgets from the almost ubiquitous notion of ‘widgets’ meaning little search boxes that you insert into your page, but which ultimately send your users to some remote site.

How do they work? Well, for a really simple example, consider this search box:

You can put in a search and you will search across a collection of different resources, in real time. You might like to look at the HTML source code of this blog post to see how it was implemented, but to save you the trouble, the bit that does the searching looks like this:

<link rel=”stylesheet” type=”text/css” href=”//” />
<script type=”text/javascript” src=”//”></script>
<div class=”mkwsSearch”></div>
<div class=”mkwsResults”></div>

That’s all. You can take that bit of HTML and put it in your home page, and it should do the same thing. The code uses Ajax to communicate with our SaaS MasterKey back-end, and it will search virtually any combination of resources that you can imagine if you have an account.

In a nutshell, our Smart Widgets are intended to make access to information a fluid thing — something that can be easily manipulated and surfaced just about anywhere you can imagine, from a blog post to a library home page. In that sense, our widgets are two things:

  • A technology platform that uses dynamic HTML together with our SaaS back-end to make it incredibly easy to embed access to almost any combination of resources into almost any page.
  • A whole new way of thinking about how information sources are used in the service of library patrons, and how the library can project its services into the surrounding community (whether that is a town, as school, or a business).

In a way that second point arose from the realization that “Searching” — i.e. providing mechanisms by way patrons could access pre-indexed collections of materials by entering search terms — has become a commodity. Librarians have spent decades (if not centuries) thinking of mechanisms to make things findable. Today, the Internet and Google in particular have made that function utterly mainstream — it is part of the fabric of the Internet, and so much a part of people’s everyday experience that it has become very hard for the library community to convince anyone that we have a better solution. This is pleasing in a way — it has been cool to watch something so esoteric as searching become just an everyday part of our culture. But it also presents new challenges. Ironically, while easier access to massive piles of information have lead some decision makers to question the continued value of libraries and librarianship, at the same time people are struggling with information overload; how to filter and select the best sources to answer a given question. Information is not the same as knowledge, and access to too much information may in fact impede the acquisition of knowledge.

We in the library community are partly to blame for this. We have pursued the vision of the ‘universal search box’ for so long and with such ardor that it is only now, as we’re finally reaching that goal, that some people are asking if this really was such a good idea after all. I think certainly for some tasks, single search boxes are a great solution (the success of Google makes this clear), but I don’t think it’s the right answer for every problem. We believe that libraries have more important roles to play than merely managing search boxes, and we would like to use our widget platform to support those roles, by organizing and enabling access to information, and ultimately by facilitating the creation of knowledge from that information.

Below is a widget that surfaces the current results from the Digital Public Library of America for ‘american political history’. There is no search box: The widget retrieves the most current information based on a search that has been prepared by the page author (me, in this case).

DPLA results will appear here

The HTML source code for this widget looks like this:

<div class=’mkwsRecords mkwsTeam_dpla’
autosearch=’american political history’

Why is such a widget useful? Well, the widget can be used to surface results from virtually any combination of resources (up to over 100 databases per widget), ordered in any way desirable, for any given search. The sources for the widget can include subscription databases and open access sources. Different widgets can be combined together on a page to illuminate a current event, a certain genre of literature, or a subject of research. The widgets can be a powerful tool to build information sources of all kinds for the users of a library.

Over the coming days and weeks, we will be discussing various applications and uses of the widgets. We hope you’ll agree they present some pretty exciting possibilities.

As always, feel free to contact us if you have any questions or if you are interested in using the widgets in your own applications or site. You can also find more information at this site.

Next post: Adding Discovery and more to Koha with Smart Widgets.

Add Metasearching to Your Application with Three Lines of HTML

One of the problems we’ve had over and over at Index Data is that we build all these cool back-end tools — things like the metasearching middleware Pazpar2 — but then don’t have a good way to show them off. We’ve never really focussed much on building UIs, so we have to do demos that go like this:

… And then you just type this 200-character URL into your web browser, and you can see this XML response that comes back, and then you just take this identifier from this bit of the XML structure and use it to build this other 200-character URL, which …

Yesterday, we launched a new toolkit that changes that — the MasterKey Widget Set, or MKWS for short. The idea is that you can add widgets to your existing web-site — ILS, content management system, blog, or whatever. The widgets provide broadcast searching quickly and painlessly, customised to fit the way you do things. One widget for a search box, one for result records, one for facets, one for switching between UI languages, and so on. Mix ’em and match ’em. The individual widgets are HTML <div>s with well-known ids beginning with mkws: for example, <div id="mkwsSearch"> provides the search box and button.

So for example, the following three lines of HTML constitute a complete, functional (though ugly) metasearcher:

<script type="text/javascript" src=""></script>
<div id="mkwsSearch"></div>
<div id="mkwsResults"></div>

The search-related content (search boxes, results, facets, paging controls, sorting controls) are all styled with CSS using MKWS-specific classes. You can easily override those classes with your own CSS, to match the widgets to your own web-site’s look and feel.

Once you move past this very simplest kind of MKWS application, you can have a lot of control over behaviour. We’ll look at some of the other options in subsequent posts, but if you want a sneak preview, take a look at the MKWS manual, Embedded metasearching with the MasterKey Widget Set. There are plenty of examples linked from the home page, too.

We’re really excited about MKWS because it opens up metasearching application design to people who otherwise wouldn’t be able to go near it. You don’t need to be an JavaScript wizard, or know about XML or JSON. We hope we’ll see designers using MKWS to make things we’ve not even imagined yet.

Switchboard Leverage

We’re in the business of making access to information easier for people and, most of all, for SOFTWARE that in turn makes that information available to people. A lot of our software is based around a kind of switchboard or functional ‘hub’ model which means that when we extend a capability in ONE area, new possibilities open up in other areas that we don’t necessarily even think about ourselves. Ironically, it means that we don’t always KNOW — and certainly we don’t always ADVERTISE — what we are capable of. In this post, I want to think about some of the ways in which we switch between functions, protocols, and data.

Our YAZ toolkit, launched in ’95, is fast approaching the end of its teenage years. I had more hair over a larger area of my head, then. It started its life as one very useful kind of switchboard: it allowed client and servers to implement both Z39.50 and the Open Systems Interconnection (OSI) based flavors of Information retrieval protocols which were pursued in the US and the rest of the world, respectively. This created a huge value at the time, and allowed both groups of implementors to focus on functionality and content, without being limited in whom they could interoperate with. ISO eventually came to its senses and adopted Z39.50. We helped created a client-side API (ZOOM), which enabled developers in a huge number of different programming languages to develop search clients using YAZ and other toolkits. When SRU, SRW, and later SRU 2 came along, we used the abstract nature of the ZOOM API to hide the differences between all these different protocols, and YAZ again became a kind of switchboard for competing ways of doing the same thing. A while back, we also added support for Solr’s webservice API, which allows you to break down the distinction between what is indexed by you and what is remotely accessed from elsewhere.

Developers who use YAZ, then, find it easy to create polyglot applications — systems which will deal effortlessly with numerous data sources, and which may EXPOSE data through many different mechanisms. Every time we add a new mechanism — because the world seems to feel that anything worth doing is worth doing in many ways — many new possibilities are opened.

Our Metaproxy takes this a step further by essentially acting as a switchboard between anything that YAZ supports (and then some!). Some folks use the SRU server function of Metaproxy to access Z39.50 resources without having to use an API like YAZ’s own. In fact, though, Metaproxy can talk to virtually ANYTHING you can imagine: SRU, Z39.50, Solr-based indexes. It can even use our Connector technology to access proprietary APIs and screen-scraped resources. Our list of supported search targets is climbing quickly towards four thousand, and lately we have even made it a SaaS platform, so people can access just about ANYTHING without having to install a bunch of software locally. But people are also using Metaproxy as a convenient way to implement open standards like Z39.50 and SRU on top of their OWN content — it can talk to a local index in Solr and expose the contents to the world in a well-defined way. Just lately, thanks to SRU 2.0, you can even share facet functionality in this way.

Our MasterKey platform takes the capabilities of this ‘access layer’ and adds some crucial elements: One is a cross-database search function that allows for extremely efficient metasearching across Solr-based indexed and remote sources through almost any reasonable mechanism. Relevance ranking, merging, facets generated on the fly. Another element is a model for administering subscriptions and search targets: Once you have thousands of sources to keep track of, making it easy for people to sift through them to choose just the right sources for their end-users become a big challenge.

On top of this we have yet another switchboard: We have a simple webservice API which exposes this unified view of a huge information language through a faceted search metaphor, but layered on top of that we have a substantial array of different tools to enable people to leverage all this functionality: Today, we have plugins for the Drupal and Typo3 CMSes, for JaveServer Faces, for JavaScript programmers, and, soon to come: A widget set which will enable non-programmers to drop very advanced search functionality into any website just by adding a couple of HTML nodes to their page layout.

Each time we add a function to one of the corners of our platform, new capabilities emerge in the strangest, unthought-of corners of the larger framework. Maybe it is not so strange that out of all the information-related tasks we perform every day, perhaps the one we struggle with the hardest is the answer to the simple question: “So, what do you guys DO, exactly?”

Build a search application in JavaServer Faces with mkjsf

If you are a Java shop looking to build web sites on top of Pazpar2 or our MasterKey platform, then MasterKey JSF (mkjsf) could be just the tool to kick-start your project.

You know your J2EE and Ajax, and you might be considering JavaServer Faces for your UI development. Or maybe you have already developed a Faces application and would like to integrate metasearch into it.

You may also know about Pazpar2, but, exactly how do you get your JSF application wired up with Pazpar2’s protocol – the commands, the parameter settings, the session handling, and the polling? MasterKey JSF addresses those questions by putting the power of Pazpar2 at the UI developer’s fingertips.

We have just released MasterKey JSF as open source in its first version, and we’re hoping you will take it for a spin and let us know what you think.

Here is how it works:

MasterKey JSF is a library (mkjsf.jar) that you include in your JSF project, and after a bare minimum of configuration, you should be good to go.

The mkjsf.jar exposes the Pazpar2 protocol to the page author in the form of two named objects pzreq (requests) and pzresp (responses). Using the pzreq object, the page author selects commands and sets command parameters for execution. The Pazpar2 responses can then be retrieved from pzresp for display in the page.

Let’s take a code example: in your Faces page, you create the HTML form that will communicate with Pazpar2. Make a JSF input field that sets the query parameter of your search command. The reference to that parameter would be

Next, to display the results, render data from the ‘show’ results to the page by iterating over and writing out data elements like hit.title or Here is the entire form:

#{hit.title} #{} #{}


That’s _almost_ it. Once the user clicks “Execute My Search”, the components should send the query to Pazpar2, start polling for results and continue until Pazpar2 says it’s done.

The mechanism that ensures that the search is executed, that the polling actually takes place, and that the element myshow will in fact be rendered on each poll, is invoked by the instruction

To enable this mechanism in the page, you need to add one more tag than shown above. This tag must be present too:

With the parameter renderWhileActiveclients=”myshow”, you are instructing the component to render the UI element myshow on each poll to Pazpar2. The tag is a wrapper around a JavaScript that repeatedly invokes a Java method, which in turn makes requests to Pazpar2 to retrieve show, termlist, stat, and bytarget data.

So far the page in this example only uses show data, but you will probably want to add other stuff, to make a rich user experience. As you add more elements to your page – for instance facets or pagination – you would also want to add more of your UI elements to the pz2watch tag – as in renderWhileActiveclients=”myshow myfacets”.

We have written a tutorial describing this example and more. The tutorial comes with a web application (a war file) with a set of pages demonstrating step by step how to add functionality to your Pazpar2 application. The tutorial also describes the initial configuration to be done. For the examples above, you would also need to add the right namespaces on top of your XHTML – like pz2utils. That’s in the tutorial too.

You can use mkjsf against our cloud-hosted back-end. You can use the guest account to try it out against a selection of open access resources without having to install any other software than mkjsf itself, or you can contact us for a hosted MasterKey account to get access to virtually any imaginable set of searchable resources. However, you can also install Pazpar2 on your own and use it to search any combination of Solr indexes, Z39.50 servers, and SRU servers, possibly supplementing these with MasterKey Connect to get access to more difficult resources.

We built the mkjsf jar for Tomcat 7, so it contains all the J2EE support — for JSF 2, EL expressions and dependency injection — the stuff that doesn’t come with the Tomcat Servlet container itself. The jar will thus contain too much for deployment to application servers like Glassfish or JBoss. We’ve made separate builds and tested them on those two as well, so it should work. Let us know if Tomcat 7 is not your thing, and we might assist with J2EE application server deployment.

Here is the PDF tutorial: mkjsf-tutorial.pdf

On our ftp server you will also find the sample application used throughout the tutorial, for Tomcat7 and for GlassFish/JBoss respectively. It should be ready to drop in the auto-deploy directory of your server. That was the intention anyway, so let us know if that gives you any issues.

You can also find more documentation in the Javadoc.

Finally, for a demonstration of the capabilities of the library, take a look at our demo at

Creative Use of Pazpar2

It’s always fun to see someone do something really neat with your software. This elegantly designed search interface for Asia studies makes excellent use of Pazpar2. I particularly like the clever use of a bar chart for the date facet. Nice work!

What if it’s about the People, Stupid?

Most of the science of Information Retrieval centers around being able to find and rank the right set of documents in response to a given query. We spend much time arguing about technical details like ranking algorithms and the benefits of indexing versus broadcast searching. Every Information Professional I know both deifies and fears Google because they get it right most of the time — enough so that many people tend to assume that whatever pops to the top of a Google search MUST be right, because it’s right there, in the result screen. Producing that seemingly simple list of documents from a query is ultimately what our work and art boils down to.

But what if we’ve got it all wrong?

Web2.0, by many definitions, is about collaborative content creation. Tim Berners-Lee says that Web3.0 is going to be about Linked Data.

What if the most important function of the web is really to bring people together around shared goals, dreams, and interests — and those precious documents are really just so much metadata about people’s skills knowledge, and potential?

Watch somebody study a field long enough and deep enough, and eventually it seems like it becomes more about the Big Thinkers who have characterized the field than about any individual theory or document. You want to get into their heads; understand what inspires them. If you can, you want to ask them questions and learn from them. Then you build upon it.

It’s easy to find the Einsteins, the Darwins, the Freuds. Dig a little deeper, and you find the people, themselves giants, who stood on their shoulders. But what if you want to find the person or team who knows the MOST about a very specific type of protein, about building Firefox Plugins, or about adjusting the wheel bearings on a certain kind of vintage motorcycle? This seems to happen to me all the time when I am digging into something for work or play. If you get deep enough, looking for documents doesn’t sate your appetite, and you start to look for the masters of the domain — to see what they’re up to; to get into their heads, to ask them questions, or to hire them.

We spend an awful lot of time talking about workset merging, deduplication, persistent URLs for works. But maybe we need to spend more time talking about the authors. Maybe at a certain level of research, that humble Author facet becomes what it’s all about, and we should all be spending more time worrying about the dismal state of author name normal forms and authorities in the sources we work with. Are there things we should learn about relationships among authors in published works (and mentions of authors in casual websites) that might help guide our users more quickly to the experts, the super-groups, and that one forum where most people are friendly and knowledgeable?

Index Data’s Integrated Discovery Model

We are often asked about where we stand on the discussion of central indexing versus broadcast metasearching. Our standard answer: “You probably need some of both” always calls for further explanation. Some time ago, I wrote this up for a potential business partner. If it sounds a little like a marketing spiel… guilty as charged. I hope the content will still seem interesting to some folks thinking about these issues. While our specific approach and technology may be ours alone, the technical issues described here are pretty universal.

It is our thesis that a discovery platform which bases itself wholly on either broadcast metasearching or a centralized harvest-and-index model is inherently limited, as compared to one which seamlessly integrates both. It is our goal to provide a versatile platform, capable of handling very large indexes as well as dynamically searching remote content. Our MasterKey platform is an expression of this idea.

At the core of our platform is Pazpar2, a highly optimized multi-protocol search engine, capable of searching large numbers of heterogeneous resources in parallel. It features a web services API, a data model-agnostic approach to incoming search results, and a cloud-friendly, dynamic configuration mechanism, designed to fit into a variety of application environments. Pazpar2 will search a number of resources concurrently, through multiple protocols, while normalizing and integrating results.

Through its SRU/Z39.50 capability, Pazpar2 can access standards-compliant library catalogs, commercial database providers, etc. Through gateway technologies, it can access a range of other resources: Our SimpleServer provides a simple platform on which to implement gateways to proprietary APIs; our Connector Platform enables access to searchable resources through web-based user interfaces. Facets are constructed on the fly through analysis of incoming results.

Through its integrated Solr client, Pazpar2 can access locally and remotely maintained indexes, taking direct advantage of the capabilities of the Solr search engine to produce high-quality ranked and faceted results with lightning-fast response times.

While accessing a locally operated Solr instance, configured to support an application-specific data model in tandem with Pazpar2, the data normalization step is eliminated, conserving CPU cycles. Facets can be derived directly from the Solr system, eliminating the need to analyze individual records. All local results can be integrated and merged before presentation to the user, providing a highly responsive experience. Data can be freely distributed across Solr instances, allowing for scalability management options in addition to those offered by Solr itself.

Through Pazpar2’s data normalization mechanism, support for remotely maintained Solr instances with arbitrary record models is available, provided that a normalization stylesheet for the data model is provided. This allows for a more sophisticated level of integration with the increasing number of Solr-based applications in the field, and increases the versatility of the platform.

Combining these elements, the user of a hybrid local/remote discovery platform will experience excellent response times on local content, and will see remote results added to the display as soon as they become available (how this is accomplished is a user interface decision).

In addition to the elements described above, the MasterKey platform includes a harvester to schedule and manage harvesting jobs for local indexing purposes (HTTP bulk file download and OAI-PMH supported at present) as well as a service-oriented architecture for managing searchable resources, users, and subscriptions in a hierarchical way designed to address consortial requirements. All components are modular, sharing a service-oriented architecture, and designed to fit into as many architectural and operational contexts as possible.

On preferring open-source software

spent most of last week up in Edinburgh, for the Open Edge conference on open-source software in libraries, attended mostly by academic librarians and their technical people. It was an interesting time, and I met a lot of interesting people. At the risk of overusing the word “interesting”, it was also of interest to see how widespread the deployment of “next-generation OPACs” like VuFind and Blacklight has become. Since both of these use Solr as their back-end database, their progressive adoption in libraries has helped to drive that of Solr, which is part of why we recently added support for Solr to our own protocol toolkit.

One reason I went to Open Edge was to give my own talk on Metasearching and local indexes: making it Just Work when two worlds collide. I got to present how our thinking about resource discovery is evolving, and our growing conviction that it’s unnecessary to insist on either metasearching or harvesting data into a big central index and searching that. We can and should (and indeed we do, for some of our customers) do both, and integrate them seamlessly.

Open Edge also gave me a strong sense that the world is changing. As recently as two or three years ago, conferences about open-source software had the tenor of “Oh, please recognise that open-source is a valid model, don’t write us off as a bunch of hobbyists”. Now that war is largely won, and it’s universally recognised, at least by the kinds of people who come to Open Edge, that open-source software is a mainstream option rather than some kind of communist weirdo alternative.

Back in the Bad Old Days (by which I guess I mean 2008), meetings like Open Edge felt very slightly clandestine … as though we were meeting in secret, or at least off on our own somewhere, because the Big Boys – the proprietary vendors – wouldn’t let us join in any reindeer games. Open-sourcers were tolerated at big events like ALA, where we could get lost in the crowd, but if we wanted to give presentations and suchlike then matters might be different. Three years on, everything is very, very different.

So …

It was against that backdrop that I listened to Ross Gardler’s talk, the last one on the final day of the conference: an introduction to discussion on the subject “Steps to building capacity in the library community”. Ross is Service Manager for JISC’s OSS Watch, Vice-President of Community Development at the Apache Software Foundation, and the organiser of the TransferSummit conference – all in all, someone with fingers on a lot of pulses, and with as much idea as anyone of which way the wind is blowing. Every two years, OSS Watch takes a survey of attitudes to open-source software in UK Higher Education and Further Education. Ross presented some findings from the most recent, as yet unpublished, 2010 survey, and compared them with those in the 2008 survey.

(In the UK, Higher Education, or HE, means universities; and Further Education, or FE, means more vocational training past the usual school-leaving age.)

The most striking part of the survey for me was the bit about policies towards software procurement: the breakdown by percentage of how many HE and FE institutions have the following policies with respect to open-source software (OSS):

  • OSS is not mentioned
  • Policy of not using OSS
  • Explicitly considers OSS
  • OSS is the preferred option

There were two “Wait, what?” moments here for me. The first one was that there even is a “Policy of not using OSS”. But it’s there, and its percentages, though low, are non-zero. In 2006, 4% of HE and 2% of FEinstitutions surveyed said that they had formal policy in place not to use open-source software.

Just think about that for a moment. Someone, somewhere, came up with the idea that if your vendor gives you free access to the source code and does not charge you a licence fee, then that makes it a worse deal. In fact, not just a worse deal, but so bad a deal that it won’t even be considered. You can just imagine open-source vendors/integrators talking to the administrators:

Vendor: … And we don’t charge a licence fee, and we’ll give you the source code in case you want to make local customisations.

Administrator: Certainly not! We reject your terrible offer.

Vendor: All right, then. We’ll charge you £50,000 per year, and we won’t give you the source code.

Administrator: You’ve got yourself a deal!

It hurts, doesn’t it?

So, anyway. That was the first of my two “Wait, what?” moments. The second was a comment that Ross himself made as he was discussing these figures. The flip side of the stats that I found so incredible is that in 2006, 7% of FE institutions said that “OSS is the preferred option”, and Ross’s comment was something along the lines that he thought that was just as wrong as the converse.

Well, usually when I hear something like this, I speak up immediately, especially in a session like this Open Edge one where we’d been explicitly invited to chip in. But instead I sat there, sort of paralysed, gazing into the middle distance. I struggled to get my brain back onto the tracks, having had it knocked sideways by such a tremendous whack of (let me be frank) irrationality.

The reality is, OSS is simply not all that controversial anymore. The Open Source Initiative has been up and running for 13 years (and the Free Software Foundation for more than a quarter of a century). OSStechnologies are everywhere around us. The pros and cons of the different models are well understood. Mostly, for us, we have found that it allows us to communicate freely with a large community of developers while also working with an incredible range of commercial and public organizations. It would be nice to see people move past both the hype and the irrational fear, and recognize that this is an approach to collaboration that is here to stay, even if it isn’t right for everyone.

As my friend Matt Wedel likes to say, we’re all living in the future now. Let’s not pretend we’re still shackled by past practices. Onward!

Taylor-OpenEdge2011-when-worlds-collide(1) 2.69 MB

Clustering Snippets With Carrot2

We’ve been investigating ways we might add result clustering to our metasearch tools. Here’s a short introduction to the topic and to an open source platform for experimenting in this area.


Using a search interface that just takes some keywords often leads to miscommunication. The computer has no sense of context and users may not realise their query is ambiguous. When I wanted to represent a matrix in LaTeX typesetting markup, I searched for the keywords ‘latex’ and ‘matrix’. At that moment I wasn’t thinking along the lines of “latex cat-suits, like in the Matrix movies”. But, since one of those movies was just recently out, the search engine included some results of that sort. The information it might use to determine relevance (word frequency, number and recency of links, what people ultimately click on when performing similar searches, etc.) would not be enough to choose which interpretation of my query terms I was interested in.

To aid the user in narrowing results to just those applicable to the context they’re thinking about, a good deal of work has been done in the area of “clustering” searches. The idea is to take the usual flat, one dimensional result list and instead split it into several groups of results corresponding to different categories. Mining data to find patterns has been studied for some time and current work on automatically grouping documents has roots in this research.

One common way to represent a document, both for searching and data mining, is the vector space model. Introduced by Gerard Salton, underappreciated pioneer of text search, the idea is to distil documents down to statistics about the words they contain. Particularly, it involves considering the terms as dimensions so we can apply some handy matrix math to them. Then a document becomes a vector in term-space with a magnitude corresponding to how relevant the document is considered to be to those terms. In other words, a sushi menu would be located further along in the ‘fish’ and ‘rice’ directions than in the ‘dinosaur’ direction.

This kind of bag-of-words model is very useful for separating documents into groups. Problem is, it’s not very clear what to call those groups. A cluster that was formed of documents with prominent features of ‘voltage’, ‘circuit’, ‘capacitor’ and ‘resistor’ is certainly forming a group around a specific topic, but none of the terms used to determine that make quite so good a label as the more general word ‘electronics’. There has consequently been a shift in document clustering research away from “data-centric” models that aim to make the clearest groupings to “description-centric” models that aim to create the best groupings that can be clearly labelled.

Another differentiator among clustering algorithms is when the clustering happens, before or after search. When processing search results the focus is on a postretrieval approach which lets the clustering algorithm work with the search algorithm rather than against it. Search algorithms determine which documents are relevant to the query. Since the goal of document clustering in this context is to help select among only relevant documents, it is best to find clusters within that set. If we instead consider all documents it may be that most of the relevant documents are in the same cluster and many clusters will not be relevant.

Similarly, we can leverage another part of the search system: snippet generation. Rather than just a list of documents, many search tools will include contextual excerpts containing the query terms. Here too we have a system that is winnowing the data set. Much as the retrieval algorithm limits clustering to only relevant documents, these provide candidates for the most relevant few phrases or sentences.


A postretrieval clustering system designed to work with short excerpts is precisely the sort that could be put to use in federated search, at least for those targets that return snippets. It just so happens that there is an open source one to play with! It’s called Carrot2 and is a Java framework for document clustering devised by Dawid Weiss and Stanislaw Osinski as an experimental platform for their clustering research and promoting their proprietary algrorithm. They provide implementations for two algorithms which I’ll attempt to summarize below but are best described in their paper A Survey of Web Clustering Engines and in more detail in some of their other research.


Suffix Tree Clustering (STC) is one of the first feasible snippet-based document clustering algorithms, proposed in 1998 by Zamir and Etzioni. It uses an interesting data structure, the generalised suffix tree, to efficiently build a list of the most frequently used phrases in the snippets from the search results. This tree is constructed from the snippets and their suffixes, that is, all the snippets, all the snippets without their first words, and every sequence that ends at the end of the snippet including the last words by themselves. From this list the documents are divided into groups based on which words they begin with, with no two groups starting with the same word and no groups with only one document. In turn, each of those groups is divided by the next words in sequence. Here’s a compact example:

The groups generated this way form a list of phrases and associated number of snippets. Some of these phrases are selected to decide which will be the base clusters. They are ranked according to a set of heuristic rules. These rules take into account whether the phrases are present in too many or too few documents and how long the phrases are after discarding uninteresting “stop” words. Heuristics are also employed to merge clusters deemed too similar. The top few clusters are then presented to the users.

This approach is very efficient and the use of frequent phrases provides better quality labels than a purely term based technique.


The second algorithm, Lingo, was developed as the PhD thesis of Dawid Weiss, one of the partners in the Carrot project. Lingo constructs a “term-document matrix” where each snippet gets a column, each word a row and the values are the frequency of that word in that snippet. It then applies a matrix factorization called singular value decomposition or SVD. Part of the result of this is used to from collections of related terms. Specially, they possess a latent semantic relationship. If you’re interested in the how this works, the most approachable tutorial I’ve found on the topic presents it using golf scores. A more thorough treatment on how SVD is applied to text can be found here, along with some discussion on popular myths and misunderstandings of the approach.

Grouping the words in the snippets by topic helps form clusters but does not suggest a label. To this end, Lingo uses a suffix-based approach like STC to find a list of frequently occurring phrases for use as candidate labels. Since both the collections of words and the labels are in the same set of snippets, they exist in the same term-document space. As in our earlier example of ‘fish’ being closer to ‘sushi’ than to ‘dinosaur’, there is a concept of distance here. The label closest to a collection is used. Collections without a label close enough and labels without collections are discarded. The documents with snippets matching these remaining labels then form the clusters.

Lingo improves over STC by selecting not only frequent phrases but ones that apply to groupings of documents known to be related in a way that takes into account more content than just that one phrase. STC avoids choosing phrases that are too general by discarding those that occur in too many documents, but it’s still quite possible to get very non-descriptive labels. The added precision Lingo brings with the SVDcomes at a steep computational cost as matrix operations are much slower. The authors sell a proprietary algorithm, Lingo3G, that overcomes the performance limitations.

Other approaches

Cluster analysis can be applied to a wide range of fields and all sorts of non-textual data: Pixels, genes, chemicals, criminals, anything with some qualities to group by. General techniques can usually only provide a data-driven approach to clustering text and, as discussed above, labeling is very important.

Often the best label for a cluster isn’t even in the text being clustered. Some research investigates pulling in other data to very good effect. One of the best examples of this is the idea of using Wikipedia titles. This works well because the titles of Wikipedia pages are not only relevant to the content contained therein but are in fact explicitly chosen as a label to represent that content. Search query logs are another potential source of cluster labels.

Clustering is still very much an open problem and an active research area. Even the best description centric approaches are still no match for manually assigned subject categories.

SOLR support in ZOOM, Pazpar2, and MasterKey

We have always held that the schism between broadcast metasearching and local indexing is rather goofy – that in practice, you do whatever it takes to get the results in front of your user when and where he needs it, and the best solutions will allow for whatever approach is needed in the moment. Inspired by the increasing popularity of the SOLR/Lucene indexing server in the library world, we have just completed a project to add support for SOLR targets in the ZOOM API implementation in the YAZ library. So YAZ now supports Z39.50, SRU/SRW 1.x and the SOLR API.

Since SOLR uses HTTP as its transport protocol, triggering a SOLR search is done setting the “sru” option to value “solr”.

Using zoomsh, a SOLR target can be searched as follows:

<span class="caps">ZOOM</span>> set sru solr
<span class="caps">ZOOM</span>> connect http://localhost:8984/solr/select
<span class="caps">ZOOM</span>> find title:water
http://localhost:8984/solr/select: 498 hits
<span class="caps">ZOOM</span>>show 1 1
1 database=unknown syntax<span class="caps">=XML</span> schema=unknown
  <str name=”author”>Bamforth, Charles W.,</str>
  <str name=”author-date”>1952-</str>
<span class="caps">ZOOM</span>>

This functionality has also been integrated in our open-source metasearch engine Pazpar2, and by extension into our commercial MasterKey suite. We have experimented with using a database schema in SOLR that directly matches pazpar2 internal format so practically no conversion is required on searching/retrieval. Performance is blazing.

Even without doing any tuning, SOLR is very fast as long at it is given enough memory. The 7 million records in the Library of Congress catalog runs on a memory footprint of 2GB. And it does seem to need a setting for a warm-up; otherwise the first search is very slow.


The ZOOM API (and implementation) has been extended to handle facet results generated by SOLR and Z39.50 (via an extension of the Z39.50 protocol) targets. Facets definition is a comma separated attribute list in PQF syntax:

<span class="caps">ZOOM</span>>set facets “@attr 1=title @attr 3=5, @attr 1=subject @attr 3=2”
<span class="caps">ZOOM</span>>connect http://localhost:8987/solr/select
<span class="caps">ZOOM</span>>find title:beer
http://localhost:8987/solr/select: 586 hits
<span class="caps">ZOOM</span>>facets
    De Beers(3)
    One Fierce Beer Coaster(3)
    Wilhelm Beer(3)
    August Beer(2)
    Beer pong(2)
    Beer – United States (1)
    Food adulteration and inspection(1)
<span class="caps">ZOOM</span>>

which specifies that we want to get up to 5 title facets and 2 subject facets on a SOLR index of wikipedia-abstracts. Pazpar2 still does its own facet count for targets that don’t return facets, but having the target return facets when possible makes them available immediately.

This means that SOLR-based databases, when accessed from Pazpar2 or a MasterKey application, feel just as fast as they do in applications based directly on SOLR. In other words, we can freely mix and match locally indexed resources and remote databases, without compromising on either type – a huge leap forward.

Of course, the extension to our ZOOM API means that anyone who uses our YAZ-based ZOOMimplementation can cross-search SOLRZ39.50, and SRU targets. Magically, this extends to ZOOMimplementations in other languages that base themselves on our code, including APIs in Perl, PHP, C++, Ruby, Java, Visual Basic, Tcl, and, astonishingly, Squeal). Depending on how those APIs represent option values, changes may or may not be necessary to their implementation, but a small extension is necessary to express facet values. Time will tell whether the ZOOM maintainers decide to canonize our particular approach to handling facets.

A release of YAZ and Pazpar2 with these features is imminent.