pazpar2: New Features and Challenges

It has been quiet on the pazpar2 front lately, but I will make amends with this blog post.

Lately I have been working on a Harvester that would harvest into a Local Unified Index (LUI). The LUI has been implemented with Solr.

This means we can implement Integrated Search, which is our name for doing both searching remote targets (meta-searching) and a Local Unified Index (LUI), aka Central Index.

New Features

With this we saw a need to move the faceting logic from our UI into the backend. The old UI would combine (anding) the user-entered search with the faceting value. A search for “beer” faceted with the subject “Brewing” would become “beer and su=Brewing”. There are two problems with this. First, the ccl for su is probably setup to match incomplete subfield (6=1), but when faceting we are interested in an exact match using complete field (6=3). This can be fixed by introducing a new ccl term, su_exact, that defines this and use this for faceting. However, this does not solve the other problem. Since Solr supports native faceting and a special API to do so, we need to be able to distinguish between user searches and facet searches.

This made us introduce a new parameter to the search command in 1.6.0: limit. This carries the facet search parameters, so we can support both intersection of multiple facets and union for multiple values of the same facet, though our UI does not support the latter.

However, the limit parameters require that we have configured limitmaps. Quite similar to cclmap definitions (au, su) this configures pazar2 on how to do faceted search. However they are using the pazpar2 meta-data field names (author, subject), not ccl names.

The limitmaps supports three formats:

  • rpn:
  • ccl:
  • local:

rpn: uses RPN syntax to define search attribute. limitmap_author could be “rpn: @attr 1=1003 @attr 6=3” (use field 1003, complete field).

ccl: will enable you to point to an existing cclmap definition. if you already defined a cclmap_auexact, you can point to this definition by using “ccl: auexact”.

local: means that we are not capable of doing any search for this facet, so “just” do a record filter on the facet.

Together with a rewrite of the search logic, there is now a way to do a facet filtering without redoing the search. Pazpar2 will now compare the previous search with the new search and if they match, pazpar2 will not do a new search when adding a limit on a field that has local: definition. It has been requested on the yaz list, and it is now possible! Previously you would have re-configured the target with a record filter but still pazpar2 would re-search.

But we do want to re-search targets with native facet support. In order to actually request and get native facets, we need to define other mapping, facetmap_metadata-field. So in order to request and get Solr facets for author, we would configure a facetmap_author with a value “author_exact”. Solr requires different fields for searching (tokenized) and faceting (non-tokenized, exact), so the author_exact should be configured as a string or keyword type and having the value from author copied into it.

The Challenges

The challenges arise when we want to use native facet that Solr provides, combined with targets that don’t. The implementation of Z39.50 protocol in yaz has been extended to have both facet request and response; I am not aware of any server implementations that uses it.

Facets count generated by pazpar2 is over a small sample of records, typically 100, whereas native facet from Solr counts is the total for the full result. This means that native facets will typically dominate the facets lists. So pazpar2 has a feature to approximate the facet count. It scales up based on the total hit count. So finding 8 subject “Brewing” in the 100 samples, but the total hit count is 1200, pazpar2 can be configured to approximate the facet count to 8/100 * 1200 = 96.

Related to meta-searching without facet support, we have a similar challenge. Target will be filtered for hits, but they will show up with full hit counts. The extreme is of course when no records are found for a target, it will still show up, and when faceting on the target, end up with zero results. Ouch. We need to implement an approximation here as well.

All in all this may lead us to extend results returned with approximation values of such things. Then it would be up to clients to decide what to use.

I think what would be valuable to implement in pazpar2 is a combined record filtering and limit map search. We may not be able to do an exact facet filtering on a target, but we might be able to do a search that would filter some result out, but a bit too wide. Applying the filtering on the result would remove the unwanted results. This is still possible if using a record filter setting, but this is currently not possible with the limit parameter.

Performance Challenges

LUI (using Solr) is quite fast, but then unifying a huge number of databases into one also means it will get hammered by requests from pazpar2. An easy optimization is to use 100 records instead of 20 records 5 times. Pazpar2 does not support presentChunk size larger than 20 at the moment. A bigger optimization would be to implement logic in pazpar2 to implement a LUI procotol, which would mean we could do all targets in one request, knowing that it is multiple target and do the separation before ingesting into session.

Actually pazpar2 / yaz still doesn’t do the special treatment of facets search for Solr; since we are seeing quite good performance with faceted searches, we haven’t seen the need to do so yet. It will also require a ZOOM API extension.

Another performance boost could come from finishing up our threading of pazpar2.

If you want to test out the LUI combined with some meta-searched targets, go to http://mk2.indexdata.com and log in as blogguest using the password pazpar2.

I would love to get feedback on the outstanding challenges and what you think is most valuable to implement in pazpar2.

2 Comments

Very good and interesting

Very good and interesting news!

So far I found myself doing faceting client-side, so I could have multiple facets and all. This sounds like it may open the way to doing things server-side again. (Now I'll just have to find to to test/figure that out).

I really like the idea of extending the searches on the server when facets are used. However, for some fields which require heavy normalisation (e.g. language, media type), it is unclear to me how to achieve that as pazpar2 does not »know« the original term (e.g. language code) to search for after fields have been normalised.

You touch the UI issues that arise with this approach. Have you seen how people react to the greater unclarity of the less reliable hit counts?

(you should really get rid of that captcha, it's causing considerable grief on my end)

Hit counts

The latest Pazpar2 will try to do some basic scaling of facets based on the 'sample' received. You're right it's a bit of a usability issue. Many don't care, but some get very upset when the numbers don't match exactly. One solution we've played with (you can see it at http://mk2.indexdata.com/) is to simply not show numbers, but to use tag-cloud-esque font scaling to indicate relative weight. Simply listing the facets without associated hitcounts can work too.

Ultimately, though, the idea is to leave these choices up to the UI designer as much as possible.