9. Relevance Ranking and Sorting of Result Sets

9.1. Overview

The default ordering of a result set is left up to the server, which inside Zebra means sorting in ascending document ID order. This is not always the order humans want to browse the sometimes quite large hit sets. Ranking and sorting comes to the rescue.

In cases where a good presentation ordering can be computed at indexing time, we can use a fixed static ranking scheme, which is provided for the alvis indexing filter. This defines a fixed ordering of hit lists, independently of the query issued.

There are cases, however, where relevance of hit set documents is highly dependent on the query processed. Simply put, dynamic relevance ranking sorts a set of retrieved records such that those most likely to be relevant to your request are retrieved first. Internally, Zebra retrieves all documents that satisfy your query, and re-orders the hit list to arrange them based on a measurement of similarity between your query and the content of each record.

Finally, there are situations where hit sets of documents should be sorted during query time according to the lexicographical ordering of certain sort indexes created at indexing time.

9.2. Static Ranking

Zebra uses internally inverted indexes to look up term frequencies in documents. Multiple queries from different indexes can be combined by the binary boolean operations AND, OR and/or NOT (which is in fact a binary AND NOT operation). To ensure fast query execution speed, all indexes have to be sorted in the same order.

The indexes are normally sorted according to document ID in ascending order, and any query which does not invoke a special re-ranking function will therefore retrieve the result set in document ID order.

If one defines the

      staticrank: 1
     

directive in the main core Zebra configuration file, the internal document keys used for ordering are augmented by a preceding integer, which contains the static rank of a given document, and the index lists are ordered first by ascending static rank, then by ascending document ID. Zero is the ``best'' rank, as it occurs at the beginning of the list; higher numbers represent worse scores.

The experimental alvis filter provides a directive to fetch static rank information out of the indexed XML records, thus making all hit sets ordered after ascending static rank, and for those doc's which have the same static rank, ordered after ascending doc ID. See Chapter 8, ALVIS XML Record Model and Filter Module for the gory details.

9.3. Dynamic Ranking

In order to fiddle with the static rank order, it is necessary to invoke additional re-ranking/re-ordering using dynamic ranking or score functions. These functions return positive integer scores, where highest score is ``best''; hit sets are sorted according to descending scores (in contrary to the index lists which are sorted according to ascending rank number and document ID).

Dynamic ranking is enabled by a directive like one of the following in the zebra configuration file (use only one of these a time!):

      rank: rank-1        # default TDF-IDF like
      rank: rank-static   # dummy do-nothing
     

Dynamic ranking is done at query time rather than indexing time (this is why we call it ``dynamic ranking'' in the first place ...) It is invoked by adding the BIB-1 relation attribute with value ``relevance'' to the PQF query (that is, @attr 2=102, see also The BIB-1 Attribute Set Semantics, also in HTML). To find all articles with the word Eoraptor in the title, and present them relevance ranked, issue the PQF query:

      @attr 2=102 @attr 1=4 Eoraptor
     

9.3.1. Dynamically ranking using PQF queries with the 'rank-1' algorithm

The default rank-1 ranking module implements a TF/IDF (Term Frequecy over Inverse Document Frequency) like algorithm. In contrast to the usual definition of TF/IDF algorithms, which only considers searching in one full-text index, this one works on multiple indexes at the same time. More precisely, Zebra does boolean queries and searches in specific addressed indexes (there are inverted indexes pointing from terms in the dictionary to documents and term positions inside documents). It works like this:

Query Components

First, the boolean query is dismantled into its principal components, i.e. atomic queries where one term is looked up in one index. For example, the query

	   @attr 2=102 @and @attr 1=1010 Utah @attr 1=1018 Springer
	  

is a boolean AND between the atomic parts

	   @attr 2=102 @attr 1=1010 Utah
	  

and

	   @attr 2=102 @attr 1=1018 Springer
	  

which gets processed each for itself.

Atomic hit lists

Second, for each atomic query, the hit list of documents is computed.

In this example, two hit lists for each index @attr 1=1010 and @attr 1=1018 are computed.

Atomic scores

Third, each document in the hit list is assigned a score (_if_ ranking is enabled and requested in the query) using a TF/IDF scheme.

In this example, both atomic parts of the query assign the magic @attr 2=102 relevance attribute, and are to be used in the relevance ranking functions.

It is possible to apply dynamic ranking on only parts of the PQF query:

	   @and @attr 2=102 @attr 1=1010 Utah @attr 1=1018 Springer
	  

searches for all documents which have the term 'Utah' on the body of text, and which have the term 'Springer' in the publisher field, and sort them in the order of the relevance ranking made on the body-of-text index only.

Hit list merging

Fourth, the atomic hit lists are merged according to the boolean conditions to a final hit list of documents to be returned.

This step is always performed, independently of the fact that dynamic ranking is enabled or not.

Document score computation

Fifth, the total score of a document is computed as a linear combination of the atomic scores of the atomic hit lists

Ranking weights may be used to pass a value to a ranking algorithm, using the non-standard BIB-1 attribute type 9. This allows one branch of a query to use one value while another branch uses a different one. For example, we can search for utah in the @attr 1=4 index with weight 30, as well as in the @attr 1=1010 index with weight 20:

	   @attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 @attr 1=1010 city
	  

The default weight is sqrt(1000) ~ 34 , as the Z39.50 standard prescribes that the top score is 1000 and the bottom score is 0, encoded in integers.

Warning

The ranking-weight feature is experimental. It may change in future releases of zebra.

Re-sorting of hit list

Finally, the final hit list is re-ordered according to scores.

The rank-1 algorithm does not use the static rank information in the list keys, and will produce the same ordering with or without static ranking enabled.

Warning

Dynamic ranking is not compatible with estimated hit sizes, as all documents in a hit set must be accessed to compute the correct placing in a ranking sorted list. Therefore the use attribute setting @attr 2=102 clashes with @attr 9=integer.

9.3.2. Dynamically ranking CQL queries

Dynamic ranking can be enabled during sever side CQL query expansion by adding @attr 2=102 chunks to the CQL config file. For example

       relationModifier.relevant		= 2=102
      

invokes dynamic ranking each time a CQL query of the form

       Z> querytype cql
       Z> f alvis.text =/relevant house
      

is issued. Dynamic ranking can also be automatically used on specific CQL indexes by (for example) setting

       index.alvis.text                        = 1=text 2=102
      

which then invokes dynamic ranking each time a CQL query of the form

       Z> querytype cql
       Z> f alvis.text = house
      

is issued.

9.4. Sorting

Zebra sorts efficiently using special sorting indexes (type=s; so each sortable index must be known at indexing time, specified in the configuration of record indexing. For example, to enable sorting according to the BIB-1 Date/time-added-to-db field, one could add the line

      xelm /*/@created               Date/time-added-to-db:s
     

to any .abs record-indexing configuration file. Similarly, one could add an indexing element of the form

      <z:index name="date-modified" type="s">
      <xsl:value-of select="some/xpath"/>
     </z:index>
      

to any alvis-filter indexing stylesheet.

Indexing can be specified at searching time using a query term carrying the non-standard BIB-1 attribute-type 7. This removes the need to send a Z39.50 Sort Request separately, and can dramatically improve latency when the client and server are on separate networks. The sorting part of the query is separate from the rest of the query - the actual search specification - and must be combined with it using OR.

A sorting subquery needs two attributes: an index (such as a BIB-1 type-1 attribute) specifying which index to sort on, and a type-7 attribute whose value is be 1 for ascending sorting, or 2 for descending. The term associated with the sorting attribute is the priority of the sort key, where 0 specifies the primary sort key, 1 the secondary sort key, and so on.

For example, a search for water, sort by title (ascending), is expressed by the PQF query

      @or @attr 1=1016 water @attr 7=1 @attr 1=4 0
     

whereas a search for water, sort by title ascending, then date descending would be

      @or @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 @attr 7=2 @attr 1=30 1
     

Notice the fundamental differences between dynamic ranking and sorting: there can be only one ranking function defined and configured; but multiple sorting indexes can be specified dynamically at search time. Ranking does not need to use specific indexes, so dynamic ranking can be enabled and disabled without re-indexing; whereas, sorting indexes need to be defined before indexing.