7. Relevance ranking

Pazpar2 uses a variant of the fterm frequency–inverse document frequency (Tf-idf) ranking algorithm.

The Tf-part is straightforward to calculate and is based on the documents that Pazpar2 fetches. The idf-part, however, is more tricky since the corpus at hand is ONLY the relevant documents and not irrelevant ones. Pazpar2 does not have the full corpus -- only the documents that match a particular search.

Computatation of the Tf-part is based on the normalized documents. The length, the position, and terms are thus normalized at this point. Also the computation is performed for each document received from the target - before merging takes place. The result of a TF-compuation is added to the TF-total of a cluster. Thus, if a document occurs twice, then the TF-part is doubled. That, however, can be adjusted, because the TF-part may be divided by the number of documents in a cluster.

The algorithm used by Pazpar2 has two phases. In phase one, Pazpar2 computes a tf-array: This is being done as records are fetched from the database. In this case, the rank weight w, and the rank tweaks lead, follow and length.

    tf[1,2,..N] = 0;
    foreach document in a cluster
       foreach field
          w[1,2,..N] = 0;
          for i = 1, .. N:  (each term)
             foreach pos (where term i occurs in field)
                // w is configured weight for field
                // pos is position of term in field
                w[i] += w / (1 + log2(1+lead*pos))
                if (d > 0)
                    w[i] += w[i] * follow / (1+log2(d)
          // length: length of field (number of terms that is)
	  if (length strategy is "linear")
             tf[i] += w[i] / length;
          else if (length strategy is "log")
             tf[i] += w[i] / log2(length);
          else if (length strategy is "none")
             tf[i] += w[i];
	  

In phase two, the idf-array is computed and the final score is computed. This is done for each cluster, as part of each show command. The rank tweak cluster is in use here.

    // dococcur[i]: number of records where term occurs
    // doctotal: number of records
    for i = 1, .., N (each term)
      if (dococcur[i] > 0)
         idf[i] = log(1 + doctotal / dococcur[i])
      else
         idf[i] = 0;

    relevance = 0;
    for i = 1, .., N: (each term)
       if (cluster is "yes")
          tf[i] = tf[i] / cluster_size;
       relevance += 100000 * tf[i] / idf[i];
       

For controlling the ranking parameters, refer to the rank element of the service definition. Refer to the rank attribute of the metadata element for how to control ranking for individual metadata fields.