Chapter 10. Field Structure and Character Sets

Table of Contents

1. The default.idx file
2. Charmap Files
3. ICU Chain Files

In order to provide a flexible approach to national character set handling, Zebra allows the administrator to configure the set up the system to handle any 8-bit character set — including sets that require multi-octet diacritics or other multi-octet characters. The definition of a character set includes a specification of the permissible values, their sort order (this affects the display in the SCAN function), and relationships between upper- and lowercase characters. Finally, the definition includes the specification of space characters for the set.

The operator can define different character sets for different fields, typical examples being standard text fields, numerical fields, and special-purpose fields such as WWW-style linkages (URx).

Zebra 1.3 and Zebra versions 2.0.18 and earlier required that the field type is a single character, e.g. w (for word), and p for phrase. Zebra 2.0.20 and later allow field types to be any string. This allows for greater flexibility - in particular per-locale (language) fields can be defined.

Version 2.0.20 of Zebra can also be configured - per field - to use the ICU library to perform tokenization and normalization of strings. This is an alternative to the "charmap" files which has been part of Zebra since its first release.

1. The default.idx file

The field types, and hence character sets, are associated with data elements by the indexing rules (say title:w) in the various filters. Fields are defined in a field definition file which, by default, is called default.idx. This file provides the association between field type codes and the character map files (with the .chr suffix). The format of the .idx file is as follows

index field type code

This directive introduces a new search index code. The argument is a one-character code to be used in the .abs files to select this particular index type. An index, roughly, corresponds to a particular structure attribute during search. Refer to the section called “Z39.50 Search”.

sort field code type

This directive introduces a sort index. The argument is a one-character code to be used in the .abs fie to select this particular index type. The corresponding use attribute must be used in the sort request to refer to this particular sort index. The corresponding character map (see below) is used in the sort process.

completeness boolean

This directive enables or disables complete field indexing. The value of the boolean should be 0 (disable) or 1. If completeness is enabled, the index entry will contain the complete contents of the field (up to a limit), with words (non-space characters) separated by single space characters (normalized to " " on display). When completeness is disabled, each word is indexed as a separate entry. Complete subfield indexing is most useful for fields which are typically browsed (e.g., titles, authors, or subjects), or instances where a match on a complete subfield is essential (e.g., exact title searching). For fields where completeness is disabled, the search engine will interpret a search containing space characters as a word proximity search.

firstinfield boolean

This directive enables or disables first-in-field indexing. The value of the boolean should be 0 (disable) or 1.

alwaysmatches boolean

This directive enables or disables alwaysmatches indexing. The value of the boolean should be 0 (disable) or 1.

charmap filename

This is the filename of the character map to be used for this index for field type. See Section 2, “Charmap Files” for details.

icuchain filename

Specifies the filename with ICU tokenization and normalization rules. See Section 3, “ICU Chain Files” for details. Using icuchain for a field type is an alternative to charmap. It does not make sense to define both icuchain and charmap for the same field type.

Example 10.1. Field types

Following are three excerpts of the standard tab/default.idx configuration file. Notice that the index and sort are grouping directives, which bind all other following directives to them:

     # Traditional word index
     # Used if completeness is 'incomplete field' (@attr 6=1) and
     # structure is word/phrase/word-list/free-form-text/document-text
     index w
     completeness 0
     position 1
     alwaysmatches 1
     firstinfield 1
     charmap string.chr

     ...

     # Null map index (no mapping at all)
     # Used if structure=key (@attr 4=3)
     index 0
     completeness 0
     position 1
     charmap @

     ...

     # Sort register
     sort s
     completeness 1
     charmap string.chr