Parse by XPattern

XPattern, a hybrid between XPath and regular expressions, is a language created by Index Data specifically for identifying and addressing parts of an HTML document.

The “Parse by XPattern” step breaks the HTML of a result set into recognizable descriptive and bibliographic fields for display, manipulation and mapping purposes, and is most frequently the first step in the “Parse” task of a connector.

The Builder enables automated creation of XPatterns via a built-in pattern designer. To begin, click “Design XPattern” then successively click on relevant sections of the results set returned from the connector’s “Search” task. For each section selected, pick the appropriate data type (author, title, etc.) from the corresponding drop-down, and indicate whether or not the field is required (obligatory) or optional. The designer will update the pattern and highlight hits on the page as you work with it.

Selected sections of a webpage may contain important related attributes. For example, many “title” fields have an href attribute that can be mapped to URL, so that programs interpreting the connector can provide links to the resources.

The configuration of this step is divided in five tabs:

  • Design where you can interactively design your XPattern
  • Edit where you can edit the XPattern manually
  • History where you can try out previous versions of the XPattern
  • Options for the step
  • Hitnumber check. A simple tool for checking that the XPattern will not miss hits on the page.

For more detailed information on XPattern see: