The model for connectors was thrashed out in a series on informal meetings in Montreal, and so has come to be referred to as the Montreal Model. It’s a purely conceptual model, quite independent of how connectors happen to be stored and transmitted: it’s about what kind of thing they are, not how they’re represented. (But don’t worry, there is also a concrete representation, which we’ll show you at the end of this page.)
The abstract model
The model makes a clear distinction between three concepts: Connectors, Tasks and Steps.
A connector is a complete, self-contained specification for how to use a web-site for searching and retrieval. There is nothing about a connector that ties is particularly to the Z39.50 or SRU protocols: it described only the interaction with the back end, leaving it to other software to control how it is invoked.
A connector contains three things:
Metadata: a set of key=value pairs which contain information about the connector: for example, its title, author and date. These are not needed when running the connector, but are important for administering them, especially within the Repository.
Properties: a separate set of key=value pairs that are distinguished from the metadata in two ways. First, they influence the running connector; and second, their values may be arbitrarily complex structures, whereas those of metadata are always simple strings. For example, while the
block_cssproperty is a boolean, and can take only the values
whitelistproperty has as its value a list of Internet domains.
Tasks. The meat of a connector is the set of tasks that it provides, which together provide its functionality. They are discussed in more detail in the next section.
When a user searches on a web-site, or an application such as a metasearch engine searches on a user’s behalf, the whole search-and-retrieve process typically consists of several separate tasks.
Tasks come in four types, described in more detail below (
next), but a connector need not have exactly
one of each. Many connectors – such as those for publicly available
sites like Google and Wikipedia – will not need an
init task at
all; and some connectors will provide multiple instances of the
search task. For example, if a simple keyword search is submitted,
then a search task that expects to use only the
will be invoked; but if author and title are specific as separate
parts of a query, then a search task that uses the
title arguments will be invoked instead, if one exists. Such a task
would probably use the back-end web site’s Advanced Search page.
Some web-sites require a user to authenticate before being allowed to search. For these, a separate initialization step is required before the searches are submitted. Higher-level logic, such as the Z39.50/SRU-to-CFEngine gateway, must determines whether or not this task needs to be invoked at the start of each session, using application-level criteria: for example, a Z39.50 session might cause a connector’s
inittask to be run if authentication information is included in the Z39.50 InitRequest.
inittasks typically navigate to a login page, copy task parameters such as
passwordinto appropriate fields on that page, and submit the form. This generally results in the server issuing a cookie which allows access to the main site.
Every connector needs one or more search tasks. Their job is to load the web-site’s appropriate search page (which may be either a Basic or Advanced Search page depending on which arguments are provided), fill in the form, submit it, and extract a hit-count from the result page.
The purpose of the
parsetask is to recognize the parts of a result page that contain records, and extract field from them: author, title, date, etc. – whatever is available. In general, this is the most complex part of a connector.
The two main approaches that the Connector Platform supports are parsing by regular expression matching on the HTML, or using XPattern, an XPath-like language that matches sequences of elements in the DOM tree. Helpers exist for constructing XPatterns automatically, and this is the preferred approach for sites whose HTML is sufficiently well structured to support it.
In general, the data extracted by the parsers requires post-processing to make it usable: for example, extraneous text often needs to be stripped out of field, whitespace needs trimming, dates need normalizing, URLs need to be made to work from outside the originating site, etc. The Platform provides facilities for these kinds of transformations, as well as a general-purpose regular-expression transformer. These are described in detail in the reference portion of this manual.
Most web-sites do not present all the results of a search on a single page. This would obviously be prohibitive in cases where many results are found – for example, at the time of writing Google has 575,000,000 hits for the search “water”. Accordingly, connectors must provide a
nexttask which can navigate from one page of results to the next: the higher-level code that runs the connector will invoke this task, alternating with the
parsetask, as many times as necessary to fetch the number of records required by the application.
In general, then, the sequence of task invocations in a session is as follows:
- search 1
- parse 1
- next 1, parse 2
- next 2, parse 3
- [more next/parse pairs as needed]
- search 2
- parse 1
- next 1, parse 2
- next 2, parse 3
- [more searches as needed]
- [parse and next/parse pairs as needed after each search]
- search 1
Each task consists of three things: a sequence of steps, a set of named test arguments, and optionally a group of tests. These are now described in more detail.
Tasks contain steps
Most fundamentally, a task consists of a sequence of steps, which are run in order, and which together fulfill the task. In general the steps are run in order from first to last, but there are a few ways in which that order can be tweaked:
Each step in a task can be marked as a an Alt step; this means that it gets run if and only if the step immediately before it failed - for example, if a step that tries to extract the value of a particular part of a results page can’t do so because its XPath doesn’t match anything in the document. Usually, such failures cause the whole task to fail, but alt steps provide a recovery mechanism for such situations. A common use is setting the hit-count to zero in a
searchtask when the part of the document that’s supposed to say “Showing hits 1-10 of 3,456” is not present.
Because connectors are working with web-sites, and because web-sites are complex things sitting on the other side of a global network, some operations don’t always work they way they should. Occasionally a step will fail for transitory reasons – for example, a result page appears to have loaded when in fact it is not complete, but will be in a few more seconds. To cope with situations like this, the Platform provides a Retry step, which can be placed after a step that might fail. It specifies how many times to retry the failed operation before giving up, and how long to wait between tries.
While alt steps and the Retry step are both ways to recover from errors, the Next If step provides rudimentary control flow based on the values of arguments. Unlike Retry, it appears before the step it controls, and says to run that step only if a condition is satisfied: that a particular argument matches a specified value.
See the reference manual for more on the Retry and Next If steps.
Tasks contain test arguments
While developing a connector, it’s often useful to run either a whole task, or one of the task’s steps in isolation. The behavior of the tasks and steps in a running connector depends on the arguments that are submitted with the service request (e.g. the database name and the various parts of the query, for a search request). To emulate this when testing tasks and steps within the builder, some values are needed for these arguments. For this reason, test arguments can be specified, and are saved as part of the task.
Note that the values of these arguments do not affect the behavior of the connector when providing a service by running under the Engine. At that time, the test arguments that were provided for the benefit of the builder are ignored, and the values sent as part of the service request are used instead.
Tasks contain tests
Finally, tasks may contain tests. Each test consists of running the task with the supplied arguments, then making assertions regarding the results generated – for example, that that hit-count is a non-negative integer. Since the testing facilities are under active development at the time of writing, and liable to change, they will not be described in detail.
The last concept in the model is that of a step. Although tasks also carry test arguments and tests, these are accessories: the core of a task is the sequence of steps that it contains, and it is only these that affect the actual functioning of the task when is run in the Engine.
A step has a type, chosen from a finite set of (at the time of writing) seventeen. The type can be thought as the operation code of the instruction that runs on the domain-specific virtual machine that is the Engine. We frequently refer informally to step types simply as steps, as in “Oh, use a Parse by Xpattern step for that”.
Some steps are appropriate for
search tasks, some for
next, and some are applicable everywhere. Analyzing a
pool of extant connectors shows the following patterns of usage:
inittasks use steps: Click, Open URL, Extract regex, Retry and Set form value.
searchtasks use steps: Click, Open URL, Normalize URL, Parameterized result, Extract regex, Retry, Set preference, Constant result, Set form value, Next if, Submit form, and Transform result.
parsetasks use steps: Click, Join result, Open URL, Normalize date, Normalize URL, Parse by regex, Parse by Xpattern, Retry, Split result, and Transform result.
nexttasks use steps: Click, Extract regex, Retry, Set preference and Constant result.
It’s apparent that the very general Click and Retry steps are used
in all four tasks, and Extract regex in all but the
while more specific steps such as Parse by Xpattern and
Normalize date are used only in the
Each step carries its own configuration; and the type of the configuration varies by step type. For example, an Open URL step’s configuration consists either of the URL to go to, or the name of an argument that contains the URL to go to. At the other end of the complexity scale, the Transform step is configured by the name of a result-list to act on, a result within that list to use as input, a regular expression to apply to the value that result, a string to substitute for the portion that matches the regular expression, and the name of a result within the list to write the result of the substitution back to. (This sounds more complicated than it is – stay with it, it will all make sense when the time comes.)
The different configuration parameters of each step type are surveyed in the reference section of this manual.
The structure of connectors, as described here, lends itself nicely to a simple expression in XML. The XML format used is described in a separate page. The Relax-NG specification for that XML format can be thought of as a formalization of the description on this page.