My first connector

Let’s build a connector!

Let’s say that we want to make a connector for the online digital library of the American Museum of Natural History. The AMNH has digitized more than a century of publications and makes them freely available, so it’s a useful resource to make available to metasearch systems.

Reconnaissance: searching the website by hand

Before we try to build the connector, we’ll search the site using an ordinary web-browser to check that it supports the necessary operations.

Start by navigating to the the digital library home page at http://digitallibrary.amnh.org/dspace/

We notice that it’s possible to do a simple keyword search from the front page, and that there is also a link to an advanced search page. To keep things simple, we’ll start with the keyword search, and return later to the advanced search.

In the box below the caption Search by words or numbers, enter the query dinosaur. (In the radio-button group below, leave All Publications checked: later on we can treat the publications separately if we want to, but for now we will accept all matching documents from any of the listed publications.) Hit the Search button.

The search-results page page shows the first 10 of 85 hits. These are helpfully laid out in a table, each row giving the publication’s date, its title including some citation information (with a link to a detail record), and the authors (including their birth and death dates in most but not all cases.)

Click through the first hit, Relationships of the saurischian dinosaurs. American Museum novitates ; no. 2181. The detail record includes more information about the resource and a link to a PDF of the full text. Return to the results list.

Now click on the next link at the bottom of the table containing the ten hits. This leads to a page that shows results 11-20. Click on this page’s next link, and the destination page shows records 21-30.

All is as it should be. Now we can start to build the connector, which will perform all the same actions on our behalf that we just performed by hand.

Building the connector

Open the Builder sidebar, either by Selecting ViewSidebarConnector Framework: Builder from the main menu, or using the shortcut Shift-Ctrl-C.

If this is the first time you’ve used the Builder in this browser session, then it will be set up ready to make a new connector. If not, you may have another connector already loaded: in this case, start a new connector by choosing ToolsConnector FrameworkNew from the main menu or clicking on the New Connector icon that it on the left of the sidebar’s top toolbar.

In the main window, go back to the AMNH Digital Library home page. Now we’re ready to begin.

No init task

Using the AMNH site does not require any authentication, so there is no need for an init task in this connector. We can get straight to work on telling the connector how to search.

The Search task

Make sure that the Current Task: dropdown near the top of the Builder sidebar is set to search – if it’s not, then change it.

The first step: going to the search page

The first step of submitting a search is getting onto the page where it’s to be submitted from, so make a new Go to URL step. Click on the Add Step button – the big plus sign that is the first button in the Steps toolbar about half way down the sidebar, and the Step Browser pops up, offering you a choice from among all the different types of step. Double-click on Go to URL (or click on it once, then hit the Add button at the bottom of the Step Browser).

Three things happen when you do this:

  • The Step Browser closes itself.

  • A new and empty Go to URL step is added to the task, and can be seen in the step list just below the Steps toolbar.

  • The step configuration pane leaps into existence, to the right of the sidebar and below the main window. This pane contains different controls depending on the type of the currently selected step.

For a Go to URL step, the important element of the configuration is the location to go to. This is presented as a textbox that a URL can be typed or pasted into, but since we are already on the right page, we can take the shortcut of clicking the Use current page button to the right of the textbox.

Clicking the button makes two things happen:

  • The textbox in the step configuration pane is filled in with the URL of the current page (i.e. the AMNH Digital Library home page).

  • That change in the state of the step is reflected in the step list over in the sidebar: the step’s summary now contains the URL as well as the type.

The URL in the step list also changes if you edit the URL in the step configuration pane by hand. The step list always contains a summary of all the steps that make up the current task. Each step’s summary consists of the step type, in bold, followed by an informative snippet of the configuration, which again is different depending on the type of the step.

You can always test a step by clicking on the Play button in the Steps toolbar: it’s the right-pointing triangle that is fifth from the left. Since you’re already at the AMNH page, navigate away to any page you like, and then hit the Play button. The builder will run the new Go to URL step, which will take you back to the AMNH Digital Library front page.

Setting the form value

Having loaded the search page, the next step is to type in the value to be searched for. To see that in action, we’re going to need a value to test with, so enter the search term dinosaur as the keyword argument in the list of Test Arguments near the top of the sidebar. These test arguments are saved as part of the connector, although they are not used when running the connector in production: they are are purely for the benfit of the Builder and the developer using it.

Now we can add a Set form value step. Open the Step Browser and double click on the step name. The new, empty, step appears below the existing Go to URL step in the step list, and the step configuration pane changes to show the configuration of the new step.

Click on the Go to URL step in the step list and watch the configuration pane change to show the URL that was specified in the configuration for that step; click on the Set form value step in the list and the configuration pane changes back again.

In the configuration pane, the textbox labelled Form field to populate should contain an XPath specifying which element on the page is to be set to the specified argument. Rather than typing in an XPath, it’s usually easier to use the node selector, and that’s what we’ll do now. Click on the Select node button. Now as you move your mouse around the contained web page, a dotted purple outline follows the mouse, showing which of the page’s elements is being pointed to. Point at the search box and click: the textbox in the configuration pane is filled with a complicated XPath that designates the selected textbox.

Now we can test the new step. Hit the Play button and the word dinosaur appears in the search box.

When building a connector for a more complex search form, multiple Set form value steps will be used to set multiple values – for example, separate author, title and date values.

Submitting the search

Having filled in the form, we need to submit it to the server. There are two ways to do this: with the Submit form step or with Click. We’ll use Click.

As usual, open the Step Browser and double-click on the Click step. As usual, the new step is added to the list in the sidebar, and its empty configuration appears in the step configuration pane.

As with the Set form value step, this is configured by an XPath that indicates which element of the page to click; and as before we can use the node selector to do this. Hit the Select node button, and click on the contained page’s Search button to fill in the XPath.

Now you can click on the Play button to check that the form submission step works as intended. When you do so, the contained page will change to show the first page of results.

Extracting and cleaning the hit-count

When a Z59.60 or SRU client sends a search request, the response has to specify how many records were found, so our connector has to extract this information from the results page.

To do this, we will need two steps: Extract value and Transform result.

Add an Extract Value step, and in the new step’s configuration pane use the node selector to choose the text area that contains the text “Results 1-10 of 85.” Run the step to check that it works correctly. Now for the first time we are using the Results area near the bottom of the Builder sidebar: the single result, called hits, has the value “Results 1-10 of 85.” (i.e. the content of the nominated area of the contained page.)

Now add a Transform result step. This is one of the most complex and powerful of all the steps, but for now we can ignore the five fields at the top of the configuration pane and use one of the pre-canned recipes at the bottom. Click on the Last number button, and note that the configuration fields above are filled in. Hit the play button to check that the transformation has worked correctly.

In this case there is an interesting wrinkle: the Last number recipe pulls out the last sequence of digits, commas and periods from the value it’s working on, so that it can work on decimals as well as whole numbers. This has the undesirable side-effect that the terminating period of the sentence “Results 1-10 of 85.” survives the transformation. As it happens, this is good enough: the numeric value of the string “85.” is 85, so we don’t need to do anything about this.

Extra credit: getting rid of that terminating period

If you are the kind of person who likes things to be neat and tidy, you can remove the period by adding another Transform result step, setting the Regular expression to \.$ and leaving the replacement text empty.

Testing the whole search task

The connector’s search task is now complete. To check that it does what it should, you can use the Play All button on the Steps toolbar: it’s the one on the right, and it looks like the Play button with an additional vertical bar to the left of the triangle.

When you press this button, each step in the task is run in turn. You will see the contained site in the main window switch back to the AMNH Digital Library home page when the Go to URL step is run, then the word dinosaur appear in the search box when Set form value is run, then the form being submitted when Click is run, and the page will change to the result list. Then, too quickly to see the individual steps happening, the hit-count will be extracted and transformed.

While this is happening, the step configuraton pane is replaced by a log explaining what the connector is doing. This can be useful when debugging complex connectors. As soon as you click on a step in the list, or add a new step, this log is replaced once more by the configuration pane.

Congratulations, you have completed your first task!

The Parse task

Make sure that the Current Task: dropdown near the top of the Builder sidebar is set to parse – if it’s not, then change it.

Once a connector has obtained a page of search results, it needs to pick that page apart into separate records, and the records into separate fields, in order to have useful information to report back. The Connector Framework supports two separate approaches to parsing result pages: parsing by regular expression or by XPattern.

The regular expression parser works directly on the HTML of the results page. As a result, it is very powerful and general, but tends to be an absolute pig to work with. It remains an important tool to use as a fallback when other approaches fail, but it has for most purposes been superseded by the XPattern parser, which works at a higher level, dealing with nodes of the parsed page rather than with raw text. Because it works at this higher level, the XPattern parser is able to offer tools that help you to build a pattern, often very easily.

Once data has been extracted from the page, it can be cleaned up using transformatins like the one we used in the search task to tidy up the hit-count.

Initial parsing

Use the Step Browser to add a Parse by XPattern step – the first step of the new parse task. As usual, the step configuration pane appears. Rather than typing in an XPattern, we will use the Builder to help us create one: this is done by clicking on various elements of a sample record and specifying which field of the result record the values should go into.

In the step configuration pane, click on the Start creating a pattern button. A new message appears in the configuration pane, “Please click on some part of a good hit”, along with some buttons that we can ignore for now.

Do as the instruction says: click on the first interesting part of the first hit, the date “1964”. Immediately this is highlighted, and an entry describing this field is added to the configuration pane, highlighted in the same colour. As well as the value, this entry contains a dropdown for specifying which field of the output record should be set from this part of the page. From this dropdown, choose date. We can ignore the other parts of the entry for now.

Click on the Add another node button below the newly created entry, and then on the next interesting part of the record, the title Relationships of the saurischian dinosaurs. American Museum novitates ; no. 2181”. This is highlighted in a different colour, and a new entry is appended. From this entry’s first dropdown, choose the field title.

This field in the results page has an important difference from the others: it is a link to the full-record page (which in turn has the link to the full-text PDF). We need to get at the full-record link so that it can be returned to the search client, so click on the title entry’s Attributes button. Another line is added to the title entry, allowing an attribute from the title to be captured. In this case, we want the href attribute, which is selected by default, and we want to copy it in to url field: choose this fieldname from the dropdown after the going into caption.

Finally, we need to capture the author. Click once more on the Add another node button, then on the first record’s author, “Colbert, Edwin Harris, 1905-“, and set the fieldname in the dropdown to author.

Now we are ready to generate the XPattern. Hit the generate a pattern button at the bottom of the configuration pane, and the XPattern textbox will be filled in with a pattern describing the set of result-page fields and output-record fields we’ve nominated:

TD $date : TD  { A $title [ @href $url  ] } : TD $author

And now the magic happens. Click the Play button above the step list and gasp in awe at the parsed data that fills the Results area at the bottom of the sidebar! (You will need to resize the Results area in order to see more than one or two of the result records.)

Cleaning the parsed data

Looking at the records in the Results area, we can see that:

  • All ten records (numbered 0-9) have been correctly parsed out of the page.

  • The dates and titles are good (although we might later want to refine the titles by moving the citation information into another field).

  • The URLs are relative to the root of the website that hosts them rather than absolute.

  • There may be one or more authors, separated by semi-colons, and each author may have birth and death dates appended. Some authors are terminated with a period, some are not.

Cleaning URLs

Fixing the URLs is easy: add a Normalize URL step. There is no need to configure it: It Just Works. Click on the Play button, and all the URLs in the Results area will be transformed into absolute URLs that include the site name as well as the local path.

Cleaning authors

The authors are little more complex to deal with. Since the compound and irregular author strings we have are already useful, we will defer further work on these for now, and come back to them in the next lesson.

So we now have a complete and functional parse step, albeit one that we can refine later.

The Next task

Make sure that the Current Task: dropdown near the top of the Builder sidebar is set to next – if it’s not, then change it.

In general, a searching client will want more records than are displayed on a site’s first results page. To support this, after having returned the first batch of result records, the Engine will repeatedly invoke the next and parse tasks to obtain records from subsequent pages.

For most sites, moving on to the next page of results is as simple as clicking a link, and that’s the case here. Add a Click step to the new next task; click on the Select node button in that step’s configuration pain, and then select the next link at the bottom of the contained page. Now hit the Play button on the Steps toolbar to verify that this step does indeed move on to the next batch of records.

Now that we have records 11-20 on screen, we can parse these using out existing task. Using the dropdown at the top of the Builder sidebar, select the parse task, and hit the Play All button on the Steps toolbar. Both the Parse by XPattern and Normalize URL steps will run, and the parsed result will be that the records in the Results are are replaced by broken-apart versions of those on this second page. (As a side-effect, the parsed-out regions of the contained page are highlighted: this can be useful when trying to work out why and XPattern is not doing what was expected.)

Refining the link

There is a problem here, though. Go back to the next task and run it again, and you will see that rather than stepping on to the next page of results, the site leaps straight to the last page.

That’s because the XPath that the node selector genrated says “use the link number 9 in the table cell that contains those links”, which works fine on the first page of results but not on the second, because different parts of the pages list are linked depending on where you are.

The solution is to change the XPath so that it always picks the next link. Go back to the first page of results, where you originally defined the Click step, and:

  • Click on the Refine xpath button in the step configuration pane.

  • In the popup, click on the component a[9], which is the part of the path that specifies link number 9. (The previous components of the path explain where the relevant table cell is.)

  • Of the three attributes of that next link that are displayed, choose Text Content, since that is the part of the link that identifies it as the right one. Check the box next to that caption.

  • Hit the Save button.

The XPath in the textbox is rewritten according to your modification, and now it is possible to step all the way through the result list by repeatedly invoking the next task.

Try it!

The connector for the AMNH Digital Library is now complete, at least in a primitive form. It could be used to provide searching for metasearch tools such as Masterkey. Save the connector by choosing ToolsConnector FrameworkSave… from the menu, or using the Save button that is third from the left in the top toolbar. It’s conventional to use a filename that ends with .cf – for example, amng-diglib.cf

In the next session, we will refine this connector.