Refining the AMNH connector
We now want to refine the connector that we made for the AMNH digital library. There are two improvements that we might wish to make: supporting advanced search (i.e. searching specifically for an author or title); and cleaning the parsed data more fully. We will consider each of these in turn.
Advanced search
We noticed earlier that the AMNH site has an Advanced Search page. We can use this to make field-specific searching available to clients – title, author and subject. (The site also supports searching for series, but the Builder does not have a search parameter for that, so we can’t make use of it.)
To take advantage of the Advanced Search page, we’ll make a second
search task – one that uses the title, author and subject arguments
rather than the keyword argument. Once both search tasks are in
place, we’ll be able to tell which is which, when we pull down the
Current Task dropdown, because one will be called search (keyword)
and the other will be called search (title, author, subject) – the
Builder knows which parameters each task uses and, names the task accordingly.
To create the new task, click on the Add Task button – the large plus
sign in the toolar at the top of the Builder sidebar. From the Add
Task browser, double-click on search (or click on it once, then
hit the Add button at the bottom of the Task browser). We need to
add some sample parameters for testing, so in the Test Arguments
area of the new task, let’s set title to relationships, author to
colbert and subject to dinosaurs. Now we’re ready to start adding steps.
As before, we’ll start by having the Builder navigate to the search page. Add an Open URL step to the new task, and set the Constant
Location to http://digitallibrary.amnh.org/dspace/advanced-search.
The next part of the process is to populate the search form with the title, author and subject parts of the query. This can be awkward to do in search-pages like this one, where the choice of which fields to search is not statically determined but must be made by selecting from dropdowns on the search page. But in this case it’s not so bad because there are three search fields available: we can set the dropdowns to constant values and use them to implement the three query parameters. If there were fewer search fields available, we would need to use conditional logic to determine at run time what selections to make from the dropdowns that control them.
Let’s start by fixing the first dropdown to specify a title search.
Add a Set form value step, hit the Select node button in the
step configuration pane, and then click on the first of the three
field-name dropdowns in the search page. When you do this, the Form
field to populate field in the step configuration pane is filled in,
and the dropdown itself opens up to offer the list of options. From
this list, click on Title, and the step configuration pane’s
Populate with constant field is also filled in. Now test that the
step works by manually changing that dropdown to one of its other
values, then hitting the Play button to revert it to Title.
Next we need to set the value of the title argument into the form.
Add another Set form value step, hit the Select node button in
the step configuration pane, and then click on the first of the three
entry boxes on the AMNH search page, the one next to the dropdown that
we’ve set to Title. Now, in the step configuration pane, go to the
Populate with task argument dropdown and choose title. Click on
the Play button to verify that this step does indeed set the
appropriate value.
What you have done for the title parameter in the Advanced Search
page’s first search field, you can now do for the author and subject
parameters in the other two fields. Go ahead: add two more pairs of
Set form value steps, and test them. (Somewhat confusingly, the
value that the AMNH Digital Library’s Advance Search page uses for
field-selection dropdown when it’s set to “Subject” is keyword.
Don’t worry, you didn’t make a mistake.)
Finally, we can submit the search form. This is done exactly the same
as for the simple search: add a Click action, Select node and
hit the search form’s Search button.
Test the new task using the Play button, and a single record should be
found. This can be parsed using the existing parse step.
Tidying up the titles
The author strings that we extracted with the XPattern parser are rather ugly, and we can usefully do more work on them.
To get set up for this work, return to the original search (keyword)
task, and hit the Play All button to re-do the search that finds 85
hits. Return to the parse task, and re-run it to get a set of
results into the Builder.
Now we can get to work.
Splitting and cleaning author names
The first thing to do is split the author strings apart into separate authors: this is useful because when multiple values are returned separately rather than as part of a glued-together string, they can be used in facet lists.
Add a Split result step, and configure it as follows:
In the Result list dropdown, select
results.Set Result to transform to
author.Set the Regular expression to
;\s*(semicolon, backslash, lowercase letter ‘s’, asterisk). This matches a literal semicolon followed by any number of spaces, tabs, etc.Set Result to set to
author, so that the split values will replace the existing author string. (Alternatively, they could be copied into a separate field.)
Running this step does not change the single authors at all, but splits compound author strings like “Brown, Barnum.; Schlaikjer, Erich Maren, 1905-” into multiple authors.
Removing trailing periods
We see that some author names have terminating periods, while others
do not – for example, in the field that was split in the previous
step, “Brown, Barnum.” ends with a period, while “Schlaikjer, Erich
Maren, 1905-” does not. We can tidy this up with a Transform
result step: add the step, set the Result list, Result to
transform and Result to set as with the previous step
(results, author, author), and set the Regular expression to
\.$ (backslash, period, dollar). This matches a period at the end
of a value only. Leave the Replace with field blank, since we
want to replace the trailing period with nothing. Press the Play
button to check that this step does indeed remove the period from the
end of “Brown, Barnum.”.
Removing birth and death dates
We now see that some of the author names have birth dates, or both birth-and-death dates, after them, whereas others do not. For example, the fourth record has three authors, Osborn, Henry Fairfield, 1857-1935, Brown, Barnum, and Lull, Richard Swann, 1867-. To remove these dates, we use a regular expression that matches a comma, followed by zero or more whitespace characters, then any number of digits and minus signs at the end of the string.
Add another Transform result step, as before with Result list
set to results, Result to transform set to author and Result
to set set to author. Set the Regular expression to
,\s*[0-9-]+$ and leave the Replace with string empty.
Press the Play button, and watch all the dates disappear from the ends of the author names.
Reversing the “Last name, First name” format
Now we have author names like Osborn, Henry Fairfield, Brown, Barnum, and Lull, Richard Swann. These are all in a consistent format within the AMNH Digital Library database, but normalised names in this last name, first name format are generall unusual, and so client software that uses these names to generate facet lists will not recognise that Osborn, Henry Fairfield in this database is the same author as Henry Fairfield Osborn in another. So in order to make our connector a better citizen in the metasearching world, we’ll finish up by switching the names into the more common form.
For this, we will use yet another Transform result step on the
author field (so set up Result list, Result to transform and
Result to set as before). This time, we need a more sophisticated
regular expression that captures both the surname and the forenames
separately. Set Regular expression to (.*),\s*(.*). This
matches and captures any sequence of characters, followed by a comma
and zero or more spaces, followed by another sequence of any
characters, which is also captured. Set Replace with to $2 $1,
which simply emits the two captured substrings in reverse order.
Hit the Play button: all the author names are converted into conventional form.
Putting it together
Now that the parse task is complete, you can test it as a whole.
Use the next task to step on to hits 11-20, then go back to the
parse task, and hit the Play All button. All of the URLs and
authors should appear correctly. in the Results area.
