OAI harvesting from within Keystone
In addition, Keystone allows the system administrator
to define OAI harvesting tasks. To do so, one must install the
libtkl-perl and the
tkl-oai-harvester Debian packages.
The OAI harvester daemon called tkl-oai-harvester is
started and stopped with the scripts
/etc/init.d/tkl-oai-harvester start
/etc/init.d/tkl-oai-harvester stop
/etc/init.d/tkl-oai-harvester restart
When installed with apt-get as .deb
pakages on a Debian system,
these start/stop scripts are installed properly, so that the harvesting
service is started automatically at boot time. The start/stop scripts are
rather simple and it should not be difficult to adjust them to work with
other operating systems than Debian GNU/Linux.
Harvesting tasks are created in the admin interface.
The bibliotheca example portal contains
the task directory called
bibliotheca/tasks, including two
subdirectories oaibizigate,
oaitklite, and two Keystone files
directory.tkl, and
index.tkl, which are tuned to display the
resulting oai*.tkl files containing the
harvested OAI metadata records.
Navigate within the admin interface to the
bibliotheca/tasks directory, and add a new
oai task file. Fill in the starting url
(remember the trailing slash when addressing a Keystone OAI
server!). Type the target directory
relative to the portal root - for example "/tasks/oaitklite/" - and
choose select status = pending.
After saving the resulting task file should look like this:
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
<task creator="admin" created="2003-07-10, 13:59:50" modifier="admin" modified="2003-07-10, 13:59:50">
<tasktype>oai</tasktype>
<url>http://tkl-cvs.indexdata.dk/bibliotheca/</url>
<target>/tasks/oaitklite/</target>
<description>OAI harvesting job at our Keystone server</description>
<status>pending</status>
<xslt>oai2link.xsl</xslt>
<handler></handler>
</task>
The content of the <handler> tag is interpreted
as an optional script called by the harvester, when the job
is finished. There is no restrictions upon the script except that it
is run by the web server user, typically www-data, with the
corresponding restrictions. The collection of task handler scripts should be
placed in the reserved directory tasks/handlers.
Each such task handler script is called with a single argument, the path to
the directory, where the harvested records are placed.
Please notice, that when you associate a task handler script other than
the trivial one, i.e. do_nothing.handler, to your OAI
harvesting task, this handler is given the responsibility for indexing the
harvested records. Otherwise, the OAI harvester daemon tkl-oai-harvester
performs the indexing.
The <xslt> tag contains the name of an XSLT
transformating stylesheet which will be applied to each OAI
harvested record before it is stored. The collection of such
XSLT transforming stylesheets should be placed in the reserved
directory /authorities/oai.
If the optional <prefix> tag is specified,
this will be used as the filename prefix when the OAI harvested records
are stored as files on your hard-drive. If nothing is specified, link-
is used as the default value.
The OAI protocol does not require the repository to support the set
record filter[3]. If you want to
harvest a set-enabled OAI repository, you can optionally use the
<set> setting for this purpose. An empty set value
is interpreted as no set value!
When an oai task file is saved, a spool file is automatically
placed in the /var/spool/tkl directory,
and the OAI harvester will fetch and perform the job within a
couple of minutes. During execution of the job, the status tag
will change from "pending" over "running" to "finished", and
after finishing of the job, the spool file will be removed.
The harvested OAI metadata records can be inspected by
directing the usual user web interface to
bibliotheca/tasks/oaitklite, where all
records are displayed in the fetched order. Clicking at the
first link of a record displays some more details of it.
Although the OAI records are indexed on system boot, or when
running
/etc/init.d/tkl index
they have been initially marked "hidden" and will not be
displayed in search result sets.