Connector file format
Description
The XML format for connectors is kept as simple as possible: no namespaces are used, for example. The format is as follows:
The top-level element is
connectorThe
connectorcontains an optionalmetaDatablock, zero or morepropertyelements and one or moretasks.The
metaDatablock, if present contains information about the connector. It carries no attributes and contains zero or moremetaelements.Each
metaelement carriesnameandcontentattributes, representing a name/value pair, and is otherwise empty.Each
propertycarriestypeandnameattributes, and contains its value: at present, this is always text, but in principle it could be any XML.Each
taskcarries anameattribute, which is set toinit,search,parseornext, and contains zero or more steps and zero or more tests.Each
stepcarries anameattribute (which should be calledtypeas it specifies the type of the step), aversionattribute (which states which version of the step code was in use when the connector was saved, and therefore the format of step-specific configuration), and an optionalaltatrribute, which if present must be sent toyes(and not, for some reason,true).The
stepelement may contain any XML as necessary to contain the step-specific configuration: its interpretation is guided by thenameattribute, which is like the type member of a discriminated union.Each
testcarries anameattribute, and contains zero or moreargelements and zero or moreassertelements.Each
argcarriesnameandvalueattributes, representing a name/value pair, and is otherwise empty. (It is suspiciously similar to themetaelement, in fact.)Each
assertcarriespathandvalueattributes, representing the asssertion that after running the containing task wit the specified arguments, the part of the result structure specified by the path matches the regular expression that is the value.
Example
The following XML represents the most recent version of the connector for the Library of Congress’s online bibliographic catalogue.
<connector>
<metaData>
<meta name="title" content="Library of Congress"/>
<meta name="author" content="Index Data ApS"/>
<meta name="date" content=""/>
<meta name="note" content=""/>
<meta name="url" content=""/>
</metaData>
<task name="search">
<step name="nav_to" version="0.3">
<stepConf type="object">
<url type="string">http://catalog.loc.gov/cgi-bin/Pwebrecon.cgi?DB=local&PAGE=First</url>
</stepConf>
</step>
<step name="set_value" version="0.4">
<stepConf type="object">
<dest type="object">
<xpath type="string">//input[@name="Search_Arg"]</xpath>
<frames type="array"/>
</dest>
<param type="string">keyword</param>
</stepConf>
</step>
<step name="click" version="0.2">
<stepConf type="object">
<target type="string">//td[2]/div/input[2]</target>
<wait type="bool">true</wait>
</stepConf>
</step>
<step name="regex_extract" version="0.1">
<stepConf type="object">
<regex type="string">.* of ([0-9]+)</regex>
<matchNum type="string">1</matchNum>
<node type="object">
<xpath type="string">/html/body/div[@class="you-searched"]/table/tbody/tr[3]/td</xpath>
<frames type="array"/>
</node>
<sourceAttribute type="string">textContent</sourceAttribute>
<attributes type="array">
<item type="string">align</item>
<item type="string">style</item>
</attributes>
<result type="string">hits</result>
<match_num type="number">0</match_num>
<attr type="string">textContent</attr>
</stepConf>
</step>
<step alt="yes" name="set_result" version="0.1">
<stepConf type="object">
<constant type="string">0</constant>
<result type="string">hits</result>
</stepConf>
</step>
<test name="Default">
<arg name="keyword" value="water"/>
</test>
</task>
<task name="init">
<step name="nav_to" version="0.3">
<stepConf type="object">
<url type="string">http://catalog.loc.gov/</url>
</stepConf>
</step>
<test name="Default"/>
</task>
<task name="next">
<step name="click" version="0.2">
<stepConf type="object">
<target type="object">
<xpath type="string">//form/div/table/tbody/tr/td/a/img[@alt="Next Screen or Record"]</xpath>
</target>
<wait type="bool">true</wait>
</stepConf>
</step>
<test name="Default"/>
</task>
<task name="parse">
<step name="parse_xpattern" version="0.1">
<stepConf type="object">
<xpattern type="string">TD { INPUT : A [@href $url] } : TD { IMG } : TD $author :
TD { A $title} : TD $date</xpattern>
<hitarea type="object">
<xpath type="string">/html/body/form/table[2]</xpath>
<frames type="array"/>
</hitarea>
<xpatternhistory type="array">
<item type="string">TD { INPUT : A [@href $url] } : TD { IMG } : TD $author :
TD { A $title} : TD $date</item>
</xpatternhistory>
</stepConf>
</step>
<test name="Default"/>
</task>
</connector>
Relax-NG Compact schema
The format of connector XML is formally constrained by the following schema, expressed in the efficient and readable Relax-NG Compact format. This schema is also available in Relax-NG XML format and in the horrible, bloated, impenetrable W3C XML Schema language if you insist.
Careful readers will note that the textual description of the XML format at the top of this page is pretty much identical with the Relax-NG schema down here. We should make a tool that automatically generates prose from the schema. But I didn’t.
start = element connector {
element metaData { meta* }?,
property*,
task+
}
meta = element meta {
attribute name { text },
attribute content { text }
}
property = element property {
attribute type { "bool" },
attribute name { text },
text
}
task = element task {
attribute name { text },
step*,
test*
}
step = element step {
attribute name { text },
attribute version { text },
attribute alt { "yes" }?,
element stepConf {
attribute type { "object" },
ANY
}
}
test = element test {
attribute name { text },
element arg {
attribute name { text },
attribute value { text }
}*,
element assert {
attribute path { text },
attribute value { text }
}*
}
# This macro is stolen from trang's output when fed a DTD with ANY.
# It's not ideal because it's unlikely that trang can recognise the
# idiom and give the appropriate translation back into DTD or XML
# Schema, but it works
ANY = (element * { attribute * { text }*, ANY } | text)*
