Connector file format

Description

The XML format for connectors is kept as simple as possible: no namespaces are used, for example. The format is as follows:

  • The top-level element is connector

  • The connector contains an optional metaData block, zero or more property elements and one or more tasks.

  • The metaData block, if present contains information about the connector. It carries no attributes and contains zero or more meta elements.

  • Each meta element carries name and content attributes, representing a name/value pair, and is otherwise empty.

  • Each property carries type and name attributes, and contains its value: at present, this is always text, but in principle it could be any XML.

  • Each task carries a name attribute, which is set to init, search, parse or next, and contains zero or more steps and zero or more tests.

  • Each step carries a name attribute (which should be called type as it specifies the type of the step), a version attribute (which states which version of the step code was in use when the connector was saved, and therefore the format of step-specific configuration), and an optional alt atrribute, which if present must be sent to yes (and not, for some reason, true).

  • The step element may contain any XML as necessary to contain the step-specific configuration: its interpretation is guided by the name attribute, which is like the type member of a discriminated union.

  • Each test carries a name attribute, and contains zero or more arg elements and zero or more assert elements.

  • Each arg carries name and value attributes, representing a name/value pair, and is otherwise empty. (It is suspiciously similar to the meta element, in fact.)

  • Each assert carries path and value attributes, representing the asssertion that after running the containing task wit the specified arguments, the part of the result structure specified by the path matches the regular expression that is the value.

Example

The following XML represents the most recent version of the connector for the Library of Congress’s online bibliographic catalogue.

<connector>
  <metaData>
    <meta name="title" content="Library of Congress"/>
    <meta name="author" content="Index Data ApS"/>
    <meta name="date" content=""/>
    <meta name="note" content=""/>
    <meta name="url" content=""/>
  </metaData>
  <task name="search">
    <step name="nav_to" version="0.3">
      <stepConf type="object">
        <url type="string">http://catalog.loc.gov/cgi-bin/Pwebrecon.cgi?DB=local&amp;PAGE=First</url>
      </stepConf>
    </step>
    <step name="set_value" version="0.4">
      <stepConf type="object">
        <dest type="object">
          <xpath type="string">//input[@name="Search_Arg"]</xpath>
          <frames type="array"/>
        </dest>
        <param type="string">keyword</param>
      </stepConf>
    </step>
    <step name="click" version="0.2">
      <stepConf type="object">
        <target type="string">//td[2]/div/input[2]</target>
        <wait type="bool">true</wait>
      </stepConf>
    </step>
    <step name="regex_extract" version="0.1">
      <stepConf type="object">
        <regex type="string">.* of ([0-9]+)</regex>
        <matchNum type="string">1</matchNum>
        <node type="object">
          <xpath type="string">/html/body/div[@class="you-searched"]/table/tbody/tr[3]/td</xpath>
          <frames type="array"/>
        </node>
        <sourceAttribute type="string">textContent</sourceAttribute>
        <attributes type="array">
          <item type="string">align</item>
          <item type="string">style</item>
        </attributes>
        <result type="string">hits</result>
        <match_num type="number">0</match_num>
        <attr type="string">textContent</attr>
      </stepConf>
    </step>
    <step alt="yes" name="set_result" version="0.1">
      <stepConf type="object">
        <constant type="string">0</constant>
        <result type="string">hits</result>
      </stepConf>
    </step>
    <test name="Default">
      <arg name="keyword" value="water"/>
    </test>
  </task>
  <task name="init">
    <step name="nav_to" version="0.3">
      <stepConf type="object">
        <url type="string">http://catalog.loc.gov/</url>
      </stepConf>
    </step>
    <test name="Default"/>
  </task>
  <task name="next">
    <step name="click" version="0.2">
      <stepConf type="object">
        <target type="object">
          <xpath type="string">//form/div/table/tbody/tr/td/a/img[@alt="Next Screen or Record"]</xpath>
        </target>
        <wait type="bool">true</wait>
      </stepConf>
    </step>
    <test name="Default"/>
  </task>
  <task name="parse">
    <step name="parse_xpattern" version="0.1">
      <stepConf type="object">
        <xpattern type="string">TD { INPUT : A [@href $url] } : TD { IMG } : TD  $author :
                                    TD  { A  $title}  : TD  $date</xpattern>
        <hitarea type="object">
          <xpath type="string">/html/body/form/table[2]</xpath>
          <frames type="array"/>
        </hitarea>
        <xpatternhistory type="array">
          <item type="string">TD { INPUT : A [@href $url] } : TD { IMG } : TD  $author :
                                  TD  { A  $title}  : TD  $date</item>
            </xpatternhistory>
      </stepConf>
    </step>
    <test name="Default"/>
  </task>
</connector>

Relax-NG Compact schema

The format of connector XML is formally constrained by the following schema, expressed in the efficient and readable Relax-NG Compact format. This schema is also available in Relax-NG XML format and in the horrible, bloated, impenetrable W3C XML Schema language if you insist.

Careful readers will note that the textual description of the XML format at the top of this page is pretty much identical with the Relax-NG schema down here. We should make a tool that automatically generates prose from the schema. But I didn’t.

start = element connector {
    element metaData { meta* }?,
    property*,
    task+
}

meta = element meta { 
    attribute name { text },
    attribute content { text }
}

property = element property {
    attribute type { "bool" },
    attribute name { text },
    text
}

task = element task {
    attribute name { text },
    step*,
    test*
}

step = element step {
    attribute name { text },
    attribute version { text },
    attribute alt { "yes" }?,
    element stepConf {
        attribute type { "object" },
        ANY
    }
}

test = element test {
    attribute name { text },
    element arg {
        attribute name { text },
        attribute value { text }
    }*,
    element assert {
        attribute path { text },
        attribute value { text }
    }*
}


# This macro is stolen from trang's output when fed a DTD with ANY.
# It's not ideal because it's unlikely that trang can recognise the
# idiom and give the appropriate translation back into DTD or XML
# Schema, but it works
ANY = (element * { attribute * { text }*, ANY } | text)*