About

The StatisticsPlugin logs selected service proxy requests in the Common Log Format

It tracks polling status to facilitate counting of actual end-user requests

In order to reduce the size of the log file as much as possible the plug-in will attempt to only log one request per polling sequence - for instance, a full view request will appear as one request even though a series of polls are executed against the Service Proxy.

The plug-in currently also per default omits insignificant requests from the log such as 'termlist' and 'show' and 'bytarget' and 'stat' and 'ping'.

The targets searched will be logged (for per-target statistics) by capturing the bytarget responses. It's possible to have the plugin issue it's own bytarget requests to generate this response, in case the client doesn't use bytarget requests (see the DO_BYTARGET_AFTER configuration parameter).

Clicks on external links out of MK2 can be logged if they are proxy'ed through the ClickOutPlugin. The chain would have to be set up as "chains.clickout = statistics,clickout". The statistics plug-in should appear first in the chain since the ClickOut plug-in performs a redirect thus breaking the chain.

Record requests and target sets pertains to a given search in Pazpar2. A given record request entry in the log can be linked to its search by its Pazpar2 session ID and the search sequence number. The same goes for target set entries.

Log entries are not guaranteed to appear in the log in the same order they were executed - the reason being that when the logger registers a not yet completed record response (for example) it will wait and see if a newer record response comes in and discard the first. If - however - no newer record response comes in, because the user abandoned the request for a new search perhaps, the logger will then log the last incomplete response, albeit with a (configurable) delay.

Protocol

There is no particular request syntax associated with the statistics plug-in. It can be included in any chain

It will also not affect the response.

The output will be log statements in log files, one file per virtual host on the Service Proxy installation.

The format of the log lines will be:

  • CLF elements
    • remote host (ip address of the client)
    • remote logname (always '-', not available)
    • user name and realm
    • [time stamp]
    • "users request" OR by-target results
    • HTTP status code (always '-', not available)
    • Content length (always '-', not available)
  • Additional elements
    • HTTP session ID (intended for session count)
    • Pazpar2 session ID
    • Search sequence number - the sequence number of current search of the HTTP session
    • Async status of the request ('open','done', or '-' depending on status)

    Example

    127.0.0.1 - demo:mk_demo [2011-07-13 11:47:16] "GET /service-proxy/command=search&query=wild%20horses&windowid=mkid_1310561512790&filter=&torusquery=&recordfilter= HTTP/1.1" - - 975396221B98163263395ED0D1E159EA 499020197 3 -
    127.0.0.1 - demo:mk_demo [2011-07-13 11:47:38] "TARGETS <?xml version="1.0" encoding="UTF-8"?><bytarget><status>OK</status><target><id>connect.indexdata.com:9000/hathitrust</id><name>Hathi Trust Digital Library</name><hits>70</hits><diagnostic>0</diagnostic><records>70</records><state>Client_Idle</state></target><target><id>connect.indexdata.com:9000/scout_report_archives</id><name>Scout Report Archives</name><hits>1</hits><diagnostic>0</diagnostic><records>1</records><state>Client_Idle</state></target><target><id>connect.indexdata.com:9000/mit_opencourseware</id><name>MIT OpenCourseWare</name><hits>103</hits><diagnostic>0</diagnostic><records>100</records><state>Client_Idle</state></target><target><id>z3950.loc.gov:7090/voyager</id><name>Library of Congress</name><hits>640</hits><diagnostic>0</diagnostic><records>100</records><state>Client_Idle</state></target><target><id>connect.indexdata.com:9000/ted_talks</id><name>TED Talks</name><hits>1</hits><diagnostic>0</diagnostic><records>1</records><state>Client_Idle</state></target><target><id>library.metmuseum.org:210/INNOPAC</id><name>Metropolitan Museum of Art - WATSONLINE</name><hits>14</hits><diagnostic>0</diagnostic><records>14</records><state>Client_Idle</state></target><target><id>connect.indexdata.com:9000/ia_texts</id><name>Internet Archive Text Collection</name><hits>31</hits><diagnostic>0</diagnostic><records>31</records><state>Client_Idle</state></target><target><id>catalog-lib.dartmouth.edu:210/innopac</id><name>Dartmouth College Library</name><hits>12</hits><diagnostic>0</diagnostic><records>12</records><state>Client_Idle</state></target><target><id>z3950.franklin.library.upenn.edu:7090/voyager</id><name>University of Pennsylvania Libraries</name><hits>76</hits><diagnostic>0</diagnostic><records>76</records><state>Client_Idle</state></target><target><id>catalog.library.cornell.edu:7090/voyager</id><name>Cornell University Library</name><hits>165</hits><diagnostic>0</diagnostic><records>100</records><state>Client_Idle</state></target><target><id>josiah.brown.edu:210/innopac</id><name>Brown University Library</name><hits>22</hits><diagnostic>0</diagnostic><records>22</records><state>Client_Idle</state></target><target><id>connect.indexdata.com:9000/harpers_magazine</id><name>Harper's Magazine</name><hits>200</hits><diagnostic>0</diagnostic><records>100</records><state>Client_Idle</state></target><target><id>connect.indexdata.com:9000/nih_clinical_trials</id><name>NIH Clinical Trials</name><hits>0</hits><diagnostic>0</diagnostic><records>0</records><state>Client_Idle</state></target><target><id>connect.indexdata.com:9000/open_library_ebooks</id><name>Open Library eBooks</name><hits>78</hits><diagnostic>0</diagnostic><records>78</records><state>Client_Idle</state></target><target><id>clio-db.cc.columbia.edu:7090/voyager</id><name>Columbia University Library</name><hits>241</hits><diagnostic>0</diagnostic><records>100</records><state>Client_Idle</state></target></bytarget> HTTP/1.1" - - 975396221B98163263395ED0D1E159EA 499020197 3 done
    127.0.0.1 - demo:mk_demo [2011-07-13 11:52:28] "GET /service-proxy/command=record&id=author%20francis%20dick%20title-complete%20wild%20horses&recordquery=ti%3D%22Wild%20horses%22%20and%20au%3D%22Francis%2C%20Dick%22&windowid=mkid_1310561512790 HTTP/1.1" - - 975396221B98163263395ED0D1E159EA 499096486 3 done

Configuration

This plug-in is configured in service-proxy.properties or equivalent as follows:

Registration

plugins.statistics = com.indexdata.serviceproxy.plugins.StatisticsPlugin

Usage in chain

Statistics can be inserted in any chain, likely at the end of the chain.

For example:

  chains.*      = relay,statistics
  chains.auth   = authn
  chains.categories = categories
  chains.record = relay,ace,statistics
  chains.clickout = statistics,clickout  

Mandatory properties

# None   

Optional properties

statistics.LOG_DIRECTORY = /var/log/masterkey/service-proxy  # Optional: Default:
                                                             #  '/var/log/masterkey/service-proxy'
statistics.MAX_BACKUP_INDEX = 5                              # Optional: Default 5
statistics.MAX_FILE_SIZE = 10000KB                           # Optional: Default 10000KB
statistics.TARGET_LOGGING_DELAY_MSECS = 10000                # Optional: How long to wait before logging incomplete bytarget process
statistics.RECORD_LOGGING_DELAY_MSECS = 10000                # Optional: How long to wait before logging incomplete record process
statistics.LOG_SHOW = FALSE                                  # Optional: Log 'show' requests? Default FALSE
statistics.LOG_TERMLIST = FALSE                              # Optional: Log 'termlist' requests? Default FALSE
statistics.LOG_BYTARGET = FALSE                              # Optional: Log 'bytarget' requests? Default FALSE 
                                                             #           (will not affect the special logging of bytarget responses) 
statistics.DO_BYTARGET_AFTER = show                          # Optional: Default never
                                                             #           This option can be used to have the plugin issue a bytarget
                                                             #           request in cases where the customer needs per-target statistics
                                                             #           but do not make regular per target requests after searches.
                                                             #           If it's known -- for instance -- that there is always a show
                                                             #           after a search, the internal bytarget request can be bound
                                                             #           to follow that with this option. It's also possible to set it to
                                                             #           'search,show' to ensure that any search is followed by a 
                                                             #           bytarget request even when no show request is ever made.