Code4lib 2011 Report -- Part2

Code4lib 2011 in Bloomington, IN – Part2

Good things come to those who wait! Here’s the Code4Lib 2011 Report Part 2. I toyed with the idea of postponing it indefinitely and have you checking impatiently the Index Data’s blog RSS feed but Higher Powers persuaded me otherwise :). Anyway, since there’s still some time until the next edition of our favorite conference you can use this report to refresh your memory or give you a taste of things to come…

Conference Day 2

Day 2 was the toughest day of the conference (albeit for reasons totally unrelated to the talks, but rather having to do with the reception and beer tasting the day before). It started off with a “A Community-Based Approach to Developing a Digital Exhibit at Notre Dame Using the Hydra Framework” by Rick Johnson and Dan Brubaker Horst from Notre Dame. They talked about why they have adopted the Hydra Framework instead of building a home-grown solution. Hydra is a Ruby on Rails based framework that integrates well-known technologies (Fedora digital repository, Solr, Blacklight) into a DAM platform by providing RoR plugins for all the components (Hydra Heads). Notre Dame folks use it to create an “exhibit” website for each one of their collections. The collections share a lot of similarities but still require a certain level of customization. Slides can be accessed here

Next on the stand were Margaret Heller and Nell Taylor talking about Chicago Underground Library’s Community-Based Cataloging System. CUL is a very interesting model for “community collections” where volunteer catalogers create entries for items that usually come with sparse or lack any meta-data, items that may be semi- official publications known or accessible only locally in one of Chicago’s neighborhoods. For a collection like that to make sense catalogers must provide enough context for the item, usually requiring much more data than the one available in the publication itself. This context may include stories behind the publication, exact location (Chicago is a city of neighborhoods), timeframe or people involved with the publication at some point (like typesetters, photographers etc). CUL also welcomes comments and content from users, although the site does not operate on a typical crowd-sourcing models – the comments and user supplied data is curated by the volunteer catalogers. To support their mission they have developed a unique cataloging and discovery system using Drupal, which they hope to eventually provide as a standalone module that any organization can implement as both a technical and theoretical template to start an Underground Library in its own city.

As said on the talk, “CUL is a replicable model for community collections [..] It uses the lens of an archive to examine the creative, political, and intellectual interdependencies of a city, tracing how people have worked together, who influenced whom, where ideas first developed, and how they spread from one publication to another through individuals.”

Slides from the CUL presentation are available here and the cataloging handout (which I guess is used by the volunteers) here

Gabriel Farrel, who had a presentation on Node.js the previous year, talked a bit about CouchDB. If you don’t know what that is – it’s a DB from the NoSQL breed written in Erlang (language invented at the Swedish Ericsson to program telecommunication switches) for high-availability and fault-tolerance. The cool features of CouchDB include: HTTP API, JavaScript support for creating views, joins and whatnot (key/value JSON object are first-level Couch citizens) using CommonJS (lingua franca of server-side JS applications), built-in versioning, flexible validation and the list goes on. Gabriel was talking about programming CouchApps – applications that are stored within the CouchDB itself (they are just a mixture of JavaScript and HTML5) which gives some cool possibilities for coding webapps that are accessed directly from the DB, execute on the client, may share code with the server side logic (so called Couch views), and get all the goodies as any other data stored in the Couch. You can read more about CouchApps here

After a short break we listened to Matt Zumwalt talking about the Opinionated Meta-data project – a Ruby gem that wraps around your XML data-model (using the Nokogiri Ruby XML library) and allows you to call things the way you want to call them, independently of the actual underlying meta-data format (MODS, DC, whatever). It helps to standardize the terminology used throughout your application, settle on terms applicable to the given domain and user stories and shield your application when the technological requirements change.

Ben Andersson from the Extensible Catalog foundation talked about the core part of XC called the Metasearch Services Toolkit (MST). The MST is a Java web application that contains a harvest scheduler capable of harvesting records from OAI-PMH endpoints, persisting them in a local MySQL instance, indexing in SOLR and running necessary transformations to clean and FRBRize them. Ben talked about his experiences with tuning the performance of MST - they tried to get from the initial capability of processing 120 thousand records per hour to one million records per hour. Most performance improvements were achieved by moving the core record store from SOLR to MySQL (SOLR is still way slower than a relational DB when it comes to updates speed) and keeping SOLR only for text indexing. During the rest of his talk he’s covered different tricks to improve performance within MySQL (splitting tables into multiple DBs, dropping indices before insert, using prepared/batch statements, etc) but surprisingly the final, most performant solution used CSV files dumped from Java and loaded directly to MySQL (sic!). Slides.

Before heading out for lunch we had Dan Chudnov facilitating “Ask Anything!” AKA “Human Search Engine”, open-discussion where Code4Libbers could ask anything that’s on their minds: “questions seeking answers (short or long), requests for things (hardware, software, skills, or help), or offers of things” It’s been done for the second time at Code4Lib.

Nearly all talks after-lunch used the word “micro-services” in one way or another. If there was any commonality in using this term, I guess most speakers meant “small, focused programs or services meant for a particular task that can be combined into larger software kits or suites”. Think UNIX bin utils and pipes.

Tim Shearer University of North Carolina at Chapel Hill gave a talk (for Mike Graves) about “GIS on the cheap”. In substance they’re building a GIS (or what they call a GeoBrowse) framework based on standard open-source tools like SOLR, Postgres (and especially it’s ability to handle complex polygon queries) and OpenLayers (JS library to load, display and render maps from multiple sources on web pages). They used this framework to create a dynamic interface for browsing and searching their digital collection geographically. Slides are here.

Next we heard form Sean Hannan, talking about a “micro-services approach to library website”. So instead of big, fat CMS he uses cute little tools (that the author of this blog also highly recommends): jQuery BBQ (back-button library that allows you to assign state URL to various things o n your website), Bonzai (a “compiler” for the whole website, kinda like make or ant), YUI Compressor (to minimize JS), Lemonade (CSS sprites and positioning), PHP-Typography (prettifies your text), Compass (CSS compiler that allows you to do cool things like define variables, e,g for colors, or to share definitions), Moustache (cool, logic-less templates) Checkout the slides here. If you’re into those tiny tools I would also recommend checking out CoffeScript which gives you an awesome Ruby-like grammar and compiles into standard JavaScript.

Mark Matienzo, Manuscripts and Archives, Yale University Library talked about various open-source tools for Digital Forensics built on the micro-services philosophy (Sleuth Kit, fiwalk). DF tools are important for digital archivists and digital curation practitioners. He also talked about Gumshoe, a Blacklight application of his authorship that can take data from fiwalk and import it into SOLR so you can do basic searching and faceting on various file-level meta-data. Slides

The last talk for the day was by David Lacy from Villanova University. He talked about their home grown digital library system (yes, he had a “reinvented wheel” picture on the first slide :) based heavily on various XML technologies and tools – METS metadata editor, eXist DB (an XML database) Orbeon Forms (XML ad XForms Processor), XPL (XML processing pipeline engine), OAI-PMH server (written in XQuery) – and using VuFind for the search user interface. Since their collection items are mostly scanned images, the system contains a series of services for image manipulation and OCR. David presented how their home built solution perfectly fits their workflow process, from scanning to online publishing. Slides.

As per usual the conference day was closed with Lightning Talks and breakout sessions.

Conference Day 3

I missed the closing day, morning-only sessions of Code4Lib since I had to rush to the airport to catch my plane back to Denmark. I wish I didn’t – the flight was canceled and I spent the whole day on the Indianapolis airport (pretty pleasant, btw) and only managed to catch a late afternoon flight to Chicago. It took only three more (sic!) to get back home! But hey, don’t feel bad for me or sad because you don’t get to read an awesome Third Day report – the video archives are still out there, available for your convenience. Thanks for checking in with Index Data’s blog and I hope to see you again on the Code4Lib circuit!