Infrastructure for a Bibliographic Network

In the quarter of a century that Index Data has been providing software solutions to libraries, one of the most common tasks we are asked to do is aggregate the holdings of groups of libraries, and among our bag of tricks is software that builds unified catalogs (Zebra) and federated search catalogs (MasterKey).  As is always the case, the progress of technology means that what was new and innovative yesterday is necessary and presumed today. Based on recent inquiries to Index Data, we think union catalogs of arbitrary groups of libraries are reaching that latter stage. In this post, we are describing a new way of thinking about aggregations of libraries’ records, and we’re calling this the Bibliographic Network.

The use cases for a Bibliographic Network are varied.  Some are looking for a mechanism to pool the records from consortia members so those records can be inserted into a member library’s self-hosted discovery layer as options presented to a patron for requesting.  Others want to replace a consortium’s staff-mediated ILL system with a patron-initiated request and delivery system. Another group of libraries wants quick access to the collection retention decisions of peer libraries.  Each of these cases presumes the existence of a shared index. And in maintaining a shared index recognizing that not everyone will arrive there at the same time, and that multiple parallel models, data formats, and systems will continue to exist, perhaps indefinitely.

As we have been thinking about this at Index Data, we designed a software system that can be deployed by any library, organization, or company to build an aggregation of metadata spanning an arbitrary group of member libraries. It can also be used to further share this metadata and create aggregations across consortia and group memberships, thus advancing experimentation already underway in the library community. Called the “Bibliographic Network”, the software system provides the means to build indexes as needed as well as to publish, in flat files or through linked data mechanisms, sets of bibliographic and authority data at any point in the network.

Conceptual Model

For the purpose of this discussion, we view the space that we are modeling as the curation and sharing of bibliographic metadata in terms of entities such as titles and authors. The BIBFRAME model provides a convenient set of entities that map well to traditional bibliographic description: Works, Instances (i.e. editions, different formats), Items (physical, identifiable objects), Agents, Subjects, and Events. In practice, metadata about these entities can be exchanged using different mechanisms such as MARC file batch load, publishing BIBFRAME entities on the web, or through schema.org statements.  Conceptually, we are looking to create software to help with the curation (and indexing!) of these entities. Crosswalks and mappings to and from different representations will be required to support different use cases, but we are not in most of this document making assumptions about how the data might be stored and managed.

When thinking about the bibliographic ecosystem – and in particular a move towards a linked data ecosystem where maximum re-use of once-published entities is desirable (as opposed to each library copying and re-publishing all of the same bibliographic statements) – it is tempting to seek an ordered universe of actors with discretely scoped responsibilities:

In such a model, national libraries might be responsible for cataloging their national bibliography (all original publications in that country), while libraries might fill in the gaps through original cataloging and consortia might serve to aggregate and consolidate metadata across member libraries. In a perfectly organized universe, no title would be redundantly (copy) cataloged, and everyone would share the same universe of bibliographic metadata, with clear responsibilities and chains of provenance for individual bibliographic statements. Holdings information might be aggregated at any level, i.e. consortia might wish to aggregate detailed holdings information about member libraries. National or international utilities and agencies might choose to focus on metadata and not address holdings information.

In the real world, though, such a state is practically unattainable, and the path towards even a partial realization is unclear:

  • Different national libraries have vastly different mandates and resources and organizations are rarely limited to a single role or a single set of clearly defined relationships.
  • National and state libraries are working libraries, as well as serving the role of utilities for other libraries.
  • Libraries frequently belong to more than one consortium.

The potential for confusion and inefficiency is high, especially since today these roles are poorly defined and existing software systems do not support them well.

However, we believe that it is possible to create a set of software and services that can support the concrete needs of today, while also carving an easier path towards a more efficient organization of the work of curating our bibliographic universe.

Bibliographic Ecosystem as a Network

As suggested above, a simple hierarchical, tree-structured view of the library universe breaks down because libraries often play more than one role. We propose to model the ecosystem instead as a set of nodes in a graph or network, with each node corresponding to a library or organization, and the edges corresponding to relationships between libraries determined by functional roles. We propose the following as a starting point for a set of roles germane to the collective maintenance of bibliographic metadata:

  • Contribute: A library (or other organization) engages in original cataloging, creating new entities when suitable ones cannot be identified in the network.
  • Aggregate: An organization gathers together metadata from other organizations. Aggregation is the complementary operation to syndication (described below). The aggregation function includes disambiguation, i.e. an attempt to resolve matching/overlapping entities.
  • Index: An organization creates an index of metadata that it has contributed and/or aggregated. Optionally, an index may be made available to others for re-use.
  • Publish: An organization publishes bibliographic entities that it has contributed or aggregated. For example, a consortium may publish the combined holdings of its member libraries. Publishing implies potentially minting new permanent identities (URIs) and should be done with care and in concert with local and national partner libraries.
  • Syndicate: An organization makes its set of metadata (contributed and/or aggregated) available to other nodes, e.g. to support a shared index or union catalog. This implies supporting a harvesting or ‘push’ mechanism to share updates.

In this model, the node (library, consortium, utility, etc) and its relationships to other nodes become central to how the system functions. Each library might aggregate from multiple other nodes and share metadata with multiple other nodes:

Consider use cases, or patterns that might emerge from the basic building blocks outlined above:

  • A small library might use an index made available by a consortium or larger library, either through an API, through download into a discovery layer, or directly through a discovery layer made available by the larger library.
  • The same library might Contribute original metadata and Syndicate it for ingestion into a larger aggregation in a consortium or state library
  • A consortium might Aggregate metadata from member libraries and other consortia, national libraries, etc. It might Index all of this metadata, and Syndicate or Publish metadata to a national library or partner consortia
  • A national library might play all of these roles concurrently within its national ecosystem as well as with partner national libraries

What emerges is a network of organizations collaborating more or less closely to curate a combined set of bibliographic metadata (including authorities, etc). The gestalten network(s) might be highly organized or they might form locally organized regions (i.e. within consortia) surrounded by looser structures. Representing Publish as a separate function allows a sub-community to be selective about when to mint new entities with discrete URIs. But by defining possible roles and relationships, and, crucially, by providing software that makes it easy to adopt these roles, a development in this area might both address immediate short-term needs by solving concrete problems, while also carving a path towards a higher level of organization and re-use.

Implementation

The five roles/relationships described above should ideally be established entirely in terms of open interfaces and standards. In practice, the roles described could all be supported by different combinations of software or modules that form parts of different systems. We will, however, propose a modular open source solution that makes it easy for an organization to establish a node in a bibliographic data network such as the one described here, and support the different roles as well as relationships with other entities. This software can be seen as an enabling factor, so that the network model can be implemented in specific communities (such as consortia), but also as a platform for exploring and experimenting with new use cases in a controlled environment.

We are working on an implementation of the Bibliographic Network based on the Okapi-Stripes Platform from the FOLIO Project.  The Okapi-Stripes Platform was originally developed to be the infrastructure for the FOLIO Project. It is seen now as a generally useful infrastructure component for other projects, and the Open Library Foundation (OLF) is considering the separation of the Platform components into a distinct OLF community project.  The “FOLIO LSP” suite of apps on the Okapi-Stripes Platform intended to replace the functionality of an integrated library system is the most well-known part of the project. As a platform, it is possible to create other apps without any connection to the LSP suite and to borrow individual apps without depending on others.  Our implementation uses the Okapi-Stripes Platform and borrows/adapts selected FOLIO LSP apps without creating a strong dependency between the Bibliographic Network and the LSP project; the projects are separate as it is assumed that many libraries will not use FOLIO LSP as their integrated library system.

A description of the Okapi-Stripes Platform’s relationship to the rest of FOLIO can be found in the FOLIO wiki. Documentation of Okapi (server-side modules) and Stripes (browser-based JavaScript single-page-application) can be found on the FOLIO Developer’s website. However, it should not be necessary to possess deep knowledge of the Okapi-Stripes Platform to understand the high-level architectural concepts presented here.

The Okapi-Stripes Platform is a true multi-tenant application environment for web-based applications, or “apps.” The concept is similar to Google’s office suite and other web-hosted, extensible environments. A tenant is like a workspace in the Platform, which is typically specific to a library or organization. It has its own set of apps, its own users, its own data. Multiple tenants can be hosted in a single installed instance of the Platform.

Each library’s node is instantiated as a tenant in a shared cloud-hosted Okapi-Stripes deployment. Thus, although in principle the ecosystem model is entirely decentralized, it would be simple to co-locate the nodes representing the libraries in a consortium on one single software instance managed by a library or by a consortium. Individual libraries might also choose to run their own nodes, or system vendors might incorporate the functionality and the recommended standard interfaces into library management products.

One of the key design decisions to make when building a solution based on the Platform is the division of functionality into modules or “apps”. These apps can be deployed independently for each tenant, so libraries can adapt an Okapi-Stripes-based solution to their needs by selecting which apps to make use of.

One straightforward architectural interpretation of the model above would be built around a metadata warehouse or inventory representing the universe of metadata known to a given node/library, then use the capabilities in the apps to move data to other nodes in the Bibliographic Network. A library’s node on the Bibliographic Network contains apps corresponding to the five main functions or roles discussed above:

Figure: Overview of a library node:

In the illustration above, the green boxes correspond conceptually to Okapi/Stripes web-apps, with admin-facing user interfaces based on the Stripes React toolkit and Okapi-mediated microservices at the back-end. The Inventory app and associated Summary metadata model and detailed record stores are extant software, made available by the Open Library Foundation under the Apache2 license.

The high-level functionality of the apps (building on the high-level roles described above) are as follows

  • ILS Extract: Populate the Inventory with bibliographic data by way of extraction from the library’s ILS through OAI-PMH, scheduled FTP downloads, etc. Depending on the ILS, some periodic manual processing may be unavoidable. Records are pushed into the Inventory by way of an identity resolution function, which looks for duplicate entities as well as entities that already exist in the Inventory (updates).  The ILS Extract app performs one half of the Contribute role in the model described above.
  • Inventory: An aggregation or warehouse of bibliographic metadata (instances as well as potentially works, agents, places, items), normalized into a common data model for purposes of deduplication and filtering. Original MARC records are stored (in SRS, or Source Record Storage) or externally maintained BIBFRAME records can be linked to. The Inventory app performs the other half of the Contribute role in the model described above.
  • Syndicate: Make all or a subset of the Inventory available to other Bibliographic Network nodes (other libraries or a consortial node). This could theoretically be done through any number of mechanisms, but ideally, one which supports incremental updates and a ‘push’ function, so that local changes can be exposed as quickly as possible to partner nodes.  ResourceSync is a candidate solution.
  • Aggregate: The receiving end of the syndication function. Includes the same identity resolution logic as the ILS extract. This function allows the library to construct a ‘super-inventory’ based on the contents of other library nodes.
  • Publish: This app publishes (a part of) the inventory as linked data entities with permanent identifiers, suitable for cross-linking or indexing by web search engines.
  • Index: Construct a searchable index of the contents of the inventory using a high-performance indexing engine. This includes a configurable mapping step, and may make use both of the summary record contents and original MARC contents if desired.

An Example Deployment for a Consortium Union Catalog

While the software described here could be deployed by many types of organizations to build different ‘topologies’ of metadata exchanges and aggregations, we offer this proposal for the specific near-term use case of a consortium’s union catalog.  Each consortium member will have access to a library node, wherein they can manage the synchronization of their metadata with their ILS, apply filters, and syndicate metadata to a shared consortial node, operated by the consortium or by one library on behalf of the consortium.

Our proposal leverages the multi-tenant capabilities of the Okapi-Stripes and the existing Inventory app of the FOLIO LSP.  Each consortium member would be its own tenant on a shared Platform installation (“Member Tenant”). The Member Tenant has:

  • the ILS Extract app, adequately configured and provisioned for its particular ILS, to extract and normalize data from the ILS;
  • the Inventory app for staging and managing the library’s contribution to the shared dataset; and
  • the Syndicate app to make metadata available to the consortial shared aggregation (or possibly to multiple aggregations, if it participates in more than one consortium).

Records from the member’s ILS and other sources would use the ILS Extract app to synchronize holdings information stored in the Inventory module.  Records could be synchronized with a handful of mechanisms: regular OAI-PMH or ResourceSync harvests, realtime webhook messages, and comparison from a periodic full databases dump.  As a last resort, Index Data can use its Connector Framework web-scraping technology to periodically scan the member’s OPAC. The synchronization mechanism merges bibliographic and holdings records should they come as separate feeds.  The ILS Extract app provides a consistent interface between the member’s bibliographic data and the Bibliographic Network, whether such data is loaded in bulk or added/updated/deleted incrementally from the member’s ILS.

Details about the eligibility and availability of a resource for lending can also be stored in the item’s Inventory module record.  The ILS Extract app would have a configuration page for mapping which item locations, item types, item formats, and publication years coming from the member’s ILS to a requesting eligibility statement.  If item status (checked out, lost, etc.) is available in the real-time feed of records from the member ILS, the Inventory module can also store real-time availability of items in the Member Tenant Inventory module.  Between the Member Tenant and the member’s ILS is a variety of mechanisms for determining availability, from Z39.50 connections to use of Index Data’s web-scraping Connector Platform technology. To other applications consuming a real-time-availability request API response, the data provided will be consistent no matter what mechanism is used to determine availability from the member’s ILS.

Changes recorded in each Member Tenant would also be available in that tenant’s Syndicate app to be contributed to the consortial Inventory app running in a separate Okapi-Stripes Platform tenant (the “Consortial Tenant”).  That module performs a deduplication/merging algorithm and stores the resulting derivative record in the Consortial Tenant Inventory. A holdings record representing the member’s record is also stored in the Consortial Tenant Inventory module.

The Consortial Tenant aggregates metadata from each member library (and potentially from other sources, e.g. Hathitrust, Internet Archive, Library of Congress, etc). Data would be stored in a shared Inventory, and from there it would be picked up by the Index app to build a searchable index. The consortium might also choose to further syndicate its aggregated metadata set if it works in partnership with other consortia.

Either the individual libraries or the consortia might also choose to use the Publish app (optional) to publish a linked-data representation of the combined metadata set. The syndication/aggregation chain and identity resolution logic should ensure that cross-links between entities are established when needed (either between companion entities like works and instances or between equivalent instances published by different network members).

The following diagram illustrates the flow of data upwards from the individual library nodes with their associated ILSes into the shared warehouse where identity-matched bibliographic instances are maintained. From the shared warehouse, bibliographic data can be indexed, published (as BIBFRAME and/or schema.org) or syndicated to other consortia.

Making the Bibliographic Network

To this point, Index Data has privately shared the conceptual model and implementation details for the Bibliographic Network with a few clients and potential clients, and incorporated feedback into this posting.  The concepts were well received enough to seek broader feedback. If you are reading this posting as a result of following a link in an Index Data proposal to your request-for-information and request-for-proposal, please know that this is a starting point for our discussions on how Index Data technology is adaptable to your needs.  If this post gets your gears turning and you think this technology can be applied to your use cases, start a conversation with us to learn the state of development for the Bibliographic Network.

Controlled Digital Lending

Last week, I attended the 2018 Open Libraries Forum at the Internet Archive’s remarkable headquarters in San Francisco. The focus of the Forum was twofold: to learn firsthand from some of the authors of the newly-minted position statement on controlled digital lending (CDL), and to help provide input for the Archive on their own Digital Lending platform, which precedes the Position Statement, but has the potential to be an important part of a global CDL infrastructure (with some caveats that I will return to). Two of the authors of the CDL position statement have created a companion white paper which lays out the legal argument for CDL in greater detail. Anyone considering engaging in CDL should consult these resources as well as their own legal counsel, but having said that, the documents are admirably clear and readable. It is important to stress again that the position statement on CDL and the associated white paper are not about the Internet Archive’s platform: they describe a practice which could be adopted by any library.

I think that CDL has the potential to be a big deal for libraries. So much of the transformation from print to digital has been accompanied by an erosion of the rights of content consumers. The balance of rights and responsibilities historically associated with copyright law has been replaced by digital licenses and DRM systems which reserve virtually all privileges and affordances for the licensor. CDL develops an argument, grounded in copyright law, for libraries to claim some territory of their own in the digital realm. This offers new degrees of freedom, and new opportunities for innovative library services or simply for reducing material handling costs.

Only time will tell how this will shake out, how libraries will come to apply these new possibilities. Will they be the domain of commercial vendors or large organizations, or might they become a mainstream part of library practice? I would like to consider what it would look like for libraries to extend current models of circulation and resource sharing to encompass CDL, what tools and infrastructure exist and what new ones might help libraries realize the greatest benefits for their patrons.

What is CDL?

CDL is a blueprint for an emerging approach for libraries to share copyrighted material. In a nutshell, CDL allows a library to digitally lend one “copy” of a title for every physical copy that they have acquired, provided that the total number of concurrent physical and electronic lends does not exceed the number of copies owned, and provided that reasonable steps are taken to prevent borrowers from creating their own copies. CDL extends the rights of a library to circulate physical books into the digital realm, provided that the book is physically owned by the library (and not licensed). The CDL position statement was crafted by a group of legal experts, and it has been endorsed by a rapidly growing list of organizations and individual experts.

CDL defines digital lending in terms of six specific controls that a library must put in place to protect the rights of the copyright owner, and it lays out the arguments, with references to laws and legal precedents, for why such lending is within the fair use rights of the library. Because copyright law is subject to interpretation and legal challenge, no ironclad guarantees against challenge or litigation are provided. However, some additional steps are offered for libraries that wish to reduce their liability even further, and it is suggested that ultimately, libraries are granted a level of legal shelter which makes the risk of litigation and any possible penalties quite low. But again, as with any activity, it is advisable for a library to engage legal counsel before making an investment in new services.

These are the six controls required by a library to provide controlled digital lending (from the white paper):

  1. ensure that original works are acquired lawfully;
  2. apply CDL only to works that are owned and not licensed;
  3. limit the total number of copies in any format in circulation at any time to the number of physical copies the library lawfully owns (maintain an “owned to loaned” ratio);
  4. lend each digital version only to a single user at a time just as a physical copy would be loaned;
  5. limit the time period for each lend to one that is analogous to physical lending; and
  6. use digital rights management to prevent wholesale copying and redistribution.

Note that the first two points deal with the selection of material, and only the last four points deal with “technical” measures which are intended to mimic the constraints of a physical book to a reasonable degree.

If a library implements these measures, then it is doing CDL in accordance with the position statement, and it is free to lend and circulate parts of its collection electronically. If a library feels queasy and wishes to further reduce its risks, the authors suggest additional steps that might be considered including: limiting CDL to older or even out of copyright materials; focusing on non-fiction and steering away from new/current materials; implementing practices that further emulate the “transactional friction” of printed books, like artificial delays between lends or a maximum number of loans per copy to emulate a book wearing out over time. Again, these “refinements” are optional with respect to the CDL model, but they might help reduce concerns in first-time CDL practitioners.

Possible applications of CDL

CDL adds new degrees of freedom, new tools to help libraries do their work. I am no librarian, and only time will tell what creative uses people make of these tools; but it may be helpful to try to imagine some of them, to guide an exploration into possible technical approaches.

  • Libraries can bring new life to marginal parts of their collection; items available for easy online use or download to a tablet may lend themselves far better to serendipitous discovery than those buried in closed stacks.
  • Items that are very rare or costly, which might not be available for physical lending, can be made available for digital lending, either directly or through interlibrary loan.
  • For libraries with far-flung branches or for interlibrary loan in time-critical situations, digital lending may sometimes be preferable to shipping the item out and back.
  • Libraries can create thematic online exhibits or collections which cross the boundaries of institutions. If items have already been digitized and are available for digital lending (direct or through interlibrary loan), new possibilities are opened for engaging presentations.
  • CDL can be part of a cross-institutional collection management strategy. Libraries can collaborate to establish electronic as well as physical coverage in a virtualized collection.
  • CDL can be an alternative to permanently weeding a collection. Physical items of marginal value can be put into permanent storage or destroyed, yet remain available to patrons in their digital form.

What’s next?

I should state clearly that hardly any of the thoughts in this post belong to me. I have discussed CDL with a number of people over the past several months, brainstorming ideas, thinking about concerns, technical implications, etc. I have done my best to synthesize what I have learned and what makes me excited about this development, in the hope of exciting and inspiring others.

In the next post, I will explore some practical approaches to deploying CDL in a library, and think about implications for technology choices as well as the role of services like the Internet Archive.

Sebastian Hammer is co-founder and president of Index Data.

Reflections on the European BIBFRAME Workshop

Attending the European BIBFRAME Workshop in Florence, Italy, was a great way to wrap up my first month with Index Data — and it wasn’t just about the hills, art, and food. The workshop proved to be an excellent introduction to the BIBFRAME community and the variety of exciting initiatives taking place around the globe.

The first thing that struck me was the true international nature of BIBFRAME. Of course the fact that there is a European BIBFRAME Workshop at all goes to show that BIBFRAME has grown well beyond its origins at the Library of Congress. This year’s workshop included more than 80 participants from countries across Europe, as well as a handful from North America and Asia. In his introduction to the meeting, Leif Andresen of the Royal Danish Library echoed this observation, saying he believes BIBFRAME has the potential to become more international and more collaborative than MARC.

If the workshop was any indication, that’s already well on its way to being true. While BIBFRAME originated in the States, it’s the European national libraries — with their centralized models and willingness to take risks — who have really taken taken the BIBFRAME baton and run with it. For me their work was a great illustration of the way that many nations working together, while still playing to their own strengths, can make real progress in moving the library profession forward .

This success was especially obvious in the discussions surrounding the National Library of Sweden’s recent move to a full linked data environment within its union catalog. Attendees were eager to learn more from Sweden’s implementation, but they were equally inspired by the fact that it had been done at all. Niklas Lindstrom, who presented on the project, really summed up the mood of the room when he described Sweden’s efforts as “Not done, just real.”

A similar emphasis on the value of learning through implementation flowed through many presentations. Philip E. Schreur from Stanford University said that Phase 2 of the LD4P project would be strongly focused on implementation, with 20 new partner institutions exploring everything from record creation to discovery. And Richard Wallis presented a range of actionable possibilities for libraries interested in exploring linked data, from starting points like adding URIs to MARC to as-yet untackled challenges like converting BIBFRAME to Schema.org.

Wallis also echoed one of the other major themes of the conference, the importance of developing a true community approach to BIBFRAME. With individual libraries implementing projects on their own, BIBFRAME approaches are often too different to allow for real collaboration. The community needs to work on consistency, cut down on duplication, and focus on creating connections between the most well-established projects.

Many other presenters addressed this same issue, offering up potential ideas and solutions. Sally McCallum of the Library of Congress described plans to harness external authority data, including standards like ISNI. Schreur talked about the role that a Wikimedian-in-Residence will play at LD4P, working to figure out how wiki data can be used to support library linked data. And Miklós Lendvay of the National Széchényi Library of Hungary shared his library’s decision to implement the FOLIO library services platform in the hopes that it will extend the sharing mindset and allow for possible interactions between the BIBFRAME and FOLIO communities.

Having worked extensively with BIBFRAME and FOLIO, Index Data is especially excited about this last possibility, and I had the opportunity to present a lightning talk outlining some of the ways this might be achieved. So far it’s just a starting point, but I’m confident that the experiences and ideas I heard at the workshop will help shape and inspire Index Data’s future work with the BIBFRAME community.

Kristen Wilson is a project manager / business analyst working on efforts related to BIFRAME, FOLIO, and resource sharing. She joined Index Data in August 2018 after more than a decade of academic library experience.

Machine learning in libraries: profiling research projects rather than people

Machine learning in libraries, as in many other contexts, will often rely on data about people and their activities. Data in a library system can be made available for use with machine learning algorithms to develop predictive models, which have the potential to help patrons in their research. Of course, the data might also be used for the benefit of others without the patrons’ permission, either in the present or at some future time. In particular, the data can serve as a basis for creating profiles of individuals, which may be used for undue advantage. If “knowledge is power,” then it is worth considering whether that power is authorized and what its limits are.

If libraries were to avoid keeping a record of patron activities, then it would be much more difficult to build profiles of patrons. However, libraries need to track some basic information, such as circulation transactions, and can offer better services to patrons if they track and analyze how resources are being used. In addition, they may be obligated to record information about access to electronic resources. There are also the benefits of machine learning, and while “opt-in” or “opt-out” models can limit profiling, many machine learning algorithms will only work well if most people participate.

Suppose, however, that libraries were to request that every patron specify one or more “research projects” corresponding to the patron’s interests, broadly defined, and to select one of these projects when logging in to the library system. Within the library’s database, most patron activities could then be associated with a project rather than with a patron.

For example, Jenny might create a project called “Information theory” for her graduate study, a second project called “Cooking” to pursue her passion for learning new cuisines, and a third called “Reading” to represent reading for pleasure. Now somewhere in the library system there is stored an association between Jenny and her three projects. This association would not be authorized to be shared outside of the library or with machine learning algorithms, except with specific consent in cases where absolutely necessary. Elsewhere in the database, Jenny’s data would be associated not directly with her but with the project she has selected to work on.

When Jenny is logged into the library system, she might see a dashboard for the project that she is currently working on, and can easily switch to another project. This may make a certain sense to Jenny, as it would allow her to view and manage related information together. She probably would not care to see articles about information theory suggested by a recommender system while browsing books of recipes.

With some exceptions, there is no need for machine learning to profile people in order to help with their research interests. Jenny’s selection of a project could even assist the algorithms to be more accurate, by indicating what she is working on. At any rate, the decision would be up to Jenny in how she chooses to organize her projects and to set boundaries for how data are used, based on which of her interests she thinks would benefit. She could begin to use machine learning selectively as a tool, rather than being pressured into an all-or-nothing choice.

Another advantage of this approach could be seen in cases where data are anonymized and exported from the library system to be analyzed by someone outside the library. Anonymized data can sometimes be “re-identified” because the data may reveal a combination of specific activities or interests that can be linked to a person. If the library were to track projects rather than patrons, and assuming the patron-project groupings were not disclosed with the anonymized data, then patron data would be fragmented by project, potentially making re-identification more difficult.

Nassib Nassar joined Index Data in 2015 as senior product manager and software engineer.

 

The State of FOLIO: Numbers and Muses

,

As the calendar turned to a new year, we took the opportunity to reflect on the state of FOLIO today.

The Index Data team members are excited to have jumpstarted this open source effort, and the numbers tell a story of a growing and highly engaged community.

FOLIO Community

 

FOLIO community adoption

 

To get beyond the numbers, we asked some of the project leaders to share their thoughts on where FOLIO is today and where it’s going.

The 5-minute version

You can watch the entire interview here.

2017 was a year of remarkable progress for FOLIO, and we have good reasons to believe that 2018 will be even more exciting!  Index Data is eager to engage with the community and be at the forefront of welcoming new participants to the FOLIO project.

Bibliotech Education Offers Node-based Zoom Client Based on Index Data’s YAZ Client

Daniel Engelke, chief technology officer and co-founder of Bibliotech Education Ltd, notified us about their release of an open source Z39.50 toolkit for Node.js that uses Index Data’s YAZ toolkit. The source code is available on GitHub. Daniel said, “Having the YAZ toolkit available, specifically the libyaz5-dev package made developing a zoom client in Node.js extremely easy for us!”

npm package

Bibliotech is an online platform that provides students with access to their textbooks and libraries with affordable textbook packages.

RA21 project aims to ease remote access to licensed content

In the two decades since electronic journals started replacing print journals as the primary access to article content, the quandary of how to ensure proper access to electronic articles that are licensed and paid for by the library has been with us.Note 1 Termed the “off campus problem”, libraries have employed numerous techniques and technologies to enable access to authorized users when they were not at their institutions. Access from on campus is easy — the publisher’s system recognizes the network address of the computer requesting access and allows the access to happen. Requests from network addresses that are not recognized are met with “access denied” messages and/or requirements to pay for one-off access to articles. To get around this problem, libraries have deployed web proxy servers, virtual private network (VPN) gateways, and federated access control mechanisms (like Shibboleth and Athens) to enable users “off campus” to access content. These techniques and technologies are not perfect, though (what happens when you get to a journal article from a search engine, for instance), and this is all well known.

Stepping into this space is the STM Association — a trade association for academic and professional publishers — with a project they are calling RA21: Resource Access in the 21st Century. The website describes the effort as:

Resource Access for the 21st Century (RA21) is an STM initiative aimed at optimizing protocols across key stakeholder groups, with a goal of facilitating a seamless user experience for consumers of scientific communication. In addition, this comprehensive initiative is working to solve long standing, complex, and broadly distributed challenges in the areas of nwww.stm-assoc.org/standards-technology/ra21-resource-access-21st-century/etwork security and user privacy. Community conversations and consensus building to engage all stakeholders is currently underway in order to explore potential alternatives to IP-authentication, and to build momentum toward testing alternatives among researcher, customer, vendor, and publisher partners.

Last week and earlier this week there were two in-person meetings where representatives from publishers, libraries, and service providers came together to discuss the initiative. Two points were put forward as the grounding principles of the effort:

  1. In part, the ease of resource access within IP ranges makes off campus access so difficult
  2. In part, the difficulty of resource outside IP ranges encourages legitimate users to resort to illegitimate means of resource access

What struck me was the importance of the first one, and its corollary: to make off-campus access much easier we might have to make on-campus access a little harder. That is, if we ask all users to authenticate themselves with their institution’s accounts no matter where they are, then the mode of access becomes seamless whether you are “on-campus” or “off-campus”.

The key, of course, is to lower that common barrier of personal authentication so far that no one thinks of it as a burden. And that is the focus of the RA21 effort. Take a look at the slides [PowerPoint] from the outreach meeting for the full story. The parts that I’m most excited about are:

  • Research into addressing the “Where Are You From” (WAYF) problem — how to make the leap from the publisher’s site to the institution’s sign-on portal as seamless as possible. If the user is from a recognized campus network address range, the publisher can link directly to the portal. Can clues such as geo-location also be used to reduce the number of institutions the user has to pick from? Can the user’s affiliated institution(s) be saved in the browser, so the publisher knows where to send the user without prompting them?
  • User experience design and usability testing for authentication screens. Can publishers agree on common page layout, wording, graphics to provide the necessary clues to the user to access the content?

The RA21 group is leveraging two technologies, SAML and Shibboleth Note 2, to accomplish the project’s goals. There are some nice side effects to this choice, notably:

  • privacy aware: the publisher trusts the institution’s identity system properly authorize users while providing hooks for the publisher to offer personalized service if the user elects to do so.
  • enhanced reporting: the institution can send general tags (user type, department/project affiliation, etc.) to the publisher that can be turned into reporting categories in reports back to the institution.

Beginning next year organizations will work on pilot projects towards the RA21 goals. One pilot that is known now is a group of pharmaceutical companies working with a subset of publishers on the WAYF experience issue. The group is looking for others as well, and they have teamed up with NISO to help facilitate the conversations and dissemination of the findings. If you are interested, check out the how to participate page for more details.

Within Index Data, we’re looking at RA21’s impact on the FOLIO project. FOLIO is starting up a special interest group that is charged with exploring these areas of authentication and privacy. I talk more about the intersection of RA21 and FOLIO on the FOLIO Discuss site.

Note 1: I am going to set aside, for the sake of this discussion, the argument that open access publishing is a better model in the digital age. That is probably true, and any resources expended towards a goal of appropriately limiting access to subscribed users would be better spent towards turning the information dissemination process into fully open access. The resource access project described here does exist, though, and is worthy of further discussion and exploration. back to text

Note 2: SAML (Security Assertion Markup Language) is a standard for exchanging authentication and authorization information while Shibboleth is an implementation of SAML popular in higher education. back to text

Index Data Staff Offer a Primer on the FOLIO Code

Coinciding with the public release of the FOLIO code repositories, Index Data staff offered a primer to developers on the context around how the FOLIO platform provides integration points for module development through the Okapi layer and what that means for developing modules in FOLIO. In the 90 minute presentation (embedded below), Peter Murray (open source community advocate) and Jakub Skoczen (architect and developer) walk through how the pieces come together at a high level, and Kurt Nordstrom (software engineer) demonstrated how to build a back-end module for Okapi using NodeJS.

 

The video is available for download from the Open Library Environment website.

Index Data Turns 20

Today, it was 20 years ago that Adam Dickmeiss and I founded Index Data together in Copenhagen. There was a bottle of champagne, and our parents shared the moment with us along with our wives, because, honestly, we were little more than big kids at the time. We were a little scared, but we were also in the fortunate position of being young, still without kids or debt. Oh, and our wives had steady jobs. Let the adventure begin!

We met just a few years earlier, as interns at the State Library Service (Statens Bibliotekstjeneste), looking to make some extra money for college. We were hired during a tumultuous period, both organizationally and technologically. In Denmark, libraries benefit from substantial support from national and municipal authorities. At that time, there was an effort afoot to consolidate the services offered to the public and research libraries, respectively, into a single organization, the Danish Library Center (DBC) with a single software platform, including a centralized, national union catalog and interlibrary loan platform. Eventually, Adam and I along with a team of young programmers were given the task of creating an indexing and search engine for the new system. The task took about a year, and I still look back on it as one of the most exciting projects I have worked on. Somewhere in there, Adam managed to graduate and I managed to forget all about my studies, but we both felt like we knew everything there was to know about library technology (remember, we were just big kids!).

The new system went into production on schedule and was a tremendous success. Today, it forms the basis for a unique, patron-centered ‘national OPAC’ that gives any citizen access to the collection of every library in the nation. But Adam and I had developed a taste for big, ambitious projects. In a sense, our shared experience had made us into entrepreneurs, and we felt hungry for more.

For our first day of work at Index Data, we each brought a chair, our PC from home, and a thousand dollars which formed the entirety of the operating capital of the company. The goal of the business, we decided, would first and foremost be to provide a good place of work for us and any colleagues who might one day join us. The purpose was to have fun doing work that we loved. The business model was to create cutting-edge, client/server-oriented software (buzzword of the era) for libraries, and to finance the development by offering our services as consultants in whatever areas we could. We felt that the best place to start would be to build a complete Integrated Library System (I did mention we were just big kids).

Our workplace was a small room in a disused factory building which had been turned into rental offices. But not in the fashionable, expensive way it’s being done today. This place was rough. Our neighbors were bohemian artists and tiny film production companies hoping to make it big. Break-ins were a continuing concern, so one of our first purchases was a large steel grid that we padlocked to our door at night. At one point during a rainstorm, water started coming down through the ceiling, so we draped plastic sheets to keep our computers dry. After that, strange mushrooms would sometimes grow out of our walls.

The original artwork for our first Christmas card, by Adam's brother Otto Dickmeiss

The original artwork for our first Christmas card, by Adam’s brother Otto Dickmeiss

In between consulting gigs, we worked steadily on our own software, building components that we thought we’d need for our big library system. At one point, we started releasing our software under Open Source licenses. We reasoned that someone might see the software and decide to ask us to help them work with it. Fresh out of an academic environment where Open Source projects were enormously influential (Linux was still new, then, but getting lots of attention), it felt natural to us but it was still a relatively unknown phenomenon in the larger industry and we suffered a good deal of friendly ribbing from our friends. We also endured some more pointed questions from our wives that were still carrying the brunt of the household expenses.

But something cool happened; people did find our software, and the consulting work increasingly involved integration and enhancements to our growing family of software components. Along the way, the building blocks we’d been creating took on a life of their own and became a focal point of our work; we never did build that library system, but our software components have been integrated into the vast majority of library systems out there in various roles, and we have enjoyed two absolutely remarkable decades of working relationships with exciting organizations and brilliant people all over the globe. We moved out of the mushroom-infested office and were joined by coworkers. We had kids.

The Europagate project team in 1995. Adam and Sebastian in the back row

The Europagate project team in 1995. Adam and Sebastian in the back row

Ten years ago this summer, my family and I moved to the US. Our business gradually shifted away from Denmark and Europe, but we struggled to maintain our old, informal and very personal company culture with me way over in New England and the rest of the team in Copenhagen. In 2007, Adam and I made a decision that in some ways were as dramatic as quitting our jobs and founding the company. We hired Lynn Bailey to be our CEO and re-configured the company mentally and structurally to be a US-based company which just happened to have its core development team in Copenhagen. Soon, they were joined by colleagues in many locations as we made a policy of hiring the most talented people with a strong interest in search and library technology, no matter where they lived. Today, we are a virtual company with colleagues in six different countries (Denmark, Sweden, Germany, the UK, Canada, and in four different US states). After writing the book on operating a commercial business around Open Source Software, we had to learn how to be a tiny multinational company, and how to work well together as a team while scattered across the globe.

The company that existed ten years ago, with a jolly group of Danes hanging out in the middle of downtown Copenhagen, has been transformed almost beyond recognition. But what has arisen in its stead is in many ways more vital and exciting. Our team is passionate about their work: We swim in a ridiculously specialized area of the sea of information technology, but we do so with tremendous pride and passion.

Index Data, at a recent team meeting in New England

Index Data, at a recent team meeting in New England

It has been an amazing 20-year journey. We were successful in creating a fun and supportive work environment for ourselves and our colleagues. I couldn’t be more proud and grateful, both for my great coworkers and for the remarkable people I have had the good fortune to do business with.

Let the adventure continue!

Adding Discovery and More to Koha with Smart Widgets

Previous post: Using Smart Widgets to Integrate Information Access

This is the second in a series of posts about our Smart Widget platform. You can also read the first post or find background material about the technology.

In our introduction to Smart Widgets, I said that part of our purpose in developing the technology was to move away from the search box as the primary paradigm for accessing information: to give librarians more tools to organize and present information for their consumers/patrons. But the widgets can also be used to IMPROVE the capabilities of the search boxes that we already have — to offer new functions beyond what your existing software is capable of. In this post, we will show a couple of different examples of how Smart Widgets can be used to add functionality to Koha, but the same principles apply to any system that allows you to customize the HTML structure of a search results page.

Previously, I showed an example of a search result widget which simply executed what you might call a ‘canned’ search, and displayed the results of that search whenever the page was loaded. The HTML code for such a widget might look something like this:

<div class=’mkwsRecords’ autosearch=’american political history’>

This widget will show a list of matching records for the query ‘american political history’ using the set of databases that has been configured into the MasterKey back-end for the library (literally anything searchable on the web can be accessed in this way). But what if we were to put such a widget on, say, the search results page of an OPAC, and have it search for whatever query the user has input? The Smart Widgets allow us to try this: The syntax would look like this:

<div class=’mkwsRecords’ autosearch=’param!query!’>

Where ‘query’ is whatever HTTP parameter name the particular search interface uses to carry the search term.

In Koha, there is a function on the staff page that allows the administrator to add extra markup to the end of the facet column on the left-hand side of the display. That is an ideal place for us to slip in a little extra functionality. The screen looks like this:

There’s a little more to this than the simple examples I showed in the last post. That is because the initial integration was so easy that we decided to see if we could add an entire Discovery function to Koha (spoiler: We could!). Embedding this markup in the facet bar gives us the following display:

If you look at the facet column to the left, you will see, below the normal facets, a list of different databases and their hit-count for the given query. The list is updated as results come in so the normal Koha response time is not affected in any way whatsoever.

We also added a separate tab to the Koha OPAC with the Discovery function, so that if you click on the widget above, you will get to this page

Pretty cool, right?

We’ll go through the steps needed to add this functionality in a later, more technical blog post, but before we get to that, I want to show you another application of the Smart Widgets which isn’t about Discovery/metasearching.

We have been thinking that there might be useful functions that an OPAC could perform beyond merely providing a peek into the physical holdings of a library (or physical/electronic in the case of Discovery platforms). What if the OPAC could evolve into a kind of information center, a front door to the library as a facilitator of learning or research.

We thought that one way to explore this idea further was to surface reference content right into the OPAC itself, to supplement the usual bibliographic results. Wikipedia is a good subject for this experiment, since it is free and quite often provides relevant information to a query. So we went back to the Koha administration console and replaced the Discovery widget with a special Wikipedia widget, and this is what we got for a search for ‘nuclear power’.

Is this useful? You’ll have to be the judge, but I think it often could be: As a way to provide another angle on the user’s query (another ‘facet’), and possibly inspiration/guidance for further research. Now obviously, Wikipedia is far from being the only possible source: Commercial reference sources or even locally maintained knowledge bases might be more obvious candidates in some settings. The widget approach would work with just about any source or combination of sources imaginable.

So, in this post, we have shown how you can add significant functionality to Koha without having to install local software and without complex programming. We happen to think that these Smart Widgets are a natural outgrowth of the move towards cloud-based services and I believe you’ll be seeing a lot of them show up in the years to come from all kinds of data and service providers. But if you’re interested in learning more about our take on them, or possibly trying them out in your OPAC, please get in touch.