DPLA Beta Sprint Index Data
This submission consists of two parts: A prototype of a discovery system based on some of our technologies, and this document, which talks about our vision for the DPLA.
The Digital Public Library of America (DPLA) is an exciting concept, arriving at a critical point in the rapidly changing landscape of American and Global information use. Traditional public libraries are beset with conflicting demands, unprecedented challenges and extreme budget pressures that curtail their ability to meet those challenges. So it is a good time to take a step back from the fray and consider: What would an ideal digital public library look like, and how would it complement and enhance the services of the existing Public Library system?
Index Data has a 17-year history of working with academic and commercial partners all over the world, to realize solutions based on search in numerous forms. In particular, we have been actively involved in the standardization of networked information retrieval from our very beginnings, and we have built and released a substantial portfolio of open source tools in support of searching. We offer a perspective on the DPLA which is very much informed by our interests and our technologies – realizing that it is but one of many. We believe that it is a necessary perspective, and we hope that it will prove an interesting one.
In order to visualize a future DPLA, we will start by considering who its users might be, and what kinds of content might be available to meet their information needs. Then we will discuss the tools and technologies that will enable the DPLA to provide those users with a rich information discovery environment.
Some Important Users
The DPLA will ultimately be everyone?s resource, but we think it is particularly important to look at two groups of patrons that are under-served today.
K-12 education practices are at an interesting crossroads at this time, facing the combined challenges of an international labor market and the rising costs of traditional education, while the economic downturn and other stressors are reducing the willingness or ability of taxpayers to shoulder the costs. Widespread availability of online access to information and educational materials, and an emerging body of made-for-Internet courseware all drive toward new solutions to the age-old challenge of educating the young so that they become productive members of society. The combination of online information content and online access to educational software or courseware can enhance educational opportunities and outcomes in a wide range of settings. Home schoolers are particularly reliant on the resources of their local public library and are positioned to immediately benefit from the additional resources that the DPLA could make available. For all of our schools and students to benefit from these new resources, they need not only to be available to interested students and classes; they need to be easy to find.
Lifelong Learners and Continuing Education
Lifelong learners make up a large part of the public library?s user community, and make significant use of both fiction and non-fiction collections. This group of users ranges from interested readers exploring new subjects to students in distance-education courses that lack access to the resources of a local university library; from local entrepreneurs doing business-related research to mid-career workers investigating new career options or training for new opportunities. Supporting, encouraging and enabling lifelong education is a critical role of the Public Library. The DPLA has the potential to improve career options and quality of life through better education, and to empower entrepreneurial business opportunity through widespread access to research, trend analysis and scholarly information.
For traditionally produced content, there are publishers and database vendors with a commercial interest in making those materials broadly available, although license costs are often a challenge. The evolving Open Access movement, however, remains under-utilized by many libraries, who struggle to select and to organize these resources, and to provide easy access for their patrons. There is a tremendous opportunity to leverage this value in the service of libraries and their patrons. Once more, the key is not merely the availability of materials but making them easy to find.
There exists a vast trove of content in the millions of out-of-copyright and orphan works on our shelves, and there are many diverse efforts to make these works available. And yet, there is no single place where an interested user can search across all of the freely available, out-of-copyright books. By combining and interleaving searches against multiple collections we begin to approach a more comprehensive solution to this problem.
The e-journal information environment is even more chaotic than the world of e-books, with numerous small collections and individual journals scattered across the web. Yet much of this content reflects the very best academic research and a rapidly growing share of the total research output. For our prototype, we have connected to a relatively small sampling of resources, but a truly comprehensive collection would of course involve many more sources, and it would be adaptable to the requirements of each library, in terms of subject areas and local interests.
One of the most exciting trends in Open Access today is the growing availability of open courseware, interactive applications, online curricula and similar materials that allow users to approach a body of knowledge without the benefit of a traditional classroom setting. The Khan Academy materials at http://khan-academy.appspot.com are an ideal example of a collection that has been created and organized to enable a logical progression of learning. NASA?s Virtual Courseware for Earth and Environmental Sciences at http://gcmd.nasa.gov/records/Virtual_Geology-00.html is a remarkable case of an isolated silo of excellent material which is not currently well-described by metadata nor indexed for searching. New materials become available all the time, but useful discovery tools are not keeping pace.
Some of the best and most current open access resources are published by governmental agencies. These materials can cover topics from health and nutrition to science, technology, government policies and current legislation.
Museums are in a fundamentally different position from libraries today. Libraries have always been about information, which has – historically – been accessed through printed volumes. As libraries go digital, and printed materials begin to disappear from shelves, their entire intellectual content remains accessible via the surrogate electronic versions. Museums, however, are fundamentally about objects, the digitization of which is often not an adequate substitute.
Digital surrogates of fine art materials can be useful for many research purposes, even though examination of originals is frequently required for in-depth scholarly analysis. Still, reliable access to art image repositories and digital art museum collections serves a myriad of educational and research needs. Similarly, natural history museum objects, such as fossils, carry far more information than photos, surface scanning or X-rays can capture. Such objects or artifacts cannot be meaningfully stored and transported digitally with any technology currently available. Yet information about these artifacts, including images, 3-D scans, analytical data and detailed descriptions, is still valuable. Many of these collections are currently available online in separate Web sites or silos – most of which are not easily accessible to mainstream search engines. Since museum content is of high value and interest to targeted user communities, including them in a reliable, sustainable, interoperable discovery environment is an important activity for the DPLA.
Many news sources are freely available, albeit often funded by advertisements. A one-stop search of diverse news sources would be of great benefit to DPLA users.
Although conventional wisdom might conclude that building a comprehensive digital library would relegate important physical collections to a secondary or even to a forgotten role, depending upon the current location of the user, the local library may still be the best place to access a particular resource. It makes sense, then, to explore the ways in which an awareness of what is physically available to the user is combined with purely digital resources.
Bringing it All Together: Our Submission
Technologies that enable discovery are our speciality. We have extensive experience in developing and deploying both information services and discovery platforms that integrate such services into rich, one-stop information discovery environments, ultimately bringing those services into the hands of the users, wherever they are. We believe that we can be a valuable partner to the DPLA in this area.
In order to provide a taste of how our technologies can be used, we have put together a simple interface which allows a user to search across several examples of the kinds of information sources that we discuss above. It can be accessed at http://dpla.indexdata.com. Accessing and integrating access to these resources in a way that hides the underlying differences from the user is no trivial task. Our interface employs a variety of methods to bring these resources together, including harvesting and indexing as well as broadcast searching using standard protocols, and public or proprietary APIs. We believe that the DPLA will be a very heterogeneous environment in terms of information sources and practices – it is impractical and unnecessarily confining to mandate a single technical platform. The tools, therefore, must above all be versatile.
It?s important to emphasize that this demonstrator is not necessarily meant to show specifically how we imagine a DPLA search interface will look, or to what kinds of things it should provide access. It is essential that information technology be malleable – that it be able to evolve over time in response to changing requirements and user expectations. We pursue this goal in two different ways: First, by always advocating openness and standardization in the way that information is exposed – to make it as easy as possible to reuse and re-purpose data in different ways over time and for different audiences; second, all of our solutions are based on components with open interfaces, which allow them to be combined in different ways depending on requirements.
Finally, openness also means being willing to deliver your services where your users are. These days, increasingly, that means social networks and online, collaborative environments. What would it look like if we imagine the DPLA as a living, breathing part of a social networking environment, with the means to easily share results, build reading lists together, to work and to study together? We believe that the Internet?s role in connecting people and in facilitating collaboration is still in its infancy. We don?t claim to know what the next step will look like, but the best way to prepare for it is by designing for openness and extensibility at every level.
Reflections on a Linked (Data) Future
Linked Data promises huge benefits to the entire library community, including better access to reusable data and a more expressive model for descriptive information of all types. Perhaps most exciting, Linked Data suggests a model in which bibliographic data doesn?t languish in separate silos (often directly modeled upon – if not copied from – the card catalogs of yesteryear), but exists as an integral components of a larger web of knowledge.
In a perfect realization of this idea, biographical information about authors, their different creative works, collaborators, areas of interest and more are all linked and interconnected with different options for obtaining representations of those works – either online or printed on demand, as audiobooks, movies, etc. The data can be crawled and analyzed for scholarly or research purposes, or it can be used to produce compelling learning and discovery experiences for students and others.
The challenges, however, are perhaps as daunting as the potential. The totality of bibliographic metadata in existence today includes tens of millions of unique records for books alone – many more for journal articles, movies, etc. Billions of redundant copies and variations of these records exist in tens of thousands of libraries around the world. The data has been created, over decades, sometimes based on mass digitization of even older physical records. Determining which of these records – often incomplete or erroneous – describe the same thing is a daunting task, to say nothing of which of the countless editions, translations, and publication forms correspond to the same fundamental creative work.
None of these problems have been ?solved? today, although many organizations have devised workarounds and partial solutions to reduce their impact within individual, closed systems. Simply adding a Linked Data layer over this mess does not eliminate the problems, although it does offer a stronger framework for addressing those problems.
Some of these issues can be resolved by tools, which might use heuristics, crosswalks, or other types of mappings to create order out of chaos. Approaches based on crowdsourcing, such as the Open Library, may be our best hope for cleaning up the data up over time. But realistically, we will be living in a structural and semantic “Wild West” for years or decades to come, and software will have to be relatively flexible and adept to function out there.
In addition, the Linked Data approach itself, for all of its almost unlimited degrees of freedom, presents its own set of challenges:
- Data models – how does one name and express relationships between different entities? Although Linked Data allows different data models to exist in the same universe and information system, making any kind of intelligent use of the data is much easier if there is some uniformity in the way that things are described. Today, such a consensus does not yet exist in the library market.
- Identity – how does one know that different facts about a person do indeed describe the same person? How does one know that multiple citations point to the same article? How does one know that two foreign language translations in fact stem from the same original? Linked Data allows for everything to be identified by URLs, but how do we get to there from our masses of legacy data, much of which predate the notion of universal identification?
- Authority and provenance – How do we know what information can be trusted when it is scattered across the web?
- Rights – The emergence of Linked Data has only complicated the issues facing developers that seek to integrate multiple information sources into their systems: Usage guidelines and licenses can be frustratingly vague about what can be done with a given service or datum. Linked Data, where information can be interlinked from many different sources, only aggravates this. Indeed, the notion of licensing and control over the use of raw data is still poorly understood – a situation fraught with risk and complications for small organizations wishing to leverage the new possibilities.
The DPLA has a significant role to play as a stakeholder in the library community. Disorganization and lack of best practices and standards are an impediment to innovation and collective development efforts. The future viability of an ?open? approach to the DPLA depends on standardization in the area of metadata expression, exchange and access.
A discovery solution, such as the one described in this proposal, has a number of potential interface points to the Linked Data world – in the long term, we envision discovery solutions becoming fully-fledged members of a linked information infrastructure, but in the meantime, we believe they can also play a role as facilitators of migration and integration.
On the consumer side, tools can be built to facilitate easier searching in linked datasets. Crawling/indexing or live searching may be the preferred approach in different situations. The logic that already exists in discovery systems to merge/deduplicate bibliographic data can be turned towards the challenges of establishing – at least provisionally – identity, even when the data is ambiguous.
On the service provider side, discovery systems can potentially expose their results in a form suitable for use by Linked Data applications. These results can be used by such applications to discover data that already exists in a linked form on the network, but it may also enable access to legacy data silos – to information resources that have not yet made the migration to a linked representation. As such, Linked-Data-enabled discovery systems have the potential not only to support a community migration towards a better way of working with information, but also to be first-class citizens in that future.
With our Beta Sprint contribution, we have offered up our thoughts and ideas about some of the challenges and opportunities facing the library community today, and how discovery technology, as a part of a larger DPLA infrastructure, can play a role in empowering libraries and citizens.
Index Data is a group of passionate and talented individuals with a long history of innovation and collaboration within the field of search and information retrieval in libraries, and widely recognized for our heritage in open source software. We believe that we would make a strong partner to the DPLA as they map out the path forward.