Machine learning in libraries: profiling research projects rather than people

Machine learning in libraries, as in many other contexts, will often rely on data about people and their activities. Data in a library system can be made available for use with machine learning algorithms to develop predictive models, which have the potential to help patrons in their research. Of course, the data might also be used for the benefit of others without the patrons’ permission, either in the present or at some future time. In particular, the data can serve as a basis for creating profiles of individuals, which may be used for undue advantage. If “knowledge is power,” then it is worth considering whether that power is authorized and what its limits are.

If libraries were to avoid keeping a record of patron activities, then it would be much more difficult to build profiles of patrons. However, libraries need to track some basic information, such as circulation transactions, and can offer better services to patrons if they track and analyze how resources are being used. In addition, they may be obligated to record information about access to electronic resources. There are also the benefits of machine learning, and while “opt-in” or “opt-out” models can limit profiling, many machine learning algorithms will only work well if most people participate.

Suppose, however, that libraries were to request that every patron specify one or more “research projects” corresponding to the patron’s interests, broadly defined, and to select one of these projects when logging in to the library system. Within the library’s database, most patron activities could then be associated with a project rather than with a patron.

For example, Jenny might create a project called “Information theory” for her graduate study, a second project called “Cooking” to pursue her passion for learning new cuisines, and a third called “Reading” to represent reading for pleasure. Now somewhere in the library system there is stored an association between Jenny and her three projects. This association would not be authorized to be shared outside of the library or with machine learning algorithms, except with specific consent in cases where absolutely necessary. Elsewhere in the database, Jenny’s data would be associated not directly with her but with the project she has selected to work on.

When Jenny is logged into the library system, she might see a dashboard for the project that she is currently working on, and can easily switch to another project. This may make a certain sense to Jenny, as it would allow her to view and manage related information together. She probably would not care to see articles about information theory suggested by a recommender system while browsing books of recipes.

With some exceptions, there is no need for machine learning to profile people in order to help with their research interests. Jenny’s selection of a project could even assist the algorithms to be more accurate, by indicating what she is working on. At any rate, the decision would be up to Jenny in how she chooses to organize her projects and to set boundaries for how data are used, based on which of her interests she thinks would benefit. She could begin to use machine learning selectively as a tool, rather than being pressured into an all-or-nothing choice.

Another advantage of this approach could be seen in cases where data are anonymized and exported from the library system to be analyzed by someone outside the library. Anonymized data can sometimes be “re-identified” because the data may reveal a combination of specific activities or interests that can be linked to a person. If the library were to track projects rather than patrons, and assuming the patron-project groupings were not disclosed with the anonymized data, then patron data would be fragmented by project, potentially making re-identification more difficult.

Nassib Nassar joined Index Data in 2015 as the product manager for FOLIO and currently leads a research project on data sharing for open science.