SPLURGE: Scholars Portal Library Usage-based Recommendation Engine

Amazon.ca has a "customers who bought this item also bought" feature that recommends things to you that you might be interested in. LibraryThing has it too: the recommendations for What's Bred in the Bone by Robertson Davies include books by Margaret Laurence, Carol Shields, Michael Ondaatje, Peter Ackroyd, John Fowles, and David Lodge, as well as other Davies works.

Library catalogues don't have any such feature, but they should. And libraries are sitting on the circulation and usage data that makes it possible. (BiblioCommons does have a Similar Titles feature, but it's a closed commercial product aimed at public libraries, and anyway the titles are added by hand.)

SPLURGE will collect usage data from OCUL members and build a recommendation engine that can be integrated into any member's catalogue. The code will be made available under the GNU Public License and the data will be made available under an open data license.

Hackfest on Friday 17 February 2012 at Ryerson University

There will be a one-day SPLURGE hackfest on Friday 17 February 2012 on the seventh floor of Heaslip House at Ryerson University in Toronto. It is open to anyone in an OCUL library. No breakfast or lunch will be provided, but there are many coffee shops, restaurants and pubs nearby. The night before the hackfest there will be a dinner in Toronto at a pub. (Email for details.)

Everyone coming to the hackfest should bring at least six months of usage data from their library. See Data Collection below for more details.

Guest wifi accounts will be available, but if your university is part of the eduroam network then you can use it. Please make any necessary arrangements in advance.

For details, please contact the organizers, William Denton (wdenton@yorku.ca and @wdenton) (Web Librarian, York University) and Cameron Metcalf (cmetcalf@uottawa.ca and @podeas) (Head, Library Systems Division, University of Ottawa).

Background

We plan to implement in Ontario something close to the JISC project called MOSAIC (Making Our Shared Activity Information Count). The documents there describe what they did, and our plan is based on that.

MOSAIC Data Collection: A Guide
MOSAIC Final Report (and Appendices)
Also MOSAIC Demonstration Links, from a software contest they ran to find new, interesting uses for their data. The examples here go beyond the Recommendation Engine idea, but are worth looking at to see other possible future directions.)

The JISC MOSAIC wiki has code and data examples.

The JISC project grew out of work done by Dave Pattern (Library Systems Manager) and others at the University of Huddersfield. They made usage data available under an Open Data Commons License.

Data
README
Pattern explains things in Free book usage data from the University of Huddersfield
Pattern summarized it all in March 2011 in Sliding Down the Long Tail.

Updated 13 Feb: The SALT Recommender API is doing what we want to do, and JISC's planned SALT 2 project is a consortial approach like OCUL would do:

SALT Recommender API at Manchester
Copac Activity Data Project aka SALT 2
JISC's Activity Data

Data gathering

Scholars Portal will aggregate the data from the different libraries, and make the (anonymous) results openly available.

Data levels

MOSAIC set out three levels of usage data in the Final Report (p 40):

We refer to library circulation (loan & renewal) information as use data. Use data contains one use record per item borrowed. Sets of use records may have different amounts of information in each record, according to the data level that applies to all the records in the set.

Level	Description	Use
Level 0	Level 0 use records contain where and when the loan was made and the item borrowed.	Level 0 use data can be used to indicate popular loan items in the participating library.
Level 1	Level 1 records are as for level 0, but also with borrower context information, indicating borrower type (staff or student), and course and progression level (for students).	Level 1 use data can be used to see, via facets, for a given search, what was borrowed in one or more of: a particular institution, a particular course, a particular progression level (or by staff), and in a particular academic year.
Level 2	Level 2 records are as for level 0, but also with an anonymised user ID	Level 2 use data enables recommendations like "borrowers of this item also borrowed," and "borrowers of this item previously borrowed/went on to borrow."

We would collect use data at Level 0.

Data collection

To build SPLURGE, and to work on it at the hackfest, we need some good sample data. Everyone coming should bring at least six months of usage data from their library. We can store the data on Scholars Portal servers (location to be determined).

We need to make it as easy as possible for people to pull the data from their systems. Because there are several different ILSes used across the province, the necessary database or report commands will vary, but once done for one ILS they can be shared with other users of the same system. MOSAIC's existing code for SirsiDynix Horizon may be useful. Any code written can be added to this repository.

We will use two of the data files described in the MOSAIC data file formats:

items.txt
transaction.YYYY.txt

Because we are working at Level 0 and not connecting users and courses, we don't need users.YYYY.txt or courses.txt.

items.txt

Tab-delimited file. Fields:

item ID    (mandatory)
ISBN(s)    (mandatory)
title      (mandatory)
author(s)
publisher
publication year
persistent URL

Sample:

123 → 0415972531 → Music & copyright → L. Marshall → Wiley   → 2004 → http://libcat.hud.ac.uk/123
234 → 0415969298 → Songwriting tips  → N. Skilbeck → Phaidon → 1997 → http://libcat.hud.ac.uk/234

The item ID is whatever ID you want to use to identify a library book. It must match the item ID contained in the item file.
The ISBN(s) are one (or more) ISBNs, separated by a | pipe character where more than one ISBN is linked to the item (e.g. 0415966744|0415966752).
The title is the title of the book.
The author(s) are one (or more) names, separated by a | pipe character where more than one name is present (e.g. John Smith|Julie Johnson).
The publisher and publication year are the name of the publishing company and the year of publication.
The persistent URL is the web address the item can be found at (e.g. on your library catalogue).

transactions.YYYY.txt

Tab-delimited file. Fields:

timestamp (mandatory)
item ID   (mandatory)
user ID   (mandatory)

Sample:

1222646400  →  114784  →  67890
1225756800  →  103828  →  67890
1225756800  →  62580   →  76543

The timestamp is in Unix time format (i.e. the number of seconds since 1st Jan 1970 UTC). It is used to calculate the day the transaction occurred on.
The user ID is whatever ID you want to use to identify an individual library user. It will be converted to a MD5 hash value before the data is submitted. It must match the user ID contained in the user file.
The item ID is whatever ID you want to use to identify a library book. It must match the item ID contained in the item file.

Data storage

The data will be stored as XML using the same format as Huddersfield used in their data release (see the usage data README):

circulation_data.xml contains aggregate usage information for individual titles
suggestion_data.xml contains people who borrowed X also borrowed Y relations

data2xml.pl (as taken from MOSAIC's code) will convert the library-generated data into richer XML that we will use for the work, as describe in the usage data README from their script repository).

Building SPLURGE

The purpose of the hackfest is to build the code to make the Recommendation Wngine work. When the Recommendation Engine is given an ISBN or other ID number it will suggest a list of related items.

Pattern's Sliding Down the Long Tail describes the logic we'll need to follow.

Tim Spalding implemented a similar feature at LibraryThing. When asked on Twitter how it worked, he said The best code is just statistics and Given random distribution how many of book X would you expect? How many did you find?.

In conversation, both Pattern and Spalding mentioned the Harry Potter effect: some books are so popular with everyone that they need to be damped down. Everyone reading Freud or Ferlinghetti, Feynman or Foucault, is probably also reading J.K. Rowling, but that doesn't mean Harry Potter and the Goblet of Fire should be recommended to people looking at Totem and Taboo or Madness and Civilization.

Implementation as a web service

The Recommendation Engine will have web-based API available at Scholars Portal. Ideally a library will be able to insert one line of Javascript into its HTML template to make the recommendations appear.

Systems and tools that might be useful

Apache Mahout
Melvyl Recommender Project at California Digital Library

Possible future directions

Moving data collection to Level 1 and involving course information.

Open data and anonymity

All of the anonymized data will be made freely available under an open license.

No identifying information will be connected to the usage data. It will be completely anonymous.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.markdown

README.markdown

SPLURGE: Scholars Portal Library Usage-based Recommendation Engine

Hackfest on Friday 17 February 2012 at Ryerson University

Background

Data gathering

Data levels

Data collection

items.txt

transactions.YYYY.txt

Data storage

Building SPLURGE

Implementation as a web service

Systems and tools that might be useful

Related reading

Possible future directions

Open data and anonymity

Files

README.markdown

Latest commit

History

README.markdown

File metadata and controls

SPLURGE: Scholars Portal Library Usage-based Recommendation Engine

Hackfest on Friday 17 February 2012 at Ryerson University

Background

Data gathering

Data levels

Data collection

items.txt

transactions.YYYY.txt

Data storage

Building SPLURGE

Implementation as a web service

Systems and tools that might be useful

Related reading

Possible future directions

Open data and anonymity