Skip to content

Indexing EAD in ArcLight

Jack Reed edited this page Aug 21, 2019 · 26 revisions

Now that you have your ArcLight application up and running, we need to index data into it.

EAD requirements

Currently, ArcLight's indexer expects the following:

  • Valid and well-formed EAD 2002 according to its XSD schema. If we can't parse the finding aid, we can't index it. (Indexing DTD-compliant EAD 2002 might work, but we can't guarantee it.)
  • All components (e.g. <c/>, <c01/>, etc.) have an id attribute. These id attributes are required to generate the "slug" used for URLs for the individual components. (See issue #446 for more information.)
  • All components have at least a <unittitle/> or <unitdate/>. Without either, we won't be able to display anything!

Download sample EAD

First we need to download or access our EAD's. Let's create a directory where we can store these within our application.

mkdir eads

Now let's add some data there.

# This command will save one of our test datasets to the directory you just created
wget -P eads/ https://raw.githubusercontent.com/sul-dlss/arclight/master/spec/fixtures/ead/nlm/alphaomegaalpha.xml

Repository configuration

Next we need to run our indexing task and tell the task which "Repository" the EAD file is linked to. By default, your ArcLight application should have a file config/repositories.yml that was generated. This file contains information about the repositories for your instance. For example, in the EAD alphaomegaalpha.xml we want to link it to the first repository in that file, nlm:

nlm:
  name: 'National Library of Medicine. History of Medicine Division'
  description: 'NLM’s History of Medicine Division collects, preserves, makes available, and interprets for diverse audiences one of the world’s richest collections of historical material related to human health and disease.'
  building: 'Building 38, Room 1E-21'
  address1: '8600 Rockville Pike'
  address2: ''
  city: 'Bethesda'
  state: 'MD'
  zip: '20894'
  country: 'USA'
  phone: ''
  contact_info: 'hmdref@nlm.nih.gov'
  thumbnail_url: "https://collections.nlm.nih.gov/pageturnerserver/ajaxp?theurl=http://localhost:8080/fedora/get/nlm:nlmuid-101421040-img/THUMB"
  google_request_url: 'https://docs.google.com/a/stanford.edu/forms/d/e/1FAIpQLSeOamhY_IcFw4sPnz0ddwWWkrPaHbM5wp7JVbOLOL_mIusEyw/viewform'
  google_request_mappings: "document_url=entry.1980510262&collection_name=entry.619150170&collection_creator=entry.14428541&eadid=entry.996397105&containers=entry.1125277048&title=entry.862815208"

We recommend that your config/repositories.yml contain only the repositories for which you have EADs to index.

Configuring a repository for Google Form Requests

ArcLight Repositories can be configured to enable items to be requestable through Google Forms. To enable this functionality, please provide the following keys in your configured repository in config/repository.yml:

  • google_request_url - this url is the url to the user facing version of your request form
  • google_request_mappings - this string represents an encoded form field mapping for your custom form fields and ArcLight. The configurable ArcLight fields are:
    • collection_name
    • collection_creator
    • eadid
    • containers

To get the Google Form field identifiers, use the "pre-filled" form to get a crafted url with a similar format to the google_request_mappings format. See Google Forms support for more information.

An example of a correctly configured form looks like this:

  google_request_url: 'https://docs.google.com/a/stanford.edu/forms/d/e/1FAIpQLSeOamhY_IcFw4sPnz0ddwWWkrPaHbM5wp7JVbOLOL_mIusEyw/viewform'
  google_request_mappings: "document_url=entry.1980510262&collection_name=entry.619150170&collection_creator=entry.14428541&eadid=entry.996397105&containers=entry.1125277048&title=entry.862815208"

Indexing a single file

We can now use the arclight:index task in ArcLight to index our EAD.

FILE=./eads/alphaomegaalpha.xml REPOSITORY_ID=nlm bundle exec rake arclight:index
Loading ./eads/alphaomegaalpha.xml into index...
Indexed ./eads/alphaomegaalpha.xml (in 0.837 secs).

Adding more finding aids and repositories

You can add new repositories to the config/repositories.yml file. The key that begins a repository is the same value you will use as the REPOSITORY_ID in the indexing rake task.

We recommend that you organize EADs by repository and put them all in a directory using the repository's key. Then, run the rake arclight:index_dir using the DIR and REPOSITORY_ID environment variables to index files all to the same repository:

# this assumes there's a directory with EAD files called /tmp/sul-spec, and a repository configured with the ID "spec"
DIR=/tmp/sul-spec REPOSITORY_ID=sul-spec bundle exec rake arclight:index_dir

Configuring Downloads for Collections

We use the config/downloads.yml file for configuration of how we show download PDF and EAD links. For example, if you have a Collection with the <unitid> of "MS C 271", you would provide links to the downloads and their sizes like so:

MS C 271:
  pdf:
    href: 'http://example.com/MS+C+271.pdf'
    size: '1.23MB'
  ead:
    href: 'http://example.com/MS+C+271.xml'
    size: 123456

Advanced: Using another Solr instance

If you have another Solr instance that you are using that's not on the default location on localhost, you can provide the SOLR_URL environment variable to index into that service:

SOLR_URL=http://solr.example.com/solr FILE=myead.xml REPOSITORY_ID=myid bundle exec rake arclight:index

Advanced: Purging your Solr instance

Normal indexing will overwrite your content with the ArcLight index software. You may, however, want to remove all of your Solr documents if your content has changed, then re-index your current content.

bundle exec rake arclight:destroy_index_docs
bundle exec rake arclight:index ...

Using Traject

Traject is the adopted new way forward for indexing content into ArcLight.

bundle exec traject -u http://127.0.0.1:8983/solr/blacklight-core -i xml -c lib/arclight/traject/ead2_config.rb spec/fixtures/ead/sample/large-components-list.xml