-
Notifications
You must be signed in to change notification settings - Fork 26
Indexing EAD in ArcLight
Now that you have your ArcLight application up and running, we need to index data into it.
Currently, ArcLight's indexer expects the following:
- Valid and well-formed EAD 2002 according to its XSD schema. If we can't parse the finding aid, we can't index it. (Indexing DTD-compliant EAD 2002 might work, but we can't guarantee it.)
- All components have at least a
<unittitle/>
or<unitdate/>
. Without either, we won't be able to display anything!
- Components should all have unique IDs applied to them. These IDs are used as "slugs" for the identifiers of the documents in Arclight. Maintaining these identifiers allows an EAD to be updated and re-indexed while maintaining the URL that the component resides at (retaining any user bookmarks, etc). We will mint IDs for components that do not have them, but this is done using the location of the component w/i the hierarchy of the EAD. This means if components are moved around, the metadata that resides at a given URL may change in unexpected ways. See
Customizing behavior of indexing components w/o IDs
below for more info.
First we need to download or access our EAD's. Let's create a directory where we can store these within our application.
mkdir eads
Now let's add some data there.
# This command will save one of our test datasets to the directory you just created
wget -P eads/ https://raw.githubusercontent.com/sul-dlss/arclight/master/spec/fixtures/ead/nlm/alphaomegaalpha.xml
Next we need to run our indexing task and tell the task which "Repository" the EAD file is linked to. By default, your ArcLight application should have a file config/repositories.yml
that was generated. This file contains information about the repositories for your instance. For example, in the EAD alphaomegaalpha.xml
we want to link it to the first repository in that file, nlm
:
nlm:
name: 'National Library of Medicine. History of Medicine Division'
description: 'NLM’s History of Medicine Division collects, preserves, makes available, and interprets for diverse audiences one of the world’s richest collections of historical material related to human health and disease.'
building: 'Building 38, Room 1E-21'
address1: '8600 Rockville Pike'
address2: ''
city: 'Bethesda'
state: 'MD'
zip: '20894'
country: 'USA'
phone: ''
contact_info: 'hmdref@nlm.nih.gov'
thumbnail_url: "https://collections.nlm.nih.gov/pageturnerserver/ajaxp?theurl=http://localhost:8080/fedora/get/nlm:nlmuid-101421040-img/THUMB"
google_request_url: 'https://docs.google.com/a/stanford.edu/forms/d/e/1FAIpQLSeOamhY_IcFw4sPnz0ddwWWkrPaHbM5wp7JVbOLOL_mIusEyw/viewform'
google_request_mappings: "document_url=entry.1980510262&collection_name=entry.619150170&collection_creator=entry.14428541&eadid=entry.996397105&containers=entry.1125277048&title=entry.862815208"
We recommend that your config/repositories.yml
contain only the repositories for which you have EADs to index.
ArcLight Repositories can be configured to enable items to be requestable through Google Forms. To enable this functionality, please provide the following keys in your configured repository in config/repository.yml
:
-
google_request_url
- this url is the url to the user facing version of your request form -
google_request_mappings
- this string represents an encoded form field mapping for your custom form fields and ArcLight. The configurable ArcLight fields are:collection_name
collection_creator
eadid
containers
To get the Google Form field identifiers, use the "pre-filled" form to get a crafted url with a similar format to the google_request_mappings
format. See Google Forms support for more information.
An example of a correctly configured form looks like this:
google_request_url: 'https://docs.google.com/a/stanford.edu/forms/d/e/1FAIpQLSeOamhY_IcFw4sPnz0ddwWWkrPaHbM5wp7JVbOLOL_mIusEyw/viewform'
google_request_mappings: "document_url=entry.1980510262&collection_name=entry.619150170&collection_creator=entry.14428541&eadid=entry.996397105&containers=entry.1125277048&title=entry.862815208"
We can now use the arclight:index
task in ArcLight to index our EAD.
FILE=./eads/alphaomegaalpha.xml REPOSITORY_ID=nlm bundle exec rake arclight:index
Loading ./eads/alphaomegaalpha.xml into index...
Indexed ./eads/alphaomegaalpha.xml (in 0.837 secs).
You can add new repositories to the config/repositories.yml
file. The key that begins a repository is the same value you will use as the REPOSITORY_ID
in the indexing rake task.
We recommend that you organize EADs by repository and put them all in a directory using the repository's key. Then, run the rake arclight:index_dir
using the DIR
and REPOSITORY_ID
environment variables to index files all to the same repository:
# this assumes there's a directory with EAD files called /tmp/sul-spec, and a repository configured with the ID "spec"
DIR=/tmp/sul-spec REPOSITORY_ID=sul-spec bundle exec rake arclight:index_dir
We use the config/downloads.yml
file for configuration of how we show download PDF and EAD links. For example, if you have a Collection with the <unitid>
of "MS C 271", you would provide links to the downloads and their sizes like so:
MS C 271:
pdf:
href: 'http://example.com/MS+C+271.pdf'
size: '1.23MB'
ead:
href: 'http://example.com/MS+C+271.xml'
size: 123456
If you have another Solr instance that you are using that's not on the default location on localhost, you can provide the SOLR_URL
environment variable to index into that service:
SOLR_URL=http://solr.example.com/solr FILE=myead.xml REPOSITORY_ID=myid bundle exec rake arclight:index
Normal indexing will overwrite your content with the ArcLight index software. You may, however, want to remove all of your Solr documents if your content has changed, then re-index your current content.
bundle exec rake arclight:destroy_index_docs
bundle exec rake arclight:index ...
While it is highly recommended that you index EAD that has consistent IDs for all components, we do mint an ID for you if we encounter a component without an ID. This can be customized in a few ways.
By default, the indexer will use something similar to an xpath to the component (but including indexes to make sure always have a unique value for each component) and uses SHA1 to create a hexdigest. This will then be added to the ID of the collection to generate the document ID (similar to other documents that have IDs).
It's possible to use another algorithm by updating Arclight::HashAbsoluteXpath.hash_algorithm
Arclight::HashAbsoluteXpath.hash_algorithm = Digest::SHA256
This can be any object that will respond to #hexdigest
with the value to be hashed as the parameter and return the hashed value.
An entirely different strategy can also be used by updating Arclight::MissingIdStrategy.selected
Arclight::MissingIdStrategy.selected = MyMissingIdStrategy
The class being used as a strategy can take the XML node as a parameter to the initializer and must return the minted ID (minus the collection ID, which will be automatically added) in response to the #to_hexdigest
method.
Traject is the adopted new way forward for indexing content into ArcLight.
bundle exec traject -u http://127.0.0.1:8983/solr/blacklight-core -i xml -c lib/arclight/traject/ead2_config.rb spec/fixtures/ead/sample/large-components-list.xml