-
Notifications
You must be signed in to change notification settings - Fork 26
Indexing EAD in ArcLight
Now that you have your ArcLight application up and running, we need to index data into it.
Currently, ArcLight's indexer expects the following:
- Valid and well-formed EAD 2002 according to its XSD schema. If we can't parse the finding aid, we can't index it. (Indexing DTD-compliant EAD 2002 might work, but we can't guarantee it.)
- All components have at least a
<unittitle/>
or<unitdate/>
. Without either, we won't be able to display anything!
- Components should all have unique IDs applied to them. These IDs are used as "slugs" for the identifiers of the documents in Arclight. Maintaining these identifiers allows an EAD to be updated and re-indexed while maintaining the URL that the component resides at (retaining any user bookmarks, etc). We will mint IDs for components that do not have them, but this is done using the location of the component w/i the hierarchy of the EAD. This means if components are moved around, the metadata that resides at a given URL may change in unexpected ways. See
Customizing behavior of indexing components w/o IDs
below for more info.
First we need to download or access our EAD's. Let's create a directory where we can store these within our application.
mkdir eads
Now let's add some data there.
# This command will save one of our test datasets to the directory you just created
wget -P eads/ https://raw.githubusercontent.com/sul-dlss/arclight/master/spec/fixtures/ead/nlm/alphaomegaalpha.xml
Next we need to run our indexing task and tell the task which "Repository" the EAD file is linked to. By default, your ArcLight application should have a file config/repositories.yml
that was generated. This file contains information about the repositories for your instance. For example, in the EAD alphaomegaalpha.xml
we want to link it to the first repository in that file, nlm
:
nlm:
name: 'National Library of Medicine. History of Medicine Division'
description: 'NLM’s History of Medicine Division collects, preserves, makes available, and interprets for diverse audiences one of the world’s richest collections of historical material related to human health and disease.'
building: 'Building 38, Room 1E-21'
address1: '8600 Rockville Pike'
address2: ''
city: 'Bethesda'
state: 'MD'
zip: '20894'
country: 'USA'
phone: ''
contact_info: 'hmdref@nlm.nih.gov'
thumbnail_url: "https://collections.nlm.nih.gov/pageturnerserver/ajaxp?theurl=http://localhost:8080/fedora/get/nlm:nlmuid-101421040-img/THUMB"
google_request_url: 'https://docs.google.com/a/stanford.edu/forms/d/e/1FAIpQLSeOamhY_IcFw4sPnz0ddwWWkrPaHbM5wp7JVbOLOL_mIusEyw/viewform'
google_request_mappings: "document_url=entry.1980510262&collection_name=entry.619150170&collection_creator=entry.14428541&eadid=entry.996397105&containers=entry.1125277048&title=entry.862815208"
We recommend that your config/repositories.yml
contain only the repositories for which you have EADs to index.
ArcLight Repositories can be configured to enable items to be requestable through Google Forms. To enable this functionality, please provide the following keys in your configured repository in config/repository.yml
under the request_types
key:
-
request_url
- this url is the url to the user facing version of your request form -
request_mappings
- this string represents an encoded form field mapping for your custom form fields and ArcLight. The configurable ArcLight fields are:collection_name
collection_creator
eadid
containers
To get the Google Form field identifiers, use the "pre-filled" form to get a crafted url with a similar format to the request_mappings
format. See Google Forms support for more information.
An example of a correctly configured form looks like this:
request_types:
google_form:
request_url: 'https://docs.google.com/a/stanford.edu/forms/d/e/1FAIpQLSeOamhY_IcFw4sPnz0ddwWWkrPaHbM5wp7JVbOLOL_mIusEyw/viewform'
request_mappings: "document_url=entry.1980510262&collection_name=entry.619150170&collection_creator=entry.14428541&eadid=entry.996397105&containers=entry.1125277048&title=entry.862815208"
ArcLight Repositories can be configured to enable items to be requestable through Aeon Web EAD requests. To enable this functionality, please provide the following keys in your configured repository in config/repository.yml
at the request_types
key:
-
request_url
- this url is the url of the Aeon instance which will handle the request -
request_mappings
- this string represents an encoded query params mapping for your request and ArcLight. This can contain a method name which is to be used as the EAD url.
An example of a correctly configured form looks like this:
request_types:
aeon_web_ead:
request_url: 'https://sample.request.com'
request_mappings: "Action=10&Form=31&Value=ead_url"
We can now use the arclight:index
task in ArcLight to index our EAD.
FILE=./eads/alphaomegaalpha.xml REPOSITORY_ID=nlm bundle exec rake arclight:index
Loading ./eads/alphaomegaalpha.xml into index...
Indexed ./eads/alphaomegaalpha.xml (in 0.837 secs).
You can add new repositories to the config/repositories.yml
file. The key that begins a repository is the same value you will use as the REPOSITORY_ID
in the indexing rake task.
We recommend that you organize EADs by repository and put them all in a directory using the repository's key. Then, run the rake arclight:index_dir
using the DIR
and REPOSITORY_ID
environment variables to index files all to the same repository:
# this assumes there's a directory with EAD files called /tmp/sul-spec, and a repository configured with the ID "spec"
DIR=/tmp/sul-spec REPOSITORY_ID=sul-spec bundle exec rake arclight:index_dir
We use the config/downloads.yml
file for configuration of how we provide download links to resources that can be generated from metadata indexed into the collection (e.g. PDF and EAD links). Accessors from the SolrDocument class can be interpolated using the ruby string formatting %{method_name}
when using the template
key (instead of the href
key). This allows an Arclight implementer to use existing accessors to interpolate values or create their own to do any sort of custom URL generation that they would like (note that non-URL values will be URL escaped).
There is a default configuration that you can use to configure behavior for all collections.
default:
pdf:
template: http://example.com/%{unitid}.pdf
Collection specific behavior can be configured using the <unitid>
. For example, if you have a Collection with the <unitid>
of "MS C 271", you would provide links to the downloads and their sizes like so (note this is not using interpolation so a plain href
key can be provided):
MS C 271:
pdf:
href: 'http://example.com/MS+C+271.pdf'
size: '1.23MB'
ead:
href: 'http://example.com/MS+C+271.xml'
size: 123456
If you need to remove links to a specific collection (or disable by default and enable for specific collections) you can set the disabled
key to true
. Note: the generated downloads.yml
disables links by default.
MS C 271:
disabled: true
The size of the download can be hardcoded as the size
key (as above), or an accessor on the solr document can be provided (as a string). For instance, if you have a #finding_aid_size
method on your SolrDocument class that can return the size for a file, you can reference that and it will be used to provide the size in the download link text (it is okay to not provide a size at all).
MS C 271:
pdf:
template: http://example.com/%{pdf_id}.pdf
size: finding_aid_size
There are custom values that can be interpolated into the URL as well. Currently this includes repository_id
which is the key that is being used in the repositories.yml
configuration for that document's repository.
Since this is using string interpolation, the accessor can return the entire URL to be provided (and in this case, it will not escape the URL as it will w/ other values).
MS C 271:
pdf:
template: %{finding_aid_url}
If you have another Solr instance that you are using that's not on the default location on localhost, you can provide the SOLR_URL
environment variable to index into that service:
SOLR_URL=http://solr.example.com/solr FILE=myead.xml REPOSITORY_ID=myid bundle exec rake arclight:index
Normal indexing will overwrite your content with the ArcLight index software. You may, however, want to remove all of your Solr documents if your content has changed, then re-index your current content.
bundle exec rake arclight:destroy_index_docs
bundle exec rake arclight:index ...
While it is highly recommended that you index EAD that has consistent IDs for all components, we do mint an ID for you if we encounter a component without an ID. This can be customized in a few ways.
By default, the indexer will use something similar to an xpath to the component (but including indexes to make sure always have a unique value for each component) and uses SHA1 to create a hexdigest. This will then be added to the ID of the collection to generate the document ID (similar to other documents that have IDs).
It's possible to use another algorithm by updating Arclight::HashAbsoluteXpath.hash_algorithm
Arclight::HashAbsoluteXpath.hash_algorithm = Digest::SHA256
This can be any object that will respond to #hexdigest
with the value to be hashed as the parameter and return the hashed value.
An entirely different strategy can also be used by updating Arclight::MissingIdStrategy.selected
Arclight::MissingIdStrategy.selected = MyMissingIdStrategy
The class being used as a strategy can take the XML node as a parameter to the initializer and must return the minted ID (minus the collection ID, which will be automatically added) in response to the #to_hexdigest
method.
Traject is the adopted new way forward for indexing content into ArcLight.
bundle exec traject -u http://127.0.0.1:8983/solr/blacklight-core -i xml -c lib/arclight/traject/ead2_config.rb spec/fixtures/ead/sample/large-components-list.xml