-
Notifications
You must be signed in to change notification settings - Fork 3
The Mapper Component
This documentation assumes a functional knowledge of Python and boto3. This documentation also assumes a basic understanding of the AWS Lambda service.
Mappers are responsible for converting object metadata from the source metadata model to the UCLDC metadata model. Mappers take as input a page of records described in a given institution's vernacular from the fetcher, parse individual records out of this response, map these records to the UCLDC metadata model, and write a json file storing these mapped records. NOTE: the code currently creates a json file for each page. Do we want to create a jsonl file instead?
We would love to make a domain specific language (DSL) at some point with a front-end interface for librarians to tweak their own mappings. In the meantime, the more clarity we can offer on a "data stack trace", the better we are able to understand why a particular piece of data ended up where, and the more quickly we are able to offer adjustments to the data mapping for contributors.
When a mapping goes awry, whether due to data errors or an error with the mapping component, we should have a clear understanding of what went wrong. If there is a data error, the error message should point us at the problematic vernacular API response stashed in s3, tell us which record in that file was a problem, and what field caused the issue (both the source field and the target field). Furthermore, the mapper component should greedily map as many records as possible despite any issues - a data problem in one record should not cause the entire collection to wait.
We understand mapping a collection to be an iterative process and ongoing dialogue with the source institution. The mapper run time should not get in the way of productive dialogue.
We understand this architecture to be drafted based off programmatic analysis of the existing Mapper classes and experience gathered from working with the existing system. No one person has experience touching every mapper in this system, so this architecture may not fully encompass all cases of what the existing mappers do, and may need to be reconfigured. We do believe our existing architecture accounts for most of what the existing mappers do, but I've listed a few 'growth areas' below that might get explored as we build out the mappers:
In the systems infrastructure architecture, we currently use a pure AWS Lambda implementation. We may move towards a queue (like Amazon Simple Queuing Service) + Lambda, or even a queue + AWS Fargate. I believe this shift will be relatively straightforward.
In designing the software architecture, I have tried to define objects in such a way to make it obvious where a particular piece of logic should live. I've tried to name classes and methods such that they read as sentences, and try my best to write semantically meaningful statements with those classes and methods.
Our mappers run on AWS Lambda. This serverless solution allows us to run mappers on demand with no provisioning or maintenance overhead. Our AWS Lambda is currently configured to use Amazon Linux 2, Kernel 4.14, x86_64 architecture, Python 3.9, in the US West 2 region.
Unlike the fetchers, mappers can be run in parallel - we don't have to wait on a resumption token from an external API. The lambda_shepherd
is an AWS Lambda function that gets a list of all pages of vernacular API responses (typically a response containing a set of 100 records or fewer) for a given collection and starts a lambda_function
for each page. The lambda_function
maps a single page of records.
AWS Lambda functions require an entry point that takes a payload and a context. The entry point to the lambda_shepherd
is map_collection(payload, context)
. This entry point takes a JSON payload that generally looks like:
{
"collection_id": 3433,
"mapper_type": "oai.cca_vault"
}
The entry point to the lambda_function
is map_page(payload, context)
. This entry point takes a json payload that generally looks like:
{
"collection_id": 3433,
"mapper_type": "oai.cca_vault",
"page_filename": "1"
}
A Rikolti mapper is a conceptual grouping of objects that achieve the objective outlined at the start of this document for a particular collection. The base mapper set is composed of three classes: a Vernacular
class, a Record
class, and a UCLDCWriter
class.
A given mapper subclass set is typically composed of just two classes: a Vernacular
class, and a Record
class.
The lambda_function
enforces the following order of operations and takes as input the collection id, the mapper type, and a page filename:
- Select an appropriate source
vernacular
based on the<mapper_type>
. - Get the vernacular API response for the page file stored in the
vernacular_metadata/<collection_id>/<page_filename>
path. - Parse records out of the API response.
- Run each record's
.toUCLDC()
method to map the records to the UCLDC data model. - Write the records to a page at
mapped_metadata/<collection_id>/<page_filename>
.
A Rikolti mapper is represented as a python module. The module name should be all lowercase with underscores and include the suffix _mapper
. The xxx_mapper
module should include a Vernacular subclass that is camel case and includes the suffix Vernacular
. The lambda payload mapper_type
should specify xxx
. The import_vernacular_reader function in metadata_mapper/lambda_function.py
requires this naming convention. For example:
mapper_type | parent module | sub module | mapper module | class name |
---|---|---|---|---|
nuxeo.nuxeo | nuxeo | nuxeo_mapper | NuxeoVernacular | |
oai.cca_vault | oai | cca_vault | CcaVaultVernacular | |
oai.content_dm.califa | oai | content_dm | califa_mapper | CalifaVernacular |
The Record class should similarly be camel case and include the suffix Record
: ContentDmRecord
.
Mapper modules are grouped in parent modules according to existing mapper class hierarchy. This module hierarchy is never to exceed 3 layers of depth. (A sub module is optional).
The Vernacular
defines the methods: get_local_api_response()
, get_s3_api_response()
, and local_path()
. The Vernacular Reader generally knows how to read a vernacular api response - either from the local filesystem (if MAPPER_DATA_SRC is set to 'local') or from s3.
Vernacular
subclasses must, at a minimum, specify a parse()
method. The parse()
method should take a vernacular API response and understand how to parse it into individual metadata records. The parse()
method should return a list of Records
.
The Record
base class stores a single metadata record in self.source_metadata
member variable and defines helper and utility methods - methods which might at a future date constitute the basis for a DSL. Currently, that set is quite small and limited to just collate_subfield
, but the intention is for it to grow.
Record.__init__(col_id, record)
takes the collection id and a metadata record as it is parsed from the API response and stashes both in self.collection_id
and self.metadata_record
for later use.
Record
subclasses must, at a minimum, specify a to_UCLDC()
method. The to_UCLDC()
method should understand how to map the source record to the UCLDC data model. The to_UCLDC()
method should return a dictionary in the UCLDC data model.
TODO: for ease of migration path, should the Record subclass define a
Record.to_harvester1_fetched_state()
where the "mapping" code that currently exists in the harvester gets moved to? Or is this just too confusing?
The UCLDCWriter
defines the methods: write_local_mapped_metadata()
, write_s3_mapped_metadata()
, and local_path()
. The UCLDC Writer generally knows how to write a page of mapped metadata - either to the local filesystem (if MAPPER_DATA_DEST is set to 'local') or from s3.
You can run a mapper in a variety of ways:
-
Run the mapper in a local environment from the command line with MAPPER_LOCAL_RUN=True, MAPPER_DATA_SRC='local|s3', and MAPPER_DATA_DEST='local|s3' for one collection:
python lambda_shepherd.py '{"collection_id": 26695, "mapper_type": "nuxeo.nuxeo"}'
-
Run the mapper in a local environment from the command line with MAPPER_LOCAL_RUN=True, MAPPER_DATA_SRC='local|s3', and MAPPER_DATA_DEST='local|s3' for multiple collections:
- Add your test collections to a file in the sample_data folder
- Modify metadata_mapper/tests.py to import that test data, then run
python tests.py
- Run the fetcher from your local environment on Lambda:
- Get AWS permissions locally however you do that (profile, env variables, sts, etc)
- aws lambda TODO: add full command here
- Run the fetcher from the AWS console:
- Login to the AWS console, navigate to Lambda, select the metadata mapper function
- Create new "test data" and add your Lambda input
- Click "test"
- Identify a few sample collections.
- Identify parsing and mapping logic previously accomplished by the Fetcher.
- Migrate the parsing work to the
Vernacular
class. - Find the existing mapper in the dpla-ingestion codebase.
- Analyze the existing mapper - I do this via refactoring the existing mapper in place.
- Migrate the fetcher's mapping work and the mapper's mapping work to the
to_UCLDC()
method. - Test against the sample collections before running against all collections.
Known Limitations: One issue I see with this process is that we could get the mapper just right for, say, collection 92, but then turn our attention to collection 184 and realize that adjustments are needed to get collection 184 just right. Those adjustments could have impact on the mapper for collection 92, but because the mapping process has already been run for collection 92, we won't notice until we go to re-harvest collection 92.
On the other hand, in order to enforce a rule that all changes to a mapper are run across all collections using a given mapping, that would involve running mappers over potentially hundreds of collections every single time an update is made, and then running tests against existing data to ensure that the positive effects of a mapper change have taken, while no new problems have been produced anywhere else.
Like with the fetcher, it is easiest to develop against smaller collections. Try to find one that has fewer than 100 objects, one with 200-300 objects, and one with a few thousand objects. You can use the same collections you built the fetcher against, assuming those collections also all use the same mapper, but this is not always true. There are only 16 fetchers while there are 55 "leaf" mappers (there are even more mappers if you include shared base classes never actually used directly by a collection but used indirectly by many different subclasses).
To identify the collections which use a certain mapper, filter collections by mapper type at https://registry.cdlib.org. The filtering list in the sidebar will also show you all fetchers used to fetch collections of this mapper type.
Find the preliminary parsing and mapping functions identified from the Migrating a Fetcher process. Identify which of this logic is doing parsing work (typically work that happens at the page level to produce records) and which is doing mapping work (typically at the record level to produce a record).
This should return an array of records, generally represented as python dictionaries.
The existing mapper code lives here: https://github.com/calisphere-legacy-harvester/dpla-ingestion/tree/ucldc/lib/mappers.
Mapper name to class relationship is established here: https://github.com/calisphere-legacy-harvester/dpla-ingestion/blob/ucldc/lib/create_mapper.py
I do this by merging the parent mapper down into the child mapper until the mapper we are targeting for migration is inheriting from object
. I then refactor the mapper we're targeting for migration to be more declarative. Rather than map_title
, map_alternative_title
, map_description
etc (many of these methods have identical implementations), factor out these identical implementations into methods like direct_mapping(source_property_name, destination_property_name)
or array_mapping(array_of_source_property_names, destination_property_name)
, etc.
6. Migrate the fetcher's mapping work and the mapper's mapping work into the to_UCLDC()
method defined on the Record.
With the analysis and refactoring work, many field mappings are simple {<dest_prop>: source_metadata.get('<source_prop>')}
and there are only a handful of complex mappings that are more challenging.
At this point, you should be able to start running the mapper against your sample collections.
Assumes you have access to our AWS account: ./deploy-version.sh
See here: https://github.com/ucldc/rikolti/blob/main/metadata_mapper/notes-on-compiling-lxml.sh
Currently, testing consists of running the mapper with MAPPER_LOCAL_RUN=True, MAPPER_DATA_SRC='local|s3', and MAPPER_DATA_DEST='local|s3' set in the environment with a sample collection set and spot checking the results saved locally against Calisphere. Sample collections can be found at https://registry.cdlib.org/api/v1/rikoltimapper/?format=json