The Mapper Component

This documentation assumes a functional knowledge of Python and boto3. This documentation also assumes a basic understanding of the AWS Lambda service.

Requirements

Mappers are responsible for converting object metadata from the source metadata model to the UCLDC metadata model. Mappers take as input a page of records described in a given institution's vernacular from the fetcher, parse individual records out of this response, map these records to the UCLDC metadata model, and write a json file storing these mapped records. NOTE: the code currently creates a json file for each page. Do we want to create a jsonl file instead?

Design Principles

Clear conceptual mapping

We would love to make a domain specific language (DSL) at some point with a front-end interface for librarians to tweak their own mappings. In the meantime, the more clarity we can offer on a "data stack trace", the better we are able to understand why a particular piece of data ended up where, and the more quickly we are able to offer adjustments to the data mapping for contributors.

Transparent error communication

When a mapping goes awry, whether due to data errors or an error with the mapping component, we should have a clear understanding of what went wrong. If there is a data error, the error message should point us at the problematic vernacular API response stashed in s3, tell us which record in that file was a problem, and what field caused the issue (both the source field and the target field). Furthermore, the mapper component should greedily map as many records as possible despite any issues - a data problem in one record should not cause the entire collection to wait.

Can be re-run quickly

We understand mapping a collection to be an iterative process and ongoing dialogue with the source institution. The mapper run time should not get in the way of productive dialogue.

Systems Architecture & Software Architecture Mutability

We understand this architecture to be drafted based off programmatic analysis of the existing Mapper classes and experience gathered from working with the existing system. No one person has experience touching every mapper in this system, so this architecture may not fully encompass all cases of what the existing mappers do, and may need to be reconfigured. We do believe our existing architecture accounts for most of what the existing mappers do, but I've listed a few 'growth areas' below that might get explored as we build out the mappers:

Systems Architecture

In the systems infrastructure architecture, we currently use a pure AWS Lambda implementation. We may move towards a queue (like Amazon Simple Queuing Service) + Lambda, or even a queue + AWS Fargate. I believe this shift will be relatively straightforward.

Software Architecture

In designing the software architecture, I have tried to define objects in such a way to make it obvious where a particular piece of logic should live. I've tried to name classes and methods such that they read as sentences, and try my best to write semantically meaningful statements with those classes and methods.

Mapper Infrastructure

Our mappers run on AWS Lambda. This serverless solution allows us to run mappers on demand with no provisioning or maintenance overhead. Our AWS Lambda is currently configured to use Amazon Linux 2, Kernel 4.14, x86_64 architecture, Python 3.9, in the US West 2 region.

Unlike the fetchers, mappers can be run in parallel - we don't have to wait on a resumption token from an external API. The lambda_shepherd is an AWS Lambda function that gets a list of all pages of vernacular API responses (typically a response containing a set of 100 records or fewer) for a given collection and starts a lambda_function for each page. The lambda_function maps a single page of records.

AWS Lambda functions require an entry point that takes a payload and a context. The entry point to the lambda_shepherd is map_collection(payload, context). This entry point takes a JSON payload that generally looks like:

{
    "collection_id": 3433,
    "mapper_type": "oai.cca_vault"
}

The entry point to the lambda_function is map_page(payload, context). This entry point takes a json payload that generally looks like:

{
    "collection_id": 3433,
    "mapper_type": "oai.cca_vault",
    "page_filename": "1"
}

Mapper Software Architecture

A Rikolti mapper is a conceptual grouping of objects that achieve the objective outlined at the start of this document for a particular collection. The base mapper set is composed of three classes: a Vernacular class, a Record class, and a UCLDCWriter class.

A given mapper subclass set is typically composed of just two classes: a Vernacular class, and a Record class.

The lambda_function enforces the following order of operations and takes as input the collection id, the mapper type, and a page filename:

Select an appropriate source vernacular based on the <mapper_type>.
Get the vernacular API response for the page file stored in the vernacular_metadata/<collection_id>/<page_filename> path.
Parse records out of the API response.
Run each record's .toUCLDC() method to map the records to the UCLDC data model.
Write the records to a page at mapped_metadata/<collection_id>/<page_filename>.

Naming Conventions

A Rikolti mapper is represented as a python module. The module name should be all lowercase with underscores and include the suffix _mapper. The xxx_mapper module should include a Vernacular subclass that is camel case and includes the suffix Vernacular. The lambda payload mapper_type should specify xxx. The import_vernacular_reader function in metadata_mapper/lambda_function.py requires this naming convention. For example:

mapper_type	parent module	sub module	mapper module	class name
nuxeo.nuxeo	nuxeo		nuxeo_mapper	NuxeoVernacular
oai.cca_vault	oai		cca_vault	CcaVaultVernacular
oai.content_dm.califa	oai	content_dm	califa_mapper	CalifaVernacular

The Record class should similarly be camel case and include the suffix Record: ContentDmRecord.

Mapper modules are grouped in parent modules according to existing mapper class hierarchy. This module hierarchy is never to exceed 3 layers of depth. (A sub module is optional).

`Vernacular` and subclasses

The Vernacular defines the methods: get_local_api_response(), get_s3_api_response(), and local_path(). The Vernacular Reader generally knows how to read a vernacular api response - either from the local filesystem (if MAPPER_DATA_SRC is set to 'local') or from s3.

Vernacular subclasses must, at a minimum, specify a parse() method. The parse() method should take a vernacular API response and understand how to parse it into individual metadata records. The parse() method should return a list of Records.

`Record` class and subclasses

The Record base class stores a single metadata record in self.source_metadata member variable and defines helper and utility methods - methods which might at a future date constitute the basis for a DSL. Currently, that set is quite small and limited to just collate_subfield, but the intention is for it to grow.

Record.__init__(col_id, record) takes the collection id and a metadata record as it is parsed from the API response and stashes both in self.collection_id and self.metadata_record for later use.

Record subclasses must, at a minimum, specify a to_UCLDC() method. The to_UCLDC() method should understand how to map the source record to the UCLDC data model. The to_UCLDC() method should return a dictionary in the UCLDC data model.

TODO: for ease of migration path, should the Record subclass define a Record.to_harvester1_fetched_state() where the "mapping" code that currently exists in the harvester gets moved to? Or is this just too confusing?

`UCLDCWriter` class

The UCLDCWriter defines the methods: write_local_mapped_metadata(), write_s3_mapped_metadata(), and local_path(). The UCLDC Writer generally knows how to write a page of mapped metadata - either to the local filesystem (if MAPPER_DATA_DEST is set to 'local') or from s3.

Running a Mapper

You can run a mapper in a variety of ways:

Run the mapper in a local environment from the command line with MAPPER_LOCAL_RUN=True, MAPPER_DATA_SRC='local|s3', and MAPPER_DATA_DEST='local|s3' for one collection: python lambda_shepherd.py '{"collection_id": 26695, "mapper_type": "nuxeo.nuxeo"}'
Run the mapper in a local environment from the command line with MAPPER_LOCAL_RUN=True, MAPPER_DATA_SRC='local|s3', and MAPPER_DATA_DEST='local|s3' for multiple collections:

Add your test collections to a file in the sample_data folder
Modify metadata_mapper/tests.py to import that test data, then run python tests.py

Run the fetcher from your local environment on Lambda:

Get AWS permissions locally however you do that (profile, env variables, sts, etc)
aws lambda TODO: add full command here

Run the fetcher from the AWS console:

Login to the AWS console, navigate to Lambda, select the metadata mapper function
Create new "test data" and add your Lambda input
Click "test"

Writing a Mapper

Identify a few sample collections.
Identify parsing and mapping logic previously accomplished by the Fetcher.
Migrate the parsing work to the Vernacular class.
Find the existing mapper in the dpla-ingestion codebase.
Analyze the existing mapper - I do this via refactoring the existing mapper in place.
Migrate the fetcher's mapping work and the mapper's mapping work to the to_UCLDC() method.
Test against the sample collections before running against all collections.

Known Limitations: One issue I see with this process is that we could get the mapper just right for, say, collection 92, but then turn our attention to collection 184 and realize that adjustments are needed to get collection 184 just right. Those adjustments could have impact on the mapper for collection 92, but because the mapping process has already been run for collection 92, we won't notice until we go to re-harvest collection 92.

On the other hand, in order to enforce a rule that all changes to a mapper are run across all collections using a given mapping, that would involve running mappers over potentially hundreds of collections every single time an update is made, and then running tests against existing data to ensure that the positive effects of a mapper change have taken, while no new problems have been produced anywhere else.

1. Identify a few sample collections.

Like with the fetcher, it is easiest to develop against smaller collections. Try to find one that has fewer than 100 objects, one with 200-300 objects, and one with a few thousand objects. You can use the same collections you built the fetcher against, assuming those collections also all use the same mapper, but this is not always true. There are only 16 fetchers while there are 55 "leaf" mappers (there are even more mappers if you include shared base classes never actually used directly by a collection but used indirectly by many different subclasses).

To identify the collections which use a certain mapper, filter collections by mapper type at https://registry.cdlib.org. The filtering list in the sidebar will also show you all fetchers used to fetch collections of this mapper type.

2. Identify parsing and mapping logic previously accomplished by the Fetcher.

Find the preliminary parsing and mapping functions identified from the Migrating a Fetcher process. Identify which of this logic is doing parsing work (typically work that happens at the page level to produce records) and which is doing mapping work (typically at the record level to produce a record).

3. Migrate the parsing work into the Vernacular's `.parse()` method.

This should return an array of records, generally represented as python dictionaries.

4. Find the existing mapper in the `dpla-ingestion` code base.

The existing mapper code lives here: https://github.com/calisphere-legacy-harvester/dpla-ingestion/tree/ucldc/lib/mappers.

Mapper name to class relationship is established here: https://github.com/calisphere-legacy-harvester/dpla-ingestion/blob/ucldc/lib/create_mapper.py

5. Analyze the existing mapper to establish what it does.

I do this by merging the parent mapper down into the child mapper until the mapper we are targeting for migration is inheriting from object. I then refactor the mapper we're targeting for migration to be more declarative. Rather than map_title, map_alternative_title, map_description etc (many of these methods have identical implementations), factor out these identical implementations into methods like direct_mapping(source_property_name, destination_property_name) or array_mapping(array_of_source_property_names, destination_property_name), etc.

6. Migrate the fetcher's mapping work and the mapper's mapping work into the `to_UCLDC()` method defined on the Record.

With the analysis and refactoring work, many field mappings are simple {<dest_prop>: source_metadata.get('<source_prop>')} and there are only a handful of complex mappings that are more challenging.

At this point, you should be able to start running the mapper against your sample collections.

Deploying the Mapper

Assumes you have access to our AWS account: ./deploy-version.sh

Rebuilding the `lxml` library

See here: https://github.com/ucldc/rikolti/blob/main/metadata_mapper/notes-on-compiling-lxml.sh

Testing the Mapper

Currently, testing consists of running the mapper with MAPPER_LOCAL_RUN=True, MAPPER_DATA_SRC='local|s3', and MAPPER_DATA_DEST='local|s3' set in the environment with a sample collection set and spot checking the results saved locally against Calisphere. Sample collections can be found at https://registry.cdlib.org/api/v1/rikoltimapper/?format=json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Mapper Component

Requirements

Design Principles

Clear conceptual mapping

Transparent error communication

Can be re-run quickly

Systems Architecture & Software Architecture Mutability

Systems Architecture

Software Architecture

Mapper Infrastructure

Mapper Software Architecture

Naming Conventions

`Vernacular` and subclasses

`Record` class and subclasses

`UCLDCWriter` class

Running a Mapper

Writing a Mapper

1. Identify a few sample collections.

2. Identify parsing and mapping logic previously accomplished by the Fetcher.

3. Migrate the parsing work into the Vernacular's `.parse()` method.

4. Find the existing mapper in the `dpla-ingestion` code base.

5. Analyze the existing mapper to establish what it does.

6. Migrate the fetcher's mapping work and the mapper's mapping work into the `to_UCLDC()` method defined on the Record.

Deploying the Mapper

Rebuilding the `lxml` library

Testing the Mapper

Clone this wiki locally

The Mapper Component

Requirements

Design Principles

Clear conceptual mapping

Transparent error communication

Can be re-run quickly

Systems Architecture & Software Architecture Mutability

Systems Architecture

Software Architecture

Mapper Infrastructure

Mapper Software Architecture

Naming Conventions

Vernacular and subclasses

Record class and subclasses

UCLDCWriter class

Running a Mapper

Writing a Mapper

1. Identify a few sample collections.

2. Identify parsing and mapping logic previously accomplished by the Fetcher.

3. Migrate the parsing work into the Vernacular's .parse() method.

4. Find the existing mapper in the dpla-ingestion code base.

5. Analyze the existing mapper to establish what it does.

6. Migrate the fetcher's mapping work and the mapper's mapping work into the to_UCLDC() method defined on the Record.

Deploying the Mapper

Rebuilding the lxml library

Testing the Mapper

Clone this wiki locally

`Vernacular` and subclasses

`Record` class and subclasses

`UCLDCWriter` class

3. Migrate the parsing work into the Vernacular's `.parse()` method.

4. Find the existing mapper in the `dpla-ingestion` code base.

6. Migrate the fetcher's mapping work and the mapper's mapping work into the `to_UCLDC()` method defined on the Record.

Rebuilding the `lxml` library