Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enrich date #1090

Merged
merged 14 commits into from
Sep 10, 2024
Merged

Enrich date #1090

merged 14 commits into from
Sep 10, 2024

Conversation

amywieliczka
Copy link
Collaborator

@amywieliczka amywieliczka commented Jul 23, 2024

Adds date parsing logic for:

  • enrich_date
  • enrich_earliest_date
  • add_facet_decade
  • make_sort_dates
  • unpack_display_date

All fetched vernacular data is run through a mapper. I did not do an analysis of what different mappers are doing with regards to date. Once the record is mapped, enrich_date and enrich_earliest_date are optionally run, pending their configuration in the enrichment chain for that particular collection.

enrich_earliest_date: retrieves the mapped record's date value, manipulates it, and then replaces the mapped record's date value with the results of it's manipulation. Internally, convert_date converts a string or list of strings into a list of dictionaries with the keys: begin, end, and displayDate. Strings are parsed using DPLA's zen library: https://github.com/dpla-attic/zen (used in DPLA's ingestion 1 and copied here with some minor python2 to python3 updates to duplicate existing functionality). If the mapped record's date value is already a dictionary, or if it is a list containing any dictionary, it does no date manipulation and simply returns the date value.

Despite its name, I didn't see any indication that enrich_earliest_date has any kind of logic determining what dates are "earliest".

enrich_date: does the exact same thing as enrich_earliest_date, except it takes a field name as an argument, operates on the value of the specified field name, and replaces the specified field's value in the mapped record. The default field name is temporal. In my analysis, enrich_date exists in 2,749 enrichment chains with no field name specified, and exists in 1,003 enrichment chains with the field name date specified. Instances of enrich_date&prop=sourceResource/date are functional duplicates of enrich_earliest_date, and since convert_dates does not perform any manipulation on lists of dictionaries, this duplication has no effect on the record. There are 3 collections that have enrich_date&prop=sourceResource/date but do not have enrich_earliest_date in their enrichment chains.

Unlike the two enrichments listed previously, the following 3 functions run on every record (except Calisphere Solr extracted records), regardless of enrichment configuration:

add_facet_decade is part of the solr_updater function that was migrated from our legacy harvester and operates on the date value of the mapped and enriched record to write to the facet_decade field in the solr_doc (the final output of the metadat_mapper module). It is difficult to say what this date value might look like at this point, since there are some collections that never call enrich_earliest_date and also never call enrich_date&prop=sourceResource/date. In those cases, the value of the date field looks as it does whenever the mapper is finished with it. Some mappers do seem to produce lists of dictionaries with the keys begin, end, displayDate (providing their own special, mapper-specific date parsing logic), but it seems that some mappers also simply produce strings, or lists of strings. Nevertheless, add_facet_decade makes the date value into a list if it is not already, and then tries to get_facet_decade for each value in that list - returning the first successful response from get_facet_decade. For its part, get_facet_decade does it's own special parsing of the displayDate using Brian's decade facet logic.

make_sort_dates is also part of the solr_updater function and operates on the date value of the mapped and enriched record to write to the sort_date_start and sort_date_end fields of the solr_doc. It depends on the date value containing at least one dictionary with the begin and end keys. If the date value contains multiple dictionaries with the begin and end keys, it sorts all the begin dates and returns the first one as the start date, and then sorts all the end dates and returns the first one as the end date.

unpack_display_date is also part of the solr_updater function and operates on the date value and the temporal value to write date and temporal values to the solr_doc. This function turns the value into a list if it is not already, and then for each element in the list, if it's a dictionary, puts the displayDate value in the return value, and if it's a string, puts the string into the return value, and finally returns this new list of display dates and strings.


There are also some updates to the validator. We shut down the CouchDB instance, so validations against couch no longer work. As such, I've commented out is_shown_by and is_shown_at validation functions.


There is also a new test, the metadata_mapper/test/test_date_processing.py module, along with 5 csvs defining inputs and expected outputs to the previously mentioned 5 functions. These csvs were created using the 27 test collections specified here: https://docs.google.com/spreadsheets/d/1JNw7ynxCJS8d4f5W4Glwwqx0qMm8ss-aB_GpjVxN1dI/edit?gid=0#gid=0 (filter on Calisphere Solr Mapper Type = No) and their validated output. You can run these tests using pytest metadata_mapper


This PR requires a new pip install in your environment - I've added a new dependency to the project, timelib

I'm also not 100% sure the conditional validation works (commit: 1b78253) - I'm not sure which OS the validation_collection_task is running on, and if the UCLDC_SOLR_URL and UCLDC_COUCH_URL are available within that environment, but this is the general idea regarding conditional validation.

bibliotechy
bibliotechy previously approved these changes Sep 3, 2024
Copy link
Contributor

@bibliotechy bibliotechy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. One small comment inline.

One question I do have- should we be running the test of the date parsers as part of our CI suite?

self.mapped_data is always a dictionary and should always be a
dictionary, so directionally, I'd rather resolve this by
figuring out why pylance thinks it could be otherwise and adding
clarity there.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to add the type declaration when the self.mapped_data is initialized above?

self.mapped_data: dict = {}

@bibliotechy
Copy link
Contributor

Also, another parsing library to consider in the future is https://github.com/bear/parsedatetime

@amywieliczka amywieliczka merged commit 621f573 into main Sep 10, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Migrate enrich_date and enrich_earliest_date enrichments
2 participants