Enrich date #1090

amywieliczka · 2024-07-23T00:11:22Z

Adds date parsing logic for:

enrich_date
enrich_earliest_date
add_facet_decade
make_sort_dates
unpack_display_date

All fetched vernacular data is run through a mapper. I did not do an analysis of what different mappers are doing with regards to date. Once the record is mapped, enrich_date and enrich_earliest_date are optionally run, pending their configuration in the enrichment chain for that particular collection.

enrich_earliest_date: retrieves the mapped record's date value, manipulates it, and then replaces the mapped record's date value with the results of it's manipulation. Internally, convert_date converts a string or list of strings into a list of dictionaries with the keys: begin, end, and displayDate. Strings are parsed using DPLA's zen library: https://github.com/dpla-attic/zen (used in DPLA's ingestion 1 and copied here with some minor python2 to python3 updates to duplicate existing functionality). If the mapped record's date value is already a dictionary, or if it is a list containing any dictionary, it does no date manipulation and simply returns the date value.

Despite its name, I didn't see any indication that enrich_earliest_date has any kind of logic determining what dates are "earliest".

enrich_date: does the exact same thing as enrich_earliest_date, except it takes a field name as an argument, operates on the value of the specified field name, and replaces the specified field's value in the mapped record. The default field name is temporal. In my analysis, enrich_date exists in 2,749 enrichment chains with no field name specified, and exists in 1,003 enrichment chains with the field name date specified. Instances of enrich_date&prop=sourceResource/date are functional duplicates of enrich_earliest_date, and since convert_dates does not perform any manipulation on lists of dictionaries, this duplication has no effect on the record. There are 3 collections that have enrich_date&prop=sourceResource/date but do not have enrich_earliest_date in their enrichment chains.

Unlike the two enrichments listed previously, the following 3 functions run on every record (except Calisphere Solr extracted records), regardless of enrichment configuration:

add_facet_decade is part of the solr_updater function that was migrated from our legacy harvester and operates on the date value of the mapped and enriched record to write to the facet_decade field in the solr_doc (the final output of the metadat_mapper module). It is difficult to say what this date value might look like at this point, since there are some collections that never call enrich_earliest_date and also never call enrich_date&prop=sourceResource/date. In those cases, the value of the date field looks as it does whenever the mapper is finished with it. Some mappers do seem to produce lists of dictionaries with the keys begin, end, displayDate (providing their own special, mapper-specific date parsing logic), but it seems that some mappers also simply produce strings, or lists of strings. Nevertheless, add_facet_decade makes the date value into a list if it is not already, and then tries to get_facet_decade for each value in that list - returning the first successful response from get_facet_decade. For its part, get_facet_decade does it's own special parsing of the displayDate using Brian's decade facet logic.

make_sort_dates is also part of the solr_updater function and operates on the date value of the mapped and enriched record to write to the sort_date_start and sort_date_end fields of the solr_doc. It depends on the date value containing at least one dictionary with the begin and end keys. If the date value contains multiple dictionaries with the begin and end keys, it sorts all the begin dates and returns the first one as the start date, and then sorts all the end dates and returns the first one as the end date.

unpack_display_date is also part of the solr_updater function and operates on the date value and the temporal value to write date and temporal values to the solr_doc. This function turns the value into a list if it is not already, and then for each element in the list, if it's a dictionary, puts the displayDate value in the return value, and if it's a string, puts the string into the return value, and finally returns this new list of display dates and strings.

There are also some updates to the validator. We shut down the CouchDB instance, so validations against couch no longer work. As such, I've commented out is_shown_by and is_shown_at validation functions.

There is also a new test, the metadata_mapper/test/test_date_processing.py module, along with 5 csvs defining inputs and expected outputs to the previously mentioned 5 functions. These csvs were created using the 27 test collections specified here: https://docs.google.com/spreadsheets/d/1JNw7ynxCJS8d4f5W4Glwwqx0qMm8ss-aB_GpjVxN1dI/edit?gid=0#gid=0 (filter on Calisphere Solr Mapper Type = No) and their validated output. You can run these tests using pytest metadata_mapper

This PR requires a new pip install in your environment - I've added a new dependency to the project, timelib

I'm also not 100% sure the conditional validation works (commit: 1b78253) - I'm not sure which OS the validation_collection_task is running on, and if the UCLDC_SOLR_URL and UCLDC_COUCH_URL are available within that environment, but this is the general idea regarding conditional validation.

bibliotechy

Looks good. One small comment inline.

One question I do have- should we be running the test of the date parsers as part of our CI suite?

bibliotechy · 2024-09-03T21:44:23Z

metadata_mapper/mappers/mapper.py

+        self.mapped_data is always a dictionary and should always be a
+        dictionary, so directionally, I'd rather resolve this by
+        figuring out why pylance thinks it could be otherwise and adding
+        clarity there.


Do you need to add the type declaration when the self.mapped_data is initialized above?

self.mapped_data: dict = {}

bibliotechy · 2024-09-03T22:05:50Z

Also, another parsing library to consider in the future is https://github.com/bear/parsedatetime

amywieliczka added 2 commits July 23, 2024 14:38

Add enrich_earliest_date and dependencies; requires a new pip install

2a586ba

Add enrich_date enrichment method

4fcfd5e

amywieliczka force-pushed the enrich-date branch from c17435c to 4fcfd5e Compare July 23, 2024 21:38

Conditionally validate only if solr and couch env available

1b78253

amywieliczka mentioned this pull request Jul 23, 2024

Migrate enrich_date and enrich_earliest_date enrichments #197

Closed

amywieliczka linked an issue Jul 23, 2024 that may be closed by this pull request

Migrate enrich_date and enrich_earliest_date enrichments #197

Closed

amywieliczka added 6 commits August 21, 2024 15:57

Remove is_shown_at, is_shown_by validations, couch

b0bacf0

Add a date comparison validation, make facet_decade order insensitive

26db230

Type updates & add dec to c removal exclusions

fe3b4b4

Isolate solr updater date-related functions

6bffa37

Update division from python2 -> python3

ff9d49c

Date manipulation test cases

a1f137c

amywieliczka force-pushed the enrich-date branch from a212112 to 44c1df1 Compare August 28, 2024 22:05

Tests using the date cases

b1b4035

amywieliczka force-pushed the enrich-date branch from 44c1df1 to b1b4035 Compare August 28, 2024 22:06

amywieliczka added 3 commits August 28, 2024 15:07

Model how to add date cases to tests

22d8667

Make ruff happy

03689a8

Remove is_shown_by validations against couchdb

499d88d

amywieliczka force-pushed the enrich-date branch from f049cc4 to 499d88d Compare August 30, 2024 19:53

amywieliczka marked this pull request as ready for review August 30, 2024 19:56

amywieliczka requested a review from barbarahui as a code owner August 30, 2024 19:56

amywieliczka requested a review from bibliotechy August 30, 2024 19:56

bibliotechy previously approved these changes Sep 3, 2024

View reviewed changes

Specify type when initializing self.mapped_data

bb482d6

amywieliczka dismissed bibliotechy’s stale review via bb482d6 September 9, 2024 21:33

amywieliczka force-pushed the enrich-date branch from 856afad to bb482d6 Compare September 9, 2024 21:57

amywieliczka merged commit 621f573 into main Sep 10, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enrich date #1090

Enrich date #1090

amywieliczka commented Jul 23, 2024 •

edited

Loading

bibliotechy left a comment

bibliotechy Sep 3, 2024

bibliotechy commented Sep 3, 2024

Enrich date #1090

Enrich date #1090

Conversation

amywieliczka commented Jul 23, 2024 • edited Loading

bibliotechy left a comment

Choose a reason for hiding this comment

bibliotechy Sep 3, 2024

Choose a reason for hiding this comment

bibliotechy commented Sep 3, 2024

amywieliczka commented Jul 23, 2024 •

edited

Loading