Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chapman Mapper Refinements #483

Merged
merged 9 commits into from
Dec 20, 2023
Merged

Chapman Mapper Refinements #483

merged 9 commits into from
Dec 20, 2023

Conversation

amywieliczka
Copy link
Collaborator

@lthurston I ran the Chapman mapper through the validator and encountered several validation errors. This branch (and PR) is a start at resolving some of the obvious ones:

  • Changed identifier mapping: identifier should be a list, not a string
  • Added a custom validator to remove errors on http vs. https in identifier and is_shown_at

@christinklez as you continue to evaluate the Chapman mapper, could you reference this PR in the issues you create? @lthurston could you work on this branch to refine the Chapman mapper?

I'm proposing that this PR becomes a place to centralize ongoing Chapman mapper conversation till such a time that we've approved the Chapman mapper, while individual issues might have a shorter life span and would spec out the details of specific issues - what do you both think?

@christinklez christinklez added this to the #4 CIC work milestone Aug 11, 2023
@amywieliczka amywieliczka marked this pull request as draft August 11, 2023 21:21
@christinklez
Copy link
Collaborator

See: #26387 (08-11); #26386 (08-11); #26286 (08-11); #26339 (08-11); #26290 (08-11)

  • is_shown_by: Missing data. Map is_shown_by. (all collections)
  • source: Missing data. Map source. (all collections)
  • title: Extra blank space within the title (not in the source record). Do not introduce blank (double) spaces within the title. (See #26387 rows 32 and 178)
  • description: Open and/or End quotes introduced (not in the source record). Do not introduce new characters. (See #26387 rows 105 and 131)
  • description: Slashes introduced in both legacy and rikolti data, but is not in the source data. Investigate this issue? (see #26387 row 105; #26386 row 394)
  • description: Japanese and Chinese characters are not getting fetched & mapped correctly. (see #26286 example in row 4, including vernacular and mapped metadata)
  • coverage / spatial: should display dates as distinct entries in a list. (see #26339 examples in rows 35 and 38.)
  • language: Missing data that's in the source (and vernacular) record. Map language. (See #26290 rows 4, 9, and 14)

@lthurston
Copy link
Contributor

  • is_shown_by: Missing data. Map is_shown_by. (all collections)

The issue is that the type is text in many of these, despite being images. Look at 26386 for example. Here's a snippet from one record:

<dc:type>text</dc:type>
<dc:format>image/jpeg</dc:format>

Maybe instead of looking at type to determine if it's an image, I should look at the format to see if contains the string image. Thoughts?

@lthurston
Copy link
Contributor

  • source: Missing data. Map source. (all collections)

Done.

  • title: Extra blank space within the title (not in the source record). Do not introduce blank (double) spaces within the title. (See #26387 rows 32 and 178)

The extra spaces are in the source record:

<dc:title>Vietnamese Boy Sits Next to M16</dc:title>

and

<dc:title>US General Accused of Murdering Civilians in Vietnam</dc:title>

No action taken on this one.

  • description: Open and/or End quotes introduced (not in the source record). Do not introduce new characters. (See #26387 rows 105 and 131)

Terminal apostrophe does appear in the source record, encoded as &#39.

<dc:description>Tapao Air Base, Thailand: Honor guard salutes as flag-draped coffin containing remains of unidentified American serviceman who died in North Vietnamese captivity is carried to waiting ambulance upon arrival here 3/6 following release by North Vietnam. Twe March 6, 1971 1971-03-06 Vietnam Vietnam War, 1961-1975; Prisoners of war--Vietnam Image BW photograph, 11 x 8.5&#39;</dc:description>

  • description: Slashes introduced in both legacy and rikolti data, but is not in the source data. Investigate this issue? (see #26387 row 105; #26386 row 394)

The slashes reported in the legacy and Rikolti description are escaped tab characters, which occur in the source data as \t.

  • description: Japanese and Chinese characters are not getting fetched & mapped correctly. (see #26286 example in row 4, including vernacular and mapped metadata)

Looks like an encoding issue. I'll save this one for last.

  • coverage / spatial: should display dates as distinct entries in a list. (see #26339 examples in rows 35 and 38.)

Done. Spatial is getting shreded, so you'll see the dates in a string joined by -- until the enrichments are adjusted.

  • language: Missing data that's in the source (and vernacular) record. Map language. (See #26290 rows 4, 9, and 14)

Done.

@lthurston
Copy link
Contributor

lthurston commented Oct 5, 2023

description: Japanese and Chinese characters are not getting fetched & mapped correctly. (see #26286 example in row 4, including vernacular and mapped metadata)

This required a change to the Fetcher class. I can move this change to another branch / PR if desired.

UPDATE: the fetcher is fixed, but the mapper is still writing files with escaped utf-8 characters. I'm looking into this.

UPDATE again: the mapper is now fixed, too, in the base mapper class. This is worth a good sanity check by @amywieliczka / @barbarahui. Again, I can shift these changes to another PR since this isn't specific to chapman, if you'd like.

@lthurston lthurston marked this pull request as ready for review October 5, 2023 21:36
@christinklez christinklez added the data validation Rikolti mapper data validation review needed label Oct 6, 2023
@christinklez christinklez linked an issue Oct 9, 2023 that may be closed by this pull request
14 tasks
@christinklez christinklez removed the data validation Rikolti mapper data validation review needed label Oct 9, 2023
@amywieliczka
Copy link
Collaborator Author

amywieliczka commented Oct 25, 2023

Hmm, same question here as in #514 re: additions to this child mapper that don't happen in the legacy mapper and maybe should happen in the parent OAI mapper? I recall seeing source and language in the Samvera mapper PR as well.

@amywieliczka
Copy link
Collaborator Author

I guess we could just play whack-a-mole now and then do some analysis of what we can pull into the base mapper class later, re-run the validation reports after factoring out commonly implemented functions, and diff them against the approved validation reports?

@christinklez
Copy link
Collaborator

Just did some spot checking, and thinking out loud here.

  • chapman (BePress): Had missing Language and Source mapping issues.

  • ucsc (Samvera): It did indeed also have missing Language and Source mapping issues.

  • up_oai_dc #26224 (BePress): at a quick glance, it is expecting Language: "English" and Source: "Spooner California Stereograph Collection" (collection name) but language and source are not mapping.

  • chico_oai_dc #26762 (ContentDM): at a quick glance, it is expecting Source: "sc10896.tif" (filename) but source is not mapping.

(I was going to look up another example, but am unable to generate reports right now. Curious if the examples above, which are all using the base OAI mapper, suggest enough variance that we should take the whack-a-mole approach for now?)

@amywieliczka
Copy link
Collaborator Author

Hmm, at least for language, it does look like this is getting mapped in the base class in the legacy harvester (https://github.com/calisphere-legacy-harvester/dpla-ingestion/blob/cfe3dcb06008c0c6cb9d8207fe28bfaa1a855e4f/lib/mappers/dublin_core_mapper.py#L97). I would expect to see language come up in the validation reports for all OAI feeds that have language data. So I think there's actually a good argument for not playing whack-a-mole and instead addressing language at the root level by adding the mapping to the OAI mapper.

I don't actually see a mapping for source happening anywhere in the legacy mapper, so I suspect that's coming from an enrichment chain? I think we should probably figure that one out sooner rather than later as well.

@amywieliczka
Copy link
Collaborator Author

amywieliczka commented Oct 25, 2023

Ahh, found the "source" mapping here: https://github.com/calisphere-legacy-harvester/dpla-ingestion/blob/cfe3dcb06008c0c6cb9d8207fe28bfaa1a855e4f/lib/mappers/mapper.py#L171

I don't think we have a "dataProvider" equivalent in Rikolti just yet - perhaps this is a case where whack-a-mole makes the most sense.

@christinklez christinklez added the data validation Rikolti mapper data validation review needed label Oct 30, 2023
@lthurston
Copy link
Contributor

Hi all, I'm a little late to the party, apologies. I was playing whack-a-mole, but the thought did occur to me that if several mappers ended up with the same, or similar mapping configuration that it would go in the parent class. What I'll do, if you'd like, is to go through the mappers that Christine references above (chapman, samvera, up, chico) and run validations on a few collections of each. Using validations to compare has been a much better way to verify the results of these mappers than how I was doing it previously, and by doing them all at the same time with an eye toward abstracting the language and source mappings to the parent will make that more likely.

Amy said:

I don't think we have a "dataProvider" equivalent in Rikolti just yet - perhaps this is a case where whack-a-mole makes the most sense.

Would you mind explaining why this is a case where we do want to whack-a-mole? Won't someone think of the poor moles?!

@barbarahui
Copy link
Collaborator

barbarahui commented Nov 13, 2023

So I just went down a very long rabbit hole trying to figure out where the heck source is getting mapped in the legacy code. The code that @amywieliczka identified above just seems to map source to dataProvider, with which nothing is ever done (at least as far as I can see).

It seems that by default, solr_updater.py is actually doing the mapping from CouchDB.originalRecord.source --> source.

See:

https://github.com/ucldc/harvester/blob/master/harvester/solr_updater.py#L123

and

https://github.com/ucldc/harvester/blob/master/harvester/solr_updater.py#L718-L740

If the original record (aka vernacular metadata) has the field source, then source gets mapped not anywhere in the mapper code, but rather in solr_updater.py. Not exactly what I expected...

Anyway, I think we should move this mapping into the Rikolti OAI mapper (and other base mappers as necessary)? Spare the moles!

@barbarahui
Copy link
Collaborator

@lthurston Confirming that map_source() should be implemented in the OAI mapper class. Also, for future mappers where we encounter this issue, please implement map_source() in the base mapper class for that type.

@lthurston lthurston force-pushed the chapman_mapper branch 4 times, most recently from fdde511 to d5793bd Compare November 14, 2023 23:37
@lthurston lthurston marked this pull request as draft November 20, 2023 15:35
@lthurston lthurston marked this pull request as ready for review December 2, 2023 15:29
@amywieliczka
Copy link
Collaborator Author

Commented out some of the type handling here and will address type more holistically. Otherwise, this all looks good, so I'm merging it in!

@amywieliczka amywieliczka merged commit e31a64a into main Dec 20, 2023
2 checks passed
@amywieliczka amywieliczka deleted the chapman_mapper branch February 28, 2024 02:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data validation Rikolti mapper data validation review needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Generate and review validation reports: chapman_oai_dc
4 participants