Change tag upsert strategy to drop old provider tags #4732
Labels
🗄️ aspect: data
Concerns the data in our catalog and/or databases
✨ goal: improvement
Improvement to an existing user-facing feature
🟧 priority: high
Stalls work on the project or its dependents
🧱 stack: catalog
Related to the catalog and Airflow DAGs
Problem
As part of the discussion in #4452, we've decided that our historical strategy of merging old and new provider tags when reingesting a work is problematic. To quote @stacimc in that discussion:
To expand on what the ethical and privacy concerns are: Openverse has an ethical responsibility to represent works as they are represented by the upstream provider, and to make clear when we are intentionally augmenting the description of a work. To that end, if the set of tags change in a provider, removed or modified tags must be reflected in Openverse's dataset. For example, if a museum decides on a new way of describing a work, and previous tags were culturally insensitive, it is important for Openverse not to reproduce that insensitivity, especially as we are representing those tags as specifically from the provider. Similarly, considering a privacy perspective, if tags at the upstream source originally include privacy invading information (e.g., the name of a pictured individual, sensitive location information, etc), and the upstream source removes them, it's critical that Openverse also no longer retains those tags after reingestion.
Keep in mind that Openverse does not current reingest most of its data (namely for dated provider DAGs like Flickr, Wikimedia), so for the vast majority of works, we will still have the problem of potentially retaining stale/incorrect tag information from the provider. Future work may be planned to selectively reingest works on a periodic basis (e.g., works returned in search queries, works for which the metadata may actually be seen by individuals).
Furthermore, as Staci pointed out in the priorities meeting yesterday, every other piece of metadata follows a replace, rather than merge approach. Provider tags have been unique in this way, and there's no reason to maintain this, and ample reason to change it.
Description
Change the jsonb_array column strategy to drop all existing provider tags. Non-provider tags (e.g., machine generated tags) must be retained. Existing provider tags should be dropped entirely in favour of incoming provider tags.
openverse/catalog/dags/common/storage/columns.py
Lines 70 to 78 in de0079d
The current strategy is written in a generic form, with column parenthesised, but I think we need to replace it with a tags-specific merge strategy, because the
provider
field on the tags is tags specific. Other jsonb array columns should use their own approach.Something like the following might be a good starting point:
Though I'm still confused about whether
EXCLUDED
is definitely the new tags or something else (the name is confounding).Additional context
Blocks #4452.
The text was updated successfully, but these errors were encountered: