Update ingestion server removal IP to include plan for filtering tags #4456
Labels
📄 aspect: text
Concerns the textual material in the repository
✨ goal: improvement
Improvement to an existing user-facing feature
🟨 priority: medium
Not blocking but should be addressed soon
🧭 project: implementation plan
An implementation plan for a project
🧱 stack: catalog
Related to the catalog and Airflow DAGs
🧱 stack: ingestion server
Related to the ingestion/data refresh server
Description
Recent discussions in data related projects have uncovered a need to separate the tag filtering step from the other cleanup operations occurring during the data refresh:
openverse/ingestion_server/ingestion_server/cleanup.py
Lines 119 to 150 in 3747f9a
Per discussion in #4417 (comment), we want to keep the tags that do not meet the criteria they're currently being filtered on (denylist and tag accuracy) rather than remove them from the catalog database. Since the intent of the data normalization project is to remove the cleanup steps from the ingestion server, it's imperative we move this filtering option into the new data refresh process. As such, we'll need to modify the implementation plan for that project to include where this filtering will take place.
It may be possible to add this as another step in the DAG, but this can be an intensive (and long-running) filtering operation. We also need to be clear at which step the filtering is happening: are we filtering the data within the temporary table prior to indexing and promoting it, or are we filtering the values as they go into Elasticsearch and leaving them in the API database? The former approach is the current one, but the latter provides more flexibility for rebuilding the index without having to copy the upstream table again. If we filter at the ES level, we could potentially leave the full list of tags in the API's response body for each record. This would mean we'd be returning the full list of tags (including denylisted and non-confident tags), but we would only be performing searches against the filtered tags.
Another consideration would need to be made for where in the new data refresh DAG this would run. It could be run as a set of mapped tasks directly in the DAG, or (since we already have the indexer workers available for this kind of chunked operation) we could add it as a second task that's handled by the indexer workers within the DAG. The modifications to the IP will need to make this distinction explicit.
This step is necessary to define going forward, since we may want to filter out entire tag providers (e.g. Clarifai) or adjust the confidence interval down the line, and we want to preserve this data in the catalog.
Additional context
See: #430, #431, and #3925
The text was updated successfully, but these errors were encountered: