Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ingestion server removal IP to include plan for filtering tags #4456

Closed
AetherUnbound opened this issue Jun 6, 2024 · 0 comments · Fixed by #4524
Closed

Update ingestion server removal IP to include plan for filtering tags #4456

AetherUnbound opened this issue Jun 6, 2024 · 0 comments · Fixed by #4524
Assignees
Labels
📄 aspect: text Concerns the textual material in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧭 project: implementation plan An implementation plan for a project 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: ingestion server Related to the ingestion/data refresh server

Comments

@AetherUnbound
Copy link
Collaborator

Description

Recent discussions in data related projects have uncovered a need to separate the tag filtering step from the other cleanup operations occurring during the data refresh:

@staticmethod
def cleanup_tags(tags):
"""
Delete denylisted and low-accuracy tags.
:return: an SQL fragment if an update is needed, ``None`` otherwise
"""
update_required = False
tag_output = []
if not tags:
return None
for tag in tags:
below_threshold = False
if "accuracy" in tag and float(tag["accuracy"]) < TAG_MIN_CONFIDENCE:
below_threshold = True
if "name" in tag and isinstance(tag["name"], str):
lower_tag = tag["name"].lower()
should_filter = _tag_denylisted(lower_tag) or below_threshold
else:
log.warning(f'Filtering malformed tag "{tag}" in "{tags}"')
should_filter = True
if should_filter:
update_required = True
else:
tag_output.append(tag)
if update_required:
fragment = Json(tag_output)
return fragment
else:
return None

Per discussion in #4417 (comment), we want to keep the tags that do not meet the criteria they're currently being filtered on (denylist and tag accuracy) rather than remove them from the catalog database. Since the intent of the data normalization project is to remove the cleanup steps from the ingestion server, it's imperative we move this filtering option into the new data refresh process. As such, we'll need to modify the implementation plan for that project to include where this filtering will take place.

It may be possible to add this as another step in the DAG, but this can be an intensive (and long-running) filtering operation. We also need to be clear at which step the filtering is happening: are we filtering the data within the temporary table prior to indexing and promoting it, or are we filtering the values as they go into Elasticsearch and leaving them in the API database? The former approach is the current one, but the latter provides more flexibility for rebuilding the index without having to copy the upstream table again. If we filter at the ES level, we could potentially leave the full list of tags in the API's response body for each record. This would mean we'd be returning the full list of tags (including denylisted and non-confident tags), but we would only be performing searches against the filtered tags.

Another consideration would need to be made for where in the new data refresh DAG this would run. It could be run as a set of mapped tasks directly in the DAG, or (since we already have the indexer workers available for this kind of chunked operation) we could add it as a second task that's handled by the indexer workers within the DAG. The modifications to the IP will need to make this distinction explicit.

This step is necessary to define going forward, since we may want to filter out entire tag providers (e.g. Clarifai) or adjust the confidence interval down the line, and we want to preserve this data in the catalog.

Additional context

See: #430, #431, and #3925

@AetherUnbound AetherUnbound added ✨ goal: improvement Improvement to an existing user-facing feature 📄 aspect: text Concerns the textual material in the repository 🟨 priority: medium Not blocking but should be addressed soon 🧭 project: implementation plan An implementation plan for a project 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: ingestion server Related to the ingestion/data refresh server labels Jun 6, 2024
@AetherUnbound AetherUnbound self-assigned this Jun 6, 2024
@openverse-bot openverse-bot moved this to 📋 Backlog in Openverse Backlog Jun 6, 2024
@AetherUnbound AetherUnbound moved this from 📋 Backlog to 📅 To Do in Openverse Backlog Jun 6, 2024
@openverse-bot openverse-bot moved this from 📅 To Do to 🏗 In Progress in Openverse Backlog Jun 19, 2024
@openverse-bot openverse-bot moved this from 🏗 In Progress to ✅ Done in Openverse Backlog Jun 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📄 aspect: text Concerns the textual material in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧭 project: implementation plan An implementation plan for a project 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: ingestion server Related to the ingestion/data refresh server
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant