Update ingestion server removal IP to include plan for filtering tags #4456

AetherUnbound · 2024-06-06T22:28:13Z

Description

Recent discussions in data related projects have uncovered a need to separate the tag filtering step from the other cleanup operations occurring during the data refresh:

openverse/ingestion_server/ingestion_server/cleanup.py

Lines 119 to 150 in 3747f9a

    
               @staticmethod 
        
               def cleanup_tags(tags): 
        
                   """ 
        
                   Delete denylisted and low-accuracy tags. 
        
                   :return: an SQL fragment if an update is needed, ``None`` otherwise 
        
                   """ 
        
                   update_required = False 
        
                   tag_output = [] 
        
                   if not tags: 
        
                       return None 
        
                   for tag in tags: 
        
                       below_threshold = False 
        
                       if "accuracy" in tag and float(tag["accuracy"]) < TAG_MIN_CONFIDENCE: 
        
                           below_threshold = True 
        
                       if "name" in tag and isinstance(tag["name"], str): 
        
                           lower_tag = tag["name"].lower() 
        
                           should_filter = _tag_denylisted(lower_tag) or below_threshold 
        
                       else: 
        
                           log.warning(f'Filtering malformed tag "{tag}" in "{tags}"') 
        
                           should_filter = True 
        
                       if should_filter: 
        
                           update_required = True 
        
                       else: 
        
                           tag_output.append(tag) 
        
                   if update_required: 
        
                       fragment = Json(tag_output) 
        
                       return fragment 
        
                   else: 
        
                       return None

Per discussion in #4417 (comment), we want to keep the tags that do not meet the criteria they're currently being filtered on (denylist and tag accuracy) rather than remove them from the catalog database. Since the intent of the data normalization project is to remove the cleanup steps from the ingestion server, it's imperative we move this filtering option into the new data refresh process. As such, we'll need to modify the implementation plan for that project to include where this filtering will take place.

It may be possible to add this as another step in the DAG, but this can be an intensive (and long-running) filtering operation. We also need to be clear at which step the filtering is happening: are we filtering the data within the temporary table prior to indexing and promoting it, or are we filtering the values as they go into Elasticsearch and leaving them in the API database? The former approach is the current one, but the latter provides more flexibility for rebuilding the index without having to copy the upstream table again. If we filter at the ES level, we could potentially leave the full list of tags in the API's response body for each record. This would mean we'd be returning the full list of tags (including denylisted and non-confident tags), but we would only be performing searches against the filtered tags.

Another consideration would need to be made for where in the new data refresh DAG this would run. It could be run as a set of mapped tasks directly in the DAG, or (since we already have the indexer workers available for this kind of chunked operation) we could add it as a second task that's handled by the indexer workers within the DAG. The modifications to the IP will need to make this distinction explicit.

This step is necessary to define going forward, since we may want to filter out entire tag providers (e.g. Clarifai) or adjust the confidence interval down the line, and we want to preserve this data in the catalog.

Additional context

See: #430, #431, and #3925

AetherUnbound self-assigned this Jun 6, 2024

openverse-bot added this to Openverse Backlog Jun 6, 2024

openverse-bot moved this to 📋 Backlog in Openverse Backlog Jun 6, 2024

AetherUnbound moved this from 📋 Backlog to 📅 To Do in Openverse Backlog Jun 6, 2024

AetherUnbound mentioned this issue Jun 7, 2024

Implementation Plan: Augment the catalog database with suitable Rekognition tags #4417

Merged

2 tasks

sarayourfriend mentioned this issue Jun 8, 2024

Move "by" tag contains filter to tag exact match filter #4464

Closed

AetherUnbound mentioned this issue Jun 19, 2024

Explicitly include Filter Data step in ingestion server removal IP #4524

Merged

8 tasks

openverse-bot moved this from 📅 To Do to 🏗 In Progress in Openverse Backlog Jun 19, 2024

AetherUnbound closed this as completed in #4524 Jun 21, 2024

openverse-bot moved this from 🏗 In Progress to ✅ Done in Openverse Backlog Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update ingestion server removal IP to include plan for filtering tags #4456

Update ingestion server removal IP to include plan for filtering tags #4456

AetherUnbound commented Jun 6, 2024

Update ingestion server removal IP to include plan for filtering tags #4456

Update ingestion server removal IP to include plan for filtering tags #4456

Comments

AetherUnbound commented Jun 6, 2024

Description

Additional context