Related query does not use filtered index #3348

sarayourfriend · 2023-11-14T01:46:20Z

Description

@stacimc and I were looking at index memory usage after the production ASGI changes (which included the related query improvements from #3151), and noticed a difference in memory usage of the filtered and unfiltered image indexes. The graphs showed the filtered index query cache memory increasing after the deployment and the unfiltered index query cache memory decreasing. This made us wonder if the changes to the related query unintentionally included a change to the index queried for the related changes. The PR does not change which index we use for the related query. But we did notice that we're using the unfiltered index for related, rather than the filtered index.

The index used for related media is passed to the related media function by the view. MediaViewSet::related passes the default index set on the media view set implementation:

openverse/api/api/views/media_views.py

Lines 162 to 166 in 6f6aac3

    
           results = related_media( 
        
               uuid=identifier, 
        
               index=self.default_index, 
        
               filter_dead=True, 
        
           )

For images and audio, this is both the image and audio index names, which are the aliases used for the unfiltered indices.

The related_media function passes this directly to Search, meaning it queries the unfiltered index. To make things a bit clearer, we should change the related_media function to accept the media type as a parameter, and then always query the filtered index for the media type. Pass the media type to make it clear that the index is chosen by the related_media function and is always the filtered index for the media type. Then use the filtered index when constructing Search, for example, search = Search(index=f"{media_type}-filtered").

The text was updated successfully, but these errors were encountered:

obulat · 2023-11-14T03:07:17Z

In the related endpoint, we use 2 ES requests: the first one to get the item with the requested identifier to get its title, tags and creator, and the second one to get related items based on that data. The second request does use the filtered index.
However, for the first request, we have to use the original index because the main item could be a sensitive one not available in the filtered index. In fact, when we pointed both indexes to the filtered index, we got several 500 errors because the first request could not find the item that was in the non-filtered index (the issue for turning these errors into 404 is forthcoming).
If we used the database to get the item instead of ES, would it be more performant and use less memory, @sarayourfriend ?

sarayourfriend · 2023-11-14T04:47:05Z

Oh! I totally misread the code. Never mind, this ticket is invalid as is. I might open a PR to add some comments to help easy skimming of the related media function, but there's nothing to do here. Thanks for clarifying how the function works.

sarayourfriend · 2023-11-14T04:49:09Z

If we used the database to get the item instead of ES, would it be more performant and use less memory, @sarayourfriend ?

For posterity, I don't think there's an easy way to measure this, and I'm sceptical a request to one or the other would matter too much either way. If we want to overall reduce the number of queries to ES, then that's a potentially meaningful way of thinking about the optimisation, because theoretically those requests for individual items are tying up threads that could serve a search request, but our actual search throughput doesn't appear to be an issue as long as our queries are working well. I'd say we shouldn't make a change in that regard unless we have a confident way to measure the effect of it.

sarayourfriend added help wanted Open to participation from the community 🟨 priority: medium Not blocking but should be addressed soon 🛠 goal: fix Bug fix 🕹 aspect: interface Concerns end-users' experience with the software 🧱 stack: api Related to the Django API labels Nov 14, 2023

openverse-bot added this to Openverse Backlog Nov 14, 2023

github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Nov 14, 2023

obulat removed the help wanted Open to participation from the community label Nov 14, 2023

sarayourfriend closed this as not planned Won't fix, can't repro, duplicate, stale Nov 14, 2023

github-project-automation bot moved this from 📋 Backlog to ✅ Done in Openverse Backlog Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Related query does not use filtered index #3348

Related query does not use filtered index #3348

sarayourfriend commented Nov 14, 2023

obulat commented Nov 14, 2023

sarayourfriend commented Nov 14, 2023

sarayourfriend commented Nov 14, 2023

Related query does not use filtered index #3348

Related query does not use filtered index #3348

Comments

sarayourfriend commented Nov 14, 2023

Description

obulat commented Nov 14, 2023

sarayourfriend commented Nov 14, 2023

sarayourfriend commented Nov 14, 2023