Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Related query does not use filtered index #3348

Closed
sarayourfriend opened this issue Nov 14, 2023 · 3 comments
Closed

Related query does not use filtered index #3348

sarayourfriend opened this issue Nov 14, 2023 · 3 comments
Labels
🕹 aspect: interface Concerns end-users' experience with the software 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: api Related to the Django API

Comments

@sarayourfriend
Copy link
Collaborator

Description

@stacimc and I were looking at index memory usage after the production ASGI changes (which included the related query improvements from #3151), and noticed a difference in memory usage of the filtered and unfiltered image indexes. The graphs showed the filtered index query cache memory increasing after the deployment and the unfiltered index query cache memory decreasing. This made us wonder if the changes to the related query unintentionally included a change to the index queried for the related changes. The PR does not change which index we use for the related query. But we did notice that we're using the unfiltered index for related, rather than the filtered index.

The index used for related media is passed to the related media function by the view. MediaViewSet::related passes the default index set on the media view set implementation:

results = related_media(
uuid=identifier,
index=self.default_index,
filter_dead=True,
)

For images and audio, this is both the image and audio index names, which are the aliases used for the unfiltered indices.

The related_media function passes this directly to Search, meaning it queries the unfiltered index. To make things a bit clearer, we should change the related_media function to accept the media type as a parameter, and then always query the filtered index for the media type. Pass the media type to make it clear that the index is chosen by the related_media function and is always the filtered index for the media type. Then use the filtered index when constructing Search, for example, search = Search(index=f"{media_type}-filtered").

@sarayourfriend sarayourfriend added help wanted Open to participation from the community 🟨 priority: medium Not blocking but should be addressed soon 🛠 goal: fix Bug fix 🕹 aspect: interface Concerns end-users' experience with the software 🧱 stack: api Related to the Django API labels Nov 14, 2023
@github-project-automation github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Nov 14, 2023
@obulat obulat removed the help wanted Open to participation from the community label Nov 14, 2023
@obulat
Copy link
Contributor

obulat commented Nov 14, 2023

In the related endpoint, we use 2 ES requests: the first one to get the item with the requested identifier to get its title, tags and creator, and the second one to get related items based on that data. The second request does use the filtered index.
However, for the first request, we have to use the original index because the main item could be a sensitive one not available in the filtered index. In fact, when we pointed both indexes to the filtered index, we got several 500 errors because the first request could not find the item that was in the non-filtered index (the issue for turning these errors into 404 is forthcoming).
If we used the database to get the item instead of ES, would it be more performant and use less memory, @sarayourfriend ?

@sarayourfriend
Copy link
Collaborator Author

Oh! I totally misread the code. Never mind, this ticket is invalid as is. I might open a PR to add some comments to help easy skimming of the related media function, but there's nothing to do here. Thanks for clarifying how the function works.

@sarayourfriend sarayourfriend closed this as not planned Won't fix, can't repro, duplicate, stale Nov 14, 2023
@github-project-automation github-project-automation bot moved this from 📋 Backlog to ✅ Done in Openverse Backlog Nov 14, 2023
@sarayourfriend
Copy link
Collaborator Author

If we used the database to get the item instead of ES, would it be more performant and use less memory, @sarayourfriend ?

For posterity, I don't think there's an easy way to measure this, and I'm sceptical a request to one or the other would matter too much either way. If we want to overall reduce the number of queries to ES, then that's a potentially meaningful way of thinking about the optimisation, because theoretically those requests for individual items are tying up threads that could serve a search request, but our actual search throughput doesn't appear to be an issue as long as our queries are working well. I'd say we shouldn't make a change in that regard unless we have a confident way to measure the effect of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🕹 aspect: interface Concerns end-users' experience with the software 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: api Related to the Django API
Projects
Archived in project
Development

No branches or pull requests

2 participants