Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify the related search #3151

Merged
merged 4 commits into from
Oct 10, 2023
Merged

Simplify the related search #3151

merged 4 commits into from
Oct 10, 2023

Conversation

obulat
Copy link
Contributor

@obulat obulat commented Oct 5, 2023

Fixes

Fixes #3149 by @obulat

Description

This PR:

  • uses the filtered index for related search
  • removes the search context building. Search context allows us to return sensitivities for the media items, but since we only use the filtered index, all sensitivities are empty anyways.
    removes the mature: False filter since we use the filtered index (removing this filter would show the user-reported mature items, so I put it back)
  • removes creator from the fields that are used in the MoreLikeThis query

Testing Instructions

Check that the related endpoints (the related in image and audio results) work as expected.
To see this work visually, you can run the frontend Nuxt app using the local API:

  1. Set the API_URL variable in the frontend/.env: API_URL=http://0.0.0.0:50280/
  2. Run the API using just up.
  3. Run Nuxt using just frontend/run dev.
    Now, you can search for something, open the single result page and see that it has the related items.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@obulat obulat requested a review from a team as a code owner October 5, 2023 18:39
@obulat obulat requested review from dhruvkb and stacimc October 5, 2023 18:39
@github-actions github-actions bot added the 🧱 stack: api Related to the Django API label Oct 5, 2023
@openverse-bot openverse-bot added the 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work label Oct 5, 2023
@obulat obulat added 🟧 priority: high Stalls work on the project or its dependents ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository and removed 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels Oct 5, 2023
@sarayourfriend
Copy link
Collaborator

removes the mature: False filter since we use the filtered index

Quick note that the filtered index only filters sensitive text, so user reported sensitive media will still be in there (just with mature=true). For full mature filtering (sensitive text and user reported) we still need to filter mature=false. (These names are confusing!)

@obulat obulat self-assigned this Oct 6, 2023
Copy link
Member

@dhruvkb dhruvkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! This is a solid improvement to the related/ endpoint.

Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I see that #3156 builds off of this but in a way that makes some of these changes irrelevant, is that correct?

@obulat
Copy link
Contributor Author

obulat commented Oct 7, 2023

LGTM, I see that #3156 builds off of this but in a way that makes some of these changes irrelevant, is that correct?

Yes. I wanted to implement the simple "quick win" approach on Friday, thinking that the "more correct" approach would take a long time and be more difficult. Then, next day I realized that #3156 would actually be doable, I opened it, and based it on this PR (because here we remove the creator from the query and use the filtered index).

search_client = Search(index=index)
# Search the default index for the item itself as it might be sensitive.
item_search = Search(index=index)
item_hit = item_search.query("match", identifier=uuid).execute().hits[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use identifier__keyword and term to avoid stemming on the text field that identifier is currently. This issue came up during the development of the SearchContext::build method where I noticed I wasn't getting the exact list of identifiers back 😱

Comment on lines 514 to 516
tags = ",".join([tag.name for tag in tags[:10]])
tags_query = SimpleQueryString(fields=["tags.name"], query=tags)
related_query |= tags_query
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if a multi_match wouldn't be more straightforward than constructing a simple query string query? If we do use simple query string, though, , isn't an operator available, as far as I can tell: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html#simple-query-string-syntax

A space should be sufficient to use the default OR operator. In this case even query_string could work fine.

_id = item.execute().hits[0].id
# Match related using title.
title = item_hit.title
title_query = SimpleQueryString(query=title, fields=["title"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to share some initial comments but this is very cool, I'm optimistic it could lead to a genuine improvement in the related results. Do we have an event on related result clicking? I'd be keen to compare before/after if this change goes through to see whether it improves things, especially if we deploy the creator change first and then this afterwards. If there are progressively positive changes, we'd at least have a gut check about what users are looking for in the related query (that being a point besides potentially improving the performance of this route).

@obulat
Copy link
Contributor Author

obulat commented Oct 10, 2023

. Do we have an event on related result clicking? I'd be keen to compare before/after if this change goes through to see whether it improves things, especially if we deploy the creator change first and then this afterwards. If there are progressively positive changes, we'd at least have a gut check about what users are looking for in the related query (that being a point besides potentially improving the performance of this route).

I couldn't find a way to compare the amount of related clicks to the search result clicks in Plausible UI, because we can't select clicks on all "related" items, only clicks on an item that is related to specific item id.
I opened a PR to fix this: #3173 (but it won't help us compare the counts before/after this PR).

obulat and others added 4 commits October 10, 2023 15:33
* Match related by title and tags

* Use default index for item search

* Handle ES hits without tags

* Use the first 10 tags for the query

---------

Co-authored-by: Dhruv Bhanushali <dhruv_b@live.com>
@obulat obulat force-pushed the simplify_related_search branch from 661acf2 to 07d6b65 Compare October 10, 2023 12:33
@obulat obulat merged commit 079ad31 into main Oct 10, 2023
44 checks passed
@obulat obulat deleted the simplify_related_search branch October 10, 2023 14:48
@sentry-io
Copy link

sentry-io bot commented Oct 10, 2023

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

  • ‼️ AttributeError: 'Hit' object has no attribute 'creator' /v1/images/{identifier}/related/ View Issue
  • ‼️ AttributeError: 'Hit' object has no attribute 'title' /v1/images/{identifier}/related/ View Issue

Did you find this useful? React with a 👍 or 👎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: api Related to the Django API
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Improve related endpoint performance
5 participants