Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use DB to augment ES hits for related media with required info #3408

Merged
merged 2 commits into from
Dec 1, 2023

Conversation

dhruvkb
Copy link
Member

@dhruvkb dhruvkb commented Nov 28, 2023

Fixes

Fixes #3403 by @obulat

Description

This PR uses the DB to augment ES hits for related media with info that's not in ES. It uses the exact same pattern (and for that matter, code) as get_media_results.

Also updated the related media tests to check for the presence of these fields.

Testing Instructions

Try, and fail, to repro the issue described in #3403 on this branch.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@dhruvkb dhruvkb requested a review from a team as a code owner November 28, 2023 19:36
@openverse-bot openverse-bot added 🟧 priority: high Stalls work on the project or its dependents 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository 🧱 stack: api Related to the Django API labels Nov 28, 2023
Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for such a quick fix, @dhruvkb ! It works well and matches what we do on another endpoint.
I added non-blocking questions inline.

api/test/media_integration.py Show resolved Hide resolved
@@ -267,6 +267,10 @@ def related(self, request, identifier=None, *_, **__):

serializer_context = self.get_serializer_context()

serializer_class = self.get_serializer()
if serializer_class.needs_db:
results = self.get_db_results(results)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any idea of how much getting the db results affects the performance of the searches?
I wonder how we can measure if it's better to get this data from the database, or to keep it in ES indexes and get it from there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a single query to the DB for all results using the field identifier that's both unique=True and index=True. I don't think it is inefficient. Personally I prefer one the way it is rn because we have to query the DB anyway for relational info that can't be stored in ES. Also the amount of relational info will only increase with new projects slated for 2024.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanation, @dhruvkb , makes sense!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhruvkb 100% in agreement here as well, that would be a good thing to document somewhere as the "rationale" of the serialiser approach, prefereably in code or, if not there, then in the API docs, in something of a "design decisions" document. It's similar to the the ES document _id not being the same as identifier, for example, in that it's easy to forget why this is the "right" way to do it and then re-ask this same question (or for new folks to ask this question). A "design decisions" doc I think would be a great thing to have in general (to summarise my point 😀).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, rather than a new documentation page, pulling this behaviour into a function that the two view methods can call and adding the rationale in the doc string there would be more resilient to future changes.

Copy link
Member Author

@dhruvkb dhruvkb Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both views ultimately use the get_db_results function, I've documented that function in 324fe92. I think the needs_db mechanism is largely moot because almost all operations ultimately need the DB and once that is simplified (#3436), get_db_results will be the common functionality entirely.

@openverse-bot
Copy link
Collaborator

Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR:

@krysal
@sarayourfriend
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was ready for review 2 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2.

@dhruvkb, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Tested with the first general image result and in main the mentioned fields are null and here are correctly filled ✅

http://localhost:50280/v1/images/3d0af15e-5f3a-42f9-8833-1a3d6baf90c3/related/

@dhruvkb
Copy link
Member Author

dhruvkb commented Dec 1, 2023

Given the high priority, I'm going to merge this and we'll solve for #3436 separately (and handle the extraction of common code as a part of that simplification process).

@dhruvkb dhruvkb merged commit adb56f1 into main Dec 1, 2023
45 checks passed
@dhruvkb dhruvkb deleted the related_info branch December 1, 2023 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: api Related to the Django API
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Related endpoint does not return all of the necessary propertes
5 participants