Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use DB to augment ES hits for related media with required info #3408

Merged
merged 2 commits into from
Dec 1, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions api/api/views/media_views.py
Original file line number Diff line number Diff line change
Expand Up @@ -267,6 +267,10 @@ def related(self, request, identifier=None, *_, **__):

serializer_context = self.get_serializer_context()

serializer_class = self.get_serializer()
if serializer_class.needs_db:
results = self.get_db_results(results)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any idea of how much getting the db results affects the performance of the searches?
I wonder how we can measure if it's better to get this data from the database, or to keep it in ES indexes and get it from there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a single query to the DB for all results using the field identifier that's both unique=True and index=True. I don't think it is inefficient. Personally I prefer one the way it is rn because we have to query the DB anyway for relational info that can't be stored in ES. Also the amount of relational info will only increase with new projects slated for 2024.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanation, @dhruvkb , makes sense!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhruvkb 100% in agreement here as well, that would be a good thing to document somewhere as the "rationale" of the serialiser approach, prefereably in code or, if not there, then in the API docs, in something of a "design decisions" document. It's similar to the the ES document _id not being the same as identifier, for example, in that it's easy to forget why this is the "right" way to do it and then re-ask this same question (or for new folks to ask this question). A "design decisions" doc I think would be a great thing to have in general (to summarise my point 😀).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, rather than a new documentation page, pulling this behaviour into a function that the two view methods can call and adding the rationale in the doc string there would be more resilient to future changes.

Copy link
Member Author

@dhruvkb dhruvkb Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both views ultimately use the get_db_results function, I've documented that function in 324fe92. I think the needs_db mechanism is largely moot because almost all operations ultimately need the DB and once that is simplified (#3436), get_db_results will be the common functionality entirely.


serializer = self.get_serializer(results, many=True, context=serializer_context)
return self.get_paginated_response(serializer.data)

Expand Down
4 changes: 4 additions & 0 deletions api/test/media_integration.py
Original file line number Diff line number Diff line change
Expand Up @@ -227,6 +227,10 @@ def get_terms_set(res):
or result["creator"] == item["creator"]
), f"{terms_set} {get_terms_set(result)}/{result['creator']}-{item['creator']}"

assert result["license_version"] is not None
assert result["attribution"] is not None
assert result["creator_url"] is not None

dhruvkb marked this conversation as resolved.
Show resolved Hide resolved

def sensitive_search_and_detail(media_type):
search_res = requests.get(
Expand Down