Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce ES shard count and simplify index properties #3143

Merged
merged 16 commits into from
Oct 19, 2023
Merged

Conversation

dhruvkb
Copy link
Member

@dhruvkb dhruvkb commented Oct 5, 2023

Series

This PR is part 1/3.

  1. This PR
  2. Use top-level keyword fields instead of subfields #3161
  3. Remove unreferenced subfields in indices #3162

Fixes

Fixes #2154 by @sarayourfriend.

Builds upon past work from WordPress/openverse-api#684 and WordPress/openverse-api#850.

Description

This PR

  • removes ES fields not used for filtering or searching
  • converts text fields to keywords where partial matching is not needed
  • updates usages of these changed fields in the API
  • refactors ES models to DRY some common code
  • reduces the shards for the audio index as a way to improve performance

Testing Instructions

CI should pass. Compare index sizes (using something like Elasticvue) before and after the PR to see the difference.

  • just down -v
  • just dc build web ingestion_server
  • just api/init

Deployment

This PR needs a multi-step deployment strategy that is well-documented in #2154 and described in a Google Doc.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@dhruvkb dhruvkb added 🟧 priority: high Stalls work on the project or its dependents ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🧱 stack: api Related to the Django API 🧱 stack: ingestion server Related to the ingestion/data refresh server labels Oct 5, 2023
@dhruvkb dhruvkb marked this pull request as ready for review October 5, 2023 13:26
@dhruvkb dhruvkb requested a review from a team as a code owner October 5, 2023 13:26
Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the changes removing all of the unnecessary mappings! And thank you for opening this PR so fast, @dhruvkb!

I wonder why the sample indices sizes locally are so different for images and audio. Even though we have 5000 items for both media types, Audio takes 2.47MB, and Image - 5.54MB, so it's almost twice as big.

api/api/controllers/search_controller.py Outdated Show resolved Hide resolved
"tags": {
"properties": {
"accuracy": {"type": "float"},
# Text-based fields
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think setting it to keyword-only would greatly reduce the size of the index. It would change the search by tags a little bit. I.e., it would probably not match if you are searching for draw, you won't get the media that has drawing tag, but do we really want it to match like that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an important point, we could convert it into keywords (and potentially use stems so that 'drawing' and 'draw' evaluate to the same keyword). I'll let others chime in on this discussion as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we maybe see how this set of changes goes, then consider moving tags to keywords as a follow-up? I agree that keywords + stemming could be useful (if that's a distinction we can make), but it would be a significant change to the way our search currently operates which leaves me a little hesitant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, we can keep this possibility open as a future optimisation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When discussing the collection views we explicitly said that tag stemming was desirable. The tag dogs should match dog, we used as an example. Any other approach doesn't make much sense and would confound users (and ourselves, I have to think!).

Copy link
Collaborator

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me although I'm a little confused about the deploy strategy. Does this need to be broken into multiple PRs for the different stages?

@dhruvkb
Copy link
Member Author

dhruvkb commented Oct 6, 2023

@stacimc I'm not sure we would need to break the PR as we can deploy changes the ingestion server and API separately. But it would definitely need to be a multi-stage deployment with a data refresh between the ingestion server and API deployments.

@dhruvkb
Copy link
Member Author

dhruvkb commented Oct 6, 2023

@obulat the size difference depends majorly on the contents of the title, description and tags fields.

Media type None With title only With desc only With tags only Complete
audio 1.18 MB 1.59 MB 1.59 MB 1.43 MB 2.25 MB
image 1.78 MB 2.02 MB 2.80 MB 3.80 MB 5.78 MB

In almost all of the fields (especially tags), image documents seem to contain more data per document than their audio counterparts which leads to the vast size difference.

@stacimc
Copy link
Collaborator

stacimc commented Oct 6, 2023

I'm not sure we would need to break the PR as we can deploy changes the ingestion server and API separately. But it would definitely need to be a multi-stage deployment with a data refresh between the ingestion server and API deployments.

Definitely, I'm just hesitant to merge as-is if it will put the API in a state where we can't deploy until the ingestion server has been deployed and a data refresh run. The data refresh is currently paused because of the related API instability and the API will need to be deployed to fix a vulnerability. If we pull the API changes into a separate PR it would just ensure that the API doesn't accidentally get deployed in a bad state.

@obulat
Copy link
Contributor

obulat commented Oct 6, 2023

@obulat the size difference depends majorly on the contents of the title, description and tags fields.

Interesting! Thank you so much for investigating and sharing, @dhruvkb !

@obulat
Copy link
Contributor

obulat commented Oct 6, 2023

Definitely, I'm just hesitant to merge as-is if it will put the API in a state where we can't deploy until the ingestion server has been deployed and a data refresh run. The data refresh is currently paused because of the related API instability and the API will need to be deployed to fix a vulnerability. If we pull the API changes into a separate PR it would just ensure that the API doesn't accidentally get deployed in a bad state.

We might need to wait to merge this PR until we are ready to:

  • deploy the API
  • deploy the ingestion server
  • start the data refresh
    at the same time.

@stacimc
Copy link
Collaborator

stacimc commented Oct 6, 2023

@obulat The problem is that once this is merged, we can't deploy the API until the other two steps (ingestion server deploy and data refresh) are both complete. For that 3-day period while the data refresh is running, we wouldn't be able to deploy the API (without backing out the API changes). If something like a security vulnerability or critical bug comes up, we'd have to remember to back out the changes first.

@dhruvkb
Copy link
Member Author

dhruvkb commented Oct 6, 2023

It's easier if I just separate the API changes into a separate PR. But both PRs will fail CI and CI will remain broken till both pieces are reunited. That would also surely prevent us from deploying anything.

@AetherUnbound
Copy link
Collaborator

If CI is failing, does that mean the API will start erroring out once the data refresh is complete?

@dhruvkb
Copy link
Member Author

dhruvkb commented Oct 7, 2023

What would cause CI to fail if these were split?

The API's search_controller contain references to fields and subfields defined in es_mapping.py, which need to be updated when the mapping changes. For example, identifier__keyword needs to be replaced with identifer (as there is no longer a keyword subfield).

Prod will not break if we split this PR because the API can continue to use the old index while the data refresh step runs. But in CI, the "data refresh" step happens on every run and the API always uses the latest indices (it doesn't have the old ones) so unless updated, it will references nonexistent fields in ES.

@sarayourfriend
Copy link
Collaborator

sarayourfriend commented Oct 7, 2023

This change as is is not a zero-downtime change. It would need to be merged essentially in unison with the data refresh, otherwise the index mapping will change out from underneath the API while it is still referencing the old format.

To make it non-breaking, we need to think of it like a database migration to change column types. This issue gives clear steps for how to accomplish this for the identifier field. The same method applies to all the others. These mapping changes should be at minimum three PRs to prevent breaking changes in live environments or in CI.

removes ES fields not used for filtering or searching

This also must not be done. Search is at least capable of serializing from the ES hits (more below about whether this actually happens), not always the database, and therefore fields required to return the search results must remain (url and foreign landing url). If they aren't searched, we should just disable indexing on them, not remove them entirely. They're useful for debugging and testing anyway, and including them in the source document introduces no overhead aside from the trivial serialization and storage overhead, which we've already been paying without issue.

https://github.com//WordPress/openverse/blob/bc7b2a7f6ee32fc3ca367baae7681abc41ef7f79/api/api/views/media_views.py#L137-L140

I'm confused by the existing code on the serializers. It seems that in practice we essentially always use the database results because AudioSerializer and ImageSerializer, both have needs_db = True. The right-hand side of the conditional in the code I've linked above (serializer_class.needs_db) is always true. We can probably clarify the way this works to avoid confusion on it and just always use the database to serialize results. It would simplify the serializer code as well.

In any case, removing those fields outright isn't a meaningful optimisation, it hurts debug-ability (if querying ES directly we'd have to now do a separate search against the database to retrieve the URL and foreign landing URL), and risks future media types breaking outright (unless we make the change to always use the database results). Even though the last one can be mitigated (and we should revisit this code, no doubt, at least to confirm the conditional still makes sense), reducing the amount of easily accessible information when debugging for no additional benefit, when the underlying benefit of reducing index size can be had by disabling indexing on those fields, doesn't make sense to me.

@dhruvkb
Copy link
Member Author

dhruvkb commented Oct 7, 2023

This issue gives clear steps for how to accomplish this

I misunderstood the issue description on my first pass. Let me work on this a little more.

Update This PR has been split into 3 (the other two being #3161 and #3162).

Search is at least capable of serializing from the ES hits

Not all fields we need in the search results are available in ES so we have to use the DB to populate those fields (hence the needs_db in the serializers). If we keep them in ES we are storing them twice and not reading from them at all.

we should just disable indexing on them, not remove them entirely

This would not reduce the storage space required as they would be contained in the documents _source field and would consume space. Getting rid of them not only reduces the index size but actually frees up more storage space.

They're useful for debugging and testing anyway

I have not done any debugging where I would look up ES results so I am not qualified to comment on this, but if they're useful we can add them back. On the other hand, it's an O(1) operation to get more info about an item from Postgres using the UUID. Postgres can provide all the info ES has (and more) so I don't feel like we're losing any valuable info here. Also, we retain the 'URL' field because it's required to be present on Hit in some places.

we've already been paying without issue

If there is a way to save some storage (and potentially money) why not take it?

We can probably clarify the way this works to avoid confusion on it

Yes. When the code was written it allowed to skip the DB if the info needed could be provided by the ES Hit. We know that this is not the case and 1 DB call is okay so we can always use the DB. This refactoring is definitely a good idea.

@dhruvkb dhruvkb removed the 🧱 stack: ingestion server Related to the ingestion/data refresh server label Oct 7, 2023
@sarayourfriend
Copy link
Collaborator

If there is a way to save some storage (and potentially money) why not take it?

We won't save money, because the disk size would not decrease. The transfer rate of these strings is trivial compared to the overhead of indexing the fields. Removing them is a significant change to the API of the documents that seems unrelated. We're not worried about storage efficiency in our index, as far as I'm aware. I'm sure over several hundreds of millions of records it will make a difference in the overall size of the stored documents, but it wouldn't cause us to change the amount of storage on our cluster (which we haven't touched in a long time and would require a lot more work, i.e., reprovisioning the cluster with the new instance sizes).

I've definitely used at least url in debugging outages and issues with our photon integration, for example (where ES was more convenient to query than Postgres).

I guess at the end of the day it's fine to remove it if there is a compelling reason and it would actually affect our storage provisioning, keeping in mind that the actual amount of storage we use has no effect on the cost of the storage we allocate, which is static on the allocation, not the usage. But it seems unrelated to improvements in index overhead, in which storage, as far as I know, isn't really a significant factor. If I'm wrong about that though, please let me know.

Not a blocker, of course, it's easy to put back if we want, but I do feel that it could be a separate PR and considered as its own thing, rather than lumped in with an already significant set of changes.

@sarayourfriend
Copy link
Collaborator

@dhruvkb Now that we have the three split-out PRs, can you share the deployment plan, including reindexes we need to do after specific deployments? I realised last night we don't need to run a full data refresh, just a reindex, and that we'll also need to run it on staging. Just want to make sure we have a clear plan so that especially production gets the changes deployed at the right time relative to the reindex operation.

@sarayourfriend
Copy link
Collaborator

I also took a look at the storage situation, just to make sure that there isn't actually a storage pressure, and we are really sitting just fine on storage, at least outside of reindexing windows:

image

Each data node has a 1900 GB SSD (they are i3.2xlarge) and the data nodes have a 10 GB EBS attached (c5 instances are EBS only).

Our image indexes are 1.2 TB, and are the largest consumers of storage, so we are probably getting close to 85% storage consumption for each node during image data refreshes. There may be some precedence for evaluating storage efficiency.

However, I'd really like to see how the field type changes alone affect storage. Theoretically, by removing most of the unnecessary subfields and switching many from text to keyword, we should save a good deal of space (especially with the removals). Knowing the effect of the field changes on their own could help us understand future opportunities to make efficiency gains in index size.

Would you be okay if we split the change to remove those two fields into a separate PR? At least until we've completely removed the cases in the code where search results can be serialised using the ES hits. Even if it does not actually happen today, I'd feel more comfortable if we removed the possibility of it altogether, rather than relying on the fact that our configuration currently happens to force the other case. We're currently OK on storage, and the field changes are much more important from a performance overhead perspective, so there isn't pressure to make this specific change quickly, from my perspective.

@github-actions github-actions bot added the 🧱 stack: ingestion server Related to the ingestion/data refresh server label Oct 9, 2023
@dhruvkb
Copy link
Member Author

dhruvkb commented Oct 9, 2023

Cool, I've re-added the fields that were removed in c67f3ee. There were a total of 9 such fields. We can reconsider removing them later.

@dhruvkb dhruvkb removed the 🧱 stack: api Related to the Django API label Oct 9, 2023
@dhruvkb dhruvkb mentioned this pull request Oct 9, 2023
8 tasks
Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhruvkb, can we also update the tags so that it is never None? Can we return [] when there are no tags?

def parse_detailed_tags(json_tags):
if not json_tags:
return None
parsed_tags = []

@dhruvkb dhruvkb requested a review from obulat October 9, 2023 18:10
@AetherUnbound
Copy link
Collaborator

@dhruvkb Thanks for splitting this out! Do you mind updating the description of this PR with a detailed plan of action for deploying these changes? (e.g. what needs to get merged, then run, then merged, etc.)

@dhruvkb
Copy link
Member Author

dhruvkb commented Oct 10, 2023

@AetherUnbound I've linked to your planning doc in the PR description.

Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM and local testing works!

@obulat
Copy link
Contributor

obulat commented Oct 13, 2023

@dhruvkb, do you think that we should revert the changes I required for tags (replace None with an empty list []) in light of this discussion on the Elasticsearch Hit's missing properties: #3188

@dhruvkb
Copy link
Member Author

dhruvkb commented Oct 13, 2023

@obulat I think it would be best to revert the change and keep data modifications to a minimum in the ingestion server layer.

We should concentrate any data wrangling in as few layers of the stack as possible, to make these changes easier to find and modify. Considering how we are slowly planning to switch away from the ingestion server, it's best to handle these differences in either the catalog or the API.

Even if we decide to use [] here, we can do so in a separate PR like we are doing for the 9 to-be-removed fields #3143 (comment).

@dhruvkb dhruvkb merged commit be0256c into main Oct 19, 2023
43 checks passed
@dhruvkb dhruvkb deleted the smaller_indices branch October 19, 2023 05:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: ingestion server Related to the ingestion/data refresh server
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Remove text mapping on identifier
6 participants