-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce ES shard count and simplify index properties #3143
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the changes removing all of the unnecessary mappings! And thank you for opening this PR so fast, @dhruvkb!
I wonder why the sample indices sizes locally are so different for images and audio. Even though we have 5000 items for both media types, Audio takes 2.47MB, and Image - 5.54MB, so it's almost twice as big.
"tags": { | ||
"properties": { | ||
"accuracy": {"type": "float"}, | ||
# Text-based fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think setting it to keyword-only would greatly reduce the size of the index. It would change the search by tags a little bit. I.e., it would probably not match if you are searching for draw
, you won't get the media that has drawing
tag, but do we really want it to match like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an important point, we could convert it into keywords (and potentially use stems so that 'drawing' and 'draw' evaluate to the same keyword). I'll let others chime in on this discussion as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we maybe see how this set of changes goes, then consider moving tags to keywords as a follow-up? I agree that keywords + stemming could be useful (if that's a distinction we can make), but it would be a significant change to the way our search currently operates which leaves me a little hesitant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, we can keep this possibility open as a future optimisation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When discussing the collection views we explicitly said that tag stemming was desirable. The tag dogs
should match dog
, we used as an example. Any other approach doesn't make much sense and would confound users (and ourselves, I have to think!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me although I'm a little confused about the deploy strategy. Does this need to be broken into multiple PRs for the different stages?
@stacimc I'm not sure we would need to break the PR as we can deploy changes the ingestion server and API separately. But it would definitely need to be a multi-stage deployment with a data refresh between the ingestion server and API deployments. |
@obulat the size difference depends majorly on the contents of the
In almost all of the fields (especially tags), image documents seem to contain more data per document than their audio counterparts which leads to the vast size difference. |
Definitely, I'm just hesitant to merge as-is if it will put the API in a state where we can't deploy until the ingestion server has been deployed and a data refresh run. The data refresh is currently paused because of the related API instability and the API will need to be deployed to fix a vulnerability. If we pull the API changes into a separate PR it would just ensure that the API doesn't accidentally get deployed in a bad state. |
We might need to wait to merge this PR until we are ready to:
|
@obulat The problem is that once this is merged, we can't deploy the API until the other two steps (ingestion server deploy and data refresh) are both complete. For that 3-day period while the data refresh is running, we wouldn't be able to deploy the API (without backing out the API changes). If something like a security vulnerability or critical bug comes up, we'd have to remember to back out the changes first. |
It's easier if I just separate the API changes into a separate PR. But both PRs will fail CI and CI will remain broken till both pieces are reunited. That would also surely prevent us from deploying anything. |
If CI is failing, does that mean the API will start erroring out once the data refresh is complete? |
The API's Prod will not break if we split this PR because the API can continue to use the old index while the data refresh step runs. But in CI, the "data refresh" step happens on every run and the API always uses the latest indices (it doesn't have the old ones) so unless updated, it will references nonexistent fields in ES. |
This change as is is not a zero-downtime change. It would need to be merged essentially in unison with the data refresh, otherwise the index mapping will change out from underneath the API while it is still referencing the old format. To make it non-breaking, we need to think of it like a database migration to change column types. This issue gives clear steps for how to accomplish this for the identifier field. The same method applies to all the others. These mapping changes should be at minimum three PRs to prevent breaking changes in live environments or in CI.
This also must not be done. Search is at least capable of serializing from the ES hits (more below about whether this actually happens), not always the database, and therefore fields required to return the search results must remain (url and foreign landing url). If they aren't searched, we should just disable indexing on them, not remove them entirely. They're useful for debugging and testing anyway, and including them in the source document introduces no overhead aside from the trivial serialization and storage overhead, which we've already been paying without issue. I'm confused by the existing code on the serializers. It seems that in practice we essentially always use the database results because In any case, removing those fields outright isn't a meaningful optimisation, it hurts debug-ability (if querying ES directly we'd have to now do a separate search against the database to retrieve the URL and foreign landing URL), and risks future media types breaking outright (unless we make the change to always use the database results). Even though the last one can be mitigated (and we should revisit this code, no doubt, at least to confirm the conditional still makes sense), reducing the amount of easily accessible information when debugging for no additional benefit, when the underlying benefit of reducing index size can be had by disabling indexing on those fields, doesn't make sense to me. |
I misunderstood the issue description on my first pass. Let me work on this a little more. Update This PR has been split into 3 (the other two being #3161 and #3162).
Not all fields we need in the search results are available in ES so we have to use the DB to populate those fields (hence the
This would not reduce the storage space required as they would be contained in the documents
I have not done any debugging where I would look up ES results so I am not qualified to comment on this, but if they're useful we can add them back. On the other hand, it's an O(1) operation to get more info about an item from Postgres using the UUID. Postgres can provide all the info ES has (and more) so I don't feel like we're losing any valuable info here. Also, we retain the 'URL' field because it's required to be present on
If there is a way to save some storage (and potentially money) why not take it?
Yes. When the code was written it allowed to skip the DB if the info needed could be provided by the ES |
We won't save money, because the disk size would not decrease. The transfer rate of these strings is trivial compared to the overhead of indexing the fields. Removing them is a significant change to the API of the documents that seems unrelated. We're not worried about storage efficiency in our index, as far as I'm aware. I'm sure over several hundreds of millions of records it will make a difference in the overall size of the stored documents, but it wouldn't cause us to change the amount of storage on our cluster (which we haven't touched in a long time and would require a lot more work, i.e., reprovisioning the cluster with the new instance sizes). I've definitely used at least I guess at the end of the day it's fine to remove it if there is a compelling reason and it would actually affect our storage provisioning, keeping in mind that the actual amount of storage we use has no effect on the cost of the storage we allocate, which is static on the allocation, not the usage. But it seems unrelated to improvements in index overhead, in which storage, as far as I know, isn't really a significant factor. If I'm wrong about that though, please let me know. Not a blocker, of course, it's easy to put back if we want, but I do feel that it could be a separate PR and considered as its own thing, rather than lumped in with an already significant set of changes. |
@dhruvkb Now that we have the three split-out PRs, can you share the deployment plan, including reindexes we need to do after specific deployments? I realised last night we don't need to run a full data refresh, just a reindex, and that we'll also need to run it on staging. Just want to make sure we have a clear plan so that especially production gets the changes deployed at the right time relative to the reindex operation. |
I also took a look at the storage situation, just to make sure that there isn't actually a storage pressure, and we are really sitting just fine on storage, at least outside of reindexing windows: Each data node has a 1900 GB SSD (they are i3.2xlarge) and the data nodes have a 10 GB EBS attached (c5 instances are EBS only). Our image indexes are 1.2 TB, and are the largest consumers of storage, so we are probably getting close to 85% storage consumption for each node during image data refreshes. There may be some precedence for evaluating storage efficiency. However, I'd really like to see how the field type changes alone affect storage. Theoretically, by removing most of the unnecessary subfields and switching many from text to keyword, we should save a good deal of space (especially with the removals). Knowing the effect of the field changes on their own could help us understand future opportunities to make efficiency gains in index size. Would you be okay if we split the change to remove those two fields into a separate PR? At least until we've completely removed the cases in the code where search results can be serialised using the ES hits. Even if it does not actually happen today, I'd feel more comfortable if we removed the possibility of it altogether, rather than relying on the fact that our configuration currently happens to force the other case. We're currently OK on storage, and the field changes are much more important from a performance overhead perspective, so there isn't pressure to make this specific change quickly, from my perspective. |
Cool, I've re-added the fields that were removed in c67f3ee. There were a total of 9 such fields. We can reconsider removing them later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dhruvkb, can we also update the tags
so that it is never None
? Can we return []
when there are no tags?
openverse/ingestion_server/ingestion_server/elasticsearch_models.py
Lines 188 to 191 in 3bb5bff
def parse_detailed_tags(json_tags): | |
if not json_tags: | |
return None | |
parsed_tags = [] |
@dhruvkb Thanks for splitting this out! Do you mind updating the description of this PR with a detailed plan of action for deploying these changes? (e.g. what needs to get merged, then run, then merged, etc.) |
@AetherUnbound I've linked to your planning doc in the PR description. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM and local testing works!
@obulat I think it would be best to revert the change and keep data modifications to a minimum in the ingestion server layer. We should concentrate any data wrangling in as few layers of the stack as possible, to make these changes easier to find and modify. Considering how we are slowly planning to switch away from the ingestion server, it's best to handle these differences in either the catalog or the API. Even if we decide to use |
Series
This PR is part 1/3.
keyword
fields instead of subfields #3161Fixes
Fixes #2154 by @sarayourfriend.
Builds upon past work from WordPress/openverse-api#684 and WordPress/openverse-api#850.
Description
This PR
Testing Instructions
CI should pass. Compare index sizes (using something like Elasticvue) before and after the PR to see the difference.
just down -v
just dc build web ingestion_server
just api/init
Deployment
This PR needs a multi-step deployment strategy that is well-documented in #2154 and described in a Google Doc.
Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin