Reduce ES shard count and simplify index properties #3143

dhruvkb · 2023-10-05T10:33:44Z

Series

This PR is part 1/3.

Fixes

Fixes #2154 by @sarayourfriend.

Builds upon past work from WordPress/openverse-api#684 and WordPress/openverse-api#850.

Description

This PR

removes ES fields not used for filtering or searching
converts text fields to keywords where partial matching is not needed
updates usages of these changed fields in the API
refactors ES models to DRY some common code
reduces the shards for the audio index as a way to improve performance

Testing Instructions

CI should pass. Compare index sizes (using something like Elasticvue) before and after the PR to see the difference.

just down -v
just dc build web ingestion_server
just api/init

Deployment

This PR needs a multi-step deployment strategy that is well-documented in #2154 and described in a Google Doc.

Checklist

My pull request has a descriptive title (not a vague title likeUpdate index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.
I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

obulat

Love the changes removing all of the unnecessary mappings! And thank you for opening this PR so fast, @dhruvkb!

I wonder why the sample indices sizes locally are so different for images and audio. Even though we have 5000 items for both media types, Audio takes 2.47MB, and Image - 5.54MB, so it's almost twice as big.

api/api/controllers/search_controller.py

obulat · 2023-10-05T15:48:05Z

ingestion_server/ingestion_server/es_mapping.py

            "tags": {
                "properties": {
                    "accuracy": {"type": "float"},
+                    # Text-based fields


I think setting it to keyword-only would greatly reduce the size of the index. It would change the search by tags a little bit. I.e., it would probably not match if you are searching for draw, you won't get the media that has drawing tag, but do we really want it to match like that?

This is an important point, we could convert it into keywords (and potentially use stems so that 'drawing' and 'draw' evaluate to the same keyword). I'll let others chime in on this discussion as well.

Could we maybe see how this set of changes goes, then consider moving tags to keywords as a follow-up? I agree that keywords + stemming could be useful (if that's a distinction we can make), but it would be a significant change to the way our search currently operates which leaves me a little hesitant.

True, we can keep this possibility open as a future optimisation.

When discussing the collection views we explicitly said that tag stemming was desirable. The tag dogs should match dog, we used as an example. Any other approach doesn't make much sense and would confound users (and ourselves, I have to think!).

stacimc

This looks good to me although I'm a little confused about the deploy strategy. Does this need to be broken into multiple PRs for the different stages?

dhruvkb · 2023-10-06T04:41:09Z

@stacimc I'm not sure we would need to break the PR as we can deploy changes the ingestion server and API separately. But it would definitely need to be a multi-stage deployment with a data refresh between the ingestion server and API deployments.

dhruvkb · 2023-10-06T11:05:47Z

@obulat the size difference depends majorly on the contents of the title, description and tags fields.

Media type	None	With title only	With desc only	With tags only	Complete
audio	1.18 MB	1.59 MB	1.59 MB	1.43 MB	2.25 MB
image	1.78 MB	2.02 MB	2.80 MB	3.80 MB	5.78 MB

In almost all of the fields (especially tags), image documents seem to contain more data per document than their audio counterparts which leads to the vast size difference.

stacimc · 2023-10-06T16:17:58Z

I'm not sure we would need to break the PR as we can deploy changes the ingestion server and API separately. But it would definitely need to be a multi-stage deployment with a data refresh between the ingestion server and API deployments.

Definitely, I'm just hesitant to merge as-is if it will put the API in a state where we can't deploy until the ingestion server has been deployed and a data refresh run. The data refresh is currently paused because of the related API instability and the API will need to be deployed to fix a vulnerability. If we pull the API changes into a separate PR it would just ensure that the API doesn't accidentally get deployed in a bad state.

obulat · 2023-10-06T16:31:53Z

@obulat the size difference depends majorly on the contents of the title, description and tags fields.

Interesting! Thank you so much for investigating and sharing, @dhruvkb !

obulat · 2023-10-06T16:33:08Z

Definitely, I'm just hesitant to merge as-is if it will put the API in a state where we can't deploy until the ingestion server has been deployed and a data refresh run. The data refresh is currently paused because of the related API instability and the API will need to be deployed to fix a vulnerability. If we pull the API changes into a separate PR it would just ensure that the API doesn't accidentally get deployed in a bad state.

We might need to wait to merge this PR until we are ready to:

deploy the API
deploy the ingestion server
start the data refresh
at the same time.

stacimc · 2023-10-06T16:48:44Z

@obulat The problem is that once this is merged, we can't deploy the API until the other two steps (ingestion server deploy and data refresh) are both complete. For that 3-day period while the data refresh is running, we wouldn't be able to deploy the API (without backing out the API changes). If something like a security vulnerability or critical bug comes up, we'd have to remember to back out the changes first.

dhruvkb · 2023-10-06T17:40:12Z

It's easier if I just separate the API changes into a separate PR. But both PRs will fail CI and CI will remain broken till both pieces are reunited. That would also surely prevent us from deploying anything.

AetherUnbound · 2023-10-06T20:24:25Z

If CI is failing, does that mean the API will start erroring out once the data refresh is complete?

dhruvkb · 2023-10-07T03:29:00Z

What would cause CI to fail if these were split?

The API's search_controller contain references to fields and subfields defined in es_mapping.py, which need to be updated when the mapping changes. For example, identifier__keyword needs to be replaced with identifer (as there is no longer a keyword subfield).

Prod will not break if we split this PR because the API can continue to use the old index while the data refresh step runs. But in CI, the "data refresh" step happens on every run and the API always uses the latest indices (it doesn't have the old ones) so unless updated, it will references nonexistent fields in ES.

sarayourfriend · 2023-10-07T03:58:50Z

This change as is is not a zero-downtime change. It would need to be merged essentially in unison with the data refresh, otherwise the index mapping will change out from underneath the API while it is still referencing the old format.

To make it non-breaking, we need to think of it like a database migration to change column types. This issue gives clear steps for how to accomplish this for the identifier field. The same method applies to all the others. These mapping changes should be at minimum three PRs to prevent breaking changes in live environments or in CI.

removes ES fields not used for filtering or searching

This also must not be done. Search is at least capable of serializing from the ES hits (more below about whether this actually happens), not always the database, and therefore fields required to return the search results must remain (url and foreign landing url). If they aren't searched, we should just disable indexing on them, not remove them entirely. They're useful for debugging and testing anyway, and including them in the source document introduces no overhead aside from the trivial serialization and storage overhead, which we've already been paying without issue.

https://github.com//WordPress/openverse/blob/bc7b2a7f6ee32fc3ca367baae7681abc41ef7f79/api/api/views/media_views.py#L137-L140

I'm confused by the existing code on the serializers. It seems that in practice we essentially always use the database results because AudioSerializer and ImageSerializer, both have needs_db = True. The right-hand side of the conditional in the code I've linked above (serializer_class.needs_db) is always true. We can probably clarify the way this works to avoid confusion on it and just always use the database to serialize results. It would simplify the serializer code as well.

In any case, removing those fields outright isn't a meaningful optimisation, it hurts debug-ability (if querying ES directly we'd have to now do a separate search against the database to retrieve the URL and foreign landing URL), and risks future media types breaking outright (unless we make the change to always use the database results). Even though the last one can be mitigated (and we should revisit this code, no doubt, at least to confirm the conditional still makes sense), reducing the amount of easily accessible information when debugging for no additional benefit, when the underlying benefit of reducing index size can be had by disabling indexing on those fields, doesn't make sense to me.

dhruvkb · 2023-10-07T05:37:55Z

This issue gives clear steps for how to accomplish this

I misunderstood the issue description on my first pass. Let me work on this a little more.

Update This PR has been split into 3 (the other two being #3161 and #3162).

Search is at least capable of serializing from the ES hits

Not all fields we need in the search results are available in ES so we have to use the DB to populate those fields (hence the needs_db in the serializers). If we keep them in ES we are storing them twice and not reading from them at all.

we should just disable indexing on them, not remove them entirely

This would not reduce the storage space required as they would be contained in the documents _source field and would consume space. Getting rid of them not only reduces the index size but actually frees up more storage space.

They're useful for debugging and testing anyway

I have not done any debugging where I would look up ES results so I am not qualified to comment on this, but if they're useful we can add them back. On the other hand, it's an O(1) operation to get more info about an item from Postgres using the UUID. Postgres can provide all the info ES has (and more) so I don't feel like we're losing any valuable info here. Also, we retain the 'URL' field because it's required to be present on Hit in some places.

we've already been paying without issue

If there is a way to save some storage (and potentially money) why not take it?

We can probably clarify the way this works to avoid confusion on it

Yes. When the code was written it allowed to skip the DB if the info needed could be provided by the ES Hit. We know that this is not the case and 1 DB call is okay so we can always use the DB. This refactoring is definitely a good idea.

sarayourfriend · 2023-10-07T23:58:12Z

If there is a way to save some storage (and potentially money) why not take it?

We won't save money, because the disk size would not decrease. The transfer rate of these strings is trivial compared to the overhead of indexing the fields. Removing them is a significant change to the API of the documents that seems unrelated. We're not worried about storage efficiency in our index, as far as I'm aware. I'm sure over several hundreds of millions of records it will make a difference in the overall size of the stored documents, but it wouldn't cause us to change the amount of storage on our cluster (which we haven't touched in a long time and would require a lot more work, i.e., reprovisioning the cluster with the new instance sizes).

I've definitely used at least url in debugging outages and issues with our photon integration, for example (where ES was more convenient to query than Postgres).

I guess at the end of the day it's fine to remove it if there is a compelling reason and it would actually affect our storage provisioning, keeping in mind that the actual amount of storage we use has no effect on the cost of the storage we allocate, which is static on the allocation, not the usage. But it seems unrelated to improvements in index overhead, in which storage, as far as I know, isn't really a significant factor. If I'm wrong about that though, please let me know.

Not a blocker, of course, it's easy to put back if we want, but I do feel that it could be a separate PR and considered as its own thing, rather than lumped in with an already significant set of changes.

sarayourfriend · 2023-10-08T22:01:40Z

@dhruvkb Now that we have the three split-out PRs, can you share the deployment plan, including reindexes we need to do after specific deployments? I realised last night we don't need to run a full data refresh, just a reindex, and that we'll also need to run it on staging. Just want to make sure we have a clear plan so that especially production gets the changes deployed at the right time relative to the reindex operation.

sarayourfriend · 2023-10-08T22:19:07Z

I also took a look at the storage situation, just to make sure that there isn't actually a storage pressure, and we are really sitting just fine on storage, at least outside of reindexing windows:

Each data node has a 1900 GB SSD (they are i3.2xlarge) and the data nodes have a 10 GB EBS attached (c5 instances are EBS only).

Our image indexes are 1.2 TB, and are the largest consumers of storage, so we are probably getting close to 85% storage consumption for each node during image data refreshes. There may be some precedence for evaluating storage efficiency.

However, I'd really like to see how the field type changes alone affect storage. Theoretically, by removing most of the unnecessary subfields and switching many from text to keyword, we should save a good deal of space (especially with the removals). Knowing the effect of the field changes on their own could help us understand future opportunities to make efficiency gains in index size.

Would you be okay if we split the change to remove those two fields into a separate PR? At least until we've completely removed the cases in the code where search results can be serialised using the ES hits. Even if it does not actually happen today, I'd feel more comfortable if we removed the possibility of it altogether, rather than relying on the fact that our configuration currently happens to force the other case. We're currently OK on storage, and the field changes are much more important from a performance overhead perspective, so there isn't pressure to make this specific change quickly, from my perspective.

dhruvkb · 2023-10-09T05:36:29Z

Cool, I've re-added the fields that were removed in c67f3ee. There were a total of 9 such fields. We can reconsider removing them later.

obulat

@dhruvkb, can we also update the tags so that it is never None? Can we return [] when there are no tags?

openverse/ingestion_server/ingestion_server/elasticsearch_models.py

Lines 188 to 191 in 3bb5bff

    
           def parse_detailed_tags(json_tags): 
        
               if not json_tags: 
        
                   return None 
        
               parsed_tags = []

AetherUnbound · 2023-10-09T20:38:04Z

@dhruvkb Thanks for splitting this out! Do you mind updating the description of this PR with a detailed plan of action for deploying these changes? (e.g. what needs to get merged, then run, then merged, etc.)

dhruvkb · 2023-10-10T05:03:40Z

@AetherUnbound I've linked to your planning doc in the PR description.

…aller_indices

AetherUnbound

Changes LGTM and local testing works!

obulat · 2023-10-13T11:44:29Z

@dhruvkb, do you think that we should revert the changes I required for tags (replace None with an empty list []) in light of this discussion on the Elasticsearch Hit's missing properties: #3188

dhruvkb · 2023-10-13T12:55:34Z

@obulat I think it would be best to revert the change and keep data modifications to a minimum in the ingestion server layer.

We should concentrate any data wrangling in as few layers of the stack as possible, to make these changes easier to find and modify. Considering how we are slowly planning to switch away from the ingestion server, it's best to handle these differences in either the catalog or the API.

Even if we decide to use [] here, we can do so in a separate PR like we are doing for the 9 to-be-removed fields #3143 (comment).

dhruvkb added 6 commits October 5, 2023 09:53

Allow setting number of shards per media type

055ac4c

Simplify and organise index properties

81912f4

Remove fields not supported by search request serializer

df51d10

Update field mapping in search controller

2e9da51

Remove .keyword from source

1d27691

Remove .keyword from identifier

560fa69

dhruvkb added 2 commits October 5, 2023 15:31

Remove keyword field from description

4aa9fb1

Delete unwanted entry from number_of_shards

997324f

dhruvkb marked this pull request as ready for review October 5, 2023 13:26

dhruvkb requested a review from a team as a code owner October 5, 2023 13:26

dhruvkb requested review from krysal and sarayourfriend October 5, 2023 13:26

obulat approved these changes Oct 5, 2023

View reviewed changes

stacimc reviewed Oct 5, 2023

View reviewed changes

dhruvkb added 3 commits October 6, 2023 12:48

Remove extraneous fields

b0c0df9

Move boosts to parent Media class

0203502

Remove unused fields and sort as per es_mapping.py

3b5d1d2

Return subfields to avoid API changes

618604d

dhruvkb removed the 🧱 stack: ingestion server Related to the ingestion/data refresh server label Oct 7, 2023

This was referenced Oct 7, 2023

Use top-level keyword fields instead of subfields #3161

Merged

Remove unreferenced subfields in indices #3162

Merged

Re-add extra fields that are not indexed

c67f3ee

github-actions bot added the 🧱 stack: ingestion server Related to the ingestion/data refresh server label Oct 9, 2023

dhruvkb removed the 🧱 stack: api Related to the Django API label Oct 9, 2023

dhruvkb mentioned this pull request Oct 9, 2023

Remove unused fields from ES #3168

Merged

8 tasks

obulat requested changes Oct 9, 2023

View reviewed changes

Use blank array instead of None for tags

ba3efc3

dhruvkb requested a review from obulat October 9, 2023 18:10

obulat approved these changes Oct 9, 2023

View reviewed changes

Merge branch 'main' of https://github.com/WordPress/openverse into sm…

2de8eb7

…aller_indices

AetherUnbound approved these changes Oct 10, 2023

View reviewed changes

AetherUnbound mentioned this pull request Oct 11, 2023

Fix PR review reminder DAG for case where base branch is 2+ levels deep #3182

Merged

8 tasks

Revert to None when no tags

d52e73f

dhruvkb merged commit be0256c into main Oct 19, 2023
43 checks passed

dhruvkb deleted the smaller_indices branch October 19, 2023 05:50

dhruvkb mentioned this pull request Nov 3, 2023

Write a documentation page for zero-downtime ES index migrations #3314

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce ES shard count and simplify index properties #3143

Reduce ES shard count and simplify index properties #3143

dhruvkb commented Oct 5, 2023 •

edited

Loading

obulat left a comment

obulat Oct 5, 2023

dhruvkb Oct 6, 2023

AetherUnbound Oct 6, 2023

dhruvkb Oct 7, 2023

sarayourfriend Oct 7, 2023

stacimc left a comment

dhruvkb commented Oct 6, 2023 •

edited

Loading

dhruvkb commented Oct 6, 2023

stacimc commented Oct 6, 2023

obulat commented Oct 6, 2023

obulat commented Oct 6, 2023

stacimc commented Oct 6, 2023

dhruvkb commented Oct 6, 2023 •

edited

Loading

AetherUnbound commented Oct 6, 2023

dhruvkb commented Oct 7, 2023 •

edited

Loading

sarayourfriend commented Oct 7, 2023 •

edited

Loading

dhruvkb commented Oct 7, 2023 •

edited

Loading

sarayourfriend commented Oct 7, 2023

sarayourfriend commented Oct 8, 2023

sarayourfriend commented Oct 8, 2023

dhruvkb commented Oct 9, 2023 •

edited

Loading

obulat left a comment

AetherUnbound commented Oct 9, 2023

dhruvkb commented Oct 10, 2023

AetherUnbound left a comment

obulat commented Oct 13, 2023

dhruvkb commented Oct 13, 2023 •

edited

Loading

	def parse_detailed_tags(json_tags):
	if not json_tags:
	return None
	parsed_tags = []

Reduce ES shard count and simplify index properties #3143

Reduce ES shard count and simplify index properties #3143

Conversation

dhruvkb commented Oct 5, 2023 • edited Loading

Series

Fixes

Description

Testing Instructions

Deployment

Checklist

Developer Certificate of Origin

obulat left a comment

Choose a reason for hiding this comment

obulat Oct 5, 2023

Choose a reason for hiding this comment

dhruvkb Oct 6, 2023

Choose a reason for hiding this comment

AetherUnbound Oct 6, 2023

Choose a reason for hiding this comment

dhruvkb Oct 7, 2023

Choose a reason for hiding this comment

sarayourfriend Oct 7, 2023

Choose a reason for hiding this comment

stacimc left a comment

Choose a reason for hiding this comment

dhruvkb commented Oct 6, 2023 • edited Loading

dhruvkb commented Oct 6, 2023

stacimc commented Oct 6, 2023

obulat commented Oct 6, 2023

obulat commented Oct 6, 2023

stacimc commented Oct 6, 2023

dhruvkb commented Oct 6, 2023 • edited Loading

AetherUnbound commented Oct 6, 2023

dhruvkb commented Oct 7, 2023 • edited Loading

sarayourfriend commented Oct 7, 2023 • edited Loading

dhruvkb commented Oct 7, 2023 • edited Loading

sarayourfriend commented Oct 7, 2023

sarayourfriend commented Oct 8, 2023

sarayourfriend commented Oct 8, 2023

dhruvkb commented Oct 9, 2023 • edited Loading

obulat left a comment

Choose a reason for hiding this comment

AetherUnbound commented Oct 9, 2023

dhruvkb commented Oct 10, 2023

AetherUnbound left a comment

Choose a reason for hiding this comment

obulat commented Oct 13, 2023

dhruvkb commented Oct 13, 2023 • edited Loading

dhruvkb commented Oct 5, 2023 •

edited

Loading

dhruvkb commented Oct 6, 2023 •

edited

Loading

dhruvkb commented Oct 6, 2023 •

edited

Loading

dhruvkb commented Oct 7, 2023 •

edited

Loading

sarayourfriend commented Oct 7, 2023 •

edited

Loading

dhruvkb commented Oct 7, 2023 •

edited

Loading

dhruvkb commented Oct 9, 2023 •

edited

Loading

dhruvkb commented Oct 13, 2023 •

edited

Loading