Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batch delete from gc #5404

Merged
merged 5 commits into from
Sep 10, 2024
Merged

batch delete from gc #5404

merged 5 commits into from
Sep 10, 2024

Conversation

trinity-1686a
Copy link
Contributor

Description

follow up to #5380
if a cluster generate less than 1k split per 10 minutes, everything was fine, but in case it would generate more than that, we have a slow down for 2 reasons:

  • we no longer run 10 tasks in parallel
  • we may get a batch of 1k splits from many different indexes, and get a 2nd batch from the same set of indexes, meaning more than one delete per index, despite having few splits to delete for each indexes. This can in theory cause us to do up to n_index time more delete calls to the metastore than necessary (caped to 1k)

Changes made:

  • we run up to 10 deletion from storage + metastore concurrently
  • we retrieve splits sorted by index_uid from metastore, so we don't do many delete per index
  • we query for 10k splits instead of 1k: before we would query for 10 * 1k, but as this is now batched, we can increase this number to keep a similar effect to before

How was this PR tested?

tested on a small cluster to not be utterly broken, for perf improvement, we'll need a way bigger cluster

Comment on lines +204 to +209
sql.join(
JoinType::Join,
Indexes::Table,
Expr::col((Splits::Table, Splits::IndexUid))
.equals((Indexes::Table, Indexes::IndexUid)),
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: i think this is a bad idea to do this long term, but i'd rather this fix be made without a migration for now, so we can easily revert clusters affected to previous versions if we find things are somehow still not good. If you disagree, we can either create the required index manually, or make a migration

Copy link

github-actions bot commented Sep 9, 2024

On SSD:

Average search latency is 1.01x that of the reference (lower is better).
Ref run id: 3358, ref commit: ec951aa
Link

On GCS:

Average search latency is 1.2x that of the reference (lower is better).
Ref run id: 3359, ref commit: ec951aa
Link

@fulmicoton fulmicoton merged commit 20b4956 into main Sep 10, 2024
5 checks passed
@fulmicoton fulmicoton deleted the trinity/batch-delete branch September 10, 2024 23:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants