Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QQ: checkpointing frequency improvements #11964

Merged
merged 2 commits into from
Aug 16, 2024
Merged

Conversation

kjnilsson
Copy link
Contributor

@kjnilsson kjnilsson commented Aug 9, 2024

The current approach takes too many checkpoints which affects performance negatively, especially with large backlogs.

This PR takes an approach more similar to what was done for release cursors in 3.13.x.

Also add a force_checkpoint aux command that the purge operation
emits - this can also be used to try to force a checkpoint

The checkpointing config can be changed by setting the the quorum_queue_checkpoint_config persistent term:

persistent_term:set(quorum_queue_checkpoint_config, {MinIntervalMs, MinIndexes, MaxIndexes}).

the current values are: {1000, 4096, 666667} which means it will take a checkpoint at most every 1s as long as at least 4096 indexes have been applied. The min indexes between each checkpoint will grow in line with the message backlog up to at most 666667.

@kjnilsson kjnilsson added this to the 4.0.0 milestone Aug 9, 2024
@michaelklishin michaelklishin changed the title Qq: adjust checkpointing algo to something more like QQ: adjust checkpointing algo to something more like it was in 3.13.x Aug 9, 2024
@kjnilsson kjnilsson changed the title QQ: adjust checkpointing algo to something more like it was in 3.13.x QQ: checkpointing frequency improvements Aug 9, 2024
@kjnilsson kjnilsson force-pushed the qq-checkpointing-tweaks branch 2 times, most recently from d513239 to 776d8cb Compare August 14, 2024 16:07
@mergify mergify bot added the bazel label Aug 14, 2024
@michaelklishin
Copy link
Member

The forced push was a rebase.

@michaelklishin
Copy link
Member

My PerfTest tests do not observe any anomalies. With an 88M (22M per queue) message backlog, there are 22-23 checkpoints per queue. When the queues are drained, the checkpoints go away roughly at the rate of consumption of 1M messages.

I assume that a checkpoint taken every ≈ 1M messages is a reasonable rate.

With a 50M message backlog across 4 queues, the node takes 18s to start on a mostly idle 10 core machine with a reasonably fast 3 year old SSD.

With a workload that simulates peak throughput with 4 queues, 3 publishers and 3 consumers,
and monitors queue directories using watch -n 1 (once a second), a checkpoint appears and disappears roughly every 1M messages published (≈ 5s).

@kjnilsson
Copy link
Contributor Author

I assume that a checkpoint taken every ≈ 1M messages is a reasonable rate.

depending on how many messages there are in the backlog it will grow the number of indexes between checkpoints from 4096 to ~1M (max) so yes that tallys. Cheers.

it was in 3.13.x.

Also add a force_checkpoint aux command that the purge operation
emits - this can also be used to try to force a checkpoint
@kjnilsson kjnilsson marked this pull request as ready for review August 15, 2024 10:59
@kjnilsson kjnilsson requested review from the-mikedavis and removed request for mkuratczyk August 15, 2024 10:59
Also remove a resolved TODO about conversion for the `last_checkpoint`
field.
@michaelklishin michaelklishin merged commit 178f9a9 into main Aug 16, 2024
238 checks passed
@michaelklishin michaelklishin deleted the qq-checkpointing-tweaks branch August 16, 2024 00:49
michaelklishin added a commit that referenced this pull request Aug 16, 2024
QQ: checkpointing frequency improvements (backport #11964)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants