perf: S3 concurrency semaphore and lock-free flush uploads#111
Merged
novatechflow merged 3 commits intoKafScale:mainfrom Feb 26, 2026
Merged
Conversation
…oads Introduce a broker-wide semaphore (KAFSCALE_S3_CONCURRENCY, default 64) that caps concurrent S3 operations across all partitions. Align the HTTP transport connection pool with the same limit. Split flushLocked into prepareFlush (under lock) and uploadFlush (lock-free) so that AppendBatch and Read callers are no longer blocked behind S3 I/O. Serialize concurrent flushes on the same partition via a flushing flag and sync.Cond. Prefetch uses TryAcquire to avoid blocking critical-path I/O.
Collaborator
|
Thank you @klaudworks - please add the new switch into the /docs? We use a dedicated branch - gh-pages - for our docs rendering, operations (https://kafscale.io/operations/) would be a good candidate? |
Contributor
Author
|
@novatechflow Sure, I'll look into it later today and look for most fitting place e.g. operations. |
Contributor
Author
|
@novatechflow added the docs |
novatechflow
approved these changes
Feb 26, 2026
Collaborator
novatechflow
left a comment
There was a problem hiding this comment.
Thank you @klaudworks !
novatechflow
approved these changes
Feb 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Status quo
Every
PartitionLogissues S3 calls independently with no shared concurrency limit. The only bound is the SDK's defaultMaxConnsPerHost: 2048. Total concurrent S3 calls is effectivelyactive_connections * partitions_per_request * 2for writes, plusactive_connections * ReadAheadSegmentsfor prefetch reads.flushLockedholdsl.mufor the entire flush cycle — buffer drain, segment build, S3 upload, and metadata commit. EveryAppendBatchandReadon the same partition blocks during S3 upload.Shortcomings and fixes
flushLocked blocks reads and writes.: flushLocked holds l.mu during S3 uploads, blocking AppendBatch and Read on the same partition until the upload completes. Fixed by splitting into prepareFlush (buffer drain + segment build, under l.mu) and uploadFlush (S3 I/O, no lock held). A flushing flag + sync.Cond serializes concurrent flushes on the same partition.
Unbounded concurrent S3 calls. Each partition issues S3 calls independently with no shared limit. Under load with many partitions, the broker can have hundreds of concurrent S3 requests with no backpressure. Fixed by adding a broker-wide semaphore (KAFSCALE_S3_CONCURRENCY, default 64) that caps concurrent S3 operations. Set to 0 to disable. For slower S3-compatible storages (Hetzner, IONOS, self-hosted MinIO), operators can lower this to match their backend's capacity.
Back-of-envelope for 64 as default concurrency: each 4MB segment on a 10 Gbps link takes 3.2ms to transfer + ~15ms S3 latency = 18.2ms total. One connection achieves 4MB / 18.2ms = ~1.76 Gbps effective throughput. To fill 10 Gbps: 10 / 1.76 = 6 concurrent requests. For 50 Gbps: 24. 64 covers even high-network instances with margin (and also leaves margin for lower s3 latency which would incr. required concurrency to saturate the network). The goal was basically to just set a default that will never throttle throughput while providing back pressure for edge cases.
Connection churn. The AWS SDK default keeps only 10 idle connections per host (
MaxIdleConnsPerHost: 10). Under burst load, most connections are created fresh with a full TCP+TLS handshake. Fixed by settingMaxConnsPerHostandMaxIdleConnsPerHostto match the semaphore limit, keeping connections warm.Prefetch goroutines competed equally with produce/consume for S3 capacity. Fixed by using non-blocking TryAcquire — prefetch is skipped when all tokens are taken. Also narrowed l.mu to just the segment list read so prefetch I/O doesn't block appends.