Skip to content

perf: S3 concurrency semaphore and lock-free flush uploads#111

Merged
novatechflow merged 3 commits intoKafScale:mainfrom
klaudworks:feature/s3-backpressure-semaphore
Feb 26, 2026
Merged

perf: S3 concurrency semaphore and lock-free flush uploads#111
novatechflow merged 3 commits intoKafScale:mainfrom
klaudworks:feature/s3-backpressure-semaphore

Conversation

@klaudworks
Copy link
Contributor

@klaudworks klaudworks commented Feb 25, 2026

Status quo

  1. Every PartitionLog issues S3 calls independently with no shared concurrency limit. The only bound is the SDK's default MaxConnsPerHost: 2048. Total concurrent S3 calls is effectively active_connections * partitions_per_request * 2 for writes, plus active_connections * ReadAheadSegments for prefetch reads.

  2. flushLocked holds l.mu for the entire flush cycle — buffer drain, segment build, S3 upload, and metadata commit. Every AppendBatch and Read on the same partition blocks during S3 upload.

Shortcomings and fixes

flushLocked blocks reads and writes.: flushLocked holds l.mu during S3 uploads, blocking AppendBatch and Read on the same partition until the upload completes. Fixed by splitting into prepareFlush (buffer drain + segment build, under l.mu) and uploadFlush (S3 I/O, no lock held). A flushing flag + sync.Cond serializes concurrent flushes on the same partition.

Unbounded concurrent S3 calls. Each partition issues S3 calls independently with no shared limit. Under load with many partitions, the broker can have hundreds of concurrent S3 requests with no backpressure. Fixed by adding a broker-wide semaphore (KAFSCALE_S3_CONCURRENCY, default 64) that caps concurrent S3 operations. Set to 0 to disable. For slower S3-compatible storages (Hetzner, IONOS, self-hosted MinIO), operators can lower this to match their backend's capacity.

Back-of-envelope for 64 as default concurrency: each 4MB segment on a 10 Gbps link takes 3.2ms to transfer + ~15ms S3 latency = 18.2ms total. One connection achieves 4MB / 18.2ms = ~1.76 Gbps effective throughput. To fill 10 Gbps: 10 / 1.76 = 6 concurrent requests. For 50 Gbps: 24. 64 covers even high-network instances with margin (and also leaves margin for lower s3 latency which would incr. required concurrency to saturate the network). The goal was basically to just set a default that will never throttle throughput while providing back pressure for edge cases.

Connection churn. The AWS SDK default keeps only 10 idle connections per host (MaxIdleConnsPerHost: 10). Under burst load, most connections are created fresh with a full TCP+TLS handshake. Fixed by setting MaxConnsPerHost and MaxIdleConnsPerHost to match the semaphore limit, keeping connections warm.

Prefetch goroutines competed equally with produce/consume for S3 capacity. Fixed by using non-blocking TryAcquire — prefetch is skipped when all tokens are taken. Also narrowed l.mu to just the segment list read so prefetch I/O doesn't block appends.

…oads

Introduce a broker-wide semaphore (KAFSCALE_S3_CONCURRENCY, default 64)
that caps concurrent S3 operations across all partitions. Align the HTTP
transport connection pool with the same limit.

Split flushLocked into prepareFlush (under lock) and uploadFlush (lock-free)
so that AppendBatch and Read callers are no longer blocked behind S3 I/O.
Serialize concurrent flushes on the same partition via a flushing flag and
sync.Cond. Prefetch uses TryAcquire to avoid blocking critical-path I/O.
@klaudworks klaudworks marked this pull request as ready for review February 25, 2026 20:05
@novatechflow
Copy link
Collaborator

Thank you @klaudworks - please add the new switch into the /docs? We use a dedicated branch - gh-pages - for our docs rendering, operations (https://kafscale.io/operations/) would be a good candidate?

@klaudworks
Copy link
Contributor Author

@novatechflow Sure, I'll look into it later today and look for most fitting place e.g. operations.

@klaudworks
Copy link
Contributor Author

@novatechflow added the docs

Copy link
Collaborator

@novatechflow novatechflow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @klaudworks !

@novatechflow novatechflow merged commit b1fc0da into KafScale:main Feb 26, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants