Skip to content

[BUG] 503 SlowDown when recovering WAL from S3 crashes broker #3177

@snapydziuba

Description

@snapydziuba

AutoMQ Version

1.6.3

Operating System

Strimzi Operator in EKS

Installation Method

other

Hardware Configuration

No response

Other Relevant Software

Environment

  • AutoMQ Version: 1.6.3-strimzi
  • Cluster: 3 broker nodes spread across availability zones

What Went Wrong?

When an AutoMQ broker is in startup the broker fails to start due to S3 rate limiting (HTTP 503 "SlowDown") during the recovery phase. The broker treats this transient error as fatal and exits, rather than retrying with exponential backoff.

When a broker pod is recreated, it must recover its state from S3. This recovery process generates a burst of S3 API requests. If S3 returns a 503 "SlowDown" response (rate limiting), the broker fails to start and exits with a fatal exception after 3 failed attempts, causing Kubernetes to restart the pod.

What Should Have Happened Instead?

The broker should:

  1. Retry S3 operations during recovery with longer/configurable backoff intervals
  2. Not treat transient S3 503 errors as fatal during startup
  3. Eventually succeed once S3 rate limiting subsides

Steps to Reproduce

Trigger a lot of requests to S3 while the Broker is in recovery

Additional Information

kafka 2026-01-27 08:42:10,405 INFO [BrokerLifecycleManager id=1002] closed event queue. (org.apache.kafka.queue.KafkaEventQueue) [main]
kafka 2026-01-27 08:42:10,406 INFO [client-metrics-reaper[]: Shutting down (org.apache.kafka.server.util.timer.SystemTimerReaper$Reaper) [main]
kafka 2026-01-27 08:42:10,406 INFO [client-metrics-reaper[]: Stopped (org.apache.kafka.server.util.timer.SystemTimerReaper$Reaper) [client-metrics-reaper]
kafka 2026-01-27 08:42:10,406 INFO [client-metrics-reaper[]: Shutdown completed (org.apache.kafka.server.util.timer.SystemTimerReaper$Reaper) [main]
kafka 2026-01-27 08:42:10,407 INFO [SharedServer id=1002] Stopping SharedServer (kafka.server.SharedServer) [main]
kafka 2026-01-27 08:42:10,407 INFO [MetadataLoader id=1002] beginShutdown: shutting down event queue. (org.apache.kafka.queue.KafkaEventQueue) [main]
kafka 2026-01-27 08:42:10,407 INFO [SnapshotGenerator id=1002] close: shutting down event queue. (org.apache.kafka.queue.KafkaEventQueue) [kafka-1002-metadata-loader-event-handler]
kafka 2026-01-27 08:42:10,408 INFO [SnapshotGenerator id=1002] closed event queue. (org.apache.kafka.queue.KafkaEventQueue) [kafka-1002-metadata-loader-event-handler]
kafka 2026-01-27 08:42:10,408 INFO [MetadataLoader id=1002] closed event queue. (org.apache.kafka.queue.KafkaEventQueue) [main]
kafka 2026-01-27 08:42:10,409 INFO [SnapshotGenerator id=1002] closed event queue. (org.apache.kafka.queue.KafkaEventQueue) [main]
kafka 2026-01-27 08:42:10,409 INFO [raft-expiration-reaper[]: Shutting down (kafka.raft.TimingWheelExpirationService$ExpiredOperationReaper) [main]
kafka 2026-01-27 08:42:10,593 INFO [raft-expiration-reaper[]: Stopped (kafka.raft.TimingWheelExpirationService$ExpiredOperationReaper) [raft-expiration-reaper]
kafka 2026-01-27 08:42:10,594 INFO [raft-expiration-reaper[]: Shutdown completed (kafka.raft.TimingWheelExpirationService$ExpiredOperationReaper) [main]
kafka 2026-01-27 08:42:10,594 INFO [kafka-1002-raft-io-thread[]: Shutting down (org.apache.kafka.raft.KafkaRaftClientDriver) [main]
kafka 2026-01-27 08:42:10,594 INFO [RaftManager id=1002] Beginning graceful shutdown (org.apache.kafka.raft.KafkaRaftClient) [main]
kafka 2026-01-27 08:42:10,595 INFO [RaftManager id=1002] Graceful shutdown completed (org.apache.kafka.raft.KafkaRaftClient) [kafka-1002-raft-io-thread]
kafka 2026-01-27 08:42:10,595 INFO [kafka-1002-raft-io-thread[]: Stopped (org.apache.kafka.raft.KafkaRaftClientDriver) [kafka-1002-raft-io-thread]
kafka 2026-01-27 08:42:10,595 INFO [RaftManager id=1002] Completed graceful shutdown of RaftClient (org.apache.kafka.raft.KafkaRaftClientDriver) [main]
kafka 2026-01-27 08:42:10,595 INFO [kafka-1002-raft-io-thread[]: Shutdown completed (org.apache.kafka.raft.KafkaRaftClientDriver) [main]
kafka 2026-01-27 08:42:10,603 INFO [kafka-1002-raft-outbound-request-thread[]: Shutting down (org.apache.kafka.raft.KafkaNetworkChannel$SendThread) [main]
kafka 2026-01-27 08:42:10,603 INFO [kafka-1002-raft-outbound-request-thread[]: Stopped (org.apache.kafka.raft.KafkaNetworkChannel$SendThread) [kafka-1002-raft-outbound-request-thread]
kafka 2026-01-27 08:42:10,603 INFO [kafka-1002-raft-outbound-request-thread[]: Shutdown completed (org.apache.kafka.raft.KafkaNetworkChannel$SendThread) [main]
kafka 2026-01-27 08:42:10,616 INFO Metrics scheduler closed (org.apache.kafka.common.metrics.Metrics) [main]
kafka 2026-01-27 08:42:10,616 INFO Closing reporter org.apache.kafka.common.metrics.JmxReporter (org.apache.kafka.common.metrics.Metrics) [main]
kafka 2026-01-27 08:42:10,616 INFO Metrics reporters closed (org.apache.kafka.common.metrics.Metrics) [main]
kafka 2026-01-27 08:42:10,698 INFO App info kafka.server for 1002 unregistered (org.apache.kafka.common.utils.AppInfoParser) [main]
kafka 2026-01-27 08:42:10,698 INFO [BrokerServer id=1002] shut down completed (kafka.server.BrokerServer) [main]
kafka 2026-01-27 08:42:10,699 INFO [BrokerServer id=1002] Transition from SHUTTING_DOWN to SHUTDOWN (kafka.server.BrokerServer) [main]
kafka 2026-01-27 08:42:10,699 ERROR Exiting Kafka due to fatal exception during startup. (kafka.Kafka$) [main]
kafka java.lang.RuntimeException: java.util.concurrent.ExecutionException: com.automq.stream.s3.exceptions.AutoMQException: java.util.concurrent.ExecutionException: software.amazon.awssdk.services.s3.model.S3Exception: Please reduce your request rate. (Service: S3, Status Code: 503, Request ID: 787QJPMB56E35YRV, Extended Request ID: JHYGgsPnY/BJvbB+5xputPsenkaRfKWmObhpqi/mx45DbAE8Unl7oiuIJ9Lp8d5C1qDmR2R9SJPk33+f+Yxb7N1vXJ65Bqb6eUHVhTp/dr8=)
kafka     at com.automq.stream.s3.S3Storage.startup(S3Storage.java:450)
kafka     at kafka.log.stream.s3.DefaultS3Client.start(DefaultS3Client.java:183)
kafka     at kafka.log.streamaspect.ClientWrapper.start(ClientWrapper.java:115)
kafka     at kafka.log.streamaspect.ElasticLogManager.startup(ElasticLogManager.scala:116)
kafka     at kafka.log.streamaspect.ElasticLogManager$.$anonfun$init$1(ElasticLogManager.scala:149)
kafka     at kafka.log.streamaspect.ElasticLogManager$.$anonfun$init$1$adapted(ElasticLogManager.scala:149)
kafka     at scala.Option.foreach(Option.scala:437)
kafka     at kafka.log.streamaspect.ElasticLogManager$.init(ElasticLogManager.scala:149)
kafka     at kafka.server.BrokerServer.startup(BrokerServer.scala:574)
kafka     at kafka.server.KafkaRaftServer.$anonfun$startup$2(KafkaRaftServer.scala:88)
kafka     at kafka.server.KafkaRaftServer.$anonfun$startup$2$adapted(KafkaRaftServer.scala:88)
kafka     at scala.Option.foreach(Option.scala:437)
kafka     at kafka.server.KafkaRaftServer.startup(KafkaRaftServer.scala:88)
kafka     at kafka.Kafka$.main(Kafka.scala:160)
kafka     at kafka.Kafka.main(Kafka.scala)
kafka Caused by: java.util.concurrent.ExecutionException: com.automq.stream.s3.exceptions.AutoMQException: java.util.concurrent.ExecutionException: software.amazon.awssdk.services.s3.model.S3Exception: Please reduce your request rate. (Service: S3, Status Code: 503, Request ID: 787QJPMB56E35YRV, Extended Request ID: JHYGgsPnY/BJvbB+5xputPsenkaRfKWmObhpqi/mx45DbAE8Unl7oiuIJ9Lp8d5C1qDmR2R9SJPk33+f+Yxb7N1vXJ65Bqb6eUHVhTp/dr8=)
kafka     at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
kafka     at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073)
kafka     at com.automq.stream.s3.S3Storage.recover0(S3Storage.java:492)
kafka     at com.automq.stream.s3.S3Storage.recover(S3Storage.java:459)
kafka     at com.automq.stream.s3.S3Storage.startup(S3Storage.java:446)
kafka     ... 14 more
kafka Caused by: com.automq.stream.s3.exceptions.AutoMQException: java.util.concurrent.ExecutionException: software.amazon.awssdk.services.s3.model.S3Exception: Please reduce your request rate. (Service: S3, Status Code: 503, Request ID: 787QJPMB56E35YRV, Extended Request ID: JHYGgsPnY/BJvbB+5xputPsenkaRfKWmObhpqi/mx45DbAE8Unl7oiuIJ9Lp8d5C1qDmR2R9SJPk33+f+Yxb7N1vXJ65Bqb6eUHVhTp/dr8=)
kafka     at kafka.log.stream.s3.wal.ConfirmWal.lambda$reset$0(ConfirmWal.java:179)
kafka     at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)
kafka     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
kafka     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
kafka     at java.base/java.lang.Thread.run(Thread.java:840)
kafka Caused by: java.util.concurrent.ExecutionException: software.amazon.awssdk.services.s3.model.S3Exception: Please reduce your request rate. (Service: S3, Status Code: 503, Request ID: 787QJPMB56E35YRV, Extended Request ID: JHYGgsPnY/BJvbB+5xputPsenkaRfKWmObhpqi/mx45DbAE8Unl7oiuIJ9Lp8d5C1qDmR2R9SJPk33+f+Yxb7N1vXJ65Bqb6eUHVhTp/dr8=)
kafka     at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
kafka     at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073)
kafka     at kafka.log.stream.s3.wal.ConfirmWal.lambda$reset$0(ConfirmWal.java:161)
kafka     ... 4 more
kafka Caused by: software.amazon.awssdk.services.s3.model.S3Exception: Please reduce your request rate. (Service: S3, Status Code: 503, Request ID: 787QJPMB56E35YRV, Extended Request ID: JHYGgsPnY/BJvbB+5xputPsenkaRfKWmObhpqi/mx45DbAE8Unl7oiuIJ9Lp8d5C1qDmR2R9SJPk33+f+Yxb7N1vXJ65Bqb6eUHVhTp/dr8=)
kafka     at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleErrorResponse(AwsXmlPredicatedResponseHandler.java:156)
kafka     at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleResponse(AwsXmlPredicatedResponseHandler.java:108)
kafka     at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:85)
kafka     at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:43)
kafka     at software.amazon.awssdk.core.internal.handler.BaseClientHandler.lambda$successTransformationResponseHandler$7(BaseClientHandler.java:279)
kafka     at software.amazon.awssdk.core.internal.http.async.AsyncResponseHandler.lambda$prepare$0(AsyncResponseHandler.java:92)
kafka     at java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1150)
kafka     at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
kafka     at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
kafka     at software.amazon.awssdk.core.internal.http.async.AsyncResponseHandler$BaosSubscriber.onComplete(AsyncResponseHandler.java:135)
kafka     at software.amazon.awssdk.core.internal.metrics.BytesReadTrackingPublisher$BytesReadTracker.onComplete(BytesReadTrackingPublisher.java:74)
kafka     at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler$DataCountingPublisher$1.onComplete(ResponseHandler.java:519)
kafka     at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.runAndLogError(ResponseHandler.java:254)
kafka     at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler.access$600(ResponseHandler.java:77)
kafka     at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler$PublisherAdapter$1.onComplete(ResponseHandler.java:375)
kafka     at software.amazon.awssdk.http.nio.netty.internal.nrs.HandlerPublisher.publishMessage(HandlerPublisher.java:402)
kafka     at software.amazon.awssdk.http.nio.netty.internal.nrs.HandlerPublisher.flushBuffer(HandlerPublisher.java:338)
kafka     at software.amazon.awssdk.http.nio.netty.internal.nrs.HandlerPublisher.receivedDemand(HandlerPublisher.java:291)
kafka     at software.amazon.awssdk.http.nio.netty.internal.nrs.HandlerPublisher.access$200(HandlerPublisher.java:61)
kafka     at software.amazon.awssdk.http.nio.netty.internal.nrs.HandlerPublisher$ChannelSubscription$1.run(HandlerPublisher.java:495)
kafka     at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
kafka     at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
kafka     at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
kafka     at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:566)
kafka     at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:994)
kafka     at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
kafka     ... 1 more
kafka     Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 1 failure: Please reduce your request rate. (Service: S3, Status Code: 503, Request ID: JH72WF2MEV91VHXS, Extended Request ID: 1sXiK951e9cWp0IbF1jzXwpqMjKSe53EyfuASEgQ2PXg7Ldzbj/vNd3UY9M+FjaRLSrKmT4t1DVoohNUGKjKHxwzGQXfrOv1dkMkOT09Phk=)
kafka     Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 2 failure: Please reduce your request rate. (Service: S3, Status Code: 503, Request ID: WZXDCHM4TGB3HKRE, Extended Request ID: LdXgfgIo89D4mf2+nsUgPyKjMW3jvFUXTWGSn7WohxIrPUio3Vp4aHIzBea9dGxJC4+EaWeSNwNT8vU321Ewb5zzhvBfNznohqyxXPY8ZCU=)
kafka     Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 3 failure: Please reduce your request rate. (Service: S3, Status Code: 503, Request ID: GY0EX4GXA14BJ8G6, Extended Request ID: 99xgEbYamujLTE3SXRXLz4lLZKbul5EZsf8ByGLQkruGwiHRbe+QY7D7PDrg/RO4Sa+yQGwvvq0Y1EX11Z3x0vndLv8kH5B43bBAwi2rVWc=)
kafka 2026-01-27 08:42:10,704 INFO App info kafka.server for 1002 unregistered (org.apache.kafka.common.utils.AppInfoParser) [kafka-shutdown-hook]
kafka 2026-01-27 08:42:10,708 INFO Stopped ServerConnector@62727399{SSL, (ssl, http/1.1)}{0.0.0.0:8443} (org.eclipse.jetty.server.AbstractConnector) [JettyShutdownThread]
kafka 2026-01-27 08:42:10,709 INFO Stopped ServerConnector@5d9b7a8a{HTTP/1.1, (http/1.1)}{localhost:8080} (org.eclipse.jetty.server.AbstractConnector) [JettyShutdownThread]
kafka 2026-01-27 08:42:10,710 INFO Stopped o.e.j.s.h.ContextHandler@4372b9b6{/v1/ready,null,STOPPED} (org.eclipse.jetty.server.handler.ContextHandler) [JettyShutdownThread]
kafka 2026-01-27 08:42:10,710 INFO Stopped o.e.j.s.h.ContextHandler@232a7d73{/v1/broker-state,null,STOPPED} (org.eclipse.jetty.server.handler.ContextHandler) [JettyShutdownThread]
kafka 2026-01-27 08:42:11,105 INFO S3RollingFileAppender log uploader closed successfully. (com.automq.log.S3RollingFileAppender) [kafka-shutdown-hook]
stream closed: EOF for monitoring-metrics/mimir-automq-broker-1002 (kafka)
'''

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions