-
Notifications
You must be signed in to change notification settings - Fork 659
Description
Title
Handle NoSuchUploadException in UploadPart operation to prevent broker crash
Description
When using AutoMQ with Tencent Cloud COS (or other S3-compatible storage), the broker crashes with Runtime.getRuntime().halt(1) when encountering NoSuchUploadException during UploadPart operations.
Root Cause
The current implementation in AwsObjectStorage.toRetryStrategyAndCause() only handles NoSuchUploadException specially for COMPLETE_MULTI_PART_UPLOAD operation:
if (COMPLETE_MULTI_PART_UPLOAD == operation) {
if (cause instanceof NoSuchUploadException) {
strategy = RetryStrategy.VISIBILITY_CHECK;
}
}However, for UPLOAD_PART operation, NoSuchUploadException (HTTP 404) results in RetryStrategy.ABORT, which eventually propagates to S3Storage.commitDeltaWALUpload() and triggers:
Runtime.getRuntime().halt(1);Scenario
This issue occurs when:
- A multipart upload is initiated
- The upload takes longer than expected (due to network issues, throttling, or large data)
- The cloud storage's lifecycle rule automatically aborts incomplete multipart uploads (e.g., after 1-7 days)
- Subsequent
UploadPartcalls fail withNoSuchUploadException - Broker crashes
Error Log
[ERROR] UploadPart for object 2b180480/_kafka_ops_sh/138445234-2 fail (com.automq.stream.s3.operator.AbstractObjectStorage)
software.amazon.awssdk.services.s3.model.NoSuchUploadException: The specified multipart upload does not exist.
The upload ID might be invalid, or the multipart upload might have been aborted or completed.
[ERROR] Unexpected exception when commit stream set object (com.automq.stream.s3.S3Storage)
java.util.concurrent.CompletionException: software.amazon.awssdk.services.s3.model.NoSuchUploadException: ...
Proposed Solution
-
When
NoSuchUploadExceptionoccurs duringUPLOAD_PART, instead of aborting, the system should:- Invalidate the current uploadId
- Re-initiate a new multipart upload
- Retry the upload from the beginning
-
Add uploadId validity tracking to detect stale upload sessions early
-
Consider adding a configurable timeout for multipart uploads to proactively restart uploads that are taking too long
Environment
- Cloud Provider: Tencent Cloud COS (S3-compatible)
- The bucket has lifecycle rules configured to abort incomplete multipart uploads
Impact
- Severity: High
- Broker crashes and requires manual restart
- Data durability may be affected if WAL upload fails