Skip to content

Handle NoSuchUploadException in UploadPart operation to prevent broker crash #3206

@daniel-y

Description

@daniel-y

Title

Handle NoSuchUploadException in UploadPart operation to prevent broker crash

Description

When using AutoMQ with Tencent Cloud COS (or other S3-compatible storage), the broker crashes with Runtime.getRuntime().halt(1) when encountering NoSuchUploadException during UploadPart operations.

Root Cause

The current implementation in AwsObjectStorage.toRetryStrategyAndCause() only handles NoSuchUploadException specially for COMPLETE_MULTI_PART_UPLOAD operation:

if (COMPLETE_MULTI_PART_UPLOAD == operation) {
    if (cause instanceof NoSuchUploadException) {
        strategy = RetryStrategy.VISIBILITY_CHECK;
    }
}

However, for UPLOAD_PART operation, NoSuchUploadException (HTTP 404) results in RetryStrategy.ABORT, which eventually propagates to S3Storage.commitDeltaWALUpload() and triggers:

Runtime.getRuntime().halt(1);

Scenario

This issue occurs when:

  1. A multipart upload is initiated
  2. The upload takes longer than expected (due to network issues, throttling, or large data)
  3. The cloud storage's lifecycle rule automatically aborts incomplete multipart uploads (e.g., after 1-7 days)
  4. Subsequent UploadPart calls fail with NoSuchUploadException
  5. Broker crashes

Error Log

[ERROR] UploadPart for object 2b180480/_kafka_ops_sh/138445234-2 fail (com.automq.stream.s3.operator.AbstractObjectStorage)
software.amazon.awssdk.services.s3.model.NoSuchUploadException: The specified multipart upload does not exist.
The upload ID might be invalid, or the multipart upload might have been aborted or completed.

[ERROR] Unexpected exception when commit stream set object (com.automq.stream.s3.S3Storage)
java.util.concurrent.CompletionException: software.amazon.awssdk.services.s3.model.NoSuchUploadException: ...

Proposed Solution

  1. When NoSuchUploadException occurs during UPLOAD_PART, instead of aborting, the system should:

    • Invalidate the current uploadId
    • Re-initiate a new multipart upload
    • Retry the upload from the beginning
  2. Add uploadId validity tracking to detect stale upload sessions early

  3. Consider adding a configurable timeout for multipart uploads to proactively restart uploads that are taking too long

Environment

  • Cloud Provider: Tencent Cloud COS (S3-compatible)
  • The bucket has lifecycle rules configured to abort incomplete multipart uploads

Impact

  • Severity: High
  • Broker crashes and requires manual restart
  • Data durability may be affected if WAL upload fails

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions