Skip to content

Handle orphaned and corrupt segments gracefully in RestoreFromS3 #112

@klaudworks

Description

@klaudworks

Problem

RestoreFromS3() rebuilds the in-memory segment list from S3 on broker restart. If any .index file download fails or fails to parse, it returns a hard error, blocking the entire partition from initializing.

Orphaned .kfs files (without a matching .index) are created when uploadFlush successfully uploads the .kfs but the .index upload fails. Since Drain() clears the buffer before uploads begin and onFlush is never called on failure, the orphaned .kfs represents data that was never acknowledged to the producer — no committed data is lost by skipping it.

Currently, the only recovery from an orphaned .kfs is manual deletion from S3.

Secondary issues:

  • last offset is derived from the raw entries list rather than the successfully loaded segments — if we change the index-download error to continue, this becomes a real bug since the last entry may be an orphan.
  • Read() returns ErrOffsetOutOfRange when a requested offset falls in a gap between segments. Kafka returns records from the next available segment in this case.

Proposal

TBD

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions