Skip to content

breaking-change/simple-oid-bucket-key #173

@bwalsh

Description

@bwalsh

ADR: Breaking Change — Migrate Bucket Object Naming to SHA-256 Content Addressing for LFS De-duplication

Status: Proposed
Date: 2026-01-21
Deciders: Platform Architecture / Storage / Data Integrity
Context: Git-LFS, object storage, de-duplication, backward compatibility


Context

The platform stores large file objects in backing object storage (e.g., S3-compatible buckets).
Today, object keys are derived from project-, path-, or upload-scoped naming conventions, not solely from file content.

As a result:

  • Identical files uploaded:

    • Across different projects
    • From different clients
    • At different times
      are stored as distinct physical objects
  • Storage-level de-duplication is impossible or incomplete

  • Storage identity diverges from Git-LFS’s content identity model

At the same time, Git-LFS already defines file identity as the SHA-256 digest of file contents, computed deterministically by clients.

This mismatch prevents correct, global de-duplication.


Problem Statement

The current bucket object naming scheme:

  • Is not content-addressed
  • Prevents multiple logical references from resolving to a single physical object
  • Increases storage cost and operational complexity
  • Conflicts with Git-LFS pointer semantics (oid sha256:<digest>)

Without changing the naming convention, true LFS de-duplication cannot be reliably implemented.


Decision

We will change the bucket object naming convention to be strictly content-addressed, using the file’s SHA-256 digest as the canonical object key.

This is a breaking change.

Canonical Object Naming Form

<bucket>/<full-sha256>

Where:

  • <full-sha256> is the lowercase hex-encoded SHA-256 digest of the file’s bytes

  • Object identity is independent of:

    • Project
    • Repository
    • Filename
    • Upload time

No project- or path-derived components are included in the object key.


Rationale

1. Align storage identity with Git-LFS identity

Git-LFS already defines content identity as:

oid sha256:<digest>

Using the same identifier at the storage layer:

  • Eliminates impedance mismatch
  • Simplifies reasoning about correctness
  • Makes de-duplication explicit rather than emergent

2. Enable true global de-duplication

With content-addressed object keys:

  • Identical files are stored once
  • Multiple logical references resolve to the same physical object
  • Storage cost scales with unique content, not number of uploads

3. Improve integrity and auditability

A content-addressed object key ensures:

  • Stored bytes must match their identifier
  • Corruption is detectable
  • Objects are naturally immutable by convention

Breaking Change Impact

Existing Clients

All existing clients must be updated.

Specifically:

  • Upload paths change
  • Download resolution logic changes
  • Any client logic assuming project- or path-based bucket keys will break

Clients must treat the SHA-256 digest as the authoritative storage key.


Existing Projects and Data

All existing projects must be re-imported and tested.

Reasons:

  • Legacy objects are stored under non-canonical names
  • Safe de-duplication cannot be inferred retroactively
  • Mixed naming schemes would introduce ambiguity and correctness bugs

Re-import ensures:

  • Objects are rewritten or re-registered under canonical keys
  • Metadata references are consistent
  • De-duplication behavior is correct and verifiable

Migration Strategy (High Level)

  1. Freeze legacy writes

    • Prevent new uploads using the old naming scheme
  2. Client upgrade

    • Release updated clients that read/write <bucket>/<full-sha256>
  3. Project re-import

    • Recompute SHA-256 for existing files
    • Register or rewrite objects under canonical keys
    • Verify integrity and references
  4. Validation

    • Confirm identical files collapse to a single physical object
    • Run project- and workflow-level tests
  5. Legacy cleanup (optional)

    • Garbage-collect unreachable legacy objects

Consequences

Positive

  • Correct, global Git-LFS de-duplication
  • Reduced storage footprint
  • Stronger integrity guarantees
  • Clean alignment with Git-LFS semantics
  • Simpler long-term architecture

Negative

  • Breaking change for all existing clients
  • Mandatory re-import and testing for existing projects
  • One-time migration and operational overhead

Non-Goals

  • Backward compatibility with legacy bucket object paths
  • Long-term support for mixed naming conventions
  • Fully transparent migration without client or project involvement

Decision Summary

To enable correct and scalable Git-LFS de-duplication, the platform will adopt SHA-256–based content-addressed bucket object naming (<bucket>/<full-sha256>), accepting a breaking change that requires client upgrades and project re-imports.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions