-
Notifications
You must be signed in to change notification settings - Fork 1
Description
ADR: Breaking Change — Migrate Bucket Object Naming to SHA-256 Content Addressing for LFS De-duplication
Status: Proposed
Date: 2026-01-21
Deciders: Platform Architecture / Storage / Data Integrity
Context: Git-LFS, object storage, de-duplication, backward compatibility
Context
The platform stores large file objects in backing object storage (e.g., S3-compatible buckets).
Today, object keys are derived from project-, path-, or upload-scoped naming conventions, not solely from file content.
As a result:
-
Identical files uploaded:
- Across different projects
- From different clients
- At different times
are stored as distinct physical objects
-
Storage-level de-duplication is impossible or incomplete
-
Storage identity diverges from Git-LFS’s content identity model
At the same time, Git-LFS already defines file identity as the SHA-256 digest of file contents, computed deterministically by clients.
This mismatch prevents correct, global de-duplication.
Problem Statement
The current bucket object naming scheme:
- Is not content-addressed
- Prevents multiple logical references from resolving to a single physical object
- Increases storage cost and operational complexity
- Conflicts with Git-LFS pointer semantics (
oid sha256:<digest>)
Without changing the naming convention, true LFS de-duplication cannot be reliably implemented.
Decision
We will change the bucket object naming convention to be strictly content-addressed, using the file’s SHA-256 digest as the canonical object key.
This is a breaking change.
Canonical Object Naming Form
<bucket>/<full-sha256>
Where:
-
<full-sha256>is the lowercase hex-encoded SHA-256 digest of the file’s bytes -
Object identity is independent of:
- Project
- Repository
- Filename
- Upload time
No project- or path-derived components are included in the object key.
Rationale
1. Align storage identity with Git-LFS identity
Git-LFS already defines content identity as:
oid sha256:<digest>
Using the same identifier at the storage layer:
- Eliminates impedance mismatch
- Simplifies reasoning about correctness
- Makes de-duplication explicit rather than emergent
2. Enable true global de-duplication
With content-addressed object keys:
- Identical files are stored once
- Multiple logical references resolve to the same physical object
- Storage cost scales with unique content, not number of uploads
3. Improve integrity and auditability
A content-addressed object key ensures:
- Stored bytes must match their identifier
- Corruption is detectable
- Objects are naturally immutable by convention
Breaking Change Impact
Existing Clients
All existing clients must be updated.
Specifically:
- Upload paths change
- Download resolution logic changes
- Any client logic assuming project- or path-based bucket keys will break
Clients must treat the SHA-256 digest as the authoritative storage key.
Existing Projects and Data
All existing projects must be re-imported and tested.
Reasons:
- Legacy objects are stored under non-canonical names
- Safe de-duplication cannot be inferred retroactively
- Mixed naming schemes would introduce ambiguity and correctness bugs
Re-import ensures:
- Objects are rewritten or re-registered under canonical keys
- Metadata references are consistent
- De-duplication behavior is correct and verifiable
Migration Strategy (High Level)
-
Freeze legacy writes
- Prevent new uploads using the old naming scheme
-
Client upgrade
- Release updated clients that read/write
<bucket>/<full-sha256>
- Release updated clients that read/write
-
Project re-import
- Recompute SHA-256 for existing files
- Register or rewrite objects under canonical keys
- Verify integrity and references
-
Validation
- Confirm identical files collapse to a single physical object
- Run project- and workflow-level tests
-
Legacy cleanup (optional)
- Garbage-collect unreachable legacy objects
Consequences
Positive
- Correct, global Git-LFS de-duplication
- Reduced storage footprint
- Stronger integrity guarantees
- Clean alignment with Git-LFS semantics
- Simpler long-term architecture
Negative
- Breaking change for all existing clients
- Mandatory re-import and testing for existing projects
- One-time migration and operational overhead
Non-Goals
- Backward compatibility with legacy bucket object paths
- Long-term support for mixed naming conventions
- Fully transparent migration without client or project involvement
Decision Summary
To enable correct and scalable Git-LFS de-duplication, the platform will adopt SHA-256–based content-addressed bucket object naming (<bucket>/<full-sha256>), accepting a breaking change that requires client upgrades and project re-imports.