breaking-change/simple-oid-bucket-key


# ADR: Breaking Change — Migrate Bucket Object Naming to SHA-256 Content Addressing for LFS De-duplication

**Status**: Proposed
**Date**: 2026-01-21
**Deciders**: Platform Architecture / Storage / Data Integrity
**Context**: Git-LFS, object storage, de-duplication, backward compatibility

---

## Context

The platform stores large file objects in backing object storage (e.g., S3-compatible buckets).
Today, object keys are derived from **project-, path-, or upload-scoped naming conventions**, not solely from file content.

As a result:

* Identical files uploaded:

  * Across different projects
  * From different clients
  * At different times
    are stored as **distinct physical objects**
* Storage-level de-duplication is impossible or incomplete
* Storage identity diverges from Git-LFS’s content identity model

At the same time, Git-LFS already defines file identity as the **SHA-256 digest of file contents**, computed deterministically by clients.

This mismatch prevents correct, global de-duplication.

---

## Problem Statement

The current bucket object naming scheme:

* Is **not content-addressed**
* Prevents multiple logical references from resolving to a single physical object
* Increases storage cost and operational complexity
* Conflicts with Git-LFS pointer semantics (`oid sha256:<digest>`)

Without changing the naming convention, true LFS de-duplication cannot be reliably implemented.

---

## Decision

**We will change the bucket object naming convention to be strictly content-addressed, using the file’s SHA-256 digest as the canonical object key.**

This is a **breaking change**.

### Canonical Object Naming Form

```
<bucket>/<full-sha256>
```

Where:

* `<full-sha256>` is the lowercase hex-encoded SHA-256 digest of the file’s bytes
* Object identity is independent of:

  * Project
  * Repository
  * Filename
  * Upload time

No project- or path-derived components are included in the object key.

---

## Rationale

### 1. Align storage identity with Git-LFS identity

Git-LFS already defines content identity as:

```
oid sha256:<digest>
```

Using the same identifier at the storage layer:

* Eliminates impedance mismatch
* Simplifies reasoning about correctness
* Makes de-duplication explicit rather than emergent

---

### 2. Enable true global de-duplication

With content-addressed object keys:

* Identical files are stored **once**
* Multiple logical references resolve to the same physical object
* Storage cost scales with **unique content**, not number of uploads

---

### 3. Improve integrity and auditability

A content-addressed object key ensures:

* Stored bytes must match their identifier
* Corruption is detectable
* Objects are naturally immutable by convention

---

## Breaking Change Impact

### Existing Clients

**All existing clients must be updated.**

Specifically:

* Upload paths change
* Download resolution logic changes
* Any client logic assuming project- or path-based bucket keys will break

Clients must treat the SHA-256 digest as the **authoritative storage key**.

---

### Existing Projects and Data

**All existing projects must be re-imported and tested.**

Reasons:

* Legacy objects are stored under non-canonical names
* Safe de-duplication cannot be inferred retroactively
* Mixed naming schemes would introduce ambiguity and correctness bugs

Re-import ensures:

* Objects are rewritten or re-registered under canonical keys
* Metadata references are consistent
* De-duplication behavior is correct and verifiable

---

## Migration Strategy (High Level)

1. **Freeze legacy writes**

   * Prevent new uploads using the old naming scheme

2. **Client upgrade**

   * Release updated clients that read/write `<bucket>/<full-sha256>`

3. **Project re-import**

   * Recompute SHA-256 for existing files
   * Register or rewrite objects under canonical keys
   * Verify integrity and references

4. **Validation**

   * Confirm identical files collapse to a single physical object
   * Run project- and workflow-level tests

5. **Legacy cleanup (optional)**

   * Garbage-collect unreachable legacy objects

---

## Consequences

### Positive

* Correct, global Git-LFS de-duplication
* Reduced storage footprint
* Stronger integrity guarantees
* Clean alignment with Git-LFS semantics
* Simpler long-term architecture

### Negative

* **Breaking change for all existing clients**
* **Mandatory re-import and testing for existing projects**
* One-time migration and operational overhead

---

## Non-Goals

* Backward compatibility with legacy bucket object paths
* Long-term support for mixed naming conventions
* Fully transparent migration without client or project involvement

---

## Decision Summary

**To enable correct and scalable Git-LFS de-duplication, the platform will adopt SHA-256–based content-addressed bucket object naming (`<bucket>/<full-sha256>`), accepting a breaking change that requires client upgrades and project re-imports.**


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

breaking-change/simple-oid-bucket-key #173

ADR: Breaking Change — Migrate Bucket Object Naming to SHA-256 Content Addressing for LFS De-duplication

Context

Problem Statement

Decision

Canonical Object Naming Form

Rationale

1. Align storage identity with Git-LFS identity

2. Enable true global de-duplication

3. Improve integrity and auditability

Breaking Change Impact

Existing Clients

Existing Projects and Data

Migration Strategy (High Level)

Consequences

Positive

Negative

Non-Goals

Decision Summary

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

breaking-change/simple-oid-bucket-key #173

Description

ADR: Breaking Change — Migrate Bucket Object Naming to SHA-256 Content Addressing for LFS De-duplication

Context

Problem Statement

Decision

Canonical Object Naming Form

Rationale

1. Align storage identity with Git-LFS identity

2. Enable true global de-duplication

3. Improve integrity and auditability

Breaking Change Impact

Existing Clients

Existing Projects and Data

Migration Strategy (High Level)

Consequences

Positive

Negative

Non-Goals

Decision Summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions