bug/simple-oid-bucket-key

# ADR: Use SHA-256 as the Canonical Content Identifier for Files

**Status**: Discussion
**Date**: 2026-01-21
**Deciders**: Platform Architecture / Data Integrity
**Context**: Git-LFS / content-addressed object storage

---

## Context

The system requires a **stable, content-addressed identifier** for large binary objects.
This identifier must:

* Uniquely identify file content (by bytes, not by name or location)
* Be deterministic and reproducible across clients
* Be safe at planetary scale (billions–trillions of objects)
* Be compatible with existing Git and Git-LFS ecosystems
* Be computationally feasible for clients and servers

In practice, the system uses **SHA-256 digests** as object identifiers (OIDs), consistent with **Git LFS** pointer files and modern **Git** object formats.

---

## Decision

**We will use SHA-256 as the canonical identifier for file content.**

The SHA-256 digest of a file’s bytes is treated as the file’s identity for:

* De-duplication
* Upload/download addressing
* Integrity verification
* Cross-system reference (e.g., object stores, registries, DRS IDs)

---

## Rationale

### 1. What “unique” means in practice

No finite hash function can *mathematically guarantee* uniqueness for all possible files.
However, in engineering systems, “unique identifier” means:

> *The probability of two different files producing the same identifier is so small that it is irrelevant compared to other real-world failure modes.*

SHA-256 satisfies this requirement.

---

### 2. Collision probability is astronomically small

SHA-256 produces a 256-bit digest, yielding (2^{256}) possible values.

Using the birthday bound, the probability of **any accidental collision** among (n) files is approximately:

[
p \approx \frac{n^2}{2^{257}}
]

Even at extreme scale:

* (n = 10^{12}) (one trillion files)
* (p \approx 4 \times 10^{-54})

This probability is many orders of magnitude smaller than risks from:

* Disk corruption
* RAM faults
* Network errors
* Software bugs
* Operator error

---

### 3. Intentional collisions are computationally infeasible

SHA-256 is designed to provide:

* **Collision resistance**: infeasible to find *any* two distinct inputs with the same hash
* **Second-preimage resistance**: infeasible to find a different file with the same hash as a known file

A generic collision attack requires approximately (2^{128}) work.
No practical cryptanalytic attacks are known that materially reduce this cost.

---

### 4. Alignment with Git and Git-LFS design

* Git originally used SHA-1 and migrated to SHA-256 specifically due to long-term collision concerns.
* Git’s SHA-256 transition documentation explicitly states that **collision resistance and second-preimage resistance** are the required properties for object identity.
* Git-LFS pointer files encode object identity as:

  ```
  oid sha256:<hex-digest>
  ```

By adopting SHA-256, this system remains **fully compatible** with Git-LFS tooling, hosting platforms, and ecosystem expectations.

---

### 5. Standards and cryptographic maturity

SHA-256 is standardized by **NIST** as part of the Secure Hash Standard (FIPS 180-4) and is widely deployed in:

* Version control systems
* Package registries
* Container image digests
* Software supply-chain security
* Integrity and authenticity systems

Its behavior and risk profile are well understood.

---

## Consequences

### Positive

* Negligible collision risk at any realistic scale
* Deterministic, content-addressed identity
* Native compatibility with Git-LFS and Git tooling
* Simple de-duplication and integrity verification
* Long-term cryptographic confidence

### Trade-offs / Non-Goals

* Absolute mathematical uniqueness is not claimed (and is impossible for any finite hash)
* Hash computation cost exists but is negligible relative to I/O for large files
* Cryptographic agility (future hash migration) remains a separate concern

---

## Operational Notes

* SHA-256 is treated as the **primary identifier**, but systems may additionally store:

  * File size
  * Content type
  * Backend-specific checksums (e.g., ETag, MD5)
* Integrity checks during transfer and storage remain necessary; **hash identity does not replace transport correctness**.

---

## Decision Summary (One-Line)

**SHA-256 provides a practically unique, collision-resistant, and industry-standard content identifier suitable for Git-LFS-scale file systems.**


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug/simple-oid-bucket-key #171

ADR: Use SHA-256 as the Canonical Content Identifier for Files

Context

Decision

Rationale

1. What “unique” means in practice

2. Collision probability is astronomically small

3. Intentional collisions are computationally infeasible

4. Alignment with Git and Git-LFS design

5. Standards and cryptographic maturity

Consequences

Positive

Trade-offs / Non-Goals

Operational Notes

Decision Summary (One-Line)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug/simple-oid-bucket-key #171

Description

ADR: Use SHA-256 as the Canonical Content Identifier for Files

Context

Decision

Rationale

1. What “unique” means in practice

2. Collision probability is astronomically small

3. Intentional collisions are computationally infeasible

4. Alignment with Git and Git-LFS design

5. Standards and cryptographic maturity

Consequences

Positive

Trade-offs / Non-Goals

Operational Notes

Decision Summary (One-Line)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions