-
Notifications
You must be signed in to change notification settings - Fork 1
Description
ADR: Use SHA-256 as the Canonical Content Identifier for Files
Status: Discussion
Date: 2026-01-21
Deciders: Platform Architecture / Data Integrity
Context: Git-LFS / content-addressed object storage
Context
The system requires a stable, content-addressed identifier for large binary objects.
This identifier must:
- Uniquely identify file content (by bytes, not by name or location)
- Be deterministic and reproducible across clients
- Be safe at planetary scale (billions–trillions of objects)
- Be compatible with existing Git and Git-LFS ecosystems
- Be computationally feasible for clients and servers
In practice, the system uses SHA-256 digests as object identifiers (OIDs), consistent with Git LFS pointer files and modern Git object formats.
Decision
We will use SHA-256 as the canonical identifier for file content.
The SHA-256 digest of a file’s bytes is treated as the file’s identity for:
- De-duplication
- Upload/download addressing
- Integrity verification
- Cross-system reference (e.g., object stores, registries, DRS IDs)
Rationale
1. What “unique” means in practice
No finite hash function can mathematically guarantee uniqueness for all possible files.
However, in engineering systems, “unique identifier” means:
The probability of two different files producing the same identifier is so small that it is irrelevant compared to other real-world failure modes.
SHA-256 satisfies this requirement.
2. Collision probability is astronomically small
SHA-256 produces a 256-bit digest, yielding (2^{256}) possible values.
Using the birthday bound, the probability of any accidental collision among (n) files is approximately:
[
p \approx \frac{n^2}{2^{257}}
]
Even at extreme scale:
- (n = 10^{12}) (one trillion files)
- (p \approx 4 \times 10^{-54})
This probability is many orders of magnitude smaller than risks from:
- Disk corruption
- RAM faults
- Network errors
- Software bugs
- Operator error
3. Intentional collisions are computationally infeasible
SHA-256 is designed to provide:
- Collision resistance: infeasible to find any two distinct inputs with the same hash
- Second-preimage resistance: infeasible to find a different file with the same hash as a known file
A generic collision attack requires approximately (2^{128}) work.
No practical cryptanalytic attacks are known that materially reduce this cost.
4. Alignment with Git and Git-LFS design
-
Git originally used SHA-1 and migrated to SHA-256 specifically due to long-term collision concerns.
-
Git’s SHA-256 transition documentation explicitly states that collision resistance and second-preimage resistance are the required properties for object identity.
-
Git-LFS pointer files encode object identity as:
oid sha256:<hex-digest>
By adopting SHA-256, this system remains fully compatible with Git-LFS tooling, hosting platforms, and ecosystem expectations.
5. Standards and cryptographic maturity
SHA-256 is standardized by NIST as part of the Secure Hash Standard (FIPS 180-4) and is widely deployed in:
- Version control systems
- Package registries
- Container image digests
- Software supply-chain security
- Integrity and authenticity systems
Its behavior and risk profile are well understood.
Consequences
Positive
- Negligible collision risk at any realistic scale
- Deterministic, content-addressed identity
- Native compatibility with Git-LFS and Git tooling
- Simple de-duplication and integrity verification
- Long-term cryptographic confidence
Trade-offs / Non-Goals
- Absolute mathematical uniqueness is not claimed (and is impossible for any finite hash)
- Hash computation cost exists but is negligible relative to I/O for large files
- Cryptographic agility (future hash migration) remains a separate concern
Operational Notes
-
SHA-256 is treated as the primary identifier, but systems may additionally store:
- File size
- Content type
- Backend-specific checksums (e.g., ETag, MD5)
-
Integrity checks during transfer and storage remain necessary; hash identity does not replace transport correctness.
Decision Summary (One-Line)
SHA-256 provides a practically unique, collision-resistant, and industry-standard content identifier suitable for Git-LFS-scale file systems.