Skip to content

bug/simple-oid-bucket-key #171

@bwalsh

Description

@bwalsh

ADR: Use SHA-256 as the Canonical Content Identifier for Files

Status: Discussion
Date: 2026-01-21
Deciders: Platform Architecture / Data Integrity
Context: Git-LFS / content-addressed object storage


Context

The system requires a stable, content-addressed identifier for large binary objects.
This identifier must:

  • Uniquely identify file content (by bytes, not by name or location)
  • Be deterministic and reproducible across clients
  • Be safe at planetary scale (billions–trillions of objects)
  • Be compatible with existing Git and Git-LFS ecosystems
  • Be computationally feasible for clients and servers

In practice, the system uses SHA-256 digests as object identifiers (OIDs), consistent with Git LFS pointer files and modern Git object formats.


Decision

We will use SHA-256 as the canonical identifier for file content.

The SHA-256 digest of a file’s bytes is treated as the file’s identity for:

  • De-duplication
  • Upload/download addressing
  • Integrity verification
  • Cross-system reference (e.g., object stores, registries, DRS IDs)

Rationale

1. What “unique” means in practice

No finite hash function can mathematically guarantee uniqueness for all possible files.
However, in engineering systems, “unique identifier” means:

The probability of two different files producing the same identifier is so small that it is irrelevant compared to other real-world failure modes.

SHA-256 satisfies this requirement.


2. Collision probability is astronomically small

SHA-256 produces a 256-bit digest, yielding (2^{256}) possible values.

Using the birthday bound, the probability of any accidental collision among (n) files is approximately:

[
p \approx \frac{n^2}{2^{257}}
]

Even at extreme scale:

  • (n = 10^{12}) (one trillion files)
  • (p \approx 4 \times 10^{-54})

This probability is many orders of magnitude smaller than risks from:

  • Disk corruption
  • RAM faults
  • Network errors
  • Software bugs
  • Operator error

3. Intentional collisions are computationally infeasible

SHA-256 is designed to provide:

  • Collision resistance: infeasible to find any two distinct inputs with the same hash
  • Second-preimage resistance: infeasible to find a different file with the same hash as a known file

A generic collision attack requires approximately (2^{128}) work.
No practical cryptanalytic attacks are known that materially reduce this cost.


4. Alignment with Git and Git-LFS design

  • Git originally used SHA-1 and migrated to SHA-256 specifically due to long-term collision concerns.

  • Git’s SHA-256 transition documentation explicitly states that collision resistance and second-preimage resistance are the required properties for object identity.

  • Git-LFS pointer files encode object identity as:

    oid sha256:<hex-digest>
    

By adopting SHA-256, this system remains fully compatible with Git-LFS tooling, hosting platforms, and ecosystem expectations.


5. Standards and cryptographic maturity

SHA-256 is standardized by NIST as part of the Secure Hash Standard (FIPS 180-4) and is widely deployed in:

  • Version control systems
  • Package registries
  • Container image digests
  • Software supply-chain security
  • Integrity and authenticity systems

Its behavior and risk profile are well understood.


Consequences

Positive

  • Negligible collision risk at any realistic scale
  • Deterministic, content-addressed identity
  • Native compatibility with Git-LFS and Git tooling
  • Simple de-duplication and integrity verification
  • Long-term cryptographic confidence

Trade-offs / Non-Goals

  • Absolute mathematical uniqueness is not claimed (and is impossible for any finite hash)
  • Hash computation cost exists but is negligible relative to I/O for large files
  • Cryptographic agility (future hash migration) remains a separate concern

Operational Notes

  • SHA-256 is treated as the primary identifier, but systems may additionally store:

    • File size
    • Content type
    • Backend-specific checksums (e.g., ETag, MD5)
  • Integrity checks during transfer and storage remain necessary; hash identity does not replace transport correctness.


Decision Summary (One-Line)

SHA-256 provides a practically unique, collision-resistant, and industry-standard content identifier suitable for Git-LFS-scale file systems.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions