Skip to content

feature/pre-commit-changes #192

@bwalsh

Description

@bwalsh

ADR: LFS-Only Local Cache for OID↔Path→S3 URL Hints with Authoritative Resolution at Pre-Push

Status

Proposed


Context and Problem Statement

We need to associate Git LFS–tracked files with external storage locations (S3 URLs) while preserving correct behavior under:

  • content changes
  • file renames / moves
  • undo / restaging workflows
  • offline or low-latency local development

Key constraints and preferences:

  • Git LFS defines the boundary of responsibility for external objects.
  • Non-LFS files are fully managed by Git and must not participate in this mechanism.
  • A server-side index (Indexd / DRS) is the authoritative source of truth for mapping content identity to storage locations.
  • We do not want to consult the authoritative index on every commit.
  • pre-commit must be fast, deterministic, offline-friendly, and index-based.
  • Correctness and enforcement should occur at pre-push, where refs, remotes, and commit ranges are known.

"How git works" 🚧

Hook stdin contents Structured? Stable format? Purpose
pre-commit Empty ❌ No N/A Validate what is staged
pre-push Ref update list ✅ Yes ✅ Yes Validate what is about to be pushed

The difference is intentional:

  • pre-commit is index-centric
  • pre-push is ref-centric

pre-commit reconciliation (practical)

In a pre-commit hook (or your custom git add wrapper), detect renames and move metadata accordingly.

How to detect renames reliably:

Use Git’s rename detection between HEAD and index:

  • git diff --cached -M --name-status

This emits lines like:

  • R100 old/path.txt new/path.txt

For each R* old new:

  • move /old/... → /new/...
  • or update the metadata file’s internal path field
  • add the moved metadata file to the index

Pros:

  • deterministic
  • works before commit
  • doesn’t require history scanning

Cons:

  • you must enforce the hook/wrapper usage

Decision

1. Scope: Git LFS–Only

This design applies exclusively to Git LFS–tracked files.

  • A file is in scope if and only if its staged content is a valid Git LFS pointer:
    version https://git-lfs.github.com/spec/v1
    oid sha256:<hex>
    
  • All non-LFS files are explicitly out of scope:
    • no cache entries
    • no validation
    • no warnings or errors
  • File size, extension, or .gitattributes patterns alone MUST NOT be used to infer scope.

2. Identity Model

  • The canonical content identity is the Git LFS OID (sha256:<hex>), extracted from the staged pointer file.
  • The system MUST NOT:
    • hash file contents
    • compute Git blob IDs
    • infer identities for non-LFS files

3. Metadata Model

The local system models three non-authoritative relationships, all maintained purely for developer workflow:

  1. Path → OID

    • Which LFS object is currently staged at a working-tree path.
  2. OID → Path(s)

    • Which paths have recently referenced a given OID.
    • Supports rename, undo, and multi-path reuse.
    • Paths are advisory and may be stale.
  3. OID → S3 URL (hint)

    • A locally cached hint for where the object may live.
    • Must be validated against the authoritative server index at pre-push.

No locally stored relationship is authoritative.


4. Crisp Rule (Normative)

Path is never authoritative; OID (sha256) is.
Paths are client-side, repo-local workflow context.
The server indexes content identity and provides access methods.


5. Local Cache Location (Non-Versioned)

All local metadata is stored under:

.git/drs/pre-commit/

This directory:

  • is never committed
  • is local to the working copy
  • may be freely deleted and reconstructed

Recommended layout

.git/drs/pre-commit/
  v1/
    paths/
      <encoded-path>.json
    oids/
      <oid>.json
    tombstones/
      <encoded-path>.json
    state.json

Cache File Schemas

What is the encoded path

Algorithm (step-by-step):

  • Start with the repo-relative path of the file (as a UTF‑8 string).

  • Encode that path using Base64 URL‑safe encoding without padding (RawURLEncoding), producing the token.

  • Append .json to the encoded token.

  • Place the file under .git/drs/pre-commit/v1/paths/.

  • This is implemented in pathEntryFile → encodePath, which uses base64.RawURLEncoding.EncodeToString([]byte(path)), then appends .json to form the final filename.

Resulting pattern:

.git/drs/pre-commit/v1/paths/<base64url_no_padding(repo_relative_path)>.json

paths/<encoded-path>.json (Path → OID)

{
  "path": "data/foo.bam",
  "lfs_oid": "sha256:<hex>",
  "updated_at": "2026-02-01T12:34:56Z"
}

oids/<oid>.json (OID → Path(s), S3 URL hint)

{
  "lfs_oid": "sha256:<hex>",
  "paths": [
    "data/foo.bam",
    "data/archive/foo-copy.bam"
  ],
  "s3_url": "s3://bucket/key",
  "updated_at": "2026-02-01T12:34:56Z",
  "content_changed": false
}

Pre-Commit Responsibilities (LFS-Only, Local-Only)

The pre-commit hook operates only on the staged index and only on LFS-tracked files.


Pre-Push Responsibilities (Authoritative, Networked)

The pre-push hook is the sole enforcement point.


Mapping .git/drs/pre-commit to Server Semantics

Local cache concept Server-side analogue Notes
path → lfs_oid none purely client-side workflow context
lfs_oid → paths[] none advisory, repo-local
lfs_oid (sha256) Indexd hashes.sha256, DRS checksums canonical content identity
lfs_oid → s3_url (hint) Indexd urls[], DRS access_methods[] server is authoritative
logical ID Indexd object_id, DRS object_id resolved at pre-push

Summary

This ADR establishes a strict LFS-only contract:
pre-commit maintains a local, non-authoritative cache of path↔OID↔URL hints, while pre-push resolves and enforces truth using Indexd / DRS.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions