Skip to content

feature/add-url-md5-sha256 #206

@bwalsh

Description

@bwalsh

ADR: Alternate Object Identifier Strategy for Git LFS Without Content SHA256

Status

Proposed


Use Case: Managing Remote-Only Large Files Without Content Hashing

Title

Enable Git LFS workflows for remote-only or very large files without requiring SHA256 content hashing during git add.


Primary Actor

Research data steward / data engineer using Git LFS integrated with DRS.


Scenario

A research team maintains large genomic files (e.g., BAM, CRAM, FASTQ) that are:

  • Already stored in an object store (S3 / Ceph / GCS)
  • Registered in DRS
  • Multi-GB or TB in size
  • Not always locally downloaded

The user wants to:

  • Reference these files in a Git repository
  • Track them via Git LFS
  • Maintain reproducibility and metadata linkage
  • Avoid computing SHA256 hashes locally (too slow or impossible)

User Story

As a research data steward managing large, remote DRS-registered files,
I want to add files to a Git LFS repository without computing a full content SHA256 hash,
So that I can efficiently reference remote objects while maintaining compatibility with Git LFS and DRS workflows.


Functional Expectations

  • During git add, the clean filter:

    • Does not require downloading or hashing full file contents.
    • Generates a stable alternate object identifier.
  • During git lfs push:

    • Remote existence checks use DRS or metadata services.
  • During git checkout:

    • Files are resolved via DRS ID.
  • No additional metadata files are committed to Git.

  • Integrity and deduplication are delegated to DRS.


Acceptance Criteria

  • User can run git add on remote-managed large files without content hashing delays.
  • Repository remains fully compatible with Git LFS commands.
  • git lfs push does not attempt redundant uploads when DRS already contains the object.
  • git checkout correctly restores files via DRS resolution.
  • No extra files beyond standard LFS pointers are committed.
  • Workflow works for multi-GB files without significant local CPU or I/O overhead.

Business / Architectural Value

  • Eliminates expensive SHA256 operations on large files.
  • Enables remote-first, metadata-addressable architecture.
  • Aligns Git workflows with DRS and Indexd object identity.
  • Supports scalable genomics and bioinformatics data management.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions