-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
ADR: Alternate Object Identifier Strategy for Git LFS Without Content SHA256
Status
Proposed
Use Case: Managing Remote-Only Large Files Without Content Hashing
Title
Enable Git LFS workflows for remote-only or very large files without requiring SHA256 content hashing during git add.
Primary Actor
Research data steward / data engineer using Git LFS integrated with DRS.
Scenario
A research team maintains large genomic files (e.g., BAM, CRAM, FASTQ) that are:
- Already stored in an object store (S3 / Ceph / GCS)
- Registered in DRS
- Multi-GB or TB in size
- Not always locally downloaded
The user wants to:
- Reference these files in a Git repository
- Track them via Git LFS
- Maintain reproducibility and metadata linkage
- Avoid computing SHA256 hashes locally (too slow or impossible)
User Story
As a research data steward managing large, remote DRS-registered files,
I want to add files to a Git LFS repository without computing a full content SHA256 hash,
So that I can efficiently reference remote objects while maintaining compatibility with Git LFS and DRS workflows.
Functional Expectations
-
During
git add, the clean filter:- Does not require downloading or hashing full file contents.
- Generates a stable alternate object identifier.
-
During
git lfs push:- Remote existence checks use DRS or metadata services.
-
During
git checkout:- Files are resolved via DRS ID.
-
No additional metadata files are committed to Git.
-
Integrity and deduplication are delegated to DRS.
Acceptance Criteria
- User can run
git addon remote-managed large files without content hashing delays. - Repository remains fully compatible with Git LFS commands.
-
git lfs pushdoes not attempt redundant uploads when DRS already contains the object. -
git checkoutcorrectly restores files via DRS resolution. - No extra files beyond standard LFS pointers are committed.
- Workflow works for multi-GB files without significant local CPU or I/O overhead.
Business / Architectural Value
- Eliminates expensive SHA256 operations on large files.
- Enables remote-first, metadata-addressable architecture.
- Aligns Git workflows with DRS and Indexd object identity.
- Supports scalable genomics and bioinformatics data management.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels