-
Notifications
You must be signed in to change notification settings - Fork 1
Description
ADR: Git LFS Integration Strategy — Custom Transfer Agent vs Batch API Server
Status
Discussion
Background
We are integrating Git LFS with a DRS-backed storage system that supports:
- Client-managed buckets
- Authorization bindings (IAM role, workload identity, broker reference)
- Short-lived credential minting at access time
Git LFS supports two fundamentally different integration models. There are two primary integration surfaces in Git LFS:
- Client-side: Custom Transfer Agent
- Server-side: LFS Batch API
These are fundamentally different architectural patterns and support different use cases.
Both can interact with DRS-backed storage, but they differ significantly in:
- Interoperability
- Federation support
- Security guarantees
- Governance capabilities
- SaaS compatibility
- Scientific reproducibility
Decision Drivers
- Interoperability with stock Git clients
- Federation across organizations
- Credential isolation and security
- Compatibility with hosted Git platforms
- Multi-user collaboration
- Operational simplicity
- Alignment with GA4GH / DRS design principles
This ADR documents the architectural differences, supported use cases, trade-offs,
and the “Add URL” (external object registration) workflow.
Architecture Overview
Custom Transfer Agent
flowchart LR
A[Git Client] --> B[Custom Transfer Agent]
B --> C[Object Store]
B --> D[DRS APIs]
Option 1 — Git LFS Custom Transfer Agent (Client-Side)
Description
A custom transfer agent is configured in the Git client via:
git config lfs.customtransfer.<name>.path <binary>Instead of contacting an LFS server, the Git LFS client:
- Invokes the custom agent binary
- Streams object metadata (OID, size)
- Delegates upload/download directly to the agent
The agent is responsible for:
- Authentication
- Storage interaction
- Credential resolution
- Progress reporting
No LFS Batch API server is involved.
Architecture
Git Client
│
│ invokes
▼
Custom Transfer Agent
│
│ interacts with
▼
Object Store / DRS
Supported Use Cases
| Use Case | Supported |
|---|---|
| Single-user research workflows | ✅ |
| Direct upload to S3/GCS via DRS | ✅ |
| Air-gapped or private deployments | ✅ |
| Power users with controlled environments | ✅ |
| Experimental storage backends | ✅ |
| Bypassing Git hosting provider | ✅ |
Characteristics
- No server required
- Full control over upload semantics
- Easy to embed DRS-specific logic
- Zero dependency on Git hosting platform support
- Tight coupling between client and storage logic
Limitations
Not Supported or Problematic
| Use Case | Limitation |
|---|---|
| Multi-user collaboration | ❌ Every user must install identical agent |
| Public Git hosting integration | ❌ GitHub/GitLab do not invoke custom agents |
| Transparent user experience | ❌ Requires client config |
| Centralized authorization policy | ❌ Logic is pushed to clients |
| Server-side audit of batch requests | ❌ Harder to enforce uniformly |
| Fine-grained repo policy enforcement | ❌ Distributed across clients |
| Web-based uploads | ❌ No web compatibility |
2. LFS Batch API Server (Server-Side)
Description
A standard Git LFS server implements:
POST /info/lfs/objects/batch
The client sends:
{
"operation": "upload",
"objects": [
{ "oid": "...", "size": 123 }
]
}The server responds with per-object actions:
{
"objects": [
{
"oid": "...",
"actions": {
"upload": {
"href": "https://signed-url",
"header": { ... }
}
}
}
]
}The client then uploads directly to object storage using returned URLs.
Architecture
Git Client
│
│ HTTP
▼
LFS Batch API Server
│
│ resolves auth binding
▼
DRS Control Plane
│
▼
Object Store
Supported Use Cases
| Use Case | Supported |
|---|---|
| Multi-user collaboration | ✅ |
| Hosted Git platforms | ✅ |
| Enterprise SSO / OAuth | ✅ |
| Centralized policy enforcement | ✅ |
| Auditable batch requests | ✅ |
| Transparent client UX | ✅ |
| Repo-scoped storage policies | ✅ |
| SaaS deployment model | ✅ |
Characteristics
- Fully compatible with stock Git LFS clients
- No client customization required
- Centralized policy and credential resolution
- Aligns with DRS access-time credential minting
- Enables fine-grained repo-level isolation
Comparison
| Dimension | Custom Transfer Agent | Batch API Server |
|---|---|---|
| Requires LFS server | ❌ | ✅ |
| Requires client modification | ✅ | ❌ |
| Works with GitHub/GitLab | ❌ | ✅ |
| Centralized authorization | ❌ | ✅ |
| Repo-scoped isolation | Weak | Strong |
| Multi-tenant SaaS | ❌ | ✅ |
| Air-gapped research | ✅ | Possible |
| Operational simplicity | Client-heavy | Server-heavy |
| Auditable batch control | Limited | Strong |
| GA4GH alignment | Medium | High |
Add URL / External Object Registration Semantics
Overview
The “Add URL” workflow allows registration of a pre-existing external object
without re-uploading it.
Supports:
- Large datasets already stored in cloud buckets
- Cross-project reuse
- Federated research workflows
Definitions
No User Upload
The client does not upload object bytes.
No Transfer
No object bytes are transferred at all (server already knows sha256).
Mode A — URL + sha256 + size
sequenceDiagram
participant U as User
participant C as Git Client
participant S as LFS/DRS Server
participant O as Object Store
U->>C: git lfs add-url <url> --sha256 --size
C->>O: HEAD <url>
C->>S: Verify sha256 exists
S-->>C: Exists
C->>C: Write pointer file
Transfer semantics:
- No user upload
- No transfer (if already indexed)
Mode B — URL Only
sequenceDiagram
participant U as User
participant C as Git Client
participant S as LFS/DRS Server
participant O as Object Store
U->>C: git lfs add-url <url>
C->>O: HEAD <url>
C->>S: Resolve sha256
alt Known
S-->>C: sha256 + size
else Unknown
S->>O: GET object (ingest)
S-->>C: sha256 computed
end
C->>C: Write pointer file
Transfer semantics:
- No user upload
- Server-side transfer may occur
Error Cases
| Condition | Error |
|---|---|
| Size mismatch | SIZE_MISMATCH |
| Checksum mismatch | CHECKSUM_MISMATCH |
| Unstable URL | UNSTABLE_OBJECT_SOURCE |
| Object modified after registration | IMMUTABILITY_VIOLATION |
| Source not accessible | SOURCE_NOT_ACCESSIBLE |
Architectural Comparison
| Dimension | Custom Agent | Batch API Server |
|---|---|---|
| Hosted Git Compatible | No | Yes |
| Centralized Policy | No | Yes |
| Multi-tenant SaaS | No | Yes |
| Auditability | No | Yes |
| Safe Federated Add-URL | No | Yes |
| Immutability Enforcement | Weak | Strong |
Critical Gap: What Is NOT Supported If Only Custom Transfer Agent Is Used
If we rely solely on a custom transfer agent:
-
Hosted Git integration is impossible
- GitHub and GitLab do not execute custom agents.
- Users cannot push from standard environments.
-
Multi-user standardization is fragile
- Every collaborator must install and configure the agent.
- Version drift causes inconsistencies.
-
No centralized policy enforcement
- Bucket and authorization logic lives on client machines.
- Hard to enforce repo-specific storage controls.
-
No transparent federation
- External collaborators cannot push without installing software.
- Violates "it just works with Git" principle.
-
Reduced security posture
- More credential resolution logic distributed to clients.
- Harder to audit access centrally.
-
No web or CI integration
- CI/CD systems require custom agent install.
- Web-based file operations cannot use it.
When Custom Transfer Agent Is Appropriate
- Research-only deployments
- Internal platform experiments
- Developer tooling
- Transitional architecture
- Air-gapped or highly controlled environments
When Batch API Server Is Required
- Production multi-tenant environments
- Public Git hosting compatibility
- Enterprise SSO integration
- GA4GH federated ecosystems
- DRS-backed storage federation at scale
Discussion
For a federated, multi-tenant, GA4GH-aligned system:
A Git LFS Batch API server is required.
The custom transfer agent may remain as a complementary tool but cannot be the sole integration surface.