discussion/Custom-Transfer-Agent-vs-Batch-API-Server

# ADR: Git LFS Integration Strategy — Custom Transfer Agent vs Batch API Server

## Status
Discussion

---

# Background

We are integrating **Git LFS** with a DRS-backed storage system that supports:

* Client-managed buckets
* Authorization bindings (IAM role, workload identity, broker reference)
* Short-lived credential minting at access time

Git LFS supports two fundamentally different integration models.  There are two primary integration surfaces in Git LFS:

1. **Client-side: Custom Transfer Agent**
2. **Server-side: LFS Batch API**

These are fundamentally different architectural patterns and support different use cases.


Both can interact with DRS-backed storage, but they differ significantly in:

- Interoperability
- Federation support
- Security guarantees
- Governance capabilities
- SaaS compatibility
- Scientific reproducibility

Decision Drivers

- Interoperability with stock Git clients
- Federation across organizations
- Credential isolation and security
- Compatibility with hosted Git platforms
- Multi-user collaboration
- Operational simplicity
- Alignment with GA4GH / DRS design principles

This ADR documents the architectural differences, supported use cases, trade-offs,
and the “Add URL” (external object registration) workflow.

---

# Architecture Overview

## Custom Transfer Agent

```mermaid
flowchart LR
    A[Git Client] --> B[Custom Transfer Agent]
    B --> C[Object Store]
    B --> D[DRS APIs]
```

# Option 1 — Git LFS Custom Transfer Agent (Client-Side)

## Description

A custom transfer agent is configured in the Git client via:

```bash
git config lfs.customtransfer.<name>.path <binary>
```

Instead of contacting an LFS server, the Git LFS client:

* Invokes the custom agent binary
* Streams object metadata (OID, size)
* Delegates upload/download directly to the agent

The agent is responsible for:

* Authentication
* Storage interaction
* Credential resolution
* Progress reporting

No LFS Batch API server is involved.


### Architecture

```
Git Client
   │
   │ invokes
   ▼
Custom Transfer Agent
   │
   │ interacts with
   ▼
Object Store / DRS
```

---

### Supported Use Cases

| Use Case                                 | Supported |
| ---------------------------------------- | --------- |
| Single-user research workflows           | ✅         |
| Direct upload to S3/GCS via DRS          | ✅         |
| Air-gapped or private deployments        | ✅         |
| Power users with controlled environments | ✅         |
| Experimental storage backends            | ✅         |
| Bypassing Git hosting provider           | ✅         |

---

### Characteristics

* No server required
* Full control over upload semantics
* Easy to embed DRS-specific logic
* Zero dependency on Git hosting platform support
* Tight coupling between client and storage logic

---

### Limitations

#### Not Supported or Problematic

| Use Case                             | Limitation                                  |
| ------------------------------------ | ------------------------------------------- |
| Multi-user collaboration             | ❌ Every user must install identical agent   |
| Public Git hosting integration       | ❌ GitHub/GitLab do not invoke custom agents |
| Transparent user experience          | ❌ Requires client config                    |
| Centralized authorization policy     | ❌ Logic is pushed to clients                |
| Server-side audit of batch requests  | ❌ Harder to enforce uniformly               |
| Fine-grained repo policy enforcement | ❌ Distributed across clients                |
| Web-based uploads                    | ❌ No web compatibility                      |



---

# 2. LFS Batch API Server (Server-Side)


## Description

A standard Git LFS server implements:

```
POST /info/lfs/objects/batch
```

The client sends:

```json
{
  "operation": "upload",
  "objects": [
    { "oid": "...", "size": 123 }
  ]
}
```

The server responds with per-object `actions`:

```json
{
  "objects": [
    {
      "oid": "...",
      "actions": {
        "upload": {
          "href": "https://signed-url",
          "header": { ... }
        }
      }
    }
  ]
}
```

The client then uploads directly to object storage using returned URLs.

---

## Architecture

```
Git Client
   │
   │ HTTP
   ▼
LFS Batch API Server
   │
   │ resolves auth binding
   ▼
DRS Control Plane
   │
   ▼
Object Store
```

## Supported Use Cases

| Use Case                       | Supported |
| ------------------------------ | --------- |
| Multi-user collaboration       | ✅         |
| Hosted Git platforms           | ✅         |
| Enterprise SSO / OAuth         | ✅         |
| Centralized policy enforcement | ✅         |
| Auditable batch requests       | ✅         |
| Transparent client UX          | ✅         |
| Repo-scoped storage policies   | ✅         |
| SaaS deployment model          | ✅         |

---

## Characteristics

* Fully compatible with stock Git LFS clients
* No client customization required
* Centralized policy and credential resolution
* Aligns with DRS access-time credential minting
* Enables fine-grained repo-level isolation

---

# Comparison

| Dimension                    | Custom Transfer Agent | Batch API Server |
| ---------------------------- | --------------------- | ---------------- |
| Requires LFS server          | ❌                     | ✅                |
| Requires client modification | ✅                     | ❌                |
| Works with GitHub/GitLab     | ❌                     | ✅                |
| Centralized authorization    | ❌                     | ✅                |
| Repo-scoped isolation        | Weak                  | Strong           |
| Multi-tenant SaaS            | ❌                     | ✅                |
| Air-gapped research          | ✅                     | Possible         |
| Operational simplicity       | Client-heavy          | Server-heavy     |
| Auditable batch control      | Limited               | Strong           |
| GA4GH alignment              | Medium                | High             |

---
 
# Add URL / External Object Registration Semantics

## Overview

The “Add URL” workflow allows registration of a pre-existing external object
without re-uploading it.

Supports:

- Large datasets already stored in cloud buckets
- Cross-project reuse
- Federated research workflows

---

## Definitions

### No User Upload

The client does not upload object bytes.

### No Transfer

No object bytes are transferred at all (server already knows sha256).

---

# Mode A — URL + sha256 + size

```mermaid
sequenceDiagram
    participant U as User
    participant C as Git Client
    participant S as LFS/DRS Server
    participant O as Object Store

    U->>C: git lfs add-url <url> --sha256 --size
    C->>O: HEAD <url>
    C->>S: Verify sha256 exists
    S-->>C: Exists
    C->>C: Write pointer file
```

Transfer semantics:
- No user upload
- No transfer (if already indexed)

---

# Mode B — URL Only

```mermaid
sequenceDiagram
    participant U as User
    participant C as Git Client
    participant S as LFS/DRS Server
    participant O as Object Store

    U->>C: git lfs add-url <url>
    C->>O: HEAD <url>
    C->>S: Resolve sha256
    alt Known
        S-->>C: sha256 + size
    else Unknown
        S->>O: GET object (ingest)
        S-->>C: sha256 computed
    end
    C->>C: Write pointer file
```

Transfer semantics:
- No user upload
- Server-side transfer may occur

---

# Error Cases

| Condition | Error |
|-----------|--------|
| Size mismatch | SIZE_MISMATCH |
| Checksum mismatch | CHECKSUM_MISMATCH |
| Unstable URL | UNSTABLE_OBJECT_SOURCE |
| Object modified after registration | IMMUTABILITY_VIOLATION |
| Source not accessible | SOURCE_NOT_ACCESSIBLE |

---

# Architectural Comparison

| Dimension | Custom Agent | Batch API Server |
|------------|-------------|------------------|
| Hosted Git Compatible | No | Yes |
| Centralized Policy | No | Yes |
| Multi-tenant SaaS | No | Yes |
| Auditability | No | Yes |
| Safe Federated Add-URL | No | Yes |
| Immutability Enforcement | Weak | Strong |

---


# Critical Gap: What Is NOT Supported If Only Custom Transfer Agent Is Used

If we rely solely on a custom transfer agent:

1. **Hosted Git integration is impossible**

   * GitHub and GitLab do not execute custom agents.
   * Users cannot push from standard environments.

2. **Multi-user standardization is fragile**

   * Every collaborator must install and configure the agent.
   * Version drift causes inconsistencies.

3. **No centralized policy enforcement**

   * Bucket and authorization logic lives on client machines.
   * Hard to enforce repo-specific storage controls.

4. **No transparent federation**

   * External collaborators cannot push without installing software.
   * Violates "it just works with Git" principle.

5. **Reduced security posture**

   * More credential resolution logic distributed to clients.
   * Harder to audit access centrally.

6. **No web or CI integration**

   * CI/CD systems require custom agent install.
   * Web-based file operations cannot use it.

---

# When Custom Transfer Agent Is Appropriate

* Research-only deployments
* Internal platform experiments
* Developer tooling
* Transitional architecture
* Air-gapped or highly controlled environments

---

# When Batch API Server Is Required

* Production multi-tenant environments
* Public Git hosting compatibility
* Enterprise SSO integration
* GA4GH federated ecosystems
* DRS-backed storage federation at scale

---

# Discussion

For a federated, multi-tenant, GA4GH-aligned system:

> A Git LFS Batch API server is required.

The custom transfer agent may remain as a complementary tool but cannot be the sole integration surface.



Use Case	Supported
Single-user research workflows	✅
Direct upload to S3/GCS via DRS	✅
Air-gapped or private deployments	✅
Power users with controlled environments	✅
Experimental storage backends	✅
Bypassing Git hosting provider	✅

Use Case	Limitation
Multi-user collaboration	❌ Every user must install identical agent
Public Git hosting integration	❌ GitHub/GitLab do not invoke custom agents
Transparent user experience	❌ Requires client config
Centralized authorization policy	❌ Logic is pushed to clients
Server-side audit of batch requests	❌ Harder to enforce uniformly
Fine-grained repo policy enforcement	❌ Distributed across clients
Web-based uploads	❌ No web compatibility

Use Case	Supported
Multi-user collaboration	✅
Hosted Git platforms	✅
Enterprise SSO / OAuth	✅
Centralized policy enforcement	✅
Auditable batch requests	✅
Transparent client UX	✅
Repo-scoped storage policies	✅
SaaS deployment model	✅

Dimension	Custom Transfer Agent	Batch API Server
Requires LFS server	❌	✅
Requires client modification	✅	❌
Works with GitHub/GitLab	❌	✅
Centralized authorization	❌	✅
Repo-scoped isolation	Weak	Strong
Multi-tenant SaaS	❌	✅
Air-gapped research	✅	Possible
Operational simplicity	Client-heavy	Server-heavy
Auditable batch control	Limited	Strong
GA4GH alignment	Medium	High

Condition	Error
Size mismatch	SIZE_MISMATCH
Checksum mismatch	CHECKSUM_MISMATCH
Unstable URL	UNSTABLE_OBJECT_SOURCE
Object modified after registration	IMMUTABILITY_VIOLATION
Source not accessible	SOURCE_NOT_ACCESSIBLE

Dimension	Custom Agent	Batch API Server
Hosted Git Compatible	No	Yes
Centralized Policy	No	Yes
Multi-tenant SaaS	No	Yes
Auditability	No	Yes
Safe Federated Add-URL	No	Yes
Immutability Enforcement	Weak	Strong

discussion/Custom-Transfer-Agent-vs-Batch-API-Server #212

Description

ADR: Git LFS Integration Strategy — Custom Transfer Agent vs Batch API Server

Status

Background

Architecture Overview

Custom Transfer Agent

Option 1 — Git LFS Custom Transfer Agent (Client-Side)

Description

Architecture

Supported Use Cases

Characteristics

Limitations

Not Supported or Problematic

2. LFS Batch API Server (Server-Side)

Description

Architecture

Supported Use Cases

Characteristics

Comparison

Add URL / External Object Registration Semantics

Overview

Definitions

No User Upload

No Transfer

Mode A — URL + sha256 + size

Mode B — URL Only

Error Cases

Architectural Comparison

Critical Gap: What Is NOT Supported If Only Custom Transfer Agent Is Used

When Custom Transfer Agent Is Appropriate

When Batch API Server Is Required

Discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions