Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/calypr/.nav.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
title: Calypr
nav:
- index.md
- Quick Start Guide: quick-start.md
- Data: data/
- Project Management: project-management/
Expand Down
1 change: 1 addition & 0 deletions docs/calypr/project-management/.nav.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
nav:
- Project Customization: custom-views.md
- Publishing project: publishing-project.md
- Large Scale Project Management: large-projects.md
- Calypr Admin: calypr-admin/
6 changes: 3 additions & 3 deletions docs/calypr/project-management/create-project.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

# Create a Project (gen3 \+ GitHub)

Status: *Manual and DevOps‑only at the moment*
!!! info "Private Beta"
Project creation is currently is a admin operation and not avalible to users. You will need to request
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar error: "currently is a admin" should be "currently an admin" (remove the duplicate "is" and change "a" to "an").

Suggested change
Project creation is currently is a admin operation and not avalible to users. You will need to request
Project creation is currently an admin operation and not available to users. You will need to request

Copilot uses AI. Check for mistakes.
a new project to be created for you.

The standard way to start a new Calypr project is to create a Git repository that will hold your FHIR NDJSON files and a set of Git‑LFS tracked files.

Expand All @@ -12,6 +14,4 @@ For now you will need to ask a Calypr management team to create the project and
* Calypr project ID
* Initial git config settings (branch, remotes, etc.)

Future Work: Automate this step with a CLI wizard.

TODO – Write the DevOps‑only project creation guide.
257 changes: 257 additions & 0 deletions docs/calypr/project-management/large-projects.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,257 @@

# Large scale project management

How to manage a Git LFS Repositories with Thousands of Files.

## 1. Context and Problem Statement

In large projects, it’s common for a Git repository to track **thousands to hundreds of thousands** of files via Git LFS. Typical use cases:

* A research study with many samples (VCFs, BAMs, images, etc.)
* A data lake-ish repo where each commit adds more LFS pointers
* Monorepos that aggregate multiple datasets or experiments

In these cases, standard Git LFS introspection commands become **painfully slow**. A concrete example:

```bash
git lfs ls-files --json
```

On a repo with thousands of LFS pointers, this can take **several minutes**. That’s a non-starter for:

* Interactive CLI tools
* Editor/IDE integrations
* CI/CD steps that run frequently

This note describes architectural patterns to **avoid global enumeration** and keep operations fast and predictable as your LFS population grows.

## 2. Why `git lfs ls-files` is Slow in Large Repos

Conceptually, `git lfs ls-files` must:

1. Walk the Git index / working tree to identify LFS-tracked files.
2. For each file, resolve and hydrate metadata (pointer, OID, size, etc.).
3. Optionally serialize to JSON.

Even if the LFS objects are local, this is **O(N)** over every matching file visible to the command. When N = 10,000+, you’re essentially asking Git + Git LFS to do a full scan and re-derive information that:

* Doesn’t change very often, and
* Could be cached or maintained elsewhere.

From an architecture perspective, the problem is:

> We’re using `git lfs ls-files` as a **query engine and index**, when it’s really just a **dumb enumerator** over the current state.


## 3. Design Goals

For a repository with many LFS objects, we want:

1. **Predictable latency**
Operations that touch “all LFS files” should be rare and explicit; routine commands should be sub-second, even as the repo grows.

2. **Incremental updates**
Avoid full scans of N files when only a handful are new or changed.

3. **Subset operations by default**
Most tasks only need a **subset** (by path, tag, type, or commit range), not the full universe.

4. **Separation of metadata from Git internals**
Use Git (and Git LFS) as the *transport and integrity layer*, not as a full-featured metadata store.

## 4. Core Architectural Pattern: External LFS Metadata Index

Instead of deriving everything on demand from `git lfs ls-files`, maintain a **separate index** of LFS metadata that is:

* **Versioned** alongside the repo (e.g., tracked TSV/JSON),
* **Derived incrementally** from Git/LFS events, and
* **Fast to query** (path lookup, OID lookup, tags, etc.).

### 4.1. Example: `META/lfs_index.tsv`

A simple pattern:

* Maintain a tracked file such as `META/lfs_index.tsv` with columns like:

```text
path oid_sha256 size tags logical_id
data/a.bam 1a2b3c... 12345 tumor sample:XYZ
data/b.bam 4d5e6f... 67890 normal sample:ABC
```

* This TSV becomes your **primary, fast, queryable index**, *not* `git lfs ls-files`.

Pros:

* Constant-time query by path via grep / awk / Python / SQL.
* Easy to join with other metadata tables (specimens, assays, etc.).
* Can be regenerated in a controlled, explicit operation (like `make rebuild-index`).

### 4.2. How to Keep It Up-to-Date

You don’t want manual edits. Use **automation on “add” paths**:

* use a **pre-commit hook**:

* For newly staged LFS pointer files, update the index before commit.

This shifts expensive work into the **write path** where it is amortized and expected, and keeps the **read path** (queries) fast.


## 5. Avoiding `git lfs ls-files` in Common Operations

### 5.1. Don’t use `ls-files` as your data plane

Refactor any tools that currently:

```bash
git lfs ls-files --json | jq ...
```

to instead read from your **external index** (TSV/JSON/SQLite). For example:

```bash
# Old, slow:
git lfs ls-files --json | jq '.[] | select(.name|test("VCF$"))'

# New, fast:
awk -F'\t' '$1 ~ /\.vcf$/ {print $0}' META/lfs_index.tsv
```

or in Python:

```python
import csv

with open("META/lfs_index.tsv") as f:
for row in csv.DictReader(f, delimiter="\t"):
if row["path"].endswith(".vcf.gz"):
...
```

### 5.2. Use `ls-files` only for rare “rebuild index” operations

When you first introduce the index, you may need a **one-time or occasional** rebuild:

```bash
git lfs ls-files --all --json > /tmp/lfs_files.json
# transform into META/lfs_index.tsv
```

This can take minutes in huge repos—and that’s fine, *as long as it is rare* and documented as a heavy operation (like `npm install`, `docker build`, etc.).


## 6. Subset-First Design: Operate on Paths, Tags, or Commits

If you must derive state from Git directly, design your commands to **start with a subset**, not the full repo.

### 6.1. Path-based subsets

For example, instead of:

```bash
# Scans entire repo
git lfs ls-files --json
```

use:

```bash
# Only data under a project or cohort
git lfs ls-files --include "data/StudyX/**" --json
```

and structure your tooling around the concept of **project subtrees** (`data/studyA/`, `data/studyB/`, etc.) so most operations are scoped.

### 6.2. Commit-range subsets

For incremental workflows (ETL, indexing, sync), use git to find changed files:

```bash
git diff --name-only <old-commit> <new-commit> \
| git check-attr --stdin filter \
| awk '$2 == "lfs"' # or similar
```

Then only examine LFS metadata for **changed files**, merging that into your external index.


## 7. Caching and Incremental Computation

If you really want a “`git lfs ls-files --json`-like view,” you can implement your own **cached snapshot**:

1. Keep a file like `.cache/lfs_snapshot.json` keyed by commit hash (`HEAD`).
2. On invocation:

* If `HEAD` has not changed, just read the cache.
* If `HEAD` changed, compute the diff from the last snapshot and patch the cached JSON.

This means you only pay full-scan costs **when the diff is large**, and usually pay a small, incremental cost.


## 8. CI/CD Considerations

In CI, naive patterns like:

```yaml
- run: git lfs ls-files --json | jq ...
```

will slow your builds significantly once the LFS population grows.

Better patterns:

* For **linting** or **validation**:

* Operate on `META/*.tsv` and cross-check with a small sample of pointers.
* For **publishing** or **sync** steps:

* Use `git diff` between the last deployed commit and current one to identify only the LFS files that changed.
* For **health checks**:

* Schedule a periodic “heavy” job (nightly or weekly) that runs `git lfs ls-files` to verify repo consistency, rather than doing it on every push.


## 9. Git + LFS as Transport, Not Primary Index

The underlying architectural theme:

* **Git** is an excellent tool for content addressing, branching, merging, and history.
* **Git LFS** is an excellent tool for large object transport and storage.

Neither is optimized as a **high-level metadata query system** for tens of thousands of objects.

So:

* Let Git/LFS handle **integrity** and **distribution**.
* Let a simple, explicit index (TSV/JSON/SQLite, or an external service like Indexd) handle **queries**, **tags**, and **relationships**.

You can always **rebuild** your index from Git LFS if needed, but you shouldn’t be doing that implicitly on every command.


## 10. Practical Recommendations / Checklist

When you notice `git lfs ls-files --json` taking minutes:

1. **Audit your tools**
* Search for any use of `git lfs ls-files` in scripts, CI configs, and CLIs.
* Replace them with operations over an **external index**.

2. **Introduce a canonical LFS index**
* Add `META/lfs_index.tsv` (or similar) to the repo.
* Define columns: `path`, `oid_sha256`, `size`, `tags`, `logical_id`, etc.
* Commit it and treat it as the primary query surface.

3. **Automate index maintenance**
* Add a wrapper command or pre-commit hook that updates the index on `git add`.
* Provide a “heavy” `rebuild-lfs-index` command that users run explicitly when necessary.

4. **Scope operations by default**
* Design new commands to accept `--path`, `--tag`, `--study`, or `--since <commit>` flags.
* Document that global “scan everything” commands are expensive and should be infrequent.

5. **Use CI wisely**
* Only operate on changed LFS files between commits.
* Reserve full LFS integrity checks for scheduled jobs, not every PR.


10 changes: 10 additions & 0 deletions docs/tools/.nav.yml
Original file line number Diff line number Diff line change
@@ -1 +1,11 @@
title: Tools

nav:
- index.md
- git-drs
- funnel
- grip
- data-client
- forge
- sifter

2 changes: 1 addition & 1 deletion docs/tools/data-client/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ title: Data Client

The `data-client` is the modern CALYPR client library and CLI tool. It serves two primary purposes:
1. **Data Interaction**: A unified interface for uploading, downloading, and managing data in Gen3 Data Commons.
2. **Permissions Management**: It handles user access and project collaboration, replacing older tools like `calypr_admin`.
2. **Permissions Management**: It handles user access and project collaboration.

## Architecture

Expand Down
9 changes: 7 additions & 2 deletions docs/tools/git-drs/.nav.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
title: Git-DRS
nav:
- index.md
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The navigation file has a duplicate entry for "index.md". Line 3 has a bare "index.md" entry and line 4 has "Overview: index.md" which both point to the same file. This creates redundancy in the navigation structure.

Suggested change
- index.md

Copilot uses AI. Check for mistakes.
- Overview: index.md
- Installation: installation.md
- Quick Start: quickstart.md
- Troubleshooting: troubleshooting.md
- Developer Guide: developer-guide.md
- Getting Started: getting-started.md
- Commands: commands.md
# - Developer Guide: developer-guide.md
- Trouble shooting: troubleshooting.md
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent spelling: "Trouble shooting" should be "Troubleshooting" (one word) to match the previous usage in line 8 of the original file.

Suggested change
- Trouble shooting: troubleshooting.md
- Troubleshooting: troubleshooting.md

Copilot uses AI. Check for mistakes.
18 changes: 9 additions & 9 deletions docs/tools/index.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# CALYPR Tools Ecosystem
# CALYPR Tool Ecosystem

The CALYPR platform provides a suite of powerful, open-source tools designed to handle every stage of the genomic data lifecycle—from ingestion and versioning to distributed analysis and graph-based discovery.

Expand All @@ -16,17 +16,17 @@ Funnel is a distributed task execution engine that implements the GA4GH Task Exe
**The Discovery Layer.**
GRIP (Graph Resource Integration Platform) is a high-performance graph database and query engine designed for complex biological data. It enables analysts to integrate heterogeneous datasets into a unified knowledge graph and perform sophisticated queries that reveal deep relational insights across multi-omic cohorts.

### [Forge](forge/index.md)
**Project formatting**
Forge scans a data repository to build an integrated FHIR based graph of samples and all the files connected to the project. It is resposible for schema checking and database loading. You can use it client side to verify and debug your project and on the server side, it is used to load databases.
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error: "resposible" should be "responsible".

Suggested change
Forge scans a data repository to build an integrated FHIR based graph of samples and all the files connected to the project. It is resposible for schema checking and database loading. You can use it client side to verify and debug your project and on the server side, it is used to load databases.
Forge scans a data repository to build an integrated FHIR based graph of samples and all the files connected to the project. It is responsible for schema checking and database loading. You can use it client side to verify and debug your project and on the server side, it is used to load databases.

Copilot uses AI. Check for mistakes.

---
### [Data Client](data-client/index.md)
A client command line interface for interfacing with the Calypr system.

## Choosing the Right Tool
### [Sifter](sifter/index.md)
**Data Transformation**
Sifter is a tool for rapid data extraction and transformation.

| If you want to... | Use this tool |
| --- | --- |
| Version and share large genomic files | **Git-DRS** |
| Run batch analysis or Nextflow pipelines | **Funnel** |
| Query complex relationships between datasets | **GRIP** |
| Access Gen3 data from the command line | **Data Client** |

---

Expand Down
4 changes: 2 additions & 2 deletions docs/tools/sifter/docs/config.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
---
title: Paramaters
title: Parameters
---

## Paramaters Variables
## Parameters Variables

Playbooks can be parameterized. They are defined in the `params` section of the playbook YAML file.

Expand Down
1 change: 1 addition & 0 deletions docs/tools/sifter/index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
---
title: Sifter
render_macros: false
repo_url: https://github.com/bmeg/sifter
---


Expand Down