From 5b2b50002538700ad38d1ebad2f8c2c5575c3a84 Mon Sep 17 00:00:00 2001 From: Kyle Ellrott Date: Fri, 20 Feb 2026 10:50:37 -0800 Subject: [PATCH] Adding large project instructions. Cleaning up various navigation issues --- docs/calypr/.nav.yml | 1 + docs/calypr/project-management/.nav.yml | 1 + .../project-management/create-project.md | 6 +- .../project-management/large-projects.md | 257 ++++++++++++++++++ docs/tools/.nav.yml | 10 + docs/tools/data-client/index.md | 2 +- docs/tools/git-drs/.nav.yml | 6 +- docs/tools/index.md | 18 +- docs/tools/sifter/docs/config.md | 4 +- docs/tools/sifter/index.md | 1 + 10 files changed, 289 insertions(+), 17 deletions(-) create mode 100644 docs/calypr/project-management/large-projects.md diff --git a/docs/calypr/.nav.yml b/docs/calypr/.nav.yml index b656803..3a1dc84 100644 --- a/docs/calypr/.nav.yml +++ b/docs/calypr/.nav.yml @@ -1,5 +1,6 @@ title: Calypr nav: + - index.md - Quick Start Guide: quick-start.md - Website: website/ - Data Management: data-management/ diff --git a/docs/calypr/project-management/.nav.yml b/docs/calypr/project-management/.nav.yml index abc0374..0fe7754 100644 --- a/docs/calypr/project-management/.nav.yml +++ b/docs/calypr/project-management/.nav.yml @@ -2,3 +2,4 @@ nav: - Create a Project (gen3 + GitHub): create-project.md - Project Customization: custom-views.md - Publishing project: publishing-project.md + - Large Scale Project Management: large-projects.md diff --git a/docs/calypr/project-management/create-project.md b/docs/calypr/project-management/create-project.md index 474f144..d2f1a12 100644 --- a/docs/calypr/project-management/create-project.md +++ b/docs/calypr/project-management/create-project.md @@ -2,7 +2,9 @@ # Create a Project (gen3 \+ GitHub) -Status: *Manual and DevOps‑only at the moment* +!!! info "Private Beta" + Project creation is currently is a admin operation and not avalible to users. You will need to request + a new project to be created for you. The standard way to start a new Calypr project is to create a Git repository that will hold your FHIR NDJSON files and a set of Git‑LFS tracked files. @@ -12,6 +14,4 @@ For now you will need to ask a Calypr management team to create the project and * Calypr project ID * Initial git config settings (branch, remotes, etc.) -Future Work: Automate this step with a CLI wizard. -TODO – Write the DevOps‑only project creation guide. diff --git a/docs/calypr/project-management/large-projects.md b/docs/calypr/project-management/large-projects.md new file mode 100644 index 0000000..2d4fa99 --- /dev/null +++ b/docs/calypr/project-management/large-projects.md @@ -0,0 +1,257 @@ + +# Large scale project management + +How to manage a Git LFS Repositories with Thousands of Files. + +## 1. Context and Problem Statement + +In large projects, it’s common for a Git repository to track **thousands to hundreds of thousands** of files via Git LFS. Typical use cases: + +* A research study with many samples (VCFs, BAMs, images, etc.) +* A data lake-ish repo where each commit adds more LFS pointers +* Monorepos that aggregate multiple datasets or experiments + +In these cases, standard Git LFS introspection commands become **painfully slow**. A concrete example: + +```bash +git lfs ls-files --json +``` + +On a repo with thousands of LFS pointers, this can take **several minutes**. That’s a non-starter for: + +* Interactive CLI tools +* Editor/IDE integrations +* CI/CD steps that run frequently + +This note describes architectural patterns to **avoid global enumeration** and keep operations fast and predictable as your LFS population grows. + +## 2. Why `git lfs ls-files` is Slow in Large Repos + +Conceptually, `git lfs ls-files` must: + +1. Walk the Git index / working tree to identify LFS-tracked files. +2. For each file, resolve and hydrate metadata (pointer, OID, size, etc.). +3. Optionally serialize to JSON. + +Even if the LFS objects are local, this is **O(N)** over every matching file visible to the command. When N = 10,000+, you’re essentially asking Git + Git LFS to do a full scan and re-derive information that: + +* Doesn’t change very often, and +* Could be cached or maintained elsewhere. + +From an architecture perspective, the problem is: + +> We’re using `git lfs ls-files` as a **query engine and index**, when it’s really just a **dumb enumerator** over the current state. + + +## 3. Design Goals + +For a repository with many LFS objects, we want: + +1. **Predictable latency** + Operations that touch “all LFS files” should be rare and explicit; routine commands should be sub-second, even as the repo grows. + +2. **Incremental updates** + Avoid full scans of N files when only a handful are new or changed. + +3. **Subset operations by default** + Most tasks only need a **subset** (by path, tag, type, or commit range), not the full universe. + +4. **Separation of metadata from Git internals** + Use Git (and Git LFS) as the *transport and integrity layer*, not as a full-featured metadata store. + +## 4. Core Architectural Pattern: External LFS Metadata Index + +Instead of deriving everything on demand from `git lfs ls-files`, maintain a **separate index** of LFS metadata that is: + +* **Versioned** alongside the repo (e.g., tracked TSV/JSON), +* **Derived incrementally** from Git/LFS events, and +* **Fast to query** (path lookup, OID lookup, tags, etc.). + +### 4.1. Example: `META/lfs_index.tsv` + +A simple pattern: + +* Maintain a tracked file such as `META/lfs_index.tsv` with columns like: + + ```text + path oid_sha256 size tags logical_id + data/a.bam 1a2b3c... 12345 tumor sample:XYZ + data/b.bam 4d5e6f... 67890 normal sample:ABC + ``` + +* This TSV becomes your **primary, fast, queryable index**, *not* `git lfs ls-files`. + +Pros: + +* Constant-time query by path via grep / awk / Python / SQL. +* Easy to join with other metadata tables (specimens, assays, etc.). +* Can be regenerated in a controlled, explicit operation (like `make rebuild-index`). + +### 4.2. How to Keep It Up-to-Date + +You don’t want manual edits. Use **automation on “add” paths**: + +* use a **pre-commit hook**: + + * For newly staged LFS pointer files, update the index before commit. + +This shifts expensive work into the **write path** where it is amortized and expected, and keeps the **read path** (queries) fast. + + +## 5. Avoiding `git lfs ls-files` in Common Operations + +### 5.1. Don’t use `ls-files` as your data plane + +Refactor any tools that currently: + +```bash +git lfs ls-files --json | jq ... +``` + +to instead read from your **external index** (TSV/JSON/SQLite). For example: + +```bash +# Old, slow: +git lfs ls-files --json | jq '.[] | select(.name|test("VCF$"))' + +# New, fast: +awk -F'\t' '$1 ~ /\.vcf$/ {print $0}' META/lfs_index.tsv +``` + +or in Python: + +```python +import csv + +with open("META/lfs_index.tsv") as f: + for row in csv.DictReader(f, delimiter="\t"): + if row["path"].endswith(".vcf.gz"): + ... +``` + +### 5.2. Use `ls-files` only for rare “rebuild index” operations + +When you first introduce the index, you may need a **one-time or occasional** rebuild: + +```bash +git lfs ls-files --all --json > /tmp/lfs_files.json +# transform into META/lfs_index.tsv +``` + +This can take minutes in huge repos—and that’s fine, *as long as it is rare* and documented as a heavy operation (like `npm install`, `docker build`, etc.). + + +## 6. Subset-First Design: Operate on Paths, Tags, or Commits + +If you must derive state from Git directly, design your commands to **start with a subset**, not the full repo. + +### 6.1. Path-based subsets + +For example, instead of: + +```bash +# Scans entire repo +git lfs ls-files --json +``` + +use: + +```bash +# Only data under a project or cohort +git lfs ls-files --include "data/StudyX/**" --json +``` + +and structure your tooling around the concept of **project subtrees** (`data/studyA/`, `data/studyB/`, etc.) so most operations are scoped. + +### 6.2. Commit-range subsets + +For incremental workflows (ETL, indexing, sync), use git to find changed files: + +```bash +git diff --name-only \ + | git check-attr --stdin filter \ + | awk '$2 == "lfs"' # or similar +``` + +Then only examine LFS metadata for **changed files**, merging that into your external index. + + +## 7. Caching and Incremental Computation + +If you really want a “`git lfs ls-files --json`-like view,” you can implement your own **cached snapshot**: + +1. Keep a file like `.cache/lfs_snapshot.json` keyed by commit hash (`HEAD`). +2. On invocation: + + * If `HEAD` has not changed, just read the cache. + * If `HEAD` changed, compute the diff from the last snapshot and patch the cached JSON. + +This means you only pay full-scan costs **when the diff is large**, and usually pay a small, incremental cost. + + +## 8. CI/CD Considerations + +In CI, naive patterns like: + +```yaml +- run: git lfs ls-files --json | jq ... +``` + +will slow your builds significantly once the LFS population grows. + +Better patterns: + +* For **linting** or **validation**: + + * Operate on `META/*.tsv` and cross-check with a small sample of pointers. +* For **publishing** or **sync** steps: + + * Use `git diff` between the last deployed commit and current one to identify only the LFS files that changed. +* For **health checks**: + + * Schedule a periodic “heavy” job (nightly or weekly) that runs `git lfs ls-files` to verify repo consistency, rather than doing it on every push. + + +## 9. Git + LFS as Transport, Not Primary Index + +The underlying architectural theme: + +* **Git** is an excellent tool for content addressing, branching, merging, and history. +* **Git LFS** is an excellent tool for large object transport and storage. + +Neither is optimized as a **high-level metadata query system** for tens of thousands of objects. + +So: + +* Let Git/LFS handle **integrity** and **distribution**. +* Let a simple, explicit index (TSV/JSON/SQLite, or an external service like Indexd) handle **queries**, **tags**, and **relationships**. + +You can always **rebuild** your index from Git LFS if needed, but you shouldn’t be doing that implicitly on every command. + + +## 10. Practical Recommendations / Checklist + +When you notice `git lfs ls-files --json` taking minutes: + +1. **Audit your tools** + * Search for any use of `git lfs ls-files` in scripts, CI configs, and CLIs. + * Replace them with operations over an **external index**. + +2. **Introduce a canonical LFS index** + * Add `META/lfs_index.tsv` (or similar) to the repo. + * Define columns: `path`, `oid_sha256`, `size`, `tags`, `logical_id`, etc. + * Commit it and treat it as the primary query surface. + +3. **Automate index maintenance** + * Add a wrapper command or pre-commit hook that updates the index on `git add`. + * Provide a “heavy” `rebuild-lfs-index` command that users run explicitly when necessary. + +4. **Scope operations by default** + * Design new commands to accept `--path`, `--tag`, `--study`, or `--since ` flags. + * Document that global “scan everything” commands are expensive and should be infrequent. + +5. **Use CI wisely** + * Only operate on changed LFS files between commits. + * Reserve full LFS integrity checks for scheduled jobs, not every PR. + + diff --git a/docs/tools/.nav.yml b/docs/tools/.nav.yml index ddc8347..e390e2f 100644 --- a/docs/tools/.nav.yml +++ b/docs/tools/.nav.yml @@ -1 +1,11 @@ title: Tools + +nav: + - index.md + - git-drs + - funnel + - grip + - data-client + - forge + - sifter + \ No newline at end of file diff --git a/docs/tools/data-client/index.md b/docs/tools/data-client/index.md index 2a6000c..602f380 100644 --- a/docs/tools/data-client/index.md +++ b/docs/tools/data-client/index.md @@ -6,7 +6,7 @@ title: Data Client The `data-client` is the modern CALYPR client library and CLI tool. It serves two primary purposes: 1. **Data Interaction**: A unified interface for uploading, downloading, and managing data in Gen3 Data Commons. -2. **Permissions Management**: It handles user access and project collaboration, replacing older tools like `calypr_admin`. +2. **Permissions Management**: It handles user access and project collaboration. ## Architecture diff --git a/docs/tools/git-drs/.nav.yml b/docs/tools/git-drs/.nav.yml index c1ce9e1..be7431c 100644 --- a/docs/tools/git-drs/.nav.yml +++ b/docs/tools/git-drs/.nav.yml @@ -1,8 +1,10 @@ title: Git-DRS nav: + - index.md - Overview: index.md - Installation: installation.md - - Commands: commands.md + - Quick Start: quickstart.md - Getting Started: getting-started.md - - Developer Guide: developer-guide.md + - Commands: commands.md +# - Developer Guide: developer-guide.md - Trouble shooting: troubleshooting.md diff --git a/docs/tools/index.md b/docs/tools/index.md index 195f28d..40834a1 100644 --- a/docs/tools/index.md +++ b/docs/tools/index.md @@ -1,4 +1,4 @@ -# CALYPR Tools Ecosystem +# CALYPR Tool Ecosystem The CALYPR platform provides a suite of powerful, open-source tools designed to handle every stage of the genomic data lifecycle—from ingestion and versioning to distributed analysis and graph-based discovery. @@ -16,17 +16,17 @@ Funnel is a distributed task execution engine that implements the GA4GH Task Exe **The Discovery Layer.** GRIP (Graph Resource Integration Platform) is a high-performance graph database and query engine designed for complex biological data. It enables analysts to integrate heterogeneous datasets into a unified knowledge graph and perform sophisticated queries that reveal deep relational insights across multi-omic cohorts. +### [Forge](forge/index.md) +**Project formatting** +Forge scans a data repository to build an integrated FHIR based graph of samples and all the files connected to the project. It is resposible for schema checking and database loading. You can use it client side to verify and debug your project and on the server side, it is used to load databases. ---- +### [Data Client](data-client/index.md) +A client command line interface for interfacing with the Calypr system. -## Choosing the Right Tool +### [Sifter](sifter/index.md) +**Data Transformation** +Sifter is a tool for rapid data extraction and transformation. -| If you want to... | Use this tool | -| --- | --- | -| Version and share large genomic files | **Git-DRS** | -| Run batch analysis or Nextflow pipelines | **Funnel** | -| Query complex relationships between datasets | **GRIP** | -| Access Gen3 data from the command line | **Data Client** | --- diff --git a/docs/tools/sifter/docs/config.md b/docs/tools/sifter/docs/config.md index 38ab63d..391e21c 100644 --- a/docs/tools/sifter/docs/config.md +++ b/docs/tools/sifter/docs/config.md @@ -1,8 +1,8 @@ --- -title: Paramaters +title: Parameters --- -## Paramaters Variables +## Parameters Variables Playbooks can be parameterized. They are defined in the `params` section of the playbook YAML file. diff --git a/docs/tools/sifter/index.md b/docs/tools/sifter/index.md index 30ca242..1e3046c 100644 --- a/docs/tools/sifter/index.md +++ b/docs/tools/sifter/index.md @@ -1,6 +1,7 @@ --- title: Sifter render_macros: false +repo_url: https://github.com/bmeg/sifter ---