-
Notifications
You must be signed in to change notification settings - Fork 1
Adding large project instructions. Cleaning up various navigation issues #76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: refactor/new-site
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,4 +1,5 @@ | ||
| nav: | ||
| - Project Customization: custom-views.md | ||
| - Publishing project: publishing-project.md | ||
| - Large Scale Project Management: large-projects.md | ||
| - Calypr Admin: calypr-admin/ |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,257 @@ | ||
|
|
||
| # Large scale project management | ||
|
|
||
| How to manage a Git LFS Repositories with Thousands of Files. | ||
|
|
||
| ## 1. Context and Problem Statement | ||
|
|
||
| In large projects, it’s common for a Git repository to track **thousands to hundreds of thousands** of files via Git LFS. Typical use cases: | ||
|
|
||
| * A research study with many samples (VCFs, BAMs, images, etc.) | ||
| * A data lake-ish repo where each commit adds more LFS pointers | ||
| * Monorepos that aggregate multiple datasets or experiments | ||
|
|
||
| In these cases, standard Git LFS introspection commands become **painfully slow**. A concrete example: | ||
|
|
||
| ```bash | ||
| git lfs ls-files --json | ||
| ``` | ||
|
|
||
| On a repo with thousands of LFS pointers, this can take **several minutes**. That’s a non-starter for: | ||
|
|
||
| * Interactive CLI tools | ||
| * Editor/IDE integrations | ||
| * CI/CD steps that run frequently | ||
|
|
||
| This note describes architectural patterns to **avoid global enumeration** and keep operations fast and predictable as your LFS population grows. | ||
|
|
||
| ## 2. Why `git lfs ls-files` is Slow in Large Repos | ||
|
|
||
| Conceptually, `git lfs ls-files` must: | ||
|
|
||
| 1. Walk the Git index / working tree to identify LFS-tracked files. | ||
| 2. For each file, resolve and hydrate metadata (pointer, OID, size, etc.). | ||
| 3. Optionally serialize to JSON. | ||
|
|
||
| Even if the LFS objects are local, this is **O(N)** over every matching file visible to the command. When N = 10,000+, you’re essentially asking Git + Git LFS to do a full scan and re-derive information that: | ||
|
|
||
| * Doesn’t change very often, and | ||
| * Could be cached or maintained elsewhere. | ||
|
|
||
| From an architecture perspective, the problem is: | ||
|
|
||
| > We’re using `git lfs ls-files` as a **query engine and index**, when it’s really just a **dumb enumerator** over the current state. | ||
|
|
||
|
|
||
| ## 3. Design Goals | ||
|
|
||
| For a repository with many LFS objects, we want: | ||
|
|
||
| 1. **Predictable latency** | ||
| Operations that touch “all LFS files” should be rare and explicit; routine commands should be sub-second, even as the repo grows. | ||
|
|
||
| 2. **Incremental updates** | ||
| Avoid full scans of N files when only a handful are new or changed. | ||
|
|
||
| 3. **Subset operations by default** | ||
| Most tasks only need a **subset** (by path, tag, type, or commit range), not the full universe. | ||
|
|
||
| 4. **Separation of metadata from Git internals** | ||
| Use Git (and Git LFS) as the *transport and integrity layer*, not as a full-featured metadata store. | ||
|
|
||
| ## 4. Core Architectural Pattern: External LFS Metadata Index | ||
|
|
||
| Instead of deriving everything on demand from `git lfs ls-files`, maintain a **separate index** of LFS metadata that is: | ||
|
|
||
| * **Versioned** alongside the repo (e.g., tracked TSV/JSON), | ||
| * **Derived incrementally** from Git/LFS events, and | ||
| * **Fast to query** (path lookup, OID lookup, tags, etc.). | ||
|
|
||
| ### 4.1. Example: `META/lfs_index.tsv` | ||
|
|
||
| A simple pattern: | ||
|
|
||
| * Maintain a tracked file such as `META/lfs_index.tsv` with columns like: | ||
|
|
||
| ```text | ||
| path oid_sha256 size tags logical_id | ||
| data/a.bam 1a2b3c... 12345 tumor sample:XYZ | ||
| data/b.bam 4d5e6f... 67890 normal sample:ABC | ||
| ``` | ||
|
|
||
| * This TSV becomes your **primary, fast, queryable index**, *not* `git lfs ls-files`. | ||
|
|
||
| Pros: | ||
|
|
||
| * Constant-time query by path via grep / awk / Python / SQL. | ||
| * Easy to join with other metadata tables (specimens, assays, etc.). | ||
| * Can be regenerated in a controlled, explicit operation (like `make rebuild-index`). | ||
|
|
||
| ### 4.2. How to Keep It Up-to-Date | ||
|
|
||
| You don’t want manual edits. Use **automation on “add” paths**: | ||
|
|
||
| * use a **pre-commit hook**: | ||
|
|
||
| * For newly staged LFS pointer files, update the index before commit. | ||
|
|
||
| This shifts expensive work into the **write path** where it is amortized and expected, and keeps the **read path** (queries) fast. | ||
|
|
||
|
|
||
| ## 5. Avoiding `git lfs ls-files` in Common Operations | ||
|
|
||
| ### 5.1. Don’t use `ls-files` as your data plane | ||
|
|
||
| Refactor any tools that currently: | ||
|
|
||
| ```bash | ||
| git lfs ls-files --json | jq ... | ||
| ``` | ||
|
|
||
| to instead read from your **external index** (TSV/JSON/SQLite). For example: | ||
|
|
||
| ```bash | ||
| # Old, slow: | ||
| git lfs ls-files --json | jq '.[] | select(.name|test("VCF$"))' | ||
|
|
||
| # New, fast: | ||
| awk -F'\t' '$1 ~ /\.vcf$/ {print $0}' META/lfs_index.tsv | ||
| ``` | ||
|
|
||
| or in Python: | ||
|
|
||
| ```python | ||
| import csv | ||
|
|
||
| with open("META/lfs_index.tsv") as f: | ||
| for row in csv.DictReader(f, delimiter="\t"): | ||
| if row["path"].endswith(".vcf.gz"): | ||
| ... | ||
| ``` | ||
|
|
||
| ### 5.2. Use `ls-files` only for rare “rebuild index” operations | ||
|
|
||
| When you first introduce the index, you may need a **one-time or occasional** rebuild: | ||
|
|
||
| ```bash | ||
| git lfs ls-files --all --json > /tmp/lfs_files.json | ||
| # transform into META/lfs_index.tsv | ||
| ``` | ||
|
|
||
| This can take minutes in huge repos—and that’s fine, *as long as it is rare* and documented as a heavy operation (like `npm install`, `docker build`, etc.). | ||
|
|
||
|
|
||
| ## 6. Subset-First Design: Operate on Paths, Tags, or Commits | ||
|
|
||
| If you must derive state from Git directly, design your commands to **start with a subset**, not the full repo. | ||
|
|
||
| ### 6.1. Path-based subsets | ||
|
|
||
| For example, instead of: | ||
|
|
||
| ```bash | ||
| # Scans entire repo | ||
| git lfs ls-files --json | ||
| ``` | ||
|
|
||
| use: | ||
|
|
||
| ```bash | ||
| # Only data under a project or cohort | ||
| git lfs ls-files --include "data/StudyX/**" --json | ||
| ``` | ||
|
|
||
| and structure your tooling around the concept of **project subtrees** (`data/studyA/`, `data/studyB/`, etc.) so most operations are scoped. | ||
|
|
||
| ### 6.2. Commit-range subsets | ||
|
|
||
| For incremental workflows (ETL, indexing, sync), use git to find changed files: | ||
|
|
||
| ```bash | ||
| git diff --name-only <old-commit> <new-commit> \ | ||
| | git check-attr --stdin filter \ | ||
| | awk '$2 == "lfs"' # or similar | ||
| ``` | ||
|
|
||
| Then only examine LFS metadata for **changed files**, merging that into your external index. | ||
|
|
||
|
|
||
| ## 7. Caching and Incremental Computation | ||
|
|
||
| If you really want a “`git lfs ls-files --json`-like view,” you can implement your own **cached snapshot**: | ||
|
|
||
| 1. Keep a file like `.cache/lfs_snapshot.json` keyed by commit hash (`HEAD`). | ||
| 2. On invocation: | ||
|
|
||
| * If `HEAD` has not changed, just read the cache. | ||
| * If `HEAD` changed, compute the diff from the last snapshot and patch the cached JSON. | ||
|
|
||
| This means you only pay full-scan costs **when the diff is large**, and usually pay a small, incremental cost. | ||
|
|
||
|
|
||
| ## 8. CI/CD Considerations | ||
|
|
||
| In CI, naive patterns like: | ||
|
|
||
| ```yaml | ||
| - run: git lfs ls-files --json | jq ... | ||
| ``` | ||
|
|
||
| will slow your builds significantly once the LFS population grows. | ||
|
|
||
| Better patterns: | ||
|
|
||
| * For **linting** or **validation**: | ||
|
|
||
| * Operate on `META/*.tsv` and cross-check with a small sample of pointers. | ||
| * For **publishing** or **sync** steps: | ||
|
|
||
| * Use `git diff` between the last deployed commit and current one to identify only the LFS files that changed. | ||
| * For **health checks**: | ||
|
|
||
| * Schedule a periodic “heavy” job (nightly or weekly) that runs `git lfs ls-files` to verify repo consistency, rather than doing it on every push. | ||
|
|
||
|
|
||
| ## 9. Git + LFS as Transport, Not Primary Index | ||
|
|
||
| The underlying architectural theme: | ||
|
|
||
| * **Git** is an excellent tool for content addressing, branching, merging, and history. | ||
| * **Git LFS** is an excellent tool for large object transport and storage. | ||
|
|
||
| Neither is optimized as a **high-level metadata query system** for tens of thousands of objects. | ||
|
|
||
| So: | ||
|
|
||
| * Let Git/LFS handle **integrity** and **distribution**. | ||
| * Let a simple, explicit index (TSV/JSON/SQLite, or an external service like Indexd) handle **queries**, **tags**, and **relationships**. | ||
|
|
||
| You can always **rebuild** your index from Git LFS if needed, but you shouldn’t be doing that implicitly on every command. | ||
|
|
||
|
|
||
| ## 10. Practical Recommendations / Checklist | ||
|
|
||
| When you notice `git lfs ls-files --json` taking minutes: | ||
|
|
||
| 1. **Audit your tools** | ||
| * Search for any use of `git lfs ls-files` in scripts, CI configs, and CLIs. | ||
| * Replace them with operations over an **external index**. | ||
|
|
||
| 2. **Introduce a canonical LFS index** | ||
| * Add `META/lfs_index.tsv` (or similar) to the repo. | ||
| * Define columns: `path`, `oid_sha256`, `size`, `tags`, `logical_id`, etc. | ||
| * Commit it and treat it as the primary query surface. | ||
|
|
||
| 3. **Automate index maintenance** | ||
| * Add a wrapper command or pre-commit hook that updates the index on `git add`. | ||
| * Provide a “heavy” `rebuild-lfs-index` command that users run explicitly when necessary. | ||
|
|
||
| 4. **Scope operations by default** | ||
| * Design new commands to accept `--path`, `--tag`, `--study`, or `--since <commit>` flags. | ||
| * Document that global “scan everything” commands are expensive and should be infrequent. | ||
|
|
||
| 5. **Use CI wisely** | ||
| * Only operate on changed LFS files between commits. | ||
| * Reserve full LFS integrity checks for scheduled jobs, not every PR. | ||
|
|
||
|
|
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1 +1,11 @@ | ||
| title: Tools | ||
|
|
||
| nav: | ||
| - index.md | ||
| - git-drs | ||
| - funnel | ||
| - grip | ||
| - data-client | ||
| - forge | ||
| - sifter | ||
|
|
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -1,5 +1,10 @@ | ||||||
| title: Git-DRS | ||||||
| nav: | ||||||
| - index.md | ||||||
|
||||||
| - index.md |
Copilot
AI
Feb 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent spelling: "Trouble shooting" should be "Troubleshooting" (one word) to match the previous usage in line 8 of the original file.
| - Trouble shooting: troubleshooting.md | |
| - Troubleshooting: troubleshooting.md |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -1,4 +1,4 @@ | ||||||
| # CALYPR Tools Ecosystem | ||||||
| # CALYPR Tool Ecosystem | ||||||
|
|
||||||
| The CALYPR platform provides a suite of powerful, open-source tools designed to handle every stage of the genomic data lifecycle—from ingestion and versioning to distributed analysis and graph-based discovery. | ||||||
|
|
||||||
|
|
@@ -16,17 +16,17 @@ Funnel is a distributed task execution engine that implements the GA4GH Task Exe | |||||
| **The Discovery Layer.** | ||||||
| GRIP (Graph Resource Integration Platform) is a high-performance graph database and query engine designed for complex biological data. It enables analysts to integrate heterogeneous datasets into a unified knowledge graph and perform sophisticated queries that reveal deep relational insights across multi-omic cohorts. | ||||||
|
|
||||||
| ### [Forge](forge/index.md) | ||||||
| **Project formatting** | ||||||
| Forge scans a data repository to build an integrated FHIR based graph of samples and all the files connected to the project. It is resposible for schema checking and database loading. You can use it client side to verify and debug your project and on the server side, it is used to load databases. | ||||||
|
||||||
| Forge scans a data repository to build an integrated FHIR based graph of samples and all the files connected to the project. It is resposible for schema checking and database loading. You can use it client side to verify and debug your project and on the server side, it is used to load databases. | |
| Forge scans a data repository to build an integrated FHIR based graph of samples and all the files connected to the project. It is responsible for schema checking and database loading. You can use it client side to verify and debug your project and on the server side, it is used to load databases. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,6 +1,7 @@ | ||
| --- | ||
| title: Sifter | ||
| render_macros: false | ||
| repo_url: https://github.com/bmeg/sifter | ||
| --- | ||
|
|
||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar error: "currently is a admin" should be "currently an admin" (remove the duplicate "is" and change "a" to "an").