Replies: 1 comment
-
|
See also: https://source.ohsu.edu/CBDS/EVOTypes/pull/5#issuecomment-23597 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
📘 Document Best Practices for Managing Thousands of Files with Git LFS
🧩 User Story
As a developer or data steward managing a repository with thousands of large files,
I want to follow best practices and optimization techniques for using Git LFS at scale,
So that cloning, pulling, tracking, and managing files remains performant and maintainable over time.
✅ Acceptance Criteria
✅ AC1: Add a Documentation Section on Selective Tracking
git lfs track "*.bin") is provided with explanations.✅ AC2: Add Guidance on Repository Management
git lfs pruneare included. NOTE pruning should trigger deletes from indexd as well✅ AC3: Add Workflow Optimization Techniques
git clone --depth N.git lfs pull --includeand--exclude.📚 Use Case Examples
1. Research Project with Thousands of Imaging Files
A research team is tracking 25,000 microscopy images and CSV metadata files in a single repository using Git LFS. Over time, pull operations and checkout speeds degrade, and storage usage balloons. By applying these practices:
.tifand.h5files are LFS-tracked (AC1).--depth=1, use sparse checkout to limit working directories, and filter LFS pulls by folder (AC3).This restores a responsive, lightweight Git experience for team members.
2. Genomics Platform Managing Cloud-based Datasets
A genomics platform maintains Git repositories that reference over 50,000 large files stored in cloud object storage. Using a custom Git LFS transfer agent, each file is tracked by a DRS ID and resolved on-demand from S3, GCS, or Azure. Applying the practices outlined in this guide:
git lfs pull --include="assay-1234/*"ensures minimal bandwidth usage..git/lfs/objectscache to manage disk footprint.These practices allow scalable collaboration without sacrificing performance.
3. Mixed File Types in a Polyglot Codebase (Kyle’s Use Case)
Kyle is building a broad data science repository that includes scripts, notebooks, configuration files, and large data artifacts. In this case, file type alone isn't sufficient to determine LFS tracking behavior. For example:
A naïve rule like:
git lfs track "*.json"would wrongly capture both. Instead, more explicit path-based rules should be used:
This case demonstrates the importance of scoping LFS rules to project layout and not just file extensions.
🔍 Additional Notes
This documentation should live under
docs/git-lfs-scaling.mdor an equivalent section such as “Managing Large Repositories” in the main README.We welcome community input and usage examples from other high-scale data environments.
Beta Was this translation helpful? Give feedback.
All reactions