Skip to content

Navigation Menu

Appearance settings

View all features
- BY COMPANY SIZE
  Enterprises
  Small and medium teams
  Startups
  Nonprofits
- BY USE CASE
  App Modernization
  DevSecOps
  DevOps
  CI/CD
  View all use cases
- BY INDUSTRY
  Healthcare
  Financial services
  Manufacturing
  Government
  View all industries
View all solutions
- EXPLORE BY TOPIC
  AI
  Software Development
  DevOps
  Security
  View all topics
- EXPLORE BY TYPE
  Customer stories
  Events & webinars
  Ebooks & reports
  Business insights
  GitHub Skills
- SUPPORT & SERVICES
  Documentation
  Customer support
  Community forum
  Trust center
  Partners
- COMMUNITY
  GitHub SponsorsFund open source developers
- PROGRAMS
  Security Lab
  Maintainer Community
  Accelerator
  Archive Program
- REPOSITORIES
  Topics
  Trending
  Collections
- ENTERPRISE SOLUTIONS
  Enterprise platformAI-powered developer platform
- AVAILABLE ADD-ONS
  GitHub Advanced SecurityEnterprise-grade security features
  Copilot for BusinessEnterprise-grade AI features
  Premium SupportEnterprise-grade 24/7 support
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

Appearance settings

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

calypr / git-drs Public

Notifications You must be signed in to change notification settings
Fork 1
Star 1

Code
Issues 49
Pull requests 7
Discussions
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Projects
Security
Insights

Document Best Practices for Managing Thousands of Files with Git LFS #33

bwalsh started this conversation in Ideas

Document Best Practices for Managing Thousands of Files with Git LFS #33

Jul 23, 2025

· 1 comment

Return to top

Discussion options

Uh oh!

There was an error while loading. Please reload this page.

Uh oh!

There was an error while loading. Please reload this page.

bwalsh
Jul 23, 2025
Maintainer

-

📘 Document Best Practices for Managing Thousands of Files with Git LFS

🧩 User Story

As a developer or data steward managing a repository with thousands of large files,
I want to follow best practices and optimization techniques for using Git LFS at scale,
So that cloning, pulling, tracking, and managing files remains performant and maintainable over time.

✅ Acceptance Criteria

✅ AC1: Add a Documentation Section on Selective Tracking

The documentation advises users to track only essential large files (e.g., datasets, images, videos) over a size threshold (e.g., 10MB).
It cautions against tracking small text files or source code in LFS.
A command example (git lfs track "*.bin") is provided with explanations.

✅ AC2: Add Guidance on Repository Management

Instructions for pruning unused LFS files using git lfs prune are included. NOTE pruning should trigger deletes from indexd as well
The document explains the rationale for splitting large monolithic repositories into smaller ones.
It outlines when and why this improves performance and modularity.

✅ AC3: Add Workflow Optimization Techniques

Shows how to use shallow clones via git clone --depth N.
Demonstrates how to pull only necessary LFS files using git lfs pull --include and --exclude.
Provides instructions for Git sparse checkout to limit the working directory to selected paths.
Recommends teamwide consistency in cleanup and branch hygiene.

📚 Use Case Examples

1. Research Project with Thousands of Imaging Files

A research team is tracking 25,000 microscopy images and CSV metadata files in a single repository using Git LFS. Over time, pull operations and checkout speeds degrade, and storage usage balloons. By applying these practices:

Only .tif and .h5 files are LFS-tracked (AC1).
Unused LFS objects are pruned regularly (AC2).
Developers clone with --depth=1, use sparse checkout to limit working directories, and filter LFS pulls by folder (AC3).

This restores a responsive, lightweight Git experience for team members.

2. Genomics Platform Managing Cloud-based Datasets

A genomics platform maintains Git repositories that reference over 50,000 large files stored in cloud object storage. Using a custom Git LFS transfer agent, each file is tracked by a DRS ID and resolved on-demand from S3, GCS, or Azure. Applying the practices outlined in this guide:

Only files above 100MB are tracked via LFS with remote pointers.
Sparse checkout is used to localize work by assay or donor.
git lfs pull --include="assay-1234/*" ensures minimal bandwidth usage.
Daily cronjobs prune the .git/lfs/objects cache to manage disk footprint.

These practices allow scalable collaboration without sacrificing performance.

3. Mixed File Types in a Polyglot Codebase (Kyle’s Use Case)

Kyle is building a broad data science repository that includes scripts, notebooks, configuration files, and large data artifacts. In this case, file type alone isn't sufficient to determine LFS tracking behavior. For example:

# JSON config needed in Git history (NOT LFS)
src/code/resources/some-file.json

# JSON data file too large for Git (USE LFS)
data/big-files/huge-file.json

A naïve rule like:

git lfs track "*.json"

would wrongly capture both. Instead, more explicit path-based rules should be used:

# Only track JSONs in the data folder
git lfs track "data/**/*.json"

This case demonstrates the importance of scoping LFS rules to project layout and not just file extensions.

🔍 Additional Notes

This documentation should live under docs/git-lfs-scaling.md or an equivalent section such as “Managing Large Repositories” in the main README.

We welcome community input and usage examples from other high-scale data environments.

Beta Was this translation helpful? Give feedback.

You must be logged in to vote

All reactions

Replies: 1 comment

Oldest
Newest
Top

Comment options

Uh oh!

There was an error while loading. Please reload this page.

bwalsh
Jul 24, 2025
Maintainer Author

-

See also: https://source.ohsu.edu/CBDS/EVOTypes/pull/5#issuecomment-23597

Beta Was this translation helpful? Give feedback.

You must be logged in to vote

All reactions

0 replies

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Category

Labels

None yet

1 participant

Footer

© 2026 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Community
Docs
Contact

You can’t perform that action at this time.