-
Notifications
You must be signed in to change notification settings - Fork 1
GIT_CLEANUP_GUIDE
ThemisDB maintains strict repository hygiene to ensure fast clones, efficient CI/CD, and clean commit history. This guide documents the cleanup practices and how to avoid common issues.
The following patterns are permanently excluded from the repository via .gitignore and .gitattributes:
-
testdata*/- Development test datasets (often 5GB+) -
test_data_*.json- Generated test data files -
test_data_*.csv- CSV test exports -
test_data_*.db/*.sst- RocksDB test instances
Why: Test data is ephemeral and should be generated at runtime or fetched from external sources (e.g., Hugging Face datasets).
-
benchmark_results/- JSON/CSV benchmark outputs -
benchmarks/benchmark_results/- Benchmark execution results -
benchmarks/*.json/benchmarks/*.csv- Individual benchmark metrics
Why: Benchmark results vary by hardware and are better tracked in external dashboards (e.g., DataDog, Prometheus).
-
build*/- CMake build directories -
*.lib,*.a,*.o,*.obj- Object files -
*.exe,*.dll,*.so- Binaries
Repository size guidelines:
- Uncompressed repo: Should stay < 500 MB for fast clones
- Large files: Never commit files > 5 MB
- Total history: Aim for < 1 GB compressed
Before pushing code, verify:
# Check for large files being added
git diff --cached --name-only | ForEach-Object {
$size = (git ls-files -s $_).Split()[3]
if ($size -gt 5242880) { Write-Warning "Large file detected: $_ ($([math]::Round($size/1MB,2))MB)" }
}
# List all cached files with sizes
git diff --cached --name-only -z | xargs -0 -I {} sh -c 'echo "{}"; git ls-files -s "{}" | awk "{print \$4}"' | paste - - | awk '$2 > 1048576 {print $1 " - " $2 " bytes"}'For local testing and benchmarking:
# 1. Generate test data locally (NOT in repo)
./scripts/generate_test_data.sh --size 100K --output /tmp/test_data
# 2. Run benchmarks against local data
./benchmarks/run_complete_benchmarks.py --data-path /tmp/test_data
# 3. Store results in LOCAL directory (NOT committed)
mkdir -p ~/.themis_benchmark_results
cp benchmark_results/complete_benchmark_latest.json ~/.themis_benchmark_results/For containerized testing:
# Mount test data as volume (NOT in image)
docker run -v /tmp/test_data:/data themis_benchmarks \
--data-path /data \
--output /results
# Copy results back
docker cp <container_id>:/results ./benchmark_results_tmp# Option A: If file is still in staging
git reset HEAD <large-file>
git rm --cached <large-file>
# Option B: If file was committed (rewrite history - dangerous!)
git filter-branch --tree-filter 'rm -f <large-file>' HEAD
# Then: git push -f (coordinate with team!)
# Option C: Using BFG Repo Cleaner (safer alternative)
bfg --delete-files <filename> .
git reflog expire --expire=now --all && git gc --prune=now --aggressive# Apply new .gitignore rules
git rm --cached -r .
git add .
git commit -m "Apply gitignore rules"
# Verify large files are not tracked
git ls-files | grep -E "(testdata|benchmark|test_data)" | wc -l
# Should output: 0# Check current size
git count-objects -v
# Output: count: 45678
# size: 123456 (kilobytes)
# in-pack: 34567
# ...
# Optimize repository
git gc --aggressive --prune=nowname: Validate Repository
on: [push, pull_request]
jobs:
check-large-files:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Check for large files
run: |
find . -size +5M ! -path './.git/*' ! -path './build*' ! -path './vcpkg*' | \
while read file; do
echo "Error: Large file detected: $file"
exit 1
done# .dockerignore - mirrors .gitignore
benchmark_results/
testdata*/
test_data_*
build-*/
vcpkg_installed/
*.test.db# Install pre-commit hook
pip install pre-commit
pre-commit install
# Create .pre-commit-config.yaml
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.4.0
hooks:
- id: check-added-large-files
args: ['--maxkb=5000']
- id: check-json
- id: check-merge-conflict# Install Git LFS
brew install git-lfs # macOS
# or
apt-get install git-lfs # Linux
# Track large files
git lfs track "*.db"
git add .gitattributes
git commit -m "Enable Git LFS for database files"# Find the largest files in history
git rev-list --all --objects | \
sort -k2 | \
tail -10 | \
while read sha path; do
echo $(git cat-file -s $sha | awk '{sum+=$1}; END {print sum}') $path
done | sort -k1 -n -r | head -10| Metric | Value | Target |
|---|---|---|
| Uncompressed Size | ~400 MB | < 500 MB ✓ |
| Compressed Size (.git) | ~85 MB | < 150 MB ✓ |
| Number of Files | 8,532 | < 15,000 ✓ |
| Largest Single File | 2.1 MB (docs PDF) | < 5 MB ✓ |
| Benchmark Results | excluded | 0 MB ✓ |
| Test Data | excluded | 0 MB ✓ |
| Directory | Size | Status |
|---|---|---|
build-msvc/ |
1.2 GB | ✓ Ignored |
build-wsl/ |
890 MB | ✓ Ignored |
vcpkg_installed/ |
3.4 GB | ✓ Ignored |
vcpkg/packages/ |
2.1 GB | ✓ Ignored |
Symptom: Git says you're trying to push a file > LFS threshold
Solution:
# Check what's in staging
git diff --cached --stat | head -20
# Unstage and remove
git reset HEAD <file>
rm <file>
echo "<file>" >> .gitignore
git add .gitignore
git commit -m "Add <file> to .gitignore"Symptom: fatal: The remote end hung up unexpectedly
Solution:
# Check file sizes in commit
git show --stat HEAD
# If necessary, amend commit
git reset --soft HEAD~1
# Remove large file
git commit -m "Remove large test data"Symptom: git clone takes > 5 minutes
Solution:
# Shallow clone for faster initial download
git clone --depth 1 https://github.com/makr-code/ThemisDB.git
# Later pull full history
git fetch --unshallow- ✓ Keep .gitignore and .gitattributes updated
- ✓ Use external storage (S3, GCS) for test data
- ✓ Store benchmark results in time-series DB (InfluxDB, Prometheus)
- ✓ Use Git hooks to prevent large file commits
- ✓ Document ephemeral data generation in README
- ✗ Commit test datasets (> 100 MB)
- ✗ Commit build artifacts (use CI/CD instead)
- ✗ Commit benchmark results (use dashboards instead)
- ✗ Use
git add .without reviewing changes first - ✗ Force push to
mainbranch without coordination
- GitHub: Removing sensitive data from repository
- Pro Git: Git Internals
- Git LFS Documentation
- BFG Repo Cleaner
Last Updated: 2025-12-04
Maintainer: ThemisDB Development Team
ThemisDB v1.3.4 | GitHub | Documentation | Discussions | License
Last synced: January 02, 2026 | Commit: 6add659
Version: 1.3.0 | Stand: Dezember 2025
- Übersicht
- Home
- Dokumentations-Index
- Quick Reference
- Sachstandsbericht 2025
- Features
- Roadmap
- Ecosystem Overview
- Strategische Übersicht
- Geo/Relational Storage
- RocksDB Storage
- MVCC Design
- Transaktionen
- Time-Series
- Memory Tuning
- Chain of Thought Storage
- Query Engine & AQL
- AQL Syntax
- Explain & Profile
- Rekursive Pfadabfragen
- Temporale Graphen
- Zeitbereichs-Abfragen
- Semantischer Cache
- Hybrid Queries (Phase 1.5)
- AQL Hybrid Queries
- Hybrid Queries README
- Hybrid Query Benchmarks
- Subquery Quick Reference
- Subquery Implementation
- Content Pipeline
- Architektur-Details
- Ingestion
- JSON Ingestion Spec
- Enterprise Ingestion Interface
- Geo-Processor Design
- Image-Processor Design
- Hybrid Search Design
- Fulltext API
- Hybrid Fusion API
- Stemming
- Performance Tuning
- Migration Guide
- Future Work
- Pagination Benchmarks
- Enterprise README
- Scalability Features
- HTTP Client Pool
- Build Guide
- Implementation Status
- Final Report
- Integration Analysis
- Enterprise Strategy
- Verschlüsselungsstrategie
- Verschlüsselungsdeployment
- Spaltenverschlüsselung
- Encryption Next Steps
- Multi-Party Encryption
- Key Rotation Strategy
- Security Encryption Gap Analysis
- Audit Logging
- Audit & Retention
- Compliance Audit
- Compliance
- Extended Compliance Features
- Governance-Strategie
- Compliance-Integration
- Governance Usage
- Security/Compliance Review
- Threat Model
- Security Hardening Guide
- Security Audit Checklist
- Security Audit Report
- Security Implementation
- Development README
- Code Quality Pipeline
- Developers Guide
- Cost Models
- Todo Liste
- Tool Todo
- Core Feature Todo
- Priorities
- Implementation Status
- Roadmap
- Future Work
- Next Steps Analysis
- AQL LET Implementation
- Development Audit
- Sprint Summary (2025-11-17)
- WAL Archiving
- Search Gap Analysis
- Source Documentation Plan
- Changefeed README
- Changefeed CMake Patch
- Changefeed OpenAPI
- Changefeed OpenAPI Auth
- Changefeed SSE Examples
- Changefeed Test Harness
- Changefeed Tests
- Dokumentations-Inventar
- Documentation Summary
- Documentation TODO
- Documentation Gap Analysis
- Documentation Consolidation
- Documentation Final Status
- Documentation Phase 3
- Documentation Cleanup Validation
- API
- Authentication
- Cache
- CDC
- Content
- Geo
- Governance
- Index
- LLM
- Query
- Security
- Server
- Storage
- Time Series
- Transaction
- Utils
Vollständige Dokumentation: https://makr-code.github.io/ThemisDB/