-
Notifications
You must be signed in to change notification settings - Fork 0
fix(ci): resolve disk space issues and optimize Docker image sizes #46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Remove conditional cleanup that skipped main branch builds - Upgrade PyTorch CUDA image builds to aggressive cleanup (jlumbroso/free-disk-space) - Add docker system prune to all Docker build jobs - Add df -h for disk space visibility PyTorch CUDA images require 25-30GB during build. Manual cleanup provides 17-19GB (insufficient), while aggressive cleanup provides 34-39GB (safe). Affected jobs: - docker-test-lb: fixed failing builds - docker-main-gpu: upgraded from manual to aggressive cleanup - docker-prod-gpu: upgraded from manual to aggressive cleanup - All LB jobs: now use aggressive cleanup consistently Fixes: https://github.com/runpod-workers/worker-tetra/actions/runs/20802819754
Add intelligent detection and automatic installation of build-essential when pip installation fails due to missing compilation tools. How it works: 1. Attempt pip install 2. If failure detected with gcc/compiler errors -> auto-install build-essential 3. Retry pip install with build tools available 4. User sees transparent success (no manual intervention needed) Detection patterns: - gcc/g++/cc command not found - unable to execute 'gcc' - distutils compilation errors - _distutils_hack failures Benefits: - Enables removal of build-essential from base images (400MB savings) - Most users (pre-built wheels) get smaller images - Users needing compilation get automatic fallback - No user-facing changes or manual system_dependencies needed Trade-offs: - First compile-needing install ~30-60s slower (one-time cost) - Nala ensures fast parallel installation of build-essential - Error detection relies on known error patterns Test coverage: - Added 11 new unit tests covering detection and retry logic - Tests verify error pattern matching (gcc, cc, g++, distutils) - Tests verify auto-retry behavior with and without nala - Tests verify no false positives on unrelated errors - All 127 tests passing, coverage increased to 77.30%
Remove build-essential from all Docker images (GPU and CPU variants), relying on auto-retry feature for on-demand installation when needed. Rationale: - Most Python packages have pre-built wheels (numpy, torch, etc.) - Auto-retry feature automatically installs build-essential when needed - Nala enables fast parallel installation (30-60s one-time cost) - Optimizes for the common case (no compilation required) Measured image size impact: - CPU images: 844MB -> 588MB (256MB / 30% reduction) - GPU images: Expected ~400MB reduction (~5% of total size) User experience: - Transparent: Users don't need to manually specify build-essential - Auto-detection handles compilation failures automatically - Nala acceleration ensures fast installation when needed - Pre-built wheels work immediately (zero delay) Dependencies preserved: - curl: Required for uv installation - ca-certificates: Required for HTTPS - nala: Parallel downloads for system packages - git: Common for pip installs from repos
Address all code review feedback to improve robustness and maintainability: 1. Pin external action version (SECURITY) - Changed from @main to @v1.3.1 for supply chain security - Prevents breaking changes from upstream updates 2. Refine error detection patterns (CRITICAL) - Made patterns more specific to avoid false positives - Changed "gcc" to "gcc: error", "gcc: command not found", etc. - Prevents triggering on package names like "gcc-helper" 3. Document cleanup strategy consistency - Added comment explaining why docker-test uses manual cleanup - CPU images only need ~3GB vs CUDA images needing ~25-30GB 4. Update Dockerfile comments for clarity - Clarified auto-install is automatic (no manual action needed) - Added "Advanced:" prefix for manual override option 5. Add missing test for infinite loop prevention - test_auto_retry_succeeds_with_warnings_no_infinite_loop - Ensures warnings mentioning gcc don't trigger another retry - Total tests: 128 passing (77.30% coverage) 6. Give retry fresh timeout budget - Added comment clarifying retry gets fresh 300s timeout - Compilation may take longer than initial install attempt 7. Add actionable error messages - Included troubleshooting steps when build-essential install fails - Suggests manual override and common failure causes - Mentions disk space requirements (~400MB) All tests passing. Linting clean. Ready for merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR resolves CI/CD disk space failures during Docker image builds and optimizes image sizes by removing pre-installed build-essential from base images while implementing automatic detection and installation when compilation is needed.
Key Changes:
- Upgraded CI/CD disk cleanup strategy from manual cleanup to
jlumbroso/free-disk-spaceaction for CUDA image builds - Implemented intelligent build-essential auto-installation with retry logic when packages require compilation
- Removed build-essential from all four Dockerfiles to reduce image sizes by ~400MB for CPU images and ~5% for GPU images
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| .github/workflows/ci.yml | Enhanced disk space management for CUDA builds using jlumbroso/free-disk-space action and added docker system prune |
| src/dependency_installer.py | Added _needs_compilation() method and auto-retry logic to detect and handle missing compiler errors |
| tests/unit/test_dependency_installer.py | Added 11 comprehensive unit tests for compilation detection and auto-retry functionality |
| Dockerfile | Removed build-essential from GPU base image with explanatory comments |
| Dockerfile-cpu | Removed build-essential from CPU base image with explanatory comments |
| Dockerfile-lb | Removed build-essential from load balancer GPU base image with explanatory comments |
| Dockerfile-lb-cpu | Removed build-essential from load balancer CPU base image with explanatory comments |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary
This PR resolves CI/CD disk space failures and optimizes Docker image sizes through intelligent build-essential management.
Changes
1. Fix CI/CD Disk Space Exhaustion
Problem:
docker-test-lbjob failing with "no space left on device" while building PyTorch CUDA images (https://github.com/runpod-workers/worker-tetra/actions/runs/20802819754)Root Cause:
Solution:
jlumbroso/free-disk-spaceactiondocker system prune -afto clear build cachedf -hfor disk space visibilityResult: 14GB → 34-39GB available (sufficient for all builds)
2. Auto-Install build-essential When Needed
Problem: If build-essential removed from images, packages needing compilation would fail with cryptic gcc errors
Solution: Intelligent detection + automatic retry
Implementation:
_needs_compilation(): Detects 14 error patterns (gcc not found, distutils errors, etc.)Testing:
3. Remove build-essential from Base Images
Rationale:
Changes: Removed build-essential from all 4 Dockerfiles
Dockerfile(GPU)Dockerfile-cpuDockerfile-lb(Load Balancer GPU)Dockerfile-lb-cpuPreserved Dependencies:
curl: Required for uv installationca-certificates: Required for HTTPSnala: Parallel downloads for system packagesgit: Common for pip installs from reposMeasured Impact:
User Experience
For 99% of users (pre-built wheels):
For 1% of users (packages needing compilation):
system_dependenciesneededExamples
Before (Manual intervention required):
After (Automatic):
Breaking Changes
None - behavior is backward compatible and more user-friendly.
Testing
Related Issues
Fixes https://github.com/runpod-workers/worker-tetra/actions/runs/20802819754