Skip to content

Conversation

@deanq
Copy link
Contributor

@deanq deanq commented Jan 8, 2026

Summary

This PR resolves CI/CD disk space failures and optimizes Docker image sizes through intelligent build-essential management.

Changes

1. Fix CI/CD Disk Space Exhaustion

Problem: docker-test-lb job failing with "no space left on device" while building PyTorch CUDA images (https://github.com/runpod-workers/worker-tetra/actions/runs/20802819754)

Root Cause:

  • Cleanup only ran on PRs (skipped main branch pushes)
  • Manual cleanup provided only ~17-19GB free space
  • PyTorch CUDA builds need ~25-30GB

Solution:

  • Removed conditional cleanup that skipped main branch
  • Upgraded all CUDA image builds to jlumbroso/free-disk-space action
  • Added docker system prune -af to clear build cache
  • Added df -h for disk space visibility

Result: 14GB → 34-39GB available (sufficient for all builds)

2. Auto-Install build-essential When Needed

Problem: If build-essential removed from images, packages needing compilation would fail with cryptic gcc errors

Solution: Intelligent detection + automatic retry

  • Detect pip failures due to missing compilers (gcc, g++, cc not found)
  • Automatically install build-essential via system_dependencies
  • Retry pip install transparently
  • User sees seamless success

Implementation:

  • _needs_compilation(): Detects 14 error patterns (gcc not found, distutils errors, etc.)
  • Auto-retry with nala-accelerated build-essential installation
  • Comprehensive error handling and logging

Testing:

  • 11 new unit tests covering detection and retry logic
  • All 127 tests passing (77.30% coverage)
  • Tests verify error pattern matching, auto-retry flow, nala acceleration
  • Tests verify no false positives on unrelated errors

3. Remove build-essential from Base Images

Rationale:

  • 99% of Python packages have pre-built manylinux wheels
  • Users needing compilation get auto-install (30-60s one-time cost)
  • Optimizes for common case while handling edge cases transparently

Changes: Removed build-essential from all 4 Dockerfiles

  • Dockerfile (GPU)
  • Dockerfile-cpu
  • Dockerfile-lb (Load Balancer GPU)
  • Dockerfile-lb-cpu

Preserved Dependencies:

  • curl: Required for uv installation
  • ca-certificates: Required for HTTPS
  • nala: Parallel downloads for system packages
  • git: Common for pip installs from repos

Measured Impact:

  • CPU images: 844MB → 588MB (256MB / 30% reduction)
  • GPU images: ~8.24GB → ~7.84GB (expected ~400MB / 5% reduction)

User Experience

For 99% of users (pre-built wheels):

  • ✅ Smaller images, faster cold starts
  • ✅ Zero compilation delay
  • ✅ No changes needed

For 1% of users (packages needing compilation):

  • ✅ Transparent auto-install of build-essential
  • ✅ ~30-60s one-time installation (nala-accelerated)
  • ✅ No manual system_dependencies needed
  • ✅ Clear logging of what's happening

Examples

Before (Manual intervention required):

@remote(
    dependencies=["git+https://github.com/user/package.git"],
    system_dependencies=["build-essential"],  # User must know to add this
)

After (Automatic):

@remote(
    dependencies=["git+https://github.com/user/package.git"],
    # Auto-detects compilation need and installs build-essential
)

Breaking Changes

None - behavior is backward compatible and more user-friendly.

Testing

  • ✅ All 127 unit tests passing
  • ✅ 11 new tests for auto-retry feature
  • ✅ 77.30% code coverage
  • ✅ All linting and formatting checks pass
  • ✅ CI/CD pipeline should now complete successfully

Related Issues

Fixes https://github.com/runpod-workers/worker-tetra/actions/runs/20802819754

deanq added 4 commits January 7, 2026 19:14
- Remove conditional cleanup that skipped main branch builds
- Upgrade PyTorch CUDA image builds to aggressive cleanup (jlumbroso/free-disk-space)
- Add docker system prune to all Docker build jobs
- Add df -h for disk space visibility

PyTorch CUDA images require 25-30GB during build. Manual cleanup provides
17-19GB (insufficient), while aggressive cleanup provides 34-39GB (safe).

Affected jobs:
- docker-test-lb: fixed failing builds
- docker-main-gpu: upgraded from manual to aggressive cleanup
- docker-prod-gpu: upgraded from manual to aggressive cleanup
- All LB jobs: now use aggressive cleanup consistently

Fixes: https://github.com/runpod-workers/worker-tetra/actions/runs/20802819754
Add intelligent detection and automatic installation of build-essential
when pip installation fails due to missing compilation tools.

How it works:
1. Attempt pip install
2. If failure detected with gcc/compiler errors -> auto-install build-essential
3. Retry pip install with build tools available
4. User sees transparent success (no manual intervention needed)

Detection patterns:
- gcc/g++/cc command not found
- unable to execute 'gcc'
- distutils compilation errors
- _distutils_hack failures

Benefits:
- Enables removal of build-essential from base images (400MB savings)
- Most users (pre-built wheels) get smaller images
- Users needing compilation get automatic fallback
- No user-facing changes or manual system_dependencies needed

Trade-offs:
- First compile-needing install ~30-60s slower (one-time cost)
- Nala ensures fast parallel installation of build-essential
- Error detection relies on known error patterns

Test coverage:
- Added 11 new unit tests covering detection and retry logic
- Tests verify error pattern matching (gcc, cc, g++, distutils)
- Tests verify auto-retry behavior with and without nala
- Tests verify no false positives on unrelated errors
- All 127 tests passing, coverage increased to 77.30%
Remove build-essential from all Docker images (GPU and CPU variants),
relying on auto-retry feature for on-demand installation when needed.

Rationale:
- Most Python packages have pre-built wheels (numpy, torch, etc.)
- Auto-retry feature automatically installs build-essential when needed
- Nala enables fast parallel installation (30-60s one-time cost)
- Optimizes for the common case (no compilation required)

Measured image size impact:
- CPU images: 844MB -> 588MB (256MB / 30% reduction)
- GPU images: Expected ~400MB reduction (~5% of total size)

User experience:
- Transparent: Users don't need to manually specify build-essential
- Auto-detection handles compilation failures automatically
- Nala acceleration ensures fast installation when needed
- Pre-built wheels work immediately (zero delay)

Dependencies preserved:
- curl: Required for uv installation
- ca-certificates: Required for HTTPS
- nala: Parallel downloads for system packages
- git: Common for pip installs from repos
Address all code review feedback to improve robustness and maintainability:

1. Pin external action version (SECURITY)
   - Changed from @main to @v1.3.1 for supply chain security
   - Prevents breaking changes from upstream updates

2. Refine error detection patterns (CRITICAL)
   - Made patterns more specific to avoid false positives
   - Changed "gcc" to "gcc: error", "gcc: command not found", etc.
   - Prevents triggering on package names like "gcc-helper"

3. Document cleanup strategy consistency
   - Added comment explaining why docker-test uses manual cleanup
   - CPU images only need ~3GB vs CUDA images needing ~25-30GB

4. Update Dockerfile comments for clarity
   - Clarified auto-install is automatic (no manual action needed)
   - Added "Advanced:" prefix for manual override option

5. Add missing test for infinite loop prevention
   - test_auto_retry_succeeds_with_warnings_no_infinite_loop
   - Ensures warnings mentioning gcc don't trigger another retry
   - Total tests: 128 passing (77.30% coverage)

6. Give retry fresh timeout budget
   - Added comment clarifying retry gets fresh 300s timeout
   - Compilation may take longer than initial install attempt

7. Add actionable error messages
   - Included troubleshooting steps when build-essential install fails
   - Suggests manual override and common failure causes
   - Mentions disk space requirements (~400MB)

All tests passing. Linting clean. Ready for merge.
@deanq deanq requested a review from Copilot January 8, 2026 03:52
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR resolves CI/CD disk space failures during Docker image builds and optimizes image sizes by removing pre-installed build-essential from base images while implementing automatic detection and installation when compilation is needed.

Key Changes:

  • Upgraded CI/CD disk cleanup strategy from manual cleanup to jlumbroso/free-disk-space action for CUDA image builds
  • Implemented intelligent build-essential auto-installation with retry logic when packages require compilation
  • Removed build-essential from all four Dockerfiles to reduce image sizes by ~400MB for CPU images and ~5% for GPU images

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
.github/workflows/ci.yml Enhanced disk space management for CUDA builds using jlumbroso/free-disk-space action and added docker system prune
src/dependency_installer.py Added _needs_compilation() method and auto-retry logic to detect and handle missing compiler errors
tests/unit/test_dependency_installer.py Added 11 comprehensive unit tests for compilation detection and auto-retry functionality
Dockerfile Removed build-essential from GPU base image with explanatory comments
Dockerfile-cpu Removed build-essential from CPU base image with explanatory comments
Dockerfile-lb Removed build-essential from load balancer GPU base image with explanatory comments
Dockerfile-lb-cpu Removed build-essential from load balancer CPU base image with explanatory comments

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@deanq deanq merged commit 7261ccb into main Jan 8, 2026
18 checks passed
@deanq deanq deleted the deanq/ae-1102-fix-cicd-out-of-space branch January 8, 2026 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants