ci: Auto-build and upload uv cache on miss by arcticfly · Pull Request #608 · OpenPipe/ART

arcticfly · 2026-03-09T17:12:40Z

Summary

Instead of failing CI when the prebuilt uv cache is missing, gracefully fall back to building from scratch and uploading the cache for future runs
On cache miss, after uv sync completes, the new "Upload uv cache on miss" step archives UV_CACHE_DIR and uploads it via the existing build_and_push_uv_cache.sh --skip-build
Re-checks for existing cache parts before uploading to avoid races when concurrent CI runs both miss the cache
Uses continue-on-error: true so upload failures never break quality checks

Test plan

Verify CI passes on this PR (cache should be restored normally if it exists for the current fingerprint)
To test the miss path: temporarily delete the cache release assets and trigger a CI run — it should warn, build from scratch, upload the cache, and pass
Verify concurrent PRs with the same fingerprint don't conflict (re-check logic should cause the second run to skip upload)

🤖 Generated with Claude Code

Instead of failing CI when the prebuilt uv cache is missing (requiring a manual rebuild on a separate machine), gracefully fall back to building from scratch and uploading the cache for future runs. - Change permissions to contents: write for release asset uploads - Convert hard failures in cache restore to warnings with cache-hit output - Add upload step that archives the uv cache after uv sync and uploads via the existing build_and_push_uv_cache.sh script (--skip-build) - Re-check before upload to avoid races when concurrent CI runs both miss the cache - Use continue-on-error so upload failures never break quality checks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move cache building to a separate `build-cache` job that runs on a larger runner (`art-cache-builder`) only when the cache is missing. This avoids OOM on the 16GB `art-large-runner` during cold builds. - `cache-status`: lightweight check for existing cache (art-large-runner) - `build-cache`: builds and uploads cache on miss (art-cache-builder, >=32GB) - `quality-checks`: restores cache and runs checks (art-large-runner) On cache hit, build-cache is skipped and quality-checks runs immediately. On cache miss, quality-checks waits for build-cache to finish first. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Restrict parallel downloads (4), installs (1), and native build jobs (2) to keep peak memory usage within the 64GB runner limit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Docker Buildx manages memory via overlay layers and doesn't get OOM-killed like bare uv sync does. This matches the pre-#560 approach and works on the existing art-large-runner (16GB) without needing a larger runner. - Add docker/ci-uv-cache.Dockerfile to build the uv cache in Docker - build-cache job uses Buildx with GHA cache, then extracts the archive and uploads via the existing build_and_push_uv_cache.sh script - Remove dependency on art-cache-builder runner Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The 4-core/16GB art-large-runner thrashes on the Docker build due to the large packages (torch, vllm, cudnn, etc.). Use a dedicated larger runner only for cache builds to finish faster and more reliably. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

transformer-engine-torch needs cudnn.h which is provided by the pip nvidia-cudnn package. Set CUDNN_PATH and related env vars pointing to the venv location so the native extension can find the headers during compilation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

With 64GB RAM on art-cache-builder, we can run build_and_push_uv_cache.sh directly without Docker. Simpler, avoids Dockerfile env var complications (cuDNN paths, etc.), and reuses the existing script that already handles all the build details. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The cache-status job only needs python3 and curl (both on the runner natively) to compute a fingerprint and check the API. Removing the pytorch container avoids a slow image pull on every CI run. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

arcticfly and others added 12 commits March 9, 2026 10:12

ci: Trigger CI run

ba45d75

ci: Limit uv concurrency in build-cache to avoid OOM

bc64004

Restrict parallel downloads (4), installs (1), and native build jobs (2) to keep peak memory usage within the 64GB runner limit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: Update CONTRIBUTING.md for automatic cache rebuilds

9b56990

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ci: Retrigger CI with clean cache state

9d34d0e

ci: Retrigger all checks

18f800a

arcticfly requested a review from FurtherAI March 9, 2026 21:07

arcticfly merged commit ec1c174 into main Mar 10, 2026
5 checks passed

arcticfly deleted the ci/self-healing-uv-cache branch March 10, 2026 07:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: Auto-build and upload uv cache on miss#608

ci: Auto-build and upload uv cache on miss#608
arcticfly merged 12 commits intomainfrom
ci/self-healing-uv-cache

arcticfly commented Mar 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arcticfly commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

arcticfly commented Mar 9, 2026 •

edited

Loading