Skip to content

ci: Auto-build and upload uv cache on miss#608

Merged
arcticfly merged 12 commits intomainfrom
ci/self-healing-uv-cache
Mar 10, 2026
Merged

ci: Auto-build and upload uv cache on miss#608
arcticfly merged 12 commits intomainfrom
ci/self-healing-uv-cache

Conversation

@arcticfly
Copy link
Collaborator

@arcticfly arcticfly commented Mar 9, 2026

Summary

  • Instead of failing CI when the prebuilt uv cache is missing, gracefully fall back to building from scratch and uploading the cache for future runs
  • On cache miss, after uv sync completes, the new "Upload uv cache on miss" step archives UV_CACHE_DIR and uploads it via the existing build_and_push_uv_cache.sh --skip-build
  • Re-checks for existing cache parts before uploading to avoid races when concurrent CI runs both miss the cache
  • Uses continue-on-error: true so upload failures never break quality checks

Test plan

  • Verify CI passes on this PR (cache should be restored normally if it exists for the current fingerprint)
  • To test the miss path: temporarily delete the cache release assets and trigger a CI run — it should warn, build from scratch, upload the cache, and pass
  • Verify concurrent PRs with the same fingerprint don't conflict (re-check logic should cause the second run to skip upload)

🤖 Generated with Claude Code

arcticfly and others added 12 commits March 9, 2026 10:12
Instead of failing CI when the prebuilt uv cache is missing (requiring
a manual rebuild on a separate machine), gracefully fall back to building
from scratch and uploading the cache for future runs.

- Change permissions to contents: write for release asset uploads
- Convert hard failures in cache restore to warnings with cache-hit output
- Add upload step that archives the uv cache after uv sync and uploads
  via the existing build_and_push_uv_cache.sh script (--skip-build)
- Re-check before upload to avoid races when concurrent CI runs both
  miss the cache
- Use continue-on-error so upload failures never break quality checks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move cache building to a separate `build-cache` job that runs on a
larger runner (`art-cache-builder`) only when the cache is missing.
This avoids OOM on the 16GB `art-large-runner` during cold builds.

- `cache-status`: lightweight check for existing cache (art-large-runner)
- `build-cache`: builds and uploads cache on miss (art-cache-builder, >=32GB)
- `quality-checks`: restores cache and runs checks (art-large-runner)

On cache hit, build-cache is skipped and quality-checks runs immediately.
On cache miss, quality-checks waits for build-cache to finish first.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Restrict parallel downloads (4), installs (1), and native build jobs (2)
to keep peak memory usage within the 64GB runner limit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Docker Buildx manages memory via overlay layers and doesn't get OOM-killed
like bare uv sync does. This matches the pre-#560 approach and works on
the existing art-large-runner (16GB) without needing a larger runner.

- Add docker/ci-uv-cache.Dockerfile to build the uv cache in Docker
- build-cache job uses Buildx with GHA cache, then extracts the archive
  and uploads via the existing build_and_push_uv_cache.sh script
- Remove dependency on art-cache-builder runner

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The 4-core/16GB art-large-runner thrashes on the Docker build due to
the large packages (torch, vllm, cudnn, etc.). Use a dedicated larger
runner only for cache builds to finish faster and more reliably.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
transformer-engine-torch needs cudnn.h which is provided by the
pip nvidia-cudnn package. Set CUDNN_PATH and related env vars
pointing to the venv location so the native extension can find
the headers during compilation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
With 64GB RAM on art-cache-builder, we can run build_and_push_uv_cache.sh
directly without Docker. Simpler, avoids Dockerfile env var complications
(cuDNN paths, etc.), and reuses the existing script that already handles
all the build details.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The cache-status job only needs python3 and curl (both on the runner
natively) to compute a fingerprint and check the API. Removing the
pytorch container avoids a slow image pull on every CI run.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@arcticfly arcticfly requested a review from FurtherAI March 9, 2026 21:07
@arcticfly arcticfly merged commit ec1c174 into main Mar 10, 2026
5 checks passed
@arcticfly arcticfly deleted the ci/self-healing-uv-cache branch March 10, 2026 07:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant