Consolidate CI infrastructure and add NFS-resilient build cache#1285
Consolidate CI infrastructure and add NFS-resilient build cache#1285sbryngelson merged 10 commits intoMFlowCode:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds a dedicated case-optimization correctness CI job and refactors HPC CI scripts by extracting shared GPU detection/build-retry helpers, while extending benchmark cases with a --steps override and adding NaN/Inf validation.
Changes:
- Add a new
case-optimizationjob to the test workflow plus scripts to prebuild, run, and validate case-optimized benchmark runs. - Consolidate duplicated CI shell logic (GPU opts, GPU detection, build retries) into shared
.github/scripts/*helpers; replace Frontier AMD duplicates with symlinks. - Extend the packer/test tooling to detect both NaN and Inf; add
--stepsto benchmark case scripts.
Reviewed changes
Copilot reviewed 26 out of 30 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| toolchain/mfc/test/test.py | Switch test validation from NaN-only to NaN/Inf detection. |
| toolchain/mfc/packer/pack.py | Rename has_NaNs() to has_bad_values() and include Inf detection. |
| benchmarks/5eq_rk3_weno3_hllc/case.py | Add --steps override; change parallel_io setting. |
| benchmarks/viscous_weno5_sgb_acoustic/case.py | Add --steps override; change parallel_io setting. |
| benchmarks/hypo_hll/case.py | Add --steps override for timestep control. |
| benchmarks/ibm/case.py | Add --steps override; change parallel_io setting. |
| benchmarks/igr/case.py | Add --steps override; change parallel_io setting. |
| .github/workflows/test.yml | Use centralized test retry wrapper; add case-optimization CI job. |
| .github/workflows/phoenix/test.sh | Refactor to use shared GPU opts, GPU detection, and retry-build helper. |
| .github/workflows/phoenix/bench.sh | Refactor to use shared bench preamble + retry-build helper. |
| .github/workflows/frontier/test.sh | Refactor to shared GPU detection and GPU opts helper. |
| .github/workflows/frontier/build.sh | Refactor to shared GPU opts and retry-build helper. |
| .github/workflows/frontier/bench.sh | Refactor to use shared bench preamble. |
| .github/workflows/frontier_amd/test.sh | Replace with symlink to ../frontier/test.sh. |
| .github/workflows/frontier_amd/submit.sh | Replace with symlink to ../frontier/submit.sh. |
| .github/workflows/frontier_amd/build.sh | Replace with symlink to ../frontier/build.sh. |
| .github/workflows/frontier_amd/bench.sh | Replace with symlink to ../frontier/bench.sh. |
| .github/scripts/run_case_optimization.sh | New: runs 5 benchmark cases with --case-optimization and validates output. |
| .github/scripts/check_case_optimization_output.py | New: validates D/*.dat contain no NaN/Inf via packer. |
| .github/scripts/run-tests-with-retry.sh | New: centralizes “retry up to 5 sporadic failures” logic for test workflow. |
| .github/scripts/retry-build.sh | New: shared 3-attempt build retry helper with optional cleanup/validation hooks. |
| .github/scripts/prebuild-case-optimization.sh | New: prebuild benchmark cases with --case-optimization on login node. |
| .github/scripts/gpu-opts.sh | New: shared translation from job_device/job_interface → `--gpu {acc |
| .github/scripts/detect-gpus.sh | New: shared NVIDIA/AMD GPU detection setting ngpus and gpu_ids. |
| .github/scripts/bench-preamble.sh | New: shared benchmark script preamble setting ranks/build/device opts. |
| .github/file-filter.yml | Ensure .github/scripts/** changes trigger CI file-change detection. |
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (1)
.github/scripts/run_case_optimization.sh (1)
23-29: Use a single source of truth for the case list across prebuild/run scripts.The hardcoded list here can drift from
.github/scripts/prebuild-case-optimization.shdiscovery behavior. Centralizing this list avoids mismatched “built vs validated” coverage.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (30)
.github/file-filter.yml.github/scripts/bench-preamble.sh.github/scripts/check_case_optimization_output.py.github/scripts/detect-gpus.sh.github/scripts/gpu-opts.sh.github/scripts/prebuild-case-optimization.sh.github/scripts/retry-build.sh.github/scripts/run-tests-with-retry.sh.github/scripts/run_case_optimization.sh.github/workflows/frontier/bench.sh.github/workflows/frontier/build.sh.github/workflows/frontier/test.sh.github/workflows/frontier_amd/bench.sh.github/workflows/frontier_amd/bench.sh.github/workflows/frontier_amd/build.sh.github/workflows/frontier_amd/build.sh.github/workflows/frontier_amd/submit.sh.github/workflows/frontier_amd/submit.sh.github/workflows/frontier_amd/test.sh.github/workflows/frontier_amd/test.sh.github/workflows/phoenix/bench.sh.github/workflows/phoenix/test.sh.github/workflows/test.ymlbenchmarks/5eq_rk3_weno3_hllc/case.pybenchmarks/hypo_hll/case.pybenchmarks/ibm/case.pybenchmarks/igr/case.pybenchmarks/viscous_weno5_sgb_acoustic/case.pytoolchain/mfc/packer/pack.pytoolchain/mfc/test/test.py
…MFlowCode#1281) Replace 4 duplicated frontier_amd/ scripts with symlinks to frontier/ (cluster auto-detected from directory name via BASH_SOURCE). Extract 3 shared helpers into .github/scripts/: - gpu-opts.sh: sets $gpu_opts from $job_device/$job_interface - detect-gpus.sh: vendor-agnostic GPU detection (NVIDIA + AMD) - retry-build.sh: retry_build() with configurable cleanup Refactor 6 CI scripts to source the helpers, removing duplicated GPU opts blocks, GPU detection, and retry loops. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a new 'case-optimization' job to test.yml that builds and runs all benchmark cases with --case-optimization on Phoenix (acc/omp), Frontier (acc/omp), and Frontier AMD (omp). Each case runs a small grid (1 GBPP) for 10 timesteps and validates output contains no NaN/Inf values. - Add check_case_optimization_output.py validator script - Add --steps CLI override to all 5 benchmark case files - Update file-filter.yml to trigger CI on .github/scripts/ changes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add RETRY_VALIDATE_CMD hook to retry-build.sh for post-build validation - Replace 37-line inline retry loop in phoenix/test.sh with retry_build() - Derive module flag from cluster name in prebuild-case-optimization.sh, removing redundant flag field from case-optimization matrix - Extract GitHub job test retry logic to run-tests-with-retry.sh - Extract shared bench GPU/device preamble to bench-preamble.sh - Standardize source order: detect-gpus.sh before gpu-opts.sh Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…enchmarks - Set parallel_io to F in all benchmark cases so simulation writes D/*.dat text files readable by the packer (parallel_io=T writes binary to restart_data/ instead, which neither the packer nor the validation script could read) - Rewrite check_case_optimization_output.py to use pack.compile() + has_bad_values() instead of reimplementing the same parsing logic - Rename Pack.has_NaNs() to has_bad_values(), adding math.isinf() check - Call validation via build/venv/bin/python3 for toolchain dependencies Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…, hidden env dependency - run-tests-with-retry.sh: extract --test-all from "$@" instead of relying on $TEST_ALL env var for retry path - check_case_optimization_output.py: restore argument validation, add per-file NaN/Inf diagnostic reporting - run_case_optimization.sh: check venv exists before loop, fix misleading error message, normalize exit code - pack.py: fix typo in Pack class comment Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
83454c9 to
93650d5
Compare
…ing cleanup When NFS stale file handles occur on Phoenix, cached files become both unreadable and undeletable, causing all retry attempts to fail identically. Layer 1: Pre-flight health check in setup-build-cache.sh probes the cache (ls, stat, touch/rm) and nukes immediately if stale, before the build starts. Layer 2: Resilient cleanup in retry-build.sh escalates to cache nuke (mv-based rename) when rm -rf fails during retry, so the next attempt gets a fresh cache. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…enix path Frontier runners failed because the cache root /storage/coda1/... is Phoenix-specific. Select cache root via case statement on cluster name: Phoenix -> /storage/coda1/..., Frontier -> /lustre/orion/.... Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Disable build retries (max_attempts 1) across all CI jobs so failures surface immediately. Test --max-attempts remains at 3 for sporadic test failures. Case-optimized pre-builds reduced to -j 2: Phoenix login nodes have a 4GB per-user cgroup limit (confirmed via dmesg: CONSTRAINT_MEMCG). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phoenix login nodes have a 4GB per-user cgroup memory limit that OOM-kills case-optimized GPU builds (confirmed via dmesg: CONSTRAINT_MEMCG). Route the pre-build through submit.sh on Phoenix so it runs on a compute node with full memory. Frontier continues to pre-build on the login node. Reverts retry/parallelism changes (max_attempts back to 3, -j back to 8) since the root cause was the cgroup, not parallelism. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1285 +/- ##
=======================================
Coverage 44.95% 44.95%
=======================================
Files 70 70
Lines 20503 20503
Branches 1946 1946
=======================================
Hits 9217 9217
Misses 10164 10164
Partials 1122 1122 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
case-optimizationCI job that builds and runs all 5 benchmark cases with--case-optimizationon Phoenix (acc/omp), Frontier (acc/omp), and Frontier AMD (omp), validating output contains no NaN/Infcheck_case_optimization_output.pyvalidator and--stepsCLI override to benchmark casesfrontier_amd/scripts with symlinks tofrontier/(cluster auto-detected from directory name viaBASH_SOURCE).github/scripts/:gpu-opts.sh,detect-gpus.sh,retry-build.shretry-build.shescalates to mv-based cache nuke whenrm -rffails during retry cleanupFixes #1275
Test plan
./mfc.sh formatpasses./mfc.sh precheckpasses (all 5 lint gates)dirnameresolves tofrontier_amd/preserving cluster detectionjob_device=gpu job_interface=acc source gpu-opts.sh→--gpu acc🤖 Generated with Claude Code