Consolidate CI infrastructure and add NFS-resilient build cache by sbryngelson · Pull Request #1285 · MFlowCode/MFC

sbryngelson · 2026-03-02T01:12:03Z

Summary

Add a new case-optimization CI job that builds and runs all 5 benchmark cases with --case-optimization on Phoenix (acc/omp), Frontier (acc/omp), and Frontier AMD (omp), validating output contains no NaN/Inf
Add check_case_optimization_output.py validator and --steps CLI override to benchmark cases
Replace 4 duplicated frontier_amd/ scripts with symlinks to frontier/ (cluster auto-detected from directory name via BASH_SOURCE)
Extract 3 shared helpers into .github/scripts/: gpu-opts.sh, detect-gpus.sh, retry-build.sh
Refactor 6 CI scripts to source the helpers, removing duplicated GPU opts blocks, GPU detection, and retry loops
Add NFS-resilient build cache for Phoenix self-hosted runners: pre-flight health check detects stale NFS handles before builds start, and retry-build.sh escalates to mv-based cache nuke when rm -rf fails during retry cleanup

Fixes #1275

Test plan

./mfc.sh format passes
./mfc.sh precheck passes (all 5 lint gates)
Symlinks verified: dirname resolves to frontier_amd/ preserving cluster detection
Helpers spot-checked: job_device=gpu job_interface=acc source gpu-opts.sh → --gpu acc
CI jobs (test, bench, case-optimization) pass on Phoenix, Frontier, Frontier AMD
Manual test on Phoenix: corrupt cache dir, verify health check detects and nuke recovers

🤖 Generated with Claude Code

Copilot

Pull request overview

This PR adds a dedicated case-optimization correctness CI job and refactors HPC CI scripts by extracting shared GPU detection/build-retry helpers, while extending benchmark cases with a --steps override and adding NaN/Inf validation.

Changes:

Add a new case-optimization job to the test workflow plus scripts to prebuild, run, and validate case-optimized benchmark runs.
Consolidate duplicated CI shell logic (GPU opts, GPU detection, build retries) into shared .github/scripts/* helpers; replace Frontier AMD duplicates with symlinks.
Extend the packer/test tooling to detect both NaN and Inf; add --steps to benchmark case scripts.

Reviewed changes

Copilot reviewed 26 out of 30 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
toolchain/mfc/test/test.py	Switch test validation from NaN-only to NaN/Inf detection.
toolchain/mfc/packer/pack.py	Rename `has_NaNs()` to `has_bad_values()` and include Inf detection.
benchmarks/5eq_rk3_weno3_hllc/case.py	Add `--steps` override; change `parallel_io` setting.
benchmarks/viscous_weno5_sgb_acoustic/case.py	Add `--steps` override; change `parallel_io` setting.
benchmarks/hypo_hll/case.py	Add `--steps` override for timestep control.
benchmarks/ibm/case.py	Add `--steps` override; change `parallel_io` setting.
benchmarks/igr/case.py	Add `--steps` override; change `parallel_io` setting.
.github/workflows/test.yml	Use centralized test retry wrapper; add `case-optimization` CI job.
.github/workflows/phoenix/test.sh	Refactor to use shared GPU opts, GPU detection, and retry-build helper.
.github/workflows/phoenix/bench.sh	Refactor to use shared bench preamble + retry-build helper.
.github/workflows/frontier/test.sh	Refactor to shared GPU detection and GPU opts helper.
.github/workflows/frontier/build.sh	Refactor to shared GPU opts and retry-build helper.
.github/workflows/frontier/bench.sh	Refactor to use shared bench preamble.
.github/workflows/frontier_amd/test.sh	Replace with symlink to `../frontier/test.sh`.
.github/workflows/frontier_amd/submit.sh	Replace with symlink to `../frontier/submit.sh`.
.github/workflows/frontier_amd/build.sh	Replace with symlink to `../frontier/build.sh`.
.github/workflows/frontier_amd/bench.sh	Replace with symlink to `../frontier/bench.sh`.
.github/scripts/run_case_optimization.sh	New: runs 5 benchmark cases with `--case-optimization` and validates output.
.github/scripts/check_case_optimization_output.py	New: validates D/*.dat contain no NaN/Inf via packer.
.github/scripts/run-tests-with-retry.sh	New: centralizes “retry up to 5 sporadic failures” logic for test workflow.
.github/scripts/retry-build.sh	New: shared 3-attempt build retry helper with optional cleanup/validation hooks.
.github/scripts/prebuild-case-optimization.sh	New: prebuild benchmark cases with `--case-optimization` on login node.
.github/scripts/gpu-opts.sh	New: shared translation from `job_device/job_interface` → `--gpu {acc
.github/scripts/detect-gpus.sh	New: shared NVIDIA/AMD GPU detection setting `ngpus` and `gpu_ids`.
.github/scripts/bench-preamble.sh	New: shared benchmark script preamble setting ranks/build/device opts.
.github/file-filter.yml	Ensure `.github/scripts/**` changes trigger CI file-change detection.

coderabbitai

Actionable comments posted: 9

🧹 Nitpick comments (1)

.github/scripts/run_case_optimization.sh (1)

23-29: Use a single source of truth for the case list across prebuild/run scripts.

The hardcoded list here can drift from .github/scripts/prebuild-case-optimization.sh discovery behavior. Centralizing this list avoids mismatched “built vs validated” coverage.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4ee892c and 83454c9.

📒 Files selected for processing (30)

.github/file-filter.yml
.github/scripts/bench-preamble.sh
.github/scripts/check_case_optimization_output.py
.github/scripts/detect-gpus.sh
.github/scripts/gpu-opts.sh
.github/scripts/prebuild-case-optimization.sh
.github/scripts/retry-build.sh
.github/scripts/run-tests-with-retry.sh
.github/scripts/run_case_optimization.sh
.github/workflows/frontier/bench.sh
.github/workflows/frontier/build.sh
.github/workflows/frontier/test.sh
.github/workflows/frontier_amd/bench.sh
.github/workflows/frontier_amd/bench.sh
.github/workflows/frontier_amd/build.sh
.github/workflows/frontier_amd/build.sh
.github/workflows/frontier_amd/submit.sh
.github/workflows/frontier_amd/submit.sh
.github/workflows/frontier_amd/test.sh
.github/workflows/frontier_amd/test.sh
.github/workflows/phoenix/bench.sh
.github/workflows/phoenix/test.sh
.github/workflows/test.yml
benchmarks/5eq_rk3_weno3_hllc/case.py
benchmarks/hypo_hll/case.py
benchmarks/ibm/case.py
benchmarks/igr/case.py
benchmarks/viscous_weno5_sgb_acoustic/case.py
toolchain/mfc/packer/pack.py
toolchain/mfc/test/test.py

…MFlowCode#1281) Replace 4 duplicated frontier_amd/ scripts with symlinks to frontier/ (cluster auto-detected from directory name via BASH_SOURCE). Extract 3 shared helpers into .github/scripts/: - gpu-opts.sh: sets $gpu_opts from $job_device/$job_interface - detect-gpus.sh: vendor-agnostic GPU detection (NVIDIA + AMD) - retry-build.sh: retry_build() with configurable cleanup Refactor 6 CI scripts to source the helpers, removing duplicated GPU opts blocks, GPU detection, and retry loops. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add a new 'case-optimization' job to test.yml that builds and runs all benchmark cases with --case-optimization on Phoenix (acc/omp), Frontier (acc/omp), and Frontier AMD (omp). Each case runs a small grid (1 GBPP) for 10 timesteps and validates output contains no NaN/Inf values. - Add check_case_optimization_output.py validator script - Add --steps CLI override to all 5 benchmark case files - Update file-filter.yml to trigger CI on .github/scripts/ changes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add RETRY_VALIDATE_CMD hook to retry-build.sh for post-build validation - Replace 37-line inline retry loop in phoenix/test.sh with retry_build() - Derive module flag from cluster name in prebuild-case-optimization.sh, removing redundant flag field from case-optimization matrix - Extract GitHub job test retry logic to run-tests-with-retry.sh - Extract shared bench GPU/device preamble to bench-preamble.sh - Standardize source order: detect-gpus.sh before gpu-opts.sh Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…enchmarks - Set parallel_io to F in all benchmark cases so simulation writes D/*.dat text files readable by the packer (parallel_io=T writes binary to restart_data/ instead, which neither the packer nor the validation script could read) - Rewrite check_case_optimization_output.py to use pack.compile() + has_bad_values() instead of reimplementing the same parsing logic - Rename Pack.has_NaNs() to has_bad_values(), adding math.isinf() check - Call validation via build/venv/bin/python3 for toolchain dependencies Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…, hidden env dependency - run-tests-with-retry.sh: extract --test-all from "$@" instead of relying on $TEST_ALL env var for retry path - check_case_optimization_output.py: restore argument validation, add per-file NaN/Inf diagnostic reporting - run_case_optimization.sh: check venv exists before loop, fix misleading error message, normalize exit code - pack.py: fix typo in Pack class comment Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ing cleanup When NFS stale file handles occur on Phoenix, cached files become both unreadable and undeletable, causing all retry attempts to fail identically. Layer 1: Pre-flight health check in setup-build-cache.sh probes the cache (ls, stat, touch/rm) and nukes immediately if stale, before the build starts. Layer 2: Resilient cleanup in retry-build.sh escalates to cache nuke (mv-based rename) when rm -rf fails during retry, so the next attempt gets a fresh cache. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…enix path Frontier runners failed because the cache root /storage/coda1/... is Phoenix-specific. Select cache root via case statement on cluster name: Phoenix -> /storage/coda1/..., Frontier -> /lustre/orion/.... Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Disable build retries (max_attempts 1) across all CI jobs so failures surface immediately. Test --max-attempts remains at 3 for sporadic test failures. Case-optimized pre-builds reduced to -j 2: Phoenix login nodes have a 4GB per-user cgroup limit (confirmed via dmesg: CONSTRAINT_MEMCG). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Phoenix login nodes have a 4GB per-user cgroup memory limit that OOM-kills case-optimized GPU builds (confirmed via dmesg: CONSTRAINT_MEMCG). Route the pre-build through submit.sh on Phoenix so it runs on a compute node with full memory. Frontier continues to pre-build on the login node. Reverts retry/parallelism changes (max_attempts back to 3, -j back to 8) since the root cause was the cgroup, not parallelism. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codecov · 2026-03-03T14:33:51Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 44.95%. Comparing base (7c806be) to head (f6918fa).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1285   +/-   ##
=======================================
  Coverage   44.95%   44.95%           
=======================================
  Files          70       70           
  Lines       20503    20503           
  Branches     1946     1946           
=======================================
  Hits         9217     9217           
  Misses      10164    10164           
  Partials     1122     1122

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sbryngelson changed the title ~~Consolidate CI dispatch: symlink frontier_amd, extract shared helpers~~ Add case-optimization CI tests and consolidate CI dispatch infrastructure Mar 2, 2026

sbryngelson marked this pull request as ready for review March 2, 2026 03:53

Copilot AI review requested due to automatic review settings March 2, 2026 03:53

Copilot started reviewing on behalf of sbryngelson March 2, 2026 03:54 View session

Copilot AI reviewed Mar 2, 2026

View reviewed changes

coderabbitai bot reviewed Mar 2, 2026

View reviewed changes

sbryngelson and others added 5 commits March 2, 2026 10:44

sbryngelson force-pushed the pause-coverage branch from 83454c9 to 93650d5 Compare March 2, 2026 15:45

sbryngelson changed the title ~~Add case-optimization CI tests and consolidate CI dispatch infrastructure~~ Consolidate CI infrastructure and add NFS-resilient build cache Mar 3, 2026

sbryngelson and others added 4 commits March 2, 2026 22:49

Rename _nfs_cache_* to _cache_*: Frontier uses Lustre, not NFS

d487b18

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

MFlowCode deleted a comment from qodo-code-review bot Mar 3, 2026

MFlowCode deleted a comment from Copilot AI Mar 3, 2026

MFlowCode deleted a comment from coderabbitai bot Mar 3, 2026

MFlowCode deleted a comment from github-actions bot Mar 3, 2026

MFlowCode deleted a comment from qodo-code-review bot Mar 3, 2026

MFlowCode deleted a comment from github-actions bot Mar 3, 2026

MFlowCode deleted a comment from coderabbitai bot Mar 3, 2026

MFlowCode deleted a comment from codecov bot Mar 3, 2026

MFlowCode deleted a comment from github-actions bot Mar 3, 2026

sbryngelson merged commit ce98373 into MFlowCode:master Mar 3, 2026
42 of 55 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate CI infrastructure and add NFS-resilient build cache#1285

Consolidate CI infrastructure and add NFS-resilient build cache#1285
sbryngelson merged 10 commits intoMFlowCode:masterfrom
sbryngelson:pause-coverage

sbryngelson commented Mar 2, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

codecov bot commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

sbryngelson commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Mar 3, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

sbryngelson commented Mar 2, 2026 •

edited

Loading