ci: sequential benchmark monitoring to avoid OOM kills#1309
ci: sequential benchmark monitoring to avoid OOM kills#1309sbryngelson merged 3 commits intoMFlowCode:masterfrom
Conversation
When a multi-step CI job (like case-optimization) fails at an early step, the 'Print Logs' step (if: always) would cat output files from a previous successful run, making it appear the current run succeeded. Delete stale .out files after checkout so logs only show output from the current workflow run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phoenix login nodes have a 4 GB per-user cgroup memory limit shared across all runner processes. Running two benchmark monitors in parallel (each with tail -f, bash loops, and pipe subshells) on top of 7 concurrent runner processes exceeds this limit, triggering OOM kills. Submit both PR and master SLURM jobs up front so they run concurrently on compute nodes (preserving benchmark fairness), but monitor them one at a time on the login node. Also add SUBMIT_ONLY mode to submit-slurm-job.sh to support decoupling submission from monitoring. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Claude Code ReviewHead SHA: Files changed: 3
Summary
Findings1. Fragile job_slug="bench-${device}-${interface}"This replicates slug-construction logic from 2. NFS retry window may be tight — while [ ! -f "$yaml" ] && [ $attempts -lt 6 ]; do
sleep 5
attempts=$((attempts + 1))
doneMax 30 s per file. The PR description notes NFS can be slow under load. Bumping to 3. Minor logging regression — failure output paths No correctness, compilation, or physics issues. The CI-only nature of these changes means there is zero blast radius on the MFC solver itself. |
Claude Code ReviewHead SHA: 20ccd77 Files changed: 2
Summary:
Findings:
Overall: The approach is sound and directly addresses the documented OOM issue. The concurrent-submit / sequential-monitor pattern is a clean solution. The findings above are minor robustness gaps rather than correctness blockers. |
Claude Code ReviewHead SHA: 20ccd77 Files changed: 2
Summary:
Findings: Minor — potential race on job ID file read ( (cd pr && SUBMIT_ONLY=1 bash "...")
pr_job_id=The subshell Minor — NFS retry loop resets Minor — PR_BENCH_SCRIPT="$(cd "${SCRIPT_DIR}/../workflows/common" && pwd)/bench.sh"This path is constructed but never passed to the monitoring calls — it's only used in the Observation — partial-failure path continues to both YAML checks ( No blocking issues. The approach is sound — concurrent SLURM submission for fairness, sequential monitoring for memory safety. |
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis change refactors the parallel benchmarking workflow in the GitHub Actions CI scripts from a background job approach with concurrent monitoring to a structured two-phase architecture. Phase 1 submits both PR and master SLURM jobs using an updated 📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip CodeRabbit can approve the review once all CodeRabbit's comments are resolved.Enable the |
There was a problem hiding this comment.
Pull request overview
Updates the CI SLURM benchmark orchestration to avoid Phoenix login-node OOM kills by decoupling job submission from monitoring, while still running PR and master benchmarks concurrently on compute nodes for fair comparisons.
Changes:
- Add
SUBMIT_ONLY=1mode tosubmit-slurm-job.shto allow submission without starting the monitor. - Rework
run_parallel_benchmarks.shto submit both jobs first, then monitor them sequentially. - Add a short NFS-propagation retry loop when checking for benchmark YAML outputs.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| .github/scripts/submit-slurm-job.sh | Adds SUBMIT_ONLY gate to skip monitoring after submission. |
| .github/scripts/run_parallel_benchmarks.sh | Submits PR/master jobs up front, monitors sequentially, and retries YAML presence checks. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1309 +/- ##
=======================================
Coverage 45.34% 45.34%
=======================================
Files 70 70
Lines 20514 20514
Branches 1954 1954
=======================================
Hits 9303 9303
Misses 10084 10084
Partials 1127 1127 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
SUBMIT_ONLYmode tosubmit-slurm-job.shto decouple submission from monitoringContext
Phoenix login nodes enforce a 4 GB per-user cgroup memory limit shared across all processes. With 7 GitHub Actions runners on one node (~1.5 GB baseline), running two benchmark monitors in parallel (each with
tail -f, bash loops, pipe subshells) triggers OOM kills via the kernel cgroup enforcer (oom_score_adj=500set by the runner on child processes).Test plan
submit-slurm-job.sh(test suite, case-optimization) are unaffected🤖 Generated with Claude Code