ci: sequential benchmark monitoring to avoid OOM kills by sbryngelson · Pull Request #1309 · MFlowCode/MFC

sbryngelson · 2026-03-14T01:14:43Z

Summary

Submit both PR and master SLURM benchmark jobs up front (concurrent on compute nodes for fair comparison)
Monitor them one at a time on the login node to stay within the 4 GB per-user cgroup memory limit
Add SUBMIT_ONLY mode to submit-slurm-job.sh to decouple submission from monitoring
Add NFS propagation retry for YAML output files (handles the case where job 2 finishes before monitoring starts)

Context

Phoenix login nodes enforce a 4 GB per-user cgroup memory limit shared across all processes. With 7 GitHub Actions runners on one node (~1.5 GB baseline), running two benchmark monitors in parallel (each with tail -f, bash loops, pipe subshells) triggers OOM kills via the kernel cgroup enforcer (oom_score_adj=500 set by the runner on child processes).

Test plan

Phoenix GPU benchmark job completes without OOM kill
Both PR and master SLURM jobs run concurrently on compute nodes
YAML output files are found after sequential monitoring
Other callers of submit-slurm-job.sh (test suite, case-optimization) are unaffected

🤖 Generated with Claude Code

When a multi-step CI job (like case-optimization) fails at an early step, the 'Print Logs' step (if: always) would cat output files from a previous successful run, making it appear the current run succeeded. Delete stale .out files after checkout so logs only show output from the current workflow run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Phoenix login nodes have a 4 GB per-user cgroup memory limit shared across all runner processes. Running two benchmark monitors in parallel (each with tail -f, bash loops, and pipe subshells) on top of 7 concurrent runner processes exceeds this limit, triggering OOM kills. Submit both PR and master SLURM jobs up front so they run concurrently on compute nodes (preserving benchmark fairness), but monitor them one at a time on the login node. Also add SUBMIT_ONLY mode to submit-slurm-job.sh to support decoupling submission from monitoring. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-03-14T01:16:46Z

Claude Code Review

Head SHA: 3d37ae29f67981188e04957897084c62f6bc5ee8

Files changed: 3

.github/scripts/run_parallel_benchmarks.sh (+74/-54)
.github/scripts/submit-slurm-job.sh (+6/-2)
.github/workflows/test.yml (+6/0)

Summary

Decouples SLURM job submission from monitoring: both PR and master jobs are submitted upfront (concurrent on compute), then monitored sequentially on the login node — correctly addresses the 4 GB cgroup OOM issue.
SUBMIT_ONLY=1 env-var mode in submit-slurm-job.sh is clean, backward-compatible, and doesn't break existing callers.
NFS propagation retry loop (6 × 5 s = 30 s max) is a reasonable safety net for YAML files appearing after sequential monitoring.
rm -f *.out in test.yml matches the pattern from ci: clean stale .out files after checkout #1308 (no issues).
Overall logic is sound and well-commented.

Findings

1. Fragile job_slug coupling — run_parallel_benchmarks.sh line ~44

job_slug="bench-${device}-${interface}"

This replicates slug-construction logic from submit-slurm-job.sh. The comment acknowledges this must match, but if the slug formula ever changes in submit-slurm-job.sh, this breaks silently (the script would cat a non-existent file and fail with a confusing error). Low risk now, but consider exporting the slug from submit-slurm-job.sh (e.g., via a separate .slurm_slug file) so the computation lives in exactly one place.

2. NFS retry window may be tight — run_parallel_benchmarks.sh line ~106–113

while [ ! -f "$yaml" ] && [ $attempts -lt 6 ]; do
    sleep 5
    attempts=$((attempts + 1))
done

Max 30 s per file. The PR description notes NFS can be slow under load. Bumping to attempts -lt 12 (60 s) costs nothing if the file appears quickly and would reduce flakiness on a busy cluster.

3. Minor logging regression — failure output paths
The original failure blocks printed a label before the tail, e.g. echo "Last 50 lines of PR job log:". The new code omits those labels and goes straight to tail -n 50. Very minor, but useful when reading CI logs to know what section you're looking at. Easy one-liner fix if desired.

No correctness, compilation, or physics issues. The CI-only nature of these changes means there is zero blast radius on the MFC solver itself.

github-actions · 2026-03-14T01:38:27Z

Claude Code Review

Head SHA: 20ccd77

Files changed: 2

.github/scripts/run_parallel_benchmarks.sh
.github/scripts/submit-slurm-job.sh

Summary:

Fixes OOM kills on Phoenix login nodes by decoupling SLURM job submission from monitoring
Both PR and master benchmark jobs are submitted concurrently (Phase 1), then monitored sequentially (Phase 2), preserving fairness on compute while reducing login-node memory pressure
Adds SUBMIT_ONLY=1 environment variable to submit-slurm-job.sh to skip monitoring; existing callers (test suite, case-optimization) are unaffected since they don't set this flag
Adds NFS propagation retry loop (up to 30s) for YAML output files after sequential monitoring
Minor cleanups: consistent 4-space indentation, job_slug variable to reduce repetition

Findings:

run_parallel_benchmarks.sh lines 43–44 — missing error handling on cat for job ID files
After SUBMIT_ONLY=1 submission, the job ID is read back from pr/.slurm_job_id. If submission fails (non-zero exit from the cd subshell), the script continues and the cat will fail with a cryptic message rather than a clear error. Consider checking the subshell exit code explicitly:
```
(cd pr && SUBMIT_ONLY=1 bash ... ) || { echo 'ERROR: PR submission failed'; exit 1; }
pr_job_id=
```
Similarly for the master job on lines 48–49. (The outer set -euo pipefail won't catch errors inside a command substitution/subshell that is itself the right-hand side of an assignment.)
run_parallel_benchmarks.sh lines 75–82 — NFS retry loop runs even after a job failure
The NFS wait loop runs unconditionally for both YAML files, even if the corresponding SLURM job already exited with a non-zero code. This adds up to 30 s of spurious waiting on a failed run. A minor robustness improvement: skip the wait for a YAML if its job already failed.
submit-slurm-job.sh line 205 — SUBMIT_ONLY is not documented in the usage/help comment
The file likely has a usage comment at the top. Adding a note about SUBMIT_ONLY=1 would help future maintainers. (Low priority, cosmetic only.)

Overall: The approach is sound and directly addresses the documented OOM issue. The concurrent-submit / sequential-monitor pattern is a clean solution. The findings above are minor robustness gaps rather than correctness blockers.

github-actions · 2026-03-14T01:38:33Z

Claude Code Review

Head SHA: 20ccd77

Files changed: 2

.github/scripts/run_parallel_benchmarks.sh
.github/scripts/submit-slurm-job.sh

Summary:

Fixes OOM kills on Phoenix login nodes by submitting both SLURM jobs up front (concurrent on compute) then monitoring sequentially (one at a time on login)
Adds SUBMIT_ONLY=1 env var flag to submit-slurm-job.sh to decouple submission from monitoring
Adds NFS propagation retry loop (6×5s = 30s max) for YAML output files
Simplifies log path references using the job_slug variable, reducing duplication
Other callers of submit-slurm-job.sh (test suite, case-optimization) are unaffected since SUBMIT_ONLY defaults to 0

Findings:

Minor — potential race on job ID file read (run_parallel_benchmarks.sh, lines ~48–55):

(cd pr && SUBMIT_ONLY=1 bash "...")
pr_job_id=

The subshell cd pr && ... writes the job ID file. Since it runs in a subshell (not background), the cat happens after it exits — this is fine. No race. (Non-issue, just noting the sequence is correct.)

Minor — NFS retry loop resets attempts per file but loops over both in sequence (run_parallel_benchmarks.sh, lines ~90–98):
Each YAML file gets up to 30s of NFS wait independently. This is intentional and correct — the second file may need its own wait if the first was fast.

Minor — PR_BENCH_SCRIPT path assumption (run_parallel_benchmarks.sh, line ~42):

PR_BENCH_SCRIPT="$(cd "${SCRIPT_DIR}/../workflows/common" && pwd)/bench.sh"

This path is constructed but never passed to the monitoring calls — it's only used in the SUBMIT_ONLY submission phase. The comment says "the bench script must come from the PR tree", which is correct for the submission path. No issue.

Observation — partial-failure path continues to both YAML checks (run_parallel_benchmarks.sh, lines ~101–110):
If the PR job fails (pr_exit != 0), monitoring of the master job still proceeds and both YAML checks still run. This is likely intentional (to collect as much diagnostic info as possible), but if a failure is fatal, monitoring the second job wastes time. Low priority, current behavior seems reasonable.

No blocking issues. The approach is sound — concurrent SLURM submission for fairness, sequential monitoring for memory safety. SUBMIT_ONLY is a clean, backward-compatible extension.

coderabbitai · 2026-03-14T01:39:38Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 01c7df28-dbae-4643-abf3-7e7a855e4abe

📥 Commits

Reviewing files that changed from the base of the PR and between f528e23 and 20ccd77.

📒 Files selected for processing (2)

.github/scripts/run_parallel_benchmarks.sh
.github/scripts/submit-slurm-job.sh

📝 Walkthrough

Walkthrough

This change refactors the parallel benchmarking workflow in the GitHub Actions CI scripts from a background job approach with concurrent monitoring to a structured two-phase architecture. Phase 1 submits both PR and master SLURM jobs using an updated submit-slurm-job.sh with a new SUBMIT_ONLY flag that skips monitoring. Phase 2 sequentially monitors each job using run_monitored_slurm_job.sh. The script now derives job paths using device and interface slugs, captures job IDs from dedicated files, consolidates post-run verification including NFS propagation handling, and reduces login-node memory pressure through sequential rather than parallel monitoring.

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can approve the review once all CodeRabbit's comments are resolved.

Enable the reviews.request_changes_workflow setting to automatically approve the review once all CodeRabbit's comments are resolved.

Copilot

Pull request overview

Updates the CI SLURM benchmark orchestration to avoid Phoenix login-node OOM kills by decoupling job submission from monitoring, while still running PR and master benchmarks concurrently on compute nodes for fair comparisons.

Changes:

Add SUBMIT_ONLY=1 mode to submit-slurm-job.sh to allow submission without starting the monitor.
Rework run_parallel_benchmarks.sh to submit both jobs first, then monitor them sequentially.
Add a short NFS-propagation retry loop when checking for benchmark YAML outputs.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
.github/scripts/submit-slurm-job.sh	Adds `SUBMIT_ONLY` gate to skip monitoring after submission.
.github/scripts/run_parallel_benchmarks.sh	Submits PR/master jobs up front, monitors sequentially, and retries YAML presence checks.

codecov · 2026-03-14T04:58:54Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 45.34%. Comparing base (598f5a5) to head (20ccd77).
⚠️ Report is 5 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1309   +/-   ##
=======================================
  Coverage   45.34%   45.34%           
=======================================
  Files          70       70           
  Lines       20514    20514           
  Branches     1954     1954           
=======================================
  Hits         9303     9303           
  Misses      10084    10084           
  Partials     1127     1127

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sbryngelson and others added 2 commits March 13, 2026 20:26

Merge branch 'master' into frontier-runner

20ccd77

sbryngelson marked this pull request as ready for review March 14, 2026 01:37

Copilot AI review requested due to automatic review settings March 14, 2026 01:37

sbryngelson merged commit e331547 into MFlowCode:master Mar 14, 2026
24 of 27 checks passed

Copilot started reviewing on behalf of sbryngelson March 14, 2026 01:37 View session

Copilot AI reviewed Mar 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: sequential benchmark monitoring to avoid OOM kills#1309

ci: sequential benchmark monitoring to avoid OOM kills#1309
sbryngelson merged 3 commits intoMFlowCode:masterfrom
sbryngelson:frontier-runner

sbryngelson commented Mar 14, 2026

Uh oh!

github-actions bot commented Mar 14, 2026

Uh oh!

Uh oh!

github-actions bot commented Mar 14, 2026

Uh oh!

github-actions bot commented Mar 14, 2026

Uh oh!

coderabbitai bot commented Mar 14, 2026

Review failed

Walkthrough

Uh oh!

Copilot AI left a comment

Uh oh!

codecov bot commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

sbryngelson commented Mar 14, 2026

Summary

Context

Test plan

Uh oh!

github-actions bot commented Mar 14, 2026

Claude Code Review

Summary

Findings

Uh oh!

Uh oh!

github-actions bot commented Mar 14, 2026

Claude Code Review

Uh oh!

github-actions bot commented Mar 14, 2026

Claude Code Review

Uh oh!

coderabbitai bot commented Mar 14, 2026

Review failed

Walkthrough

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

codecov bot commented Mar 14, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants