Skip to content

ci: sequential benchmark monitoring to avoid OOM kills#1309

Merged
sbryngelson merged 3 commits intoMFlowCode:masterfrom
sbryngelson:frontier-runner
Mar 14, 2026
Merged

ci: sequential benchmark monitoring to avoid OOM kills#1309
sbryngelson merged 3 commits intoMFlowCode:masterfrom
sbryngelson:frontier-runner

Conversation

@sbryngelson
Copy link
Member

Summary

  • Submit both PR and master SLURM benchmark jobs up front (concurrent on compute nodes for fair comparison)
  • Monitor them one at a time on the login node to stay within the 4 GB per-user cgroup memory limit
  • Add SUBMIT_ONLY mode to submit-slurm-job.sh to decouple submission from monitoring
  • Add NFS propagation retry for YAML output files (handles the case where job 2 finishes before monitoring starts)

Context

Phoenix login nodes enforce a 4 GB per-user cgroup memory limit shared across all processes. With 7 GitHub Actions runners on one node (~1.5 GB baseline), running two benchmark monitors in parallel (each with tail -f, bash loops, pipe subshells) triggers OOM kills via the kernel cgroup enforcer (oom_score_adj=500 set by the runner on child processes).

Test plan

  • Phoenix GPU benchmark job completes without OOM kill
  • Both PR and master SLURM jobs run concurrently on compute nodes
  • YAML output files are found after sequential monitoring
  • Other callers of submit-slurm-job.sh (test suite, case-optimization) are unaffected

🤖 Generated with Claude Code

sbryngelson and others added 2 commits March 13, 2026 20:26
When a multi-step CI job (like case-optimization) fails at an early
step, the 'Print Logs' step (if: always) would cat output files from
a previous successful run, making it appear the current run succeeded.
Delete stale .out files after checkout so logs only show output from
the current workflow run.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phoenix login nodes have a 4 GB per-user cgroup memory limit shared
across all runner processes. Running two benchmark monitors in parallel
(each with tail -f, bash loops, and pipe subshells) on top of 7
concurrent runner processes exceeds this limit, triggering OOM kills.

Submit both PR and master SLURM jobs up front so they run concurrently
on compute nodes (preserving benchmark fairness), but monitor them
one at a time on the login node. Also add SUBMIT_ONLY mode to
submit-slurm-job.sh to support decoupling submission from monitoring.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link

Claude Code Review

Head SHA: 3d37ae29f67981188e04957897084c62f6bc5ee8

Files changed: 3

  • .github/scripts/run_parallel_benchmarks.sh (+74/-54)
  • .github/scripts/submit-slurm-job.sh (+6/-2)
  • .github/workflows/test.yml (+6/0)

Summary

  • Decouples SLURM job submission from monitoring: both PR and master jobs are submitted upfront (concurrent on compute), then monitored sequentially on the login node — correctly addresses the 4 GB cgroup OOM issue.
  • SUBMIT_ONLY=1 env-var mode in submit-slurm-job.sh is clean, backward-compatible, and doesn't break existing callers.
  • NFS propagation retry loop (6 × 5 s = 30 s max) is a reasonable safety net for YAML files appearing after sequential monitoring.
  • rm -f *.out in test.yml matches the pattern from ci: clean stale .out files after checkout #1308 (no issues).
  • Overall logic is sound and well-commented.

Findings

1. Fragile job_slug coupling — run_parallel_benchmarks.sh line ~44

job_slug="bench-${device}-${interface}"

This replicates slug-construction logic from submit-slurm-job.sh. The comment acknowledges this must match, but if the slug formula ever changes in submit-slurm-job.sh, this breaks silently (the script would cat a non-existent file and fail with a confusing error). Low risk now, but consider exporting the slug from submit-slurm-job.sh (e.g., via a separate .slurm_slug file) so the computation lives in exactly one place.

2. NFS retry window may be tight — run_parallel_benchmarks.sh line ~106–113

while [ ! -f "$yaml" ] && [ $attempts -lt 6 ]; do
    sleep 5
    attempts=$((attempts + 1))
done

Max 30 s per file. The PR description notes NFS can be slow under load. Bumping to attempts -lt 12 (60 s) costs nothing if the file appears quickly and would reduce flakiness on a busy cluster.

3. Minor logging regression — failure output paths
The original failure blocks printed a label before the tail, e.g. echo "Last 50 lines of PR job log:". The new code omits those labels and goes straight to tail -n 50. Very minor, but useful when reading CI logs to know what section you're looking at. Easy one-liner fix if desired.


No correctness, compilation, or physics issues. The CI-only nature of these changes means there is zero blast radius on the MFC solver itself.

@sbryngelson sbryngelson marked this pull request as ready for review March 14, 2026 01:37
Copilot AI review requested due to automatic review settings March 14, 2026 01:37
@sbryngelson sbryngelson merged commit e331547 into MFlowCode:master Mar 14, 2026
24 of 27 checks passed
@github-actions
Copy link

Claude Code Review

Head SHA: 20ccd77

Files changed: 2

  • .github/scripts/run_parallel_benchmarks.sh
  • .github/scripts/submit-slurm-job.sh

Summary:

  • Fixes OOM kills on Phoenix login nodes by decoupling SLURM job submission from monitoring
  • Both PR and master benchmark jobs are submitted concurrently (Phase 1), then monitored sequentially (Phase 2), preserving fairness on compute while reducing login-node memory pressure
  • Adds SUBMIT_ONLY=1 environment variable to submit-slurm-job.sh to skip monitoring; existing callers (test suite, case-optimization) are unaffected since they don't set this flag
  • Adds NFS propagation retry loop (up to 30s) for YAML output files after sequential monitoring
  • Minor cleanups: consistent 4-space indentation, job_slug variable to reduce repetition

Findings:

  1. run_parallel_benchmarks.sh lines 43–44 — missing error handling on cat for job ID files
    After SUBMIT_ONLY=1 submission, the job ID is read back from pr/.slurm_job_id. If submission fails (non-zero exit from the cd subshell), the script continues and the cat will fail with a cryptic message rather than a clear error. Consider checking the subshell exit code explicitly:

    (cd pr && SUBMIT_ONLY=1 bash ... ) || { echo 'ERROR: PR submission failed'; exit 1; }
    pr_job_id=

    Similarly for the master job on lines 48–49. (The outer set -euo pipefail won't catch errors inside a command substitution/subshell that is itself the right-hand side of an assignment.)

  2. run_parallel_benchmarks.sh lines 75–82 — NFS retry loop runs even after a job failure
    The NFS wait loop runs unconditionally for both YAML files, even if the corresponding SLURM job already exited with a non-zero code. This adds up to 30 s of spurious waiting on a failed run. A minor robustness improvement: skip the wait for a YAML if its job already failed.

  3. submit-slurm-job.sh line 205 — SUBMIT_ONLY is not documented in the usage/help comment
    The file likely has a usage comment at the top. Adding a note about SUBMIT_ONLY=1 would help future maintainers. (Low priority, cosmetic only.)

Overall: The approach is sound and directly addresses the documented OOM issue. The concurrent-submit / sequential-monitor pattern is a clean solution. The findings above are minor robustness gaps rather than correctness blockers.

@github-actions
Copy link

Claude Code Review

Head SHA: 20ccd77

Files changed: 2

  • .github/scripts/run_parallel_benchmarks.sh
  • .github/scripts/submit-slurm-job.sh

Summary:

  • Fixes OOM kills on Phoenix login nodes by submitting both SLURM jobs up front (concurrent on compute) then monitoring sequentially (one at a time on login)
  • Adds SUBMIT_ONLY=1 env var flag to submit-slurm-job.sh to decouple submission from monitoring
  • Adds NFS propagation retry loop (6×5s = 30s max) for YAML output files
  • Simplifies log path references using the job_slug variable, reducing duplication
  • Other callers of submit-slurm-job.sh (test suite, case-optimization) are unaffected since SUBMIT_ONLY defaults to 0

Findings:

Minor — potential race on job ID file read (run_parallel_benchmarks.sh, lines ~48–55):

(cd pr && SUBMIT_ONLY=1 bash "...")
pr_job_id=

The subshell cd pr && ... writes the job ID file. Since it runs in a subshell (not background), the cat happens after it exits — this is fine. No race. (Non-issue, just noting the sequence is correct.)

Minor — NFS retry loop resets attempts per file but loops over both in sequence (run_parallel_benchmarks.sh, lines ~90–98):
Each YAML file gets up to 30s of NFS wait independently. This is intentional and correct — the second file may need its own wait if the first was fast.

Minor — PR_BENCH_SCRIPT path assumption (run_parallel_benchmarks.sh, line ~42):

PR_BENCH_SCRIPT="$(cd "${SCRIPT_DIR}/../workflows/common" && pwd)/bench.sh"

This path is constructed but never passed to the monitoring calls — it's only used in the SUBMIT_ONLY submission phase. The comment says "the bench script must come from the PR tree", which is correct for the submission path. No issue.

Observation — partial-failure path continues to both YAML checks (run_parallel_benchmarks.sh, lines ~101–110):
If the PR job fails (pr_exit != 0), monitoring of the master job still proceeds and both YAML checks still run. This is likely intentional (to collect as much diagnostic info as possible), but if a failure is fatal, monitoring the second job wastes time. Low priority, current behavior seems reasonable.

No blocking issues. The approach is sound — concurrent SLURM submission for fairness, sequential monitoring for memory safety. SUBMIT_ONLY is a clean, backward-compatible extension.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 14, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 01c7df28-dbae-4643-abf3-7e7a855e4abe

📥 Commits

Reviewing files that changed from the base of the PR and between f528e23 and 20ccd77.

📒 Files selected for processing (2)
  • .github/scripts/run_parallel_benchmarks.sh
  • .github/scripts/submit-slurm-job.sh

📝 Walkthrough

Walkthrough

This change refactors the parallel benchmarking workflow in the GitHub Actions CI scripts from a background job approach with concurrent monitoring to a structured two-phase architecture. Phase 1 submits both PR and master SLURM jobs using an updated submit-slurm-job.sh with a new SUBMIT_ONLY flag that skips monitoring. Phase 2 sequentially monitors each job using run_monitored_slurm_job.sh. The script now derives job paths using device and interface slugs, captures job IDs from dedicated files, consolidates post-run verification including NFS propagation handling, and reduces login-node memory pressure through sequential rather than parallel monitoring.

📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can approve the review once all CodeRabbit's comments are resolved.

Enable the reviews.request_changes_workflow setting to automatically approve the review once all CodeRabbit's comments are resolved.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the CI SLURM benchmark orchestration to avoid Phoenix login-node OOM kills by decoupling job submission from monitoring, while still running PR and master benchmarks concurrently on compute nodes for fair comparisons.

Changes:

  • Add SUBMIT_ONLY=1 mode to submit-slurm-job.sh to allow submission without starting the monitor.
  • Rework run_parallel_benchmarks.sh to submit both jobs first, then monitor them sequentially.
  • Add a short NFS-propagation retry loop when checking for benchmark YAML outputs.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
.github/scripts/submit-slurm-job.sh Adds SUBMIT_ONLY gate to skip monitoring after submission.
.github/scripts/run_parallel_benchmarks.sh Submits PR/master jobs up front, monitors sequentially, and retries YAML presence checks.

@codecov
Copy link

codecov bot commented Mar 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 45.34%. Comparing base (598f5a5) to head (20ccd77).
⚠️ Report is 5 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1309   +/-   ##
=======================================
  Coverage   45.34%   45.34%           
=======================================
  Files          70       70           
  Lines       20514    20514           
  Branches     1954     1954           
=======================================
  Hits         9303     9303           
  Misses      10084    10084           
  Partials     1127     1127           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants