Skip to content

ci: fix monitor hang when SLURM job is preempted and requeued#1311

Merged
sbryngelson merged 1 commit intoMFlowCode:masterfrom
sbryngelson:frontier-runner
Mar 14, 2026
Merged

ci: fix monitor hang when SLURM job is preempted and requeued#1311
sbryngelson merged 1 commit intoMFlowCode:masterfrom
sbryngelson:frontier-runner

Conversation

@sbryngelson
Copy link
Member

When a job is preempted+requeued, sacct -X reports PREEMPTED for the original attempt even after the requeued run completes. The monitor excluded PREEMPTED from terminal states (correct for active requeues) but never detected the requeued completion via sacct, causing it to loop on state=PREEMPTED for hours until the GHA timeout killed it.

Fix: when sacct -X returns PREEMPTED, also query without -X to find the requeued run's terminal state (COMPLETED, FAILED, etc).

When a job is preempted+requeued, sacct -X reports PREEMPTED for the
original attempt even after the requeued run completes. The monitor
excluded PREEMPTED from terminal states (correct for active requeues)
but never detected the requeued completion via sacct, causing it to
loop on state=PREEMPTED for hours until the GHA timeout killed it.

Fix: when sacct -X returns PREEMPTED, also query without -X to find
the requeued run's terminal state (COMPLETED, FAILED, etc).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sbryngelson sbryngelson marked this pull request as ready for review March 14, 2026 13:17
Copilot AI review requested due to automatic review settings March 14, 2026 13:17
@sbryngelson sbryngelson merged commit e59272a into MFlowCode:master Mar 14, 2026
23 of 30 checks passed
@github-actions
Copy link

Claude Code Review

Head SHA: 1eb855c

Files changed: 1

  • .github/scripts/monitor_slurm_job.sh

Summary:

  • Fixes a CI hang where a preempted+requeued SLURM job caused the monitor to loop indefinitely on state=PREEMPTED
  • When sacct -X returns PREEMPTED, the fix queries all records (without -X) to find any terminal state from the requeued run
  • Logic is additive (9 lines added, 0 deleted); no existing behavior is changed for non-preempted jobs

Findings:

  1. Potential wrong-state selection (.github/scripts/monitor_slurm_job.sh, new lines ~57-58):

    requeue_state=$(sacct -j "$jid" -n -P -o State 2>/dev/null | grep -v PREEMPTED | head -n1 | cut -d'|' -f1 || true)

    grep -v PREEMPTED also silently drops PREEMPTED by ... extended state strings and any line containing the word "PREEMPTED" (e.g. REQUEUED_PREEMPTED if SLURM ever emits that). More importantly, head -n1 picks the first non-PREEMPTED record, which may be the original job step rather than the requeued batch step. Consider filtering more precisely (e.g. grep -E 'COMPLETED|FAILED|CANCELLED|TIMEOUT|NODE_FAIL|OUT_OF_MEMORY') to ensure only recognized terminal states are accepted.

  2. Empty requeue_state not distinguished from "no requeue yet" (.github/scripts/monitor_slurm_job.sh, lines ~59-61):
    If the requeued job is still running, sacct without -X will return an empty/blank state or a running state (RUNNING, PENDING). The current code only overrides state when requeue_state is non-empty, so a still-running requeue will leave state=PREEMPTED, which the existing terminal-state check will treat as... not terminal (correct). This is actually fine — but it is worth confirming that the outer loop in the caller properly handles PREEMPTED as a non-terminal state to keep polling, which appears to be the intent. No action required if that is already the case.

  3. No concern with -X semantics: The fix correctly targets the documented behavior of -X (report only the last job step per job), and querying without -X is the right approach to surface the requeued step. The logic is sound.

Verdict: The fix is correct and well-scoped. Addressing finding #1 (use an explicit allowlist of terminal states instead of grep -v PREEMPTED) would make the selection more robust, but the current approach is a clear improvement over the hang scenario.

@github-actions
Copy link

Claude Code Review

Head SHA: 1eb855c211e5fdd2c66e12770839c61b47dd9265

Files changed: 1

  • .github/scripts/monitor_slurm_job.sh

Summary:

  • Fixes a CI monitor hang when a SLURM job is preempted and requeued
  • When sacct -X reports PREEMPTED, the fix queries all job step records (without -X) to find the requeued run's terminal state
  • Patch is minimal (+9 lines, 0 deletions), well-targeted, and addresses a real operational bug

Findings:

  1. Logic concern — PENDING/RUNNING requeued jobs (.github/scripts/monitor_slurm_job.sh, new lines ~57–60):
    The grep -v PREEMPTED | head -n1 pattern will match any non-PREEMPTED state, including PENDING or RUNNING (active states of the requeued run). If the requeue is still in progress, sacct -n -P may return both the original PREEMPTED record and a RUNNING record for the requeued attempt. This would cause state to become RUNNING, which is correctly handled as non-terminal by the outer monitor loop — so behavior is safe. However, it's worth explicitly documenting this intent (or filtering to only override with actual terminal states) to avoid future confusion.

  2. Minor — record ordering not guaranteed:
    sacct output order is unspecified. head -n1 after grep -v PREEMPTED picks the first non-PREEMPTED record, which may or may not be the most recent requeued attempt if there are multiple requeue cycles. For chains of preemptions (e.g., PREEMPTED → PREEMPTED → COMPLETED), the first non-PREEMPTED record might be an intermediate state rather than the final one. A more robust approach would be tail -n1 (last record) or filtering for known terminal states (COMPLETED|FAILED|CANCELLED|TIMEOUT|OUT_OF_MEMORY).

Suggested improvement (optional):

This is more defensive and handles repeated preemptions cleanly.

Overall: The fix correctly addresses the reported hang. The head -n1 ordering concern is low-risk in practice (the monitor loop will continue checking on non-terminal states), but filtering to terminal states explicitly would make the intent clearer and handle edge cases more robustly.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves SLURM job monitoring in CI by handling the case where sacct -X reports PREEMPTED for an earlier (preempted) attempt even though the job was requeued and later continued under the same job ID.

Changes:

  • Enhance get_job_state() to, when sacct -X returns PREEMPTED, query all sacct records (without -X) and prefer a non-PREEMPTED state if present.

Comment on lines +58 to +62
if [ "$state" = "PREEMPTED" ]; then
requeue_state=$(sacct -j "$jid" -n -P -o State 2>/dev/null | grep -v PREEMPTED | head -n1 | cut -d'|' -f1 || true)
if [ -n "$requeue_state" ]; then
state="$requeue_state"
fi
Comment on lines +56 to +57
# original attempt while the requeued run may have completed. Check all
# records (without -X) for a terminal state that supersedes PREEMPTED.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 14, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7a9c6e81-755c-48ed-b24d-9f780b7887c4

📥 Commits

Reviewing files that changed from the base of the PR and between 25a074e and 1eb855c.

📒 Files selected for processing (1)
  • .github/scripts/monitor_slurm_job.sh

📝 Walkthrough

Walkthrough

The change modifies the get_job_state function in the SLURM monitoring script. When sacct -X returns PREEMPTED for a job that was preempted and requeued, the function now performs an additional query without the -X flag. This fallback query searches for subsequent requeue attempts to find a terminal state. If a non-PREEMPTED terminal state is found, it supersedes the PREEMPTED state as the final job status across multiple records.

📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can suggest fixes for GitHub Check annotations.

Configure the reviews.tools.github-checks setting to adjust the time to wait for GitHub Checks to complete.

@codecov
Copy link

codecov bot commented Mar 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 45.36%. Comparing base (25a074e) to head (1eb855c).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1311   +/-   ##
=======================================
  Coverage   45.36%   45.36%           
=======================================
  Files          70       70           
  Lines       20515    20515           
  Branches     1954     1954           
=======================================
  Hits         9306     9306           
  Misses      10082    10082           
  Partials     1127     1127           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants