ci: fix monitor hang when SLURM job is preempted and requeued#1311
ci: fix monitor hang when SLURM job is preempted and requeued#1311sbryngelson merged 1 commit intoMFlowCode:masterfrom
Conversation
When a job is preempted+requeued, sacct -X reports PREEMPTED for the original attempt even after the requeued run completes. The monitor excluded PREEMPTED from terminal states (correct for active requeues) but never detected the requeued completion via sacct, causing it to loop on state=PREEMPTED for hours until the GHA timeout killed it. Fix: when sacct -X returns PREEMPTED, also query without -X to find the requeued run's terminal state (COMPLETED, FAILED, etc). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Claude Code ReviewHead SHA: 1eb855c Files changed: 1
Summary:
Findings:
Verdict: The fix is correct and well-scoped. Addressing finding #1 (use an explicit allowlist of terminal states instead of |
Claude Code ReviewHead SHA: Files changed: 1
Summary:
Findings:
Suggested improvement (optional): This is more defensive and handles repeated preemptions cleanly. Overall: The fix correctly addresses the reported hang. The |
There was a problem hiding this comment.
Pull request overview
Improves SLURM job monitoring in CI by handling the case where sacct -X reports PREEMPTED for an earlier (preempted) attempt even though the job was requeued and later continued under the same job ID.
Changes:
- Enhance
get_job_state()to, whensacct -XreturnsPREEMPTED, query allsacctrecords (without-X) and prefer a non-PREEMPTEDstate if present.
| if [ "$state" = "PREEMPTED" ]; then | ||
| requeue_state=$(sacct -j "$jid" -n -P -o State 2>/dev/null | grep -v PREEMPTED | head -n1 | cut -d'|' -f1 || true) | ||
| if [ -n "$requeue_state" ]; then | ||
| state="$requeue_state" | ||
| fi |
| # original attempt while the requeued run may have completed. Check all | ||
| # records (without -X) for a terminal state that supersedes PREEMPTED. |
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe change modifies the 📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip CodeRabbit can suggest fixes for GitHub Check annotations.Configure the |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1311 +/- ##
=======================================
Coverage 45.36% 45.36%
=======================================
Files 70 70
Lines 20515 20515
Branches 1954 1954
=======================================
Hits 9306 9306
Misses 10082 10082
Partials 1127 1127 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
When a job is preempted+requeued, sacct -X reports PREEMPTED for the original attempt even after the requeued run completes. The monitor excluded PREEMPTED from terminal states (correct for active requeues) but never detected the requeued completion via sacct, causing it to loop on state=PREEMPTED for hours until the GHA timeout killed it.
Fix: when sacct -X returns PREEMPTED, also query without -X to find the requeued run's terminal state (COMPLETED, FAILED, etc).