-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Open
Description
Problem
The medic currently catches steps stuck in 'running' state (ghost claims from crashed agents) and zombie runs. But it doesn't detect steps stuck in 'pending' state where crons fire correctly but sub-agent spawning silently fails.
Observed pattern:
- Step transitions to 'pending'
- Agent cron fires (every 5 min), calls
step peek→ 'HAS_WORK' - Agent cron calls
step claim→ gets step JSON - Agent cron should spawn a sub-agent session
- Sub-agent spawn fails silently (no error logged, no session created)
- Step stays 'pending' indefinitely (cron fires again, same result)
The medic only watches for status = 'running' steps. A step that's stuck in status = 'pending' for 30+ minutes is invisible to the current health checks.
Suggested Fix
Add a new medic check: stale_pending_steps
- Find steps with
status = 'pending'whereupdated_atis older than 2x the cron interval (>10 min) - Log as a warning finding (don't reset — the step is safe as-is)
- Alert via main session so the operator can investigate or manually claim/complete
The check shouldn't auto-reset pending steps because there's no risk of data corruption — only ghost 'running' steps need resetting.
Environment
- Antfarm v0.5.1
- Observed during feature-dev, bug-fix, and security-audit workflow test run
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels