Skip to content

Medic doesn't detect persistently 'pending' steps (cron fires but sub-agent spawn fails silently) #110

@robobobby

Description

@robobobby

Problem

The medic currently catches steps stuck in 'running' state (ghost claims from crashed agents) and zombie runs. But it doesn't detect steps stuck in 'pending' state where crons fire correctly but sub-agent spawning silently fails.

Observed pattern:

  1. Step transitions to 'pending'
  2. Agent cron fires (every 5 min), calls step peek → 'HAS_WORK'
  3. Agent cron calls step claim → gets step JSON
  4. Agent cron should spawn a sub-agent session
  5. Sub-agent spawn fails silently (no error logged, no session created)
  6. Step stays 'pending' indefinitely (cron fires again, same result)

The medic only watches for status = 'running' steps. A step that's stuck in status = 'pending' for 30+ minutes is invisible to the current health checks.

Suggested Fix

Add a new medic check: stale_pending_steps

  • Find steps with status = 'pending' where updated_at is older than 2x the cron interval (>10 min)
  • Log as a warning finding (don't reset — the step is safe as-is)
  • Alert via main session so the operator can investigate or manually claim/complete

The check shouldn't auto-reset pending steps because there's no risk of data corruption — only ghost 'running' steps need resetting.

Environment

  • Antfarm v0.5.1
  • Observed during feature-dev, bug-fix, and security-audit workflow test run

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions