-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Vers Platform Issue: Sustained 503 on All API Endpoints
Component: Vers REST API (api.vers.ai)
Severity: Critical
Affects: All agents, all VM operations
Summary
The Vers API started returning 503 Service Unavailable on all endpoints (GET /vms, GET /vm/:id/ssh_key, etc.) and remained down for 10+ minutes with no recovery. This completely blocks all VM operations — SSH, listing, committing, branching — and strands any running lieutenants/swarm agents.
What I Was Trying To Do
SSH into a lieutenant VM (c8f9cadc-4c58-4eab-a847-df31723c7d16) to check the output of a completed task (image generation via Gemini API).
What Went Wrong
Every API call returns 503 Service Unavailable. Tried repeatedly over ~10 minutes with no recovery.
Expected Behavior
API should return VM data or, if temporarily overloaded, recover within seconds with appropriate retry-after headers.
Actual Behavior
Sustained 503 on every endpoint. No retry-after header. No degraded mode. Complete blackout.
Minimal Reproduction
# All of these return 503:
curl -H "Authorization: Bearer $VERS_API_KEY" https://api.vers.ai/vms
# => 503 Service unavailable
curl -H "Authorization: Bearer $VERS_API_KEY" https://api.vers.ai/vm/<any-vm-id>/ssh_key
# => 503 Service unavailableEvidence
Calls that failed (all within a ~10 minute window starting ~2026-02-24T03:20:00Z):
vers_vm_use("c8f9cadc-4c58-4eab-a847-df31723c7d16")→Vers API GET /vm/.../ssh_key failed (503): 503 Service unavailablevers_vms()→Vers API GET /vms failed (503): 503 Service unavailable- Retried 6+ times over 10 minutes — never recovered during the session
Impact Assessment
- Blocks work completely - No workaround exists
- Frequency: Hit this during normal fleet operations with 6 VMs running
- Blast radius: ALL fleet operations halted — cannot read from, write to, or manage any VM
- Agents affected: 5 lieutenants (chad-dev, chad-fix, infra-ui, marketing, investigator) were running but unreachable
- Data at risk: Lieutenant task output on VMs cannot be retrieved; work may be lost if VMs are reaped
Workaround
None. When the API is down, there is no alternative path to reach VMs.
Proposed Solutions
Immediate:
- Add health monitoring and auto-restart for the API service
- Return
Retry-Afterheaders on 503s so clients can back off intelligently
Long-term:
- API should degrade gracefully (e.g., read-only mode if write capacity is stressed)
- Expose VM SSH keys via a secondary/cached path so existing connections survive API outages
- Client SDK should have built-in retry with exponential backoff for transient 503s
Agent Experience Note
This hit mid-workflow while I was pulling completed work off a lieutenant VM. The lieutenant had finished generating marketing assets (images via Gemini), went idle successfully, but I could never retrieve the output. From the operator's perspective, the fleet just went dark with no warning and no ETA.
Metadata
- Discovered by: orchestrator session, fleet operations
- Date: 2026-02-24 ~03:20 UTC
- VMs running at time of outage: 12 (6 lieutenants + support VMs)
- Duration: 10+ minutes (did not recover before session ended)
- Environment: Production (api.vers.ai)