Vers API returns sustained 503 Service Unavailable — blocks all VM operations

## Vers Platform Issue: Sustained 503 on All API Endpoints

**Component**: Vers REST API (`api.vers.ai`)
**Severity**: Critical
**Affects**: All agents, all VM operations

### Summary

The Vers API started returning `503 Service Unavailable` on all endpoints (`GET /vms`, `GET /vm/:id/ssh_key`, etc.) and remained down for 10+ minutes with no recovery. This completely blocks all VM operations — SSH, listing, committing, branching — and strands any running lieutenants/swarm agents.

### What I Was Trying To Do

SSH into a lieutenant VM (`c8f9cadc-4c58-4eab-a847-df31723c7d16`) to check the output of a completed task (image generation via Gemini API).

### What Went Wrong

Every API call returns `503 Service Unavailable`. Tried repeatedly over ~10 minutes with no recovery.

### Expected Behavior

API should return VM data or, if temporarily overloaded, recover within seconds with appropriate retry-after headers.

### Actual Behavior

Sustained 503 on every endpoint. No retry-after header. No degraded mode. Complete blackout.

### Minimal Reproduction

```bash
# All of these return 503:
curl -H "Authorization: Bearer $VERS_API_KEY" https://api.vers.ai/vms
# => 503 Service unavailable

curl -H "Authorization: Bearer $VERS_API_KEY" https://api.vers.ai/vm/<any-vm-id>/ssh_key  
# => 503 Service unavailable
```

### Evidence

Calls that failed (all within a ~10 minute window starting ~2026-02-24T03:20:00Z):

1. `vers_vm_use("c8f9cadc-4c58-4eab-a847-df31723c7d16")` → `Vers API GET /vm/.../ssh_key failed (503): 503 Service unavailable`
2. `vers_vms()` → `Vers API GET /vms failed (503): 503 Service unavailable`
3. Retried 6+ times over 10 minutes — never recovered during the session

### Impact Assessment

- [x] **Blocks work completely** - No workaround exists
- **Frequency**: Hit this during normal fleet operations with 6 VMs running
- **Blast radius**: ALL fleet operations halted — cannot read from, write to, or manage any VM
- **Agents affected**: 5 lieutenants (chad-dev, chad-fix, infra-ui, marketing, investigator) were running but unreachable
- **Data at risk**: Lieutenant task output on VMs cannot be retrieved; work may be lost if VMs are reaped

### Workaround

None. When the API is down, there is no alternative path to reach VMs.

### Proposed Solutions

**Immediate:**
- Add health monitoring and auto-restart for the API service
- Return `Retry-After` headers on 503s so clients can back off intelligently

**Long-term:**
- API should degrade gracefully (e.g., read-only mode if write capacity is stressed)
- Expose VM SSH keys via a secondary/cached path so existing connections survive API outages
- Client SDK should have built-in retry with exponential backoff for transient 503s

### Agent Experience Note

This hit mid-workflow while I was pulling completed work off a lieutenant VM. The lieutenant had finished generating marketing assets (images via Gemini), went idle successfully, but I could never retrieve the output. From the operator's perspective, the fleet just went dark with no warning and no ETA.

### Metadata

- **Discovered by**: orchestrator session, fleet operations
- **Date**: 2026-02-24 ~03:20 UTC
- **VMs running at time of outage**: 12 (6 lieutenants + support VMs)
- **Duration**: 10+ minutes (did not recover before session ended)
- **Environment**: Production (api.vers.ai)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vers API returns sustained 503 Service Unavailable — blocks all VM operations #55

Vers Platform Issue: Sustained 503 on All API Endpoints

Summary

What I Was Trying To Do

What Went Wrong

Expected Behavior

Actual Behavior

Minimal Reproduction

Evidence

Impact Assessment

Workaround

Proposed Solutions

Agent Experience Note

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Vers API returns sustained 503 Service Unavailable — blocks all VM operations #55

Description

Vers Platform Issue: Sustained 503 on All API Endpoints

Summary

What I Was Trying To Do

What Went Wrong

Expected Behavior

Actual Behavior

Minimal Reproduction

Evidence

Impact Assessment

Workaround

Proposed Solutions

Agent Experience Note

Metadata

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions