Skip to content

Vers API returns sustained 503 Service Unavailable — blocks all VM operations #55

@wwaIII

Description

@wwaIII

Vers Platform Issue: Sustained 503 on All API Endpoints

Component: Vers REST API (api.vers.ai)
Severity: Critical
Affects: All agents, all VM operations

Summary

The Vers API started returning 503 Service Unavailable on all endpoints (GET /vms, GET /vm/:id/ssh_key, etc.) and remained down for 10+ minutes with no recovery. This completely blocks all VM operations — SSH, listing, committing, branching — and strands any running lieutenants/swarm agents.

What I Was Trying To Do

SSH into a lieutenant VM (c8f9cadc-4c58-4eab-a847-df31723c7d16) to check the output of a completed task (image generation via Gemini API).

What Went Wrong

Every API call returns 503 Service Unavailable. Tried repeatedly over ~10 minutes with no recovery.

Expected Behavior

API should return VM data or, if temporarily overloaded, recover within seconds with appropriate retry-after headers.

Actual Behavior

Sustained 503 on every endpoint. No retry-after header. No degraded mode. Complete blackout.

Minimal Reproduction

# All of these return 503:
curl -H "Authorization: Bearer $VERS_API_KEY" https://api.vers.ai/vms
# => 503 Service unavailable

curl -H "Authorization: Bearer $VERS_API_KEY" https://api.vers.ai/vm/<any-vm-id>/ssh_key  
# => 503 Service unavailable

Evidence

Calls that failed (all within a ~10 minute window starting ~2026-02-24T03:20:00Z):

  1. vers_vm_use("c8f9cadc-4c58-4eab-a847-df31723c7d16")Vers API GET /vm/.../ssh_key failed (503): 503 Service unavailable
  2. vers_vms()Vers API GET /vms failed (503): 503 Service unavailable
  3. Retried 6+ times over 10 minutes — never recovered during the session

Impact Assessment

  • Blocks work completely - No workaround exists
  • Frequency: Hit this during normal fleet operations with 6 VMs running
  • Blast radius: ALL fleet operations halted — cannot read from, write to, or manage any VM
  • Agents affected: 5 lieutenants (chad-dev, chad-fix, infra-ui, marketing, investigator) were running but unreachable
  • Data at risk: Lieutenant task output on VMs cannot be retrieved; work may be lost if VMs are reaped

Workaround

None. When the API is down, there is no alternative path to reach VMs.

Proposed Solutions

Immediate:

  • Add health monitoring and auto-restart for the API service
  • Return Retry-After headers on 503s so clients can back off intelligently

Long-term:

  • API should degrade gracefully (e.g., read-only mode if write capacity is stressed)
  • Expose VM SSH keys via a secondary/cached path so existing connections survive API outages
  • Client SDK should have built-in retry with exponential backoff for transient 503s

Agent Experience Note

This hit mid-workflow while I was pulling completed work off a lieutenant VM. The lieutenant had finished generating marketing assets (images via Gemini), went idle successfully, but I could never retrieve the output. From the operator's perspective, the fleet just went dark with no warning and no ETA.

Metadata

  • Discovered by: orchestrator session, fleet operations
  • Date: 2026-02-24 ~03:20 UTC
  • VMs running at time of outage: 12 (6 lieutenants + support VMs)
  • Duration: 10+ minutes (did not recover before session ended)
  • Environment: Production (api.vers.ai)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions