Skip to content

Add WebSocket health monitoring and reconnection to coordinator #84

@kwsantiago

Description

@kwsantiago

Problem

frost_coordinator.c connects to Nostr relays for DKG and signing coordination, but lacks robust connection health monitoring and automatic recovery. In production environments, relay connections can drop due to network instability, relay restarts, or timeouts. Without proper handling, signing sessions could fail silently or hang indefinitely.

Current Behavior

  • Single connection attempt per relay
  • No heartbeat/ping monitoring
  • No automatic reconnection on disconnect
  • No multi-relay failover strategy

Proposed Solution

1. Connection Health Monitoring

typedef struct {
    uint32_t last_pong_time;
    uint32_t ping_interval_ms;
    uint8_t missed_pongs;
    bool healthy;
} ws_health_t;
  • Send periodic WebSocket pings (every 30s)
  • Track pong responses
  • Mark connection unhealthy after 2-3 missed pongs

2. Automatic Reconnection

#define WS_RECONNECT_BASE_MS     1000
#define WS_RECONNECT_MAX_MS      30000
#define WS_RECONNECT_MAX_ATTEMPTS 5

typedef struct {
    uint8_t attempt_count;
    uint32_t next_retry_ms;
    coordinator_state_t state_before_disconnect;
} ws_reconnect_t;
  • Exponential backoff: 1s → 2s → 4s → 8s → 16s → 30s (capped)
  • Preserve session state during reconnection
  • Re-subscribe to relevant event filters on reconnect

3. Multi-Relay Failover

  • If primary relay fails, attempt secondary
  • Track relay health scores (successful ops / total attempts)
  • Prefer healthier relays for new sessions

4. Session Recovery

  • Buffer outbound events during brief disconnects
  • Replay buffered events on reconnection
  • Timeout and fail session if reconnection exceeds threshold (e.g., 60s)

Implementation Notes

  • Use esp_websocket_client ping/pong callbacks
  • Integrate with existing relay_connection_t structure
  • Add connection state to coordinator status reporting
  • Log reconnection attempts for debugging

Acceptance Criteria

  • Ping/pong health monitoring with configurable interval
  • Automatic reconnection with exponential backoff
  • Session state preserved across brief disconnects
  • Clean session failure if reconnection exceeds timeout
  • Health metrics exposed via coordinator status

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions