Skip to content

Comments

Add WebSocket health monitoring and reconnection#133

Open
wksantiago wants to merge 5 commits intomainfrom
WebSocket-health
Open

Add WebSocket health monitoring and reconnection#133
wksantiago wants to merge 5 commits intomainfrom
WebSocket-health

Conversation

@wksantiago
Copy link
Contributor

@wksantiago wksantiago commented Jan 23, 2026

Summary

  • Add ping/pong health monitoring for WebSocket connections
  • Implement automatic reconnection with exponential backoff
  • Add event buffering during brief disconnections
  • Add mutex protection for thread safety
  • Add input validation for URLs and subscription IDs

Test plan

  • Verify ping/pong keeps connections alive
  • Test automatic reconnection on disconnect
  • Verify event replay after reconnection

Summary by CodeRabbit

  • New Features

    • Relay aggregated status endpoint showing overall state, connected count, session activity, reconnect attempts, and per-relay scores.
    • Buffered event storage with automatic replay after reconnection.
  • Improvements

    • Smarter reconnect/backoff scheduling and session recovery.
    • WebSocket health checks (ping/pong) and per-relay success/fail tracking.
    • Stronger validation and per-call timeouts for subscriptions and event publishing; publish buffering when relays are unavailable.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 23, 2026

Walkthrough

Adds WebSocket health monitoring (ping/pong), per-relay reconnect/backoff state, in-memory event buffering with replay on reconnection, subscription and publish validation, per-relay success/fail scoring, and a public status API exposing aggregated coordinator and relay state.

Changes

Cohort / File(s) Summary
Public API & Types
main/frost_coordinator.h
Added WS config macros, COORDINATOR_STATE_RECONNECTING, ws_health_t, ws_reconnect_t, relay_health_score_t, coordinator_status_t, and declarations for frost_coordinator_get_status() and frost_coordinator_is_healthy().
Core Coordinator Logic
main/frost_coordinator.c
Introduced platform-aware locking, time utilities, backoff calculation, ping/pong health checks, reconnect scheduling, per-relay health/reconnect tracking, session recovery handling, and the new public frost_coordinator_get_status() implementation.
Event Buffering & Replay
main/frost_coordinator.c
Added circular in-memory event buffer (buffered_event_t, event_buffer, buffer_head, buffer_count), helpers to buffer/replay/clear events, logic to buffer publishes when relays are reconnecting, and replay on reconnect.
Subscription & Publish Flows
main/frost_coordinator.c
Added subscription ID and WebSocket URL validation, stored current_subscription/has_subscription state, per-call timeouts for subscribe/unsubscribe/publish, JSON length limits for events, and CLOSE message handling on unsubscribe.
Relay Lifecycle & Scoring
main/frost_coordinator.c
Extended relay state transitions (CONNECTED ↔ RECONNECTING ↔ ERROR), per-relay success/fail counters and relay_health_score_t population, reconnect attempt tracking, and session recovery timeout/reset behavior.
Error Handling & Safeguards
main/frost_coordinator.c
Added limits/macros (WS_MAX_EVENT_JSON_LEN, WS_MAX_SUBSCRIPTION_ID, WS_SEND_TIMEOUT_MS, buffer size, timeouts), clearing/reset of buffers and health/reconnect state on disconnect, and conditional logging for publish outcomes.

Sequence Diagram(s)

sequenceDiagram
    participant App as Client/App
    participant Coord as Coordinator
    participant Buffer as EventBuffer
    participant WS as Relay (WebSocket)

    rect rgba(200,150,255,0.5)
    Note over App,WS: Normal publish flow
    App->>Coord: publish_event(json)
    Coord->>WS: send event (per-relay, timeout)
    WS-->>Coord: ACK
    Coord-->>App: success
    end

    rect rgba(255,150,150,0.5)
    Note over WS,Coord: Connection loss & buffering
    WS--xCoord: disconnect / missed pong
    Coord->>Coord: mark RECONNECTING, schedule backoff
    App->>Coord: publish_event(json)
    Coord->>Buffer: buffer event (circular buffer)
    end

    rect rgba(150,200,255,0.5)
    Note over Coord,WS: Reconnect & health check
    Coord->>WS: reconnect attempt (backoff)
    WS-->>Coord: connected
    Coord->>WS: ping
    WS-->>Coord: pong
    Coord->>Coord: mark CONNECTED
    end

    rect rgba(150,255,200,0.5)
    Note over Buffer,WS: Replay buffered events
    Coord->>Buffer: replay_buffered_events()
    Buffer->>WS: send buffered event 1..N
    WS-->>Coord: ACKs
    Coord->>Buffer: clear buffer
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Poem

🐰 I nudge the wires, soft and spry,

I buffer whispers when links go dry,
I ping, await the pong's bright tune,
Backoff, retry beneath the moon,
Then replay all — connected soon. 🎩✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add WebSocket health monitoring and reconnection' directly and accurately summarizes the main changes, which include ping/pong health monitoring and automatic reconnection with backoff logic.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@wksantiago wksantiago requested a review from kwsantiago January 23, 2026 13:56
@wksantiago wksantiago self-assigned this Jan 23, 2026
@wksantiago wksantiago linked an issue Jan 23, 2026 that may be closed by this pull request
5 tasks
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
main/frost_coordinator.c (1)

678-720: Guard event buffer mutations with the mutex.

buffer_event is called without COORDINATOR_LOCK, but replay_buffered_events/clear_event_buffer mutate the same buffer under lock. This can race and corrupt/free the buffer.

🔒 Suggested fix
     if (published == 0 && any_reconnecting) {
+        COORDINATOR_LOCK();
         buffer_event(msg);
+        COORDINATOR_UNLOCK();
         ESP_LOGI(TAG, "Buffered event during reconnection");
     }
🤖 Fix all issues with AI agents
In `@main/frost_coordinator.c`:
- Around line 494-513: The code currently truncates overlong relay URLs in
frost_coordinator_add_relay by using strncpy, which can silently connect to the
wrong endpoint; update frost_coordinator_add_relay to first check the length of
url against RELAY_URL_LEN (e.g. if (strlen(url) >= RELAY_URL_LEN) return an
error like -4), log an error via ESP_LOGE mentioning the URL is too long, and
only then copy the string safely into relay->url (or use a bounded copy after
the length check); reference validate_websocket_url,
frost_coordinator_add_relay, relay_connection_t, and RELAY_URL_LEN so the length
check and failure path are added before the current memset/strncpy block.
- Around line 601-613: The validate_subscription_id function allows IDs of
length WS_MAX_SUBSCRIPTION_ID due to an off-by-one check (if (len >
WS_MAX_SUBSCRIPTION_ID)), which then get truncated by copy_subscription_id;
change the length check to reject IDs with length >= WS_MAX_SUBSCRIPTION_ID so
valid IDs cannot exceed the buffer (update the condition in
validate_subscription_id to use >=), and keep the existing checks for control
chars and quotes; reference validate_subscription_id and copy_subscription_id
and the WS_MAX_SUBSCRIPTION_ID constant when making the change.
🧹 Nitpick comments (1)
main/frost_coordinator.c (1)

790-865: Avoid holding the mutex while performing WS I/O.

send_ping and replay_buffered_events perform esp_websocket_client_send_* while COORDINATOR_LOCK is held, which can block and starve other operations. Consider copying the needed state under lock, then sending outside the lock.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
main/frost_coordinator.c (2)

621-655: Thread safety: frost_coordinator_subscribe modifies shared state without lock.

This function modifies g_ctx.current_subscription and g_ctx.has_subscription (lines 629-631) without holding the mutex. These fields are read by handle_ws_connected under lock (lines 236-240), creating a potential data race.

🔒 Suggested fix: protect shared state modification
 int frost_coordinator_subscribe(const char *subscription_id) {
     if (!g_initialized || !g_ctx.has_group)
         return -1;
     if (!validate_subscription_id(subscription_id)) {
         ESP_LOGE(TAG, "Invalid subscription ID");
         return -2;
     }

+    COORDINATOR_LOCK();
     copy_subscription_id(g_ctx.current_subscription, sizeof(g_ctx.current_subscription),
                          subscription_id);
     g_ctx.has_subscription = true;
+    COORDINATOR_UNLOCK();

     char pubkey_hex[65];
     ...

657-682: Thread safety: frost_coordinator_unsubscribe modifies shared state without lock.

Similar to frost_coordinator_subscribe, this function modifies g_ctx.has_subscription and g_ctx.current_subscription without holding the mutex.

🤖 Fix all issues with AI agents
In `@main/frost_coordinator.c`:
- Around line 308-322: handle_ws_error may double-increment relay->fail_count
because it increments on the max-attempts path and also calls
save_reconnect_state (which itself increments fail_count) on the other path;
change the control so fail_count is incremented exactly once by either moving
the increment into save_reconnect_state only, or by adding a flag/parameter to
save_reconnect_state (e.g., save_reconnect_state(relay, is_final)) and call it
with is_final=false on transient path and true on final path, or skip calling
save_reconnect_state when attempt_count >= WS_RECONNECT_MAX_ATTEMPTS; update
handle_ws_error and save_reconnect_state accordingly and ensure
relay->reconnect.attempt_count/WS_RECONNECT_MAX_ATTEMPTS logic remains
consistent.
- Around line 97-115: buffer_event currently mutates shared state
(g_ctx.event_buffer, buffer_head, buffer_count) without synchronization; ensure
these mutations occur under COORDINATOR_LOCK by acquiring the same lock used in
frost_coordinator_poll before calling buffer_event from publish_event (or
alternatively, add lock/unlock inside buffer_event itself). Update publish_event
to lock COORDINATOR_LOCK around the call to buffer_event (and any other accesses
to g_ctx.event_buffer) so replay_buffered_events (which runs under the same
lock) cannot race, referencing the buffer_event, publish_event,
replay_buffered_events, COORDINATOR_LOCK, and frost_coordinator_poll symbols
when making the change.
- Around line 117-126: clear_event_buffer is being invoked from
frost_coordinator_disconnect without holding the context mutex, risking races
with concurrent event-handler callbacks; modify frost_coordinator_disconnect to
acquire the same mutex used to guard g_ctx.event_buffer (e.g., lock g_ctx.mutex
or g_ctx.event_mutex) before calling clear_event_buffer and release it after the
call, ensuring all buffer mutations are protected; alternatively, if
clear_event_buffer is called from multiple call sites where locking is not
guaranteed, add locking inside clear_event_buffer itself and document that it
performs its own locking to avoid double-locks.
🧹 Nitpick comments (1)
main/frost_coordinator.c (1)

92-95: Potential integer overflow before the cap is applied.

WS_RECONNECT_BASE_MS << attempt can overflow before the comparison with WS_RECONNECT_MAX_MS if attempt is large. For example, shifting a 32-bit value left by 22+ positions causes undefined behavior in C.

Consider capping early or using multiplication with overflow check:

♻️ Suggested fix
 static uint32_t calculate_backoff(uint8_t attempt) {
+    if (attempt > 20) {
+        return WS_RECONNECT_MAX_MS;
+    }
     uint32_t delay = WS_RECONNECT_BASE_MS << attempt;
     return delay > WS_RECONNECT_MAX_MS ? WS_RECONNECT_MAX_MS : delay;
 }

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
main/frost_coordinator.c (1)

627-697: Make subscription state updates thread-safe.

frost_coordinator_subscribe/unsubscribe update g_ctx.current_subscription, g_ctx.has_subscription, and g_ctx.state without COORDINATOR_LOCK. With reconnect paths invoking subscribe from the WS event thread, this becomes a data race. Lock these state mutations (and consider snapshotting relay targets if you need to avoid holding the lock during sends).

🔒 Proposed fix
-    copy_subscription_id(g_ctx.current_subscription, sizeof(g_ctx.current_subscription),
-                         subscription_id);
-    g_ctx.has_subscription = true;
+    COORDINATOR_LOCK();
+    copy_subscription_id(g_ctx.current_subscription, sizeof(g_ctx.current_subscription),
+                         subscription_id);
+    g_ctx.has_subscription = true;
+    COORDINATOR_UNLOCK();
@@
-    g_ctx.state = COORDINATOR_STATE_ACTIVE;
+    COORDINATOR_LOCK();
+    g_ctx.state = COORDINATOR_STATE_ACTIVE;
+    COORDINATOR_UNLOCK();
@@
-    g_ctx.has_subscription = false;
-    memset(g_ctx.current_subscription, 0, sizeof(g_ctx.current_subscription));
+    COORDINATOR_LOCK();
+    g_ctx.has_subscription = false;
+    memset(g_ctx.current_subscription, 0, sizeof(g_ctx.current_subscription));
+    COORDINATOR_UNLOCK();
♻️ Duplicate comments (1)
main/frost_coordinator.c (1)

603-611: Guard disconnect state resets with the mutex.

frost_coordinator_disconnect mutates shared coordinator state without COORDINATOR_LOCK, while other paths read it under lock. This can race with WS callbacks or poll().

🔒 Proposed fix
-    clear_event_buffer();
-    g_ctx.has_subscription = false;
-    g_ctx.disconnect_time = 0;
-    g_ctx.state = COORDINATOR_STATE_IDLE;
+    COORDINATOR_LOCK();
+    clear_event_buffer_unlocked();
+    g_ctx.has_subscription = false;
+    g_ctx.disconnect_time = 0;
+    g_ctx.state = COORDINATOR_STATE_IDLE;
+    COORDINATOR_UNLOCK();
🧹 Nitpick comments (2)
main/frost_coordinator.c (2)

708-745: Consider buffering when any relay is reconnecting.

Right now buffering happens only when published == 0, so reconnecting relays miss events if at least one relay is connected. If the intent is replay for any disconnected relay, consider buffering whenever any_reconnecting is true (and dedupe as needed).


816-890: Reduce time under COORDINATOR_LOCK in the poll path.

frost_coordinator_poll holds the mutex while invoking ping/reconnect/replay helpers. Consider snapshotting work under the lock, then performing I/O outside it to reduce contention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add WebSocket health monitoring and reconnection to coordinator

1 participant