Add WebSocket health monitoring and reconnection by wksantiago · Pull Request #133 · privkeyio/keep-esp32

wksantiago · 2026-01-23T13:56:02Z

Summary

Add ping/pong health monitoring for WebSocket connections
Implement automatic reconnection with exponential backoff
Add event buffering during brief disconnections
Add mutex protection for thread safety
Add input validation for URLs and subscription IDs

Test plan

Verify ping/pong keeps connections alive
Test automatic reconnection on disconnect
Verify event replay after reconnection

Summary by CodeRabbit

New Features
- Relay aggregated status endpoint showing overall state, connected count, session activity, reconnect attempts, and per-relay scores.
- Buffered event storage with automatic replay after reconnection.
Improvements
- Smarter reconnect/backoff scheduling and session recovery.
- WebSocket health checks (ping/pong) and per-relay success/fail tracking.
- Stronger validation and per-call timeouts for subscriptions and event publishing; publish buffering when relays are unavailable.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-23T13:56:26Z

Walkthrough

Adds WebSocket health monitoring (ping/pong), per-relay reconnect/backoff state, in-memory event buffering with replay on reconnection, subscription and publish validation, per-relay success/fail scoring, and a public status API exposing aggregated coordinator and relay state.

Changes

Cohort / File(s)	Summary
Public API & Types `main/frost_coordinator.h`	Added WS config macros, `COORDINATOR_STATE_RECONNECTING`, `ws_health_t`, `ws_reconnect_t`, `relay_health_score_t`, `coordinator_status_t`, and declarations for `frost_coordinator_get_status()` and `frost_coordinator_is_healthy()`.
Core Coordinator Logic `main/frost_coordinator.c`	Introduced platform-aware locking, time utilities, backoff calculation, ping/pong health checks, reconnect scheduling, per-relay health/reconnect tracking, session recovery handling, and the new public `frost_coordinator_get_status()` implementation.
Event Buffering & Replay `main/frost_coordinator.c`	Added circular in-memory event buffer (`buffered_event_t`, `event_buffer`, `buffer_head`, `buffer_count`), helpers to buffer/replay/clear events, logic to buffer publishes when relays are reconnecting, and replay on reconnect.
Subscription & Publish Flows `main/frost_coordinator.c`	Added subscription ID and WebSocket URL validation, stored current_subscription/has_subscription state, per-call timeouts for subscribe/unsubscribe/publish, JSON length limits for events, and CLOSE message handling on unsubscribe.
Relay Lifecycle & Scoring `main/frost_coordinator.c`	Extended relay state transitions (CONNECTED ↔ RECONNECTING ↔ ERROR), per-relay success/fail counters and relay_health_score_t population, reconnect attempt tracking, and session recovery timeout/reset behavior.
Error Handling & Safeguards `main/frost_coordinator.c`	Added limits/macros (WS_MAX_EVENT_JSON_LEN, WS_MAX_SUBSCRIPTION_ID, WS_SEND_TIMEOUT_MS, buffer size, timeouts), clearing/reset of buffers and health/reconnect state on disconnect, and conditional logging for publish outcomes.

Sequence Diagram(s)

sequenceDiagram
    participant App as Client/App
    participant Coord as Coordinator
    participant Buffer as EventBuffer
    participant WS as Relay (WebSocket)

    rect rgba(200,150,255,0.5)
    Note over App,WS: Normal publish flow
    App->>Coord: publish_event(json)
    Coord->>WS: send event (per-relay, timeout)
    WS-->>Coord: ACK
    Coord-->>App: success
    end

    rect rgba(255,150,150,0.5)
    Note over WS,Coord: Connection loss & buffering
    WS--xCoord: disconnect / missed pong
    Coord->>Coord: mark RECONNECTING, schedule backoff
    App->>Coord: publish_event(json)
    Coord->>Buffer: buffer event (circular buffer)
    end

    rect rgba(150,200,255,0.5)
    Note over Coord,WS: Reconnect & health check
    Coord->>WS: reconnect attempt (backoff)
    WS-->>Coord: connected
    Coord->>WS: ping
    WS-->>Coord: pong
    Coord->>Coord: mark CONNECTED
    end

    rect rgba(150,255,200,0.5)
    Note over Buffer,WS: Replay buffered events
    Coord->>Buffer: replay_buffered_events()
    Buffer->>WS: send buffered event 1..N
    WS-->>Coord: ACKs
    Coord->>Buffer: clear buffer
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Add WebSocket health monitoring and reconnection to coordinator #84: Implements ws_health_t, ws_reconnect_t, ping/pong health checks, exponential backoff, event buffering/replay, and the coordinator status API — aligns directly with that proposal.

Poem

🐰 I nudge the wires, soft and spry,

I buffer whispers when links go dry,
I ping, await the pong's bright tune,
Backoff, retry beneath the moon,
Then replay all — connected soon. 🎩✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add WebSocket health monitoring and reconnection' directly and accurately summarizes the main changes, which include ping/pong health monitoring and automatic reconnection with backoff logic.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

main/frost_coordinator.c (1)
678-720: Guard event buffer mutations with the mutex.

buffer_event is called without COORDINATOR_LOCK, but replay_buffered_events/clear_event_buffer mutate the same buffer under lock. This can race and corrupt/free the buffer.
🔒 Suggested fix
     if (published == 0 && any_reconnecting) {
+        COORDINATOR_LOCK();
         buffer_event(msg);
+        COORDINATOR_UNLOCK();
         ESP_LOGI(TAG, "Buffered event during reconnection");
     }

🤖 Fix all issues with AI agents

In `@main/frost_coordinator.c`:
- Around line 494-513: The code currently truncates overlong relay URLs in
frost_coordinator_add_relay by using strncpy, which can silently connect to the
wrong endpoint; update frost_coordinator_add_relay to first check the length of
url against RELAY_URL_LEN (e.g. if (strlen(url) >= RELAY_URL_LEN) return an
error like -4), log an error via ESP_LOGE mentioning the URL is too long, and
only then copy the string safely into relay->url (or use a bounded copy after
the length check); reference validate_websocket_url,
frost_coordinator_add_relay, relay_connection_t, and RELAY_URL_LEN so the length
check and failure path are added before the current memset/strncpy block.
- Around line 601-613: The validate_subscription_id function allows IDs of
length WS_MAX_SUBSCRIPTION_ID due to an off-by-one check (if (len >
WS_MAX_SUBSCRIPTION_ID)), which then get truncated by copy_subscription_id;
change the length check to reject IDs with length >= WS_MAX_SUBSCRIPTION_ID so
valid IDs cannot exceed the buffer (update the condition in
validate_subscription_id to use >=), and keep the existing checks for control
chars and quotes; reference validate_subscription_id and copy_subscription_id
and the WS_MAX_SUBSCRIPTION_ID constant when making the change.

🧹 Nitpick comments (1)

main/frost_coordinator.c (1)

790-865: Avoid holding the mutex while performing WS I/O.

send_ping and replay_buffered_events perform esp_websocket_client_send_* while COORDINATOR_LOCK is held, which can block and starve other operations. Consider copying the needed state under lock, then sending outside the lock.

main/frost_coordinator.c

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

main/frost_coordinator.c (2)
621-655: Thread safety: frost_coordinator_subscribe modifies shared state without lock.

This function modifies g_ctx.current_subscription and g_ctx.has_subscription (lines 629-631) without holding the mutex. These fields are read by handle_ws_connected under lock (lines 236-240), creating a potential data race.
🔒 Suggested fix: protect shared state modification
 int frost_coordinator_subscribe(const char *subscription_id) {
     if (!g_initialized || !g_ctx.has_group)
         return -1;
     if (!validate_subscription_id(subscription_id)) {
         ESP_LOGE(TAG, "Invalid subscription ID");
         return -2;
     }

+    COORDINATOR_LOCK();
     copy_subscription_id(g_ctx.current_subscription, sizeof(g_ctx.current_subscription),
                          subscription_id);
     g_ctx.has_subscription = true;
+    COORDINATOR_UNLOCK();

     char pubkey_hex[65];
     ...
657-682: Thread safety: frost_coordinator_unsubscribe modifies shared state without lock.

Similar to frost_coordinator_subscribe, this function modifies g_ctx.has_subscription and g_ctx.current_subscription without holding the mutex.

🤖 Fix all issues with AI agents

In `@main/frost_coordinator.c`:
- Around line 308-322: handle_ws_error may double-increment relay->fail_count
because it increments on the max-attempts path and also calls
save_reconnect_state (which itself increments fail_count) on the other path;
change the control so fail_count is incremented exactly once by either moving
the increment into save_reconnect_state only, or by adding a flag/parameter to
save_reconnect_state (e.g., save_reconnect_state(relay, is_final)) and call it
with is_final=false on transient path and true on final path, or skip calling
save_reconnect_state when attempt_count >= WS_RECONNECT_MAX_ATTEMPTS; update
handle_ws_error and save_reconnect_state accordingly and ensure
relay->reconnect.attempt_count/WS_RECONNECT_MAX_ATTEMPTS logic remains
consistent.
- Around line 97-115: buffer_event currently mutates shared state
(g_ctx.event_buffer, buffer_head, buffer_count) without synchronization; ensure
these mutations occur under COORDINATOR_LOCK by acquiring the same lock used in
frost_coordinator_poll before calling buffer_event from publish_event (or
alternatively, add lock/unlock inside buffer_event itself). Update publish_event
to lock COORDINATOR_LOCK around the call to buffer_event (and any other accesses
to g_ctx.event_buffer) so replay_buffered_events (which runs under the same
lock) cannot race, referencing the buffer_event, publish_event,
replay_buffered_events, COORDINATOR_LOCK, and frost_coordinator_poll symbols
when making the change.
- Around line 117-126: clear_event_buffer is being invoked from
frost_coordinator_disconnect without holding the context mutex, risking races
with concurrent event-handler callbacks; modify frost_coordinator_disconnect to
acquire the same mutex used to guard g_ctx.event_buffer (e.g., lock g_ctx.mutex
or g_ctx.event_mutex) before calling clear_event_buffer and release it after the
call, ensuring all buffer mutations are protected; alternatively, if
clear_event_buffer is called from multiple call sites where locking is not
guaranteed, add locking inside clear_event_buffer itself and document that it
performs its own locking to avoid double-locks.

🧹 Nitpick comments (1)

main/frost_coordinator.c (1)
92-95: Potential integer overflow before the cap is applied.

WS_RECONNECT_BASE_MS << attempt can overflow before the comparison with WS_RECONNECT_MAX_MS if attempt is large. For example, shifting a 32-bit value left by 22+ positions causes undefined behavior in C.

Consider capping early or using multiplication with overflow check:
♻️ Suggested fix
 static uint32_t calculate_backoff(uint8_t attempt) {
+    if (attempt > 20) {
+        return WS_RECONNECT_MAX_MS;
+    }
     uint32_t delay = WS_RECONNECT_BASE_MS << attempt;
     return delay > WS_RECONNECT_MAX_MS ? WS_RECONNECT_MAX_MS : delay;
 }

main/frost_coordinator.c

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

main/frost_coordinator.c (1)

627-697: Make subscription state updates thread-safe.

frost_coordinator_subscribe/unsubscribe update g_ctx.current_subscription, g_ctx.has_subscription, and g_ctx.state without COORDINATOR_LOCK. With reconnect paths invoking subscribe from the WS event thread, this becomes a data race. Lock these state mutations (and consider snapshotting relay targets if you need to avoid holding the lock during sends).

🔒 Proposed fix

-    copy_subscription_id(g_ctx.current_subscription, sizeof(g_ctx.current_subscription),
-                         subscription_id);
-    g_ctx.has_subscription = true;
+    COORDINATOR_LOCK();
+    copy_subscription_id(g_ctx.current_subscription, sizeof(g_ctx.current_subscription),
+                         subscription_id);
+    g_ctx.has_subscription = true;
+    COORDINATOR_UNLOCK();
@@
-    g_ctx.state = COORDINATOR_STATE_ACTIVE;
+    COORDINATOR_LOCK();
+    g_ctx.state = COORDINATOR_STATE_ACTIVE;
+    COORDINATOR_UNLOCK();
@@
-    g_ctx.has_subscription = false;
-    memset(g_ctx.current_subscription, 0, sizeof(g_ctx.current_subscription));
+    COORDINATOR_LOCK();
+    g_ctx.has_subscription = false;
+    memset(g_ctx.current_subscription, 0, sizeof(g_ctx.current_subscription));
+    COORDINATOR_UNLOCK();

♻️ Duplicate comments (1)

main/frost_coordinator.c (1)
603-611: Guard disconnect state resets with the mutex.

frost_coordinator_disconnect mutates shared coordinator state without COORDINATOR_LOCK, while other paths read it under lock. This can race with WS callbacks or poll().
🔒 Proposed fix
-    clear_event_buffer();
-    g_ctx.has_subscription = false;
-    g_ctx.disconnect_time = 0;
-    g_ctx.state = COORDINATOR_STATE_IDLE;
+    COORDINATOR_LOCK();
+    clear_event_buffer_unlocked();
+    g_ctx.has_subscription = false;
+    g_ctx.disconnect_time = 0;
+    g_ctx.state = COORDINATOR_STATE_IDLE;
+    COORDINATOR_UNLOCK();

🧹 Nitpick comments (2)

main/frost_coordinator.c (2)

708-745: Consider buffering when any relay is reconnecting.

Right now buffering happens only when published == 0, so reconnecting relays miss events if at least one relay is connected. If the intent is replay for any disconnected relay, consider buffering whenever any_reconnecting is true (and dedupe as needed).

816-890: Reduce time under COORDINATOR_LOCK in the poll path.

frost_coordinator_poll holds the mutex while invoking ping/reconnect/replay helpers. Consider snapshotting work under the lock, then performing I/O outside it to reduce contention.

wksantiago added 2 commits January 23, 2026 09:48

fix: address security issues in WebSocket coordinator

2ecf73f

refactor: simplify WebSocket coordinator helpers

9807d28

wksantiago requested a review from kwsantiago January 23, 2026 13:56

wksantiago self-assigned this Jan 23, 2026

wksantiago linked an issue Jan 23, 2026 that may be closed by this pull request

Add WebSocket health monitoring and reconnection to coordinator #84

Open

5 tasks

coderabbitai bot reviewed Jan 23, 2026

View reviewed changes

main/frost_coordinator.c Outdated Show resolved Hide resolved

main/frost_coordinator.c Show resolved Hide resolved

wksantiago added 2 commits January 23, 2026 10:22

fix: prevent silent truncation of URLs and subscription IDs

1305199

fix: add TLS certificate bundle for WebSocket connections

33e4b82

coderabbitai bot reviewed Jan 23, 2026

View reviewed changes

main/frost_coordinator.c Show resolved Hide resolved

main/frost_coordinator.c Outdated Show resolved Hide resolved

main/frost_coordinator.c Show resolved Hide resolved

fix: add mutex protection for event buffer operations

759b04b

coderabbitai bot reviewed Jan 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add WebSocket health monitoring and reconnection#133

Add WebSocket health monitoring and reconnection#133
wksantiago wants to merge 5 commits intomainfrom
WebSocket-health

wksantiago commented Jan 23, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 23, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

wksantiago commented Jan 23, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wksantiago commented Jan 23, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 23, 2026 •

edited

Loading