Add WebSocket health monitoring and reconnection#133
Add WebSocket health monitoring and reconnection#133wksantiago wants to merge 5 commits intomainfrom
Conversation
WalkthroughAdds WebSocket health monitoring (ping/pong), per-relay reconnect/backoff state, in-memory event buffering with replay on reconnection, subscription and publish validation, per-relay success/fail scoring, and a public status API exposing aggregated coordinator and relay state. Changes
Sequence Diagram(s)sequenceDiagram
participant App as Client/App
participant Coord as Coordinator
participant Buffer as EventBuffer
participant WS as Relay (WebSocket)
rect rgba(200,150,255,0.5)
Note over App,WS: Normal publish flow
App->>Coord: publish_event(json)
Coord->>WS: send event (per-relay, timeout)
WS-->>Coord: ACK
Coord-->>App: success
end
rect rgba(255,150,150,0.5)
Note over WS,Coord: Connection loss & buffering
WS--xCoord: disconnect / missed pong
Coord->>Coord: mark RECONNECTING, schedule backoff
App->>Coord: publish_event(json)
Coord->>Buffer: buffer event (circular buffer)
end
rect rgba(150,200,255,0.5)
Note over Coord,WS: Reconnect & health check
Coord->>WS: reconnect attempt (backoff)
WS-->>Coord: connected
Coord->>WS: ping
WS-->>Coord: pong
Coord->>Coord: mark CONNECTED
end
rect rgba(150,255,200,0.5)
Note over Buffer,WS: Replay buffered events
Coord->>Buffer: replay_buffered_events()
Buffer->>WS: send buffered event 1..N
WS-->>Coord: ACKs
Coord->>Buffer: clear buffer
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
main/frost_coordinator.c (1)
678-720: Guard event buffer mutations with the mutex.
buffer_eventis called withoutCOORDINATOR_LOCK, butreplay_buffered_events/clear_event_buffermutate the same buffer under lock. This can race and corrupt/free the buffer.🔒 Suggested fix
if (published == 0 && any_reconnecting) { + COORDINATOR_LOCK(); buffer_event(msg); + COORDINATOR_UNLOCK(); ESP_LOGI(TAG, "Buffered event during reconnection"); }
🤖 Fix all issues with AI agents
In `@main/frost_coordinator.c`:
- Around line 494-513: The code currently truncates overlong relay URLs in
frost_coordinator_add_relay by using strncpy, which can silently connect to the
wrong endpoint; update frost_coordinator_add_relay to first check the length of
url against RELAY_URL_LEN (e.g. if (strlen(url) >= RELAY_URL_LEN) return an
error like -4), log an error via ESP_LOGE mentioning the URL is too long, and
only then copy the string safely into relay->url (or use a bounded copy after
the length check); reference validate_websocket_url,
frost_coordinator_add_relay, relay_connection_t, and RELAY_URL_LEN so the length
check and failure path are added before the current memset/strncpy block.
- Around line 601-613: The validate_subscription_id function allows IDs of
length WS_MAX_SUBSCRIPTION_ID due to an off-by-one check (if (len >
WS_MAX_SUBSCRIPTION_ID)), which then get truncated by copy_subscription_id;
change the length check to reject IDs with length >= WS_MAX_SUBSCRIPTION_ID so
valid IDs cannot exceed the buffer (update the condition in
validate_subscription_id to use >=), and keep the existing checks for control
chars and quotes; reference validate_subscription_id and copy_subscription_id
and the WS_MAX_SUBSCRIPTION_ID constant when making the change.
🧹 Nitpick comments (1)
main/frost_coordinator.c (1)
790-865: Avoid holding the mutex while performing WS I/O.
send_pingandreplay_buffered_eventsperformesp_websocket_client_send_*whileCOORDINATOR_LOCKis held, which can block and starve other operations. Consider copying the needed state under lock, then sending outside the lock.
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
main/frost_coordinator.c (2)
621-655: Thread safety:frost_coordinator_subscribemodifies shared state without lock.This function modifies
g_ctx.current_subscriptionandg_ctx.has_subscription(lines 629-631) without holding the mutex. These fields are read byhandle_ws_connectedunder lock (lines 236-240), creating a potential data race.🔒 Suggested fix: protect shared state modification
int frost_coordinator_subscribe(const char *subscription_id) { if (!g_initialized || !g_ctx.has_group) return -1; if (!validate_subscription_id(subscription_id)) { ESP_LOGE(TAG, "Invalid subscription ID"); return -2; } + COORDINATOR_LOCK(); copy_subscription_id(g_ctx.current_subscription, sizeof(g_ctx.current_subscription), subscription_id); g_ctx.has_subscription = true; + COORDINATOR_UNLOCK(); char pubkey_hex[65]; ...
657-682: Thread safety:frost_coordinator_unsubscribemodifies shared state without lock.Similar to
frost_coordinator_subscribe, this function modifiesg_ctx.has_subscriptionandg_ctx.current_subscriptionwithout holding the mutex.
🤖 Fix all issues with AI agents
In `@main/frost_coordinator.c`:
- Around line 308-322: handle_ws_error may double-increment relay->fail_count
because it increments on the max-attempts path and also calls
save_reconnect_state (which itself increments fail_count) on the other path;
change the control so fail_count is incremented exactly once by either moving
the increment into save_reconnect_state only, or by adding a flag/parameter to
save_reconnect_state (e.g., save_reconnect_state(relay, is_final)) and call it
with is_final=false on transient path and true on final path, or skip calling
save_reconnect_state when attempt_count >= WS_RECONNECT_MAX_ATTEMPTS; update
handle_ws_error and save_reconnect_state accordingly and ensure
relay->reconnect.attempt_count/WS_RECONNECT_MAX_ATTEMPTS logic remains
consistent.
- Around line 97-115: buffer_event currently mutates shared state
(g_ctx.event_buffer, buffer_head, buffer_count) without synchronization; ensure
these mutations occur under COORDINATOR_LOCK by acquiring the same lock used in
frost_coordinator_poll before calling buffer_event from publish_event (or
alternatively, add lock/unlock inside buffer_event itself). Update publish_event
to lock COORDINATOR_LOCK around the call to buffer_event (and any other accesses
to g_ctx.event_buffer) so replay_buffered_events (which runs under the same
lock) cannot race, referencing the buffer_event, publish_event,
replay_buffered_events, COORDINATOR_LOCK, and frost_coordinator_poll symbols
when making the change.
- Around line 117-126: clear_event_buffer is being invoked from
frost_coordinator_disconnect without holding the context mutex, risking races
with concurrent event-handler callbacks; modify frost_coordinator_disconnect to
acquire the same mutex used to guard g_ctx.event_buffer (e.g., lock g_ctx.mutex
or g_ctx.event_mutex) before calling clear_event_buffer and release it after the
call, ensuring all buffer mutations are protected; alternatively, if
clear_event_buffer is called from multiple call sites where locking is not
guaranteed, add locking inside clear_event_buffer itself and document that it
performs its own locking to avoid double-locks.
🧹 Nitpick comments (1)
main/frost_coordinator.c (1)
92-95: Potential integer overflow before the cap is applied.
WS_RECONNECT_BASE_MS << attemptcan overflow before the comparison withWS_RECONNECT_MAX_MSifattemptis large. For example, shifting a 32-bit value left by 22+ positions causes undefined behavior in C.Consider capping early or using multiplication with overflow check:
♻️ Suggested fix
static uint32_t calculate_backoff(uint8_t attempt) { + if (attempt > 20) { + return WS_RECONNECT_MAX_MS; + } uint32_t delay = WS_RECONNECT_BASE_MS << attempt; return delay > WS_RECONNECT_MAX_MS ? WS_RECONNECT_MAX_MS : delay; }
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
main/frost_coordinator.c (1)
627-697: Make subscription state updates thread-safe.
frost_coordinator_subscribe/unsubscribeupdateg_ctx.current_subscription,g_ctx.has_subscription, andg_ctx.statewithoutCOORDINATOR_LOCK. With reconnect paths invoking subscribe from the WS event thread, this becomes a data race. Lock these state mutations (and consider snapshotting relay targets if you need to avoid holding the lock during sends).🔒 Proposed fix
- copy_subscription_id(g_ctx.current_subscription, sizeof(g_ctx.current_subscription), - subscription_id); - g_ctx.has_subscription = true; + COORDINATOR_LOCK(); + copy_subscription_id(g_ctx.current_subscription, sizeof(g_ctx.current_subscription), + subscription_id); + g_ctx.has_subscription = true; + COORDINATOR_UNLOCK(); @@ - g_ctx.state = COORDINATOR_STATE_ACTIVE; + COORDINATOR_LOCK(); + g_ctx.state = COORDINATOR_STATE_ACTIVE; + COORDINATOR_UNLOCK(); @@ - g_ctx.has_subscription = false; - memset(g_ctx.current_subscription, 0, sizeof(g_ctx.current_subscription)); + COORDINATOR_LOCK(); + g_ctx.has_subscription = false; + memset(g_ctx.current_subscription, 0, sizeof(g_ctx.current_subscription)); + COORDINATOR_UNLOCK();
♻️ Duplicate comments (1)
main/frost_coordinator.c (1)
603-611: Guard disconnect state resets with the mutex.
frost_coordinator_disconnectmutates shared coordinator state withoutCOORDINATOR_LOCK, while other paths read it under lock. This can race with WS callbacks orpoll().🔒 Proposed fix
- clear_event_buffer(); - g_ctx.has_subscription = false; - g_ctx.disconnect_time = 0; - g_ctx.state = COORDINATOR_STATE_IDLE; + COORDINATOR_LOCK(); + clear_event_buffer_unlocked(); + g_ctx.has_subscription = false; + g_ctx.disconnect_time = 0; + g_ctx.state = COORDINATOR_STATE_IDLE; + COORDINATOR_UNLOCK();
🧹 Nitpick comments (2)
main/frost_coordinator.c (2)
708-745: Consider buffering when any relay is reconnecting.Right now buffering happens only when
published == 0, so reconnecting relays miss events if at least one relay is connected. If the intent is replay for any disconnected relay, consider buffering wheneverany_reconnectingis true (and dedupe as needed).
816-890: Reduce time under COORDINATOR_LOCK in the poll path.
frost_coordinator_pollholds the mutex while invoking ping/reconnect/replay helpers. Consider snapshotting work under the lock, then performing I/O outside it to reduce contention.
Summary
Test plan
Summary by CodeRabbit
New Features
Improvements
✏️ Tip: You can customize this high-level summary in your review settings.