fix(ssh): add WebSocket ping/pong to per-session revdial connections#5947
fix(ssh): add WebSocket ping/pong to per-session revdial connections#5947
Conversation
|
I wonder if the gorilla/websocket already responds to pings with pongs during // pkg/revdial/revdial.go — grabConn(), line 442
c := wsconnadapter.New(wsConn)
c.Ping()
select {
case ln.connc <- c:
case <-ln.donec:
} |
7031a95 to
fe11347
Compare
|
@gustavosbarreto Good call — I've moved This fix works for the immediate issue, but we'll need to think more deeply about how we want to implement this long-term. Given that we've never had agent integration in other languages, we need to find a way to enable this back-and-forth communication that accounts for all clients. Some of our customers rely on a large number of devices, and in those cases this extra WebSocket ping/pong transmission on every active session could impact performance. |
|
Sharing some research on cloud load balancer behavior that reinforces why this keepalive fix is necessary. All major cloud vendors enforce idle timeouts on load-balanced TCP/WebSocket connections, and some require application-level data (not just TCP keepalives) to reset the timer:
AWS ALB and GCP TCP Proxy both require actual data in the payload to keep the connection alive — TCP-level keepalive packets aren't enough. WebSocket ping/pong frames count as application-level data, which is why our fix in #5947 works. |
|
@gustavosbarreto — requesting your review on this one. The changes go beyond the original per-session keepalive fix and touch shared infrastructure, so I want to make sure nothing has unintended side effects. What to pay attention to1. The old This matters because several code paths can trigger
If any code path currently depends on getting an error back from a second 2. The old nil-check guard ( 3. Gateway Nginx This disables Nginx's read timeout for The concern here is zombie connections: if the application-level keepalive ever stops working (bug, deadlock), Nginx won't clean up the connection — it will hang indefinitely. Previously, the 60s default acted as a safety net (albeit a fragile one). Worth considering whether we want a large finite value (e.g., |
In V1 (revdial) transport, the main control connection has bidirectional keepalive (ping/pong + JSON keep-alive), but per-session WebSocket connections created via dial-back have none. During idle SSH sessions, the only traffic on the per-session WebSocket is the agent's SSH keepalive (agent→server, one-way every 30s). The server→agent direction is completely silent, causing intermediaries (load balancers, NAT, firewalls) to detect a half-idle connection and close it. Call Ping() on the agent-side wsconnadapter in grabConn() to start sending WebSocket ping frames every 30 seconds. gorilla/websocket automatically responds to pings with pong frames during NextReader(), creating bidirectional traffic that keeps the connection alive through all intermediaries. Placing this on the agent side distributes the goroutine cost across agents rather than concentrating it on the server. Fixes: #5946
…close race - Add proxy_read_timeout 0 to all WebSocket locations in the gateway Nginx config (/ssh/connection, /ssh/revdial, /agent/connection, /ws). The default 60s timeout only worked because keepalive is 30s, which is fragile coupling. Application-level ping/pong handles liveness instead. - Fix race condition in wsconnadapter Close() where concurrent callers (e.g. pong timeout AfterFunc and normal teardown) could panic on send-to-closed-channel or double-close the WebSocket connection. Use sync.Once to guarantee both channel and connection cleanup happen exactly once.
The previous nil-check guard on pongCh was racy: two concurrent callers could both see nil and create duplicate channels and goroutines, leaking the first set. Use sync.Once to guarantee initialization happens exactly once, consistent with the Close() fix in the previous commit.
The terminal_window_size_change test had a 2s per-attempt timeout and 5s overall deadline for reading stty output over a PTY. Under load, terminal I/O can exceed these tight limits, causing ~40% flakiness locally. Increase to 10s per-attempt and 30s overall. Verified with 20 consecutive runs (0 failures).
The sync.Once Close() previously returned nil on repeated calls. Store the error from the first close so callers always receive it.
36a5b6e to
2eb7d2d
Compare
Summary
Fix idle SSH session disconnections caused by intermediaries (load balancers, NAT, reverse proxies) closing silent WebSocket connections. This PR addresses #5946 with three layers of defense:
fe11347) — adds ping/pong to per-session revdial connections, preventing intermediaries from killing idle SSH sessions6d52d11) — setsproxy_read_timeout 0on all WebSocket locations, eliminating a fragile coupling between Nginx's default 60s timeout and the 30s keepalive interval6d52d11,bad0285) — usessync.Oncein bothClose()andPing()to prevent concurrent callers from panicking on double-close or creating duplicate goroutinesChanges
Per-session WebSocket ping/pong (
pkg/revdial,ssh/)The V1 (revdial) per-session WebSocket connections had no server→agent keepalive. During idle SSH sessions, the only traffic was the agent's SSH keepalive (one-way, every
KEEPALIVE_INTERVALseconds). The server→agent direction was silent, causing intermediaries to detect a half-idle connection and close it.Now
Ping()is called on the server-side adapter inConnHandler(), sending WebSocket ping frames every 30s. The agent automatically responds with pongs (gorilla/websocket handles this), creating bidirectional traffic.Gateway Nginx
proxy_read_timeout 0(gateway/nginx/conf.d/shellhub.conf)Added to all four WebSocket locations:
/ssh/connection— main V1 control connection/ssh/revdial— per-session dial-back/agent/connection— V2 yamux connection/ws— web terminal SSHPreviously, Nginx's default 60s read timeout only worked because application-level keepalive runs at 30s — if anyone changed the keepalive interval, Nginx would start killing connections. Setting it to
0(disabled) delegates liveness detection entirely to the application layer, which already has robust ping/pong mechanisms.WebSocket adapter
Close()race fix (pkg/wsconnadapter/wsconnadapter.go)The old
Close()had a race condition: concurrent callers (e.g., pong timeoutAfterFuncfiring while normal teardown runs) could panic by sending to an already-closed channel, or callconn.Close()multiple times. Nowsync.Onceensures the entire cleanup — stopping the ping goroutine and closing the WebSocket connection — happens exactly once.WebSocket adapter
Ping()init race fix (pkg/wsconnadapter/wsconnadapter.go)The old
Ping()guarded against re-initialization with a nil check (if a.pongCh != nil), which is racy under concurrent calls — both callers see nil, both create channels and goroutines, leaking the first set. Nowsync.Onceguarantees initialization happens exactly once.Cross-reference analysis (from #5946)
Two community members tested the initial fix (
fe11347):@ltan10's issue is the main control connection dying (device goes offline), which is a separate problem — the main connection already has four independent keepalive sources at 30s intervals. Their 20–27 minute disconnect pattern points to external proxy behavior (max connection duration, connection pool limits), not a ShellHub bug. We've asked for their proxy configuration details in #5946.
Test plan
go build— all services compile"close connection due pong timeout"in logsReview
@gustavosbarreto — these changes touch the WebSocket adapter used by all agent↔server connections (V1 revdial, V2 yamux, web terminal). The
sync.Oncechanges makeClose()idempotent (subsequent calls return nil instead of re-closing the connection) andPing()initialization atomic. The Nginx timeout change affects all WebSocket locations in the gateway. Would appreciate your review on whether these behavioral changes could have deeper impact on the platform — particularly the idempotentClose()semantics and whether any code path depends on callingconn.Close()multiple times.Fixes #5946