Skip to content

fix(ssh): add WebSocket ping/pong to per-session revdial connections#5947

Open
otavio wants to merge 5 commits intomasterfrom
fix/revdial-session-keepalive
Open

fix(ssh): add WebSocket ping/pong to per-session revdial connections#5947
otavio wants to merge 5 commits intomasterfrom
fix/revdial-session-keepalive

Conversation

@otavio
Copy link
Member

@otavio otavio commented Mar 5, 2026

Summary

Fix idle SSH session disconnections caused by intermediaries (load balancers, NAT, reverse proxies) closing silent WebSocket connections. This PR addresses #5946 with three layers of defense:

  1. Per-session WebSocket keepalive (fe11347) — adds ping/pong to per-session revdial connections, preventing intermediaries from killing idle SSH sessions
  2. Gateway Nginx timeout removal (6d52d11) — sets proxy_read_timeout 0 on all WebSocket locations, eliminating a fragile coupling between Nginx's default 60s timeout and the 30s keepalive interval
  3. WebSocket adapter race condition fixes (6d52d11, bad0285) — uses sync.Once in both Close() and Ping() to prevent concurrent callers from panicking on double-close or creating duplicate goroutines

Changes

Per-session WebSocket ping/pong (pkg/revdial, ssh/)

The V1 (revdial) per-session WebSocket connections had no server→agent keepalive. During idle SSH sessions, the only traffic was the agent's SSH keepalive (one-way, every KEEPALIVE_INTERVAL seconds). The server→agent direction was silent, causing intermediaries to detect a half-idle connection and close it.

Now Ping() is called on the server-side adapter in ConnHandler(), sending WebSocket ping frames every 30s. The agent automatically responds with pongs (gorilla/websocket handles this), creating bidirectional traffic.

Gateway Nginx proxy_read_timeout 0 (gateway/nginx/conf.d/shellhub.conf)

Added to all four WebSocket locations:

  • /ssh/connection — main V1 control connection
  • /ssh/revdial — per-session dial-back
  • /agent/connection — V2 yamux connection
  • /ws — web terminal SSH

Previously, Nginx's default 60s read timeout only worked because application-level keepalive runs at 30s — if anyone changed the keepalive interval, Nginx would start killing connections. Setting it to 0 (disabled) delegates liveness detection entirely to the application layer, which already has robust ping/pong mechanisms.

WebSocket adapter Close() race fix (pkg/wsconnadapter/wsconnadapter.go)

The old Close() had a race condition: concurrent callers (e.g., pong timeout AfterFunc firing while normal teardown runs) could panic by sending to an already-closed channel, or call conn.Close() multiple times. Now sync.Once ensures the entire cleanup — stopping the ping goroutine and closing the WebSocket connection — happens exactly once.

WebSocket adapter Ping() init race fix (pkg/wsconnadapter/wsconnadapter.go)

The old Ping() guarded against re-initialization with a nil check (if a.pongCh != nil), which is racy under concurrent calls — both callers see nil, both create channels and goroutines, leaking the first set. Now sync.Once guarantees initialization happens exactly once.

Cross-reference analysis (from #5946)

Two community members tested the initial fix (fe11347):

@zoopp @ltan10
What drops Per-session revdial WS Main control connection
Fix addresses it? Yes No (different issue)
Intermediary GCP Load Balancer External reverse proxy

@ltan10's issue is the main control connection dying (device goes offline), which is a separate problem — the main connection already has four independent keepalive sources at 30s intervals. Their 20–27 minute disconnect pattern points to external proxy behavior (max connection duration, connection pool limits), not a ShellHub bug. We've asked for their proxy configuration details in #5946.

Test plan

  • go build — all services compile
  • Integration tests pass locally (postgres)
  • Deploy behind a load balancer, leave SSH session idle 10+ minutes
  • Verify session stays alive and no "close connection due pong timeout" in logs
  • @zoopp / @ltan10: please re-test with this branch

Review

@gustavosbarreto — these changes touch the WebSocket adapter used by all agent↔server connections (V1 revdial, V2 yamux, web terminal). The sync.Once changes make Close() idempotent (subsequent calls return nil instead of re-closing the connection) and Ping() initialization atomic. The Nginx timeout change affects all WebSocket locations in the gateway. Would appreciate your review on whether these behavioral changes could have deeper impact on the platform — particularly the idempotent Close() semantics and whether any code path depends on calling conn.Close() multiple times.

Fixes #5946

@gustavosbarreto
Copy link
Member

I wonder if the Ping() would be better placed in grabConn() (agent-side) instead of ConnHandler() (server-side). Adding a goroutine per active session on the server doesn't scale. On the agent side each process only manages its own sessions, so the cost is distributed.

gorilla/websocket already responds to pings with pongs during NextReader(). No extra goroutine on the server, same bidirectional traffic for the LB.

// pkg/revdial/revdial.go — grabConn(), line 442
c := wsconnadapter.New(wsConn)
c.Ping()

select {
case ln.connc <- c:
case <-ln.donec:
}

@otavio otavio force-pushed the fix/revdial-session-keepalive branch from 7031a95 to fe11347 Compare March 5, 2026 18:08
@otavio
Copy link
Member Author

otavio commented Mar 5, 2026

@gustavosbarreto Good call — I've moved Ping() to grabConn() on the agent side as you suggested. This distributes the goroutine cost across agents instead of concentrating it on the server.

This fix works for the immediate issue, but we'll need to think more deeply about how we want to implement this long-term. Given that we've never had agent integration in other languages, we need to find a way to enable this back-and-forth communication that accounts for all clients. Some of our customers rely on a large number of devices, and in those cases this extra WebSocket ping/pong transmission on every active session could impact performance.

@otavio
Copy link
Member Author

otavio commented Mar 5, 2026

Sharing some research on cloud load balancer behavior that reinforces why this keepalive fix is necessary.

All major cloud vendors enforce idle timeouts on load-balanced TCP/WebSocket connections, and some require application-level data (not just TCP keepalives) to reset the timer:

Vendor LB Type Default Idle Timeout TCP Keepalive Resets Timer? App-Level Data Required?
AWS NLB 350s ✅ Yes No
AWS ALB 60s ❌ No Yes
Azure Standard LB 4 min ✅ Yes Recommended
GCP TCP Proxy 30s Conflicting docs Yes

AWS ALB and GCP TCP Proxy both require actual data in the payload to keep the connection alive — TCP-level keepalive packets aren't enough. WebSocket ping/pong frames count as application-level data, which is why our fix in #5947 works.

Ref: GCE discussion on TCP proxy LB timeouts

@otavio
Copy link
Member Author

otavio commented Mar 6, 2026

@gustavosbarreto — requesting your review on this one. The changes go beyond the original per-session keepalive fix and touch shared infrastructure, so I want to make sure nothing has unintended side effects.

What to pay attention to

1. wsconnadapter.Close() is now idempotent

The old Close() called conn.Close() on the underlying gorilla/websocket connection every time it was invoked. The new version uses sync.Once — only the first call actually closes the connection; subsequent calls return nil.

This matters because several code paths can trigger Close() concurrently or multiple times:

  • The pong timeout time.AfterFunc fires and calls Close()
  • The revdial Dialer.close() calls conn.Close()
  • yamux session teardown calls Close() on the underlying connection

If any code path currently depends on getting an error back from a second Close() call (to detect "already closed"), it will now silently get nil. I reviewed the callers and didn't find such a case, but you know the platform better.

2. wsconnadapter.Ping() initialization is now atomic

The old nil-check guard (if a.pongCh != nil) was racy under concurrent calls. The new sync.Once ensures channels and the ping goroutine are created exactly once. Current callers only call Ping() once per adapter, so this is defensive — but worth verifying no future or cloud-specific code path calls it differently.

3. Gateway Nginx proxy_read_timeout 0 on all WebSocket locations

This disables Nginx's read timeout for /ssh/connection, /ssh/revdial, /agent/connection, and /ws. Application-level keepalive (30s ping/pong + JSON keep-alive) handles liveness instead.

The concern here is zombie connections: if the application-level keepalive ever stops working (bug, deadlock), Nginx won't clean up the connection — it will hang indefinitely. Previously, the 60s default acted as a safety net (albeit a fragile one). Worth considering whether we want a large finite value (e.g., proxy_read_timeout 3600) instead of 0 as a last-resort safety net.

otavio added 5 commits March 9, 2026 21:06
In V1 (revdial) transport, the main control connection has bidirectional
keepalive (ping/pong + JSON keep-alive), but per-session WebSocket
connections created via dial-back have none. During idle SSH sessions,
the only traffic on the per-session WebSocket is the agent's SSH keepalive
(agent→server, one-way every 30s). The server→agent direction is
completely silent, causing intermediaries (load balancers, NAT, firewalls)
to detect a half-idle connection and close it.

Call Ping() on the agent-side wsconnadapter in grabConn() to start
sending WebSocket ping frames every 30 seconds. gorilla/websocket
automatically responds to pings with pong frames during NextReader(),
creating bidirectional traffic that keeps the connection alive through
all intermediaries. Placing this on the agent side distributes the
goroutine cost across agents rather than concentrating it on the server.

Fixes: #5946
…close race

- Add proxy_read_timeout 0 to all WebSocket locations in the gateway
  Nginx config (/ssh/connection, /ssh/revdial, /agent/connection, /ws).
  The default 60s timeout only worked because keepalive is 30s, which is
  fragile coupling. Application-level ping/pong handles liveness instead.
- Fix race condition in wsconnadapter Close() where concurrent callers
  (e.g. pong timeout AfterFunc and normal teardown) could panic on
  send-to-closed-channel or double-close the WebSocket connection. Use
  sync.Once to guarantee both channel and connection cleanup happen
  exactly once.
The previous nil-check guard on pongCh was racy: two concurrent callers
could both see nil and create duplicate channels and goroutines, leaking
the first set. Use sync.Once to guarantee initialization happens exactly
once, consistent with the Close() fix in the previous commit.
The terminal_window_size_change test had a 2s per-attempt timeout
and 5s overall deadline for reading stty output over a PTY. Under
load, terminal I/O can exceed these tight limits, causing ~40%
flakiness locally. Increase to 10s per-attempt and 30s overall.

Verified with 20 consecutive runs (0 failures).
The sync.Once Close() previously returned nil on repeated calls.
Store the error from the first close so callers always receive it.
@otavio otavio force-pushed the fix/revdial-session-keepalive branch from 36a5b6e to 2eb7d2d Compare March 10, 2026 00:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

"Idle" SSH connections are forcefully disconnected

2 participants