fix(ssh): add WebSocket ping/pong to per-session revdial connections by otavio · Pull Request #5947 · shellhub-io/shellhub

otavio · 2026-03-05T16:29:09Z

Summary

Fix idle SSH session disconnections caused by intermediaries (load balancers, NAT, reverse proxies) closing silent WebSocket connections. This PR addresses #5946 with three layers of defense:

Per-session WebSocket keepalive (fe11347) — adds ping/pong to per-session revdial connections, preventing intermediaries from killing idle SSH sessions
Gateway Nginx timeout removal (6d52d11) — sets proxy_read_timeout 0 on all WebSocket locations, eliminating a fragile coupling between Nginx's default 60s timeout and the 30s keepalive interval
WebSocket adapter race condition fixes (6d52d11, bad0285) — uses sync.Once in both Close() and Ping() to prevent concurrent callers from panicking on double-close or creating duplicate goroutines

Changes

Per-session WebSocket ping/pong (`pkg/revdial`, `ssh/`)

The V1 (revdial) per-session WebSocket connections had no server→agent keepalive. During idle SSH sessions, the only traffic was the agent's SSH keepalive (one-way, every KEEPALIVE_INTERVAL seconds). The server→agent direction was silent, causing intermediaries to detect a half-idle connection and close it.

Now Ping() is called on the server-side adapter in ConnHandler(), sending WebSocket ping frames every 30s. The agent automatically responds with pongs (gorilla/websocket handles this), creating bidirectional traffic.

Gateway Nginx `proxy_read_timeout 0` (`gateway/nginx/conf.d/shellhub.conf`)

Added to all four WebSocket locations:

/ssh/connection — main V1 control connection
/ssh/revdial — per-session dial-back
/agent/connection — V2 yamux connection
/ws — web terminal SSH

Previously, Nginx's default 60s read timeout only worked because application-level keepalive runs at 30s — if anyone changed the keepalive interval, Nginx would start killing connections. Setting it to 0 (disabled) delegates liveness detection entirely to the application layer, which already has robust ping/pong mechanisms.

WebSocket adapter `Close()` race fix (`pkg/wsconnadapter/wsconnadapter.go`)

The old Close() had a race condition: concurrent callers (e.g., pong timeout AfterFunc firing while normal teardown runs) could panic by sending to an already-closed channel, or call conn.Close() multiple times. Now sync.Once ensures the entire cleanup — stopping the ping goroutine and closing the WebSocket connection — happens exactly once.

WebSocket adapter `Ping()` init race fix (`pkg/wsconnadapter/wsconnadapter.go`)

The old Ping() guarded against re-initialization with a nil check (if a.pongCh != nil), which is racy under concurrent calls — both callers see nil, both create channels and goroutines, leaking the first set. Now sync.Once guarantees initialization happens exactly once.

Cross-reference analysis (from #5946)

Two community members tested the initial fix (fe11347):

	@zoopp	@ltan10
What drops	Per-session revdial WS	Main control connection
Fix addresses it?	Yes	No (different issue)
Intermediary	GCP Load Balancer	External reverse proxy

@ltan10's issue is the main control connection dying (device goes offline), which is a separate problem — the main connection already has four independent keepalive sources at 30s intervals. Their 20–27 minute disconnect pattern points to external proxy behavior (max connection duration, connection pool limits), not a ShellHub bug. We've asked for their proxy configuration details in #5946.

Test plan

go build — all services compile
Integration tests pass locally (postgres)
Deploy behind a load balancer, leave SSH session idle 10+ minutes
Verify session stays alive and no "close connection due pong timeout" in logs
@zoopp / @ltan10: please re-test with this branch

Review

@gustavosbarreto — these changes touch the WebSocket adapter used by all agent↔server connections (V1 revdial, V2 yamux, web terminal). The sync.Once changes make Close() idempotent (subsequent calls return nil instead of re-closing the connection) and Ping() initialization atomic. The Nginx timeout change affects all WebSocket locations in the gateway. Would appreciate your review on whether these behavioral changes could have deeper impact on the platform — particularly the idempotent Close() semantics and whether any code path depends on calling conn.Close() multiple times.

Fixes #5946

gustavosbarreto · 2026-03-05T17:39:36Z

I wonder if the Ping() would be better placed in grabConn() (agent-side) instead of ConnHandler() (server-side). Adding a goroutine per active session on the server doesn't scale. On the agent side each process only manages its own sessions, so the cost is distributed.

gorilla/websocket already responds to pings with pongs during NextReader(). No extra goroutine on the server, same bidirectional traffic for the LB.

// pkg/revdial/revdial.go — grabConn(), line 442
c := wsconnadapter.New(wsConn)
c.Ping()

select {
case ln.connc <- c:
case <-ln.donec:
}

otavio · 2026-03-05T18:40:30Z

@gustavosbarreto Good call — I've moved Ping() to grabConn() on the agent side as you suggested. This distributes the goroutine cost across agents instead of concentrating it on the server.

This fix works for the immediate issue, but we'll need to think more deeply about how we want to implement this long-term. Given that we've never had agent integration in other languages, we need to find a way to enable this back-and-forth communication that accounts for all clients. Some of our customers rely on a large number of devices, and in those cases this extra WebSocket ping/pong transmission on every active session could impact performance.

otavio · 2026-03-05T18:57:54Z

Sharing some research on cloud load balancer behavior that reinforces why this keepalive fix is necessary.

All major cloud vendors enforce idle timeouts on load-balanced TCP/WebSocket connections, and some require application-level data (not just TCP keepalives) to reset the timer:

Vendor	LB Type	Default Idle Timeout	TCP Keepalive Resets Timer?	App-Level Data Required?
AWS	NLB	350s	✅ Yes	No
AWS	ALB	60s	❌ No	Yes
Azure	Standard LB	4 min	✅ Yes	Recommended
GCP	TCP Proxy	30s	Conflicting docs	Yes

AWS ALB and GCP TCP Proxy both require actual data in the payload to keep the connection alive — TCP-level keepalive packets aren't enough. WebSocket ping/pong frames count as application-level data, which is why our fix in #5947 works.

Ref: GCE discussion on TCP proxy LB timeouts

otavio · 2026-03-06T14:41:17Z

@gustavosbarreto — requesting your review on this one. The changes go beyond the original per-session keepalive fix and touch shared infrastructure, so I want to make sure nothing has unintended side effects.

What to pay attention to

1. wsconnadapter.Close() is now idempotent

The old Close() called conn.Close() on the underlying gorilla/websocket connection every time it was invoked. The new version uses sync.Once — only the first call actually closes the connection; subsequent calls return nil.

This matters because several code paths can trigger Close() concurrently or multiple times:

The pong timeout time.AfterFunc fires and calls Close()
The revdial Dialer.close() calls conn.Close()
yamux session teardown calls Close() on the underlying connection

If any code path currently depends on getting an error back from a second Close() call (to detect "already closed"), it will now silently get nil. I reviewed the callers and didn't find such a case, but you know the platform better.

2. wsconnadapter.Ping() initialization is now atomic

The old nil-check guard (if a.pongCh != nil) was racy under concurrent calls. The new sync.Once ensures channels and the ping goroutine are created exactly once. Current callers only call Ping() once per adapter, so this is defensive — but worth verifying no future or cloud-specific code path calls it differently.

3. Gateway Nginx proxy_read_timeout 0 on all WebSocket locations

This disables Nginx's read timeout for /ssh/connection, /ssh/revdial, /agent/connection, and /ws. Application-level keepalive (30s ping/pong + JSON keep-alive) handles liveness instead.

The concern here is zombie connections: if the application-level keepalive ever stops working (bug, deadlock), Nginx won't clean up the connection — it will hang indefinitely. Previously, the 60s default acted as a safety net (albeit a fragile one). Worth considering whether we want a large finite value (e.g., proxy_read_timeout 3600) instead of 0 as a last-resort safety net.

In V1 (revdial) transport, the main control connection has bidirectional keepalive (ping/pong + JSON keep-alive), but per-session WebSocket connections created via dial-back have none. During idle SSH sessions, the only traffic on the per-session WebSocket is the agent's SSH keepalive (agent→server, one-way every 30s). The server→agent direction is completely silent, causing intermediaries (load balancers, NAT, firewalls) to detect a half-idle connection and close it. Call Ping() on the agent-side wsconnadapter in grabConn() to start sending WebSocket ping frames every 30 seconds. gorilla/websocket automatically responds to pings with pong frames during NextReader(), creating bidirectional traffic that keeps the connection alive through all intermediaries. Placing this on the agent side distributes the goroutine cost across agents rather than concentrating it on the server. Fixes: #5946

…close race - Add proxy_read_timeout 0 to all WebSocket locations in the gateway Nginx config (/ssh/connection, /ssh/revdial, /agent/connection, /ws). The default 60s timeout only worked because keepalive is 30s, which is fragile coupling. Application-level ping/pong handles liveness instead. - Fix race condition in wsconnadapter Close() where concurrent callers (e.g. pong timeout AfterFunc and normal teardown) could panic on send-to-closed-channel or double-close the WebSocket connection. Use sync.Once to guarantee both channel and connection cleanup happen exactly once.

The previous nil-check guard on pongCh was racy: two concurrent callers could both see nil and create duplicate channels and goroutines, leaking the first set. Use sync.Once to guarantee initialization happens exactly once, consistent with the Close() fix in the previous commit.

The terminal_window_size_change test had a 2s per-attempt timeout and 5s overall deadline for reading stty output over a PTY. Under load, terminal I/O can exceed these tight limits, causing ~40% flakiness locally. Increase to 10s per-attempt and 30s overall. Verified with 20 consecutive runs (0 failures).

The sync.Once Close() previously returned nil on repeated calls. Store the error from the first close so callers always receive it.

otavio requested a review from a team as a code owner March 5, 2026 16:29

otavio mentioned this pull request Mar 5, 2026

"Idle" SSH connections are forcefully disconnected #5946

Open

otavio requested a review from gustavosbarreto March 5, 2026 16:34

otavio force-pushed the fix/revdial-session-keepalive branch from 7031a95 to fe11347 Compare March 5, 2026 18:08

otavio added 5 commits March 9, 2026 21:06

fix(ssh): return stored error from subsequent Close() calls

2eb7d2d

The sync.Once Close() previously returned nil on repeated calls. Store the error from the first close so callers always receive it.

otavio force-pushed the fix/revdial-session-keepalive branch from 36a5b6e to 2eb7d2d Compare March 10, 2026 00:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(ssh): add WebSocket ping/pong to per-session revdial connections#5947

fix(ssh): add WebSocket ping/pong to per-session revdial connections#5947
otavio wants to merge 5 commits intomasterfrom
fix/revdial-session-keepalive

otavio commented Mar 5, 2026 •

edited

Loading

Uh oh!

gustavosbarreto commented Mar 5, 2026

Uh oh!

otavio commented Mar 5, 2026

Uh oh!

otavio commented Mar 5, 2026

Uh oh!

otavio commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

otavio commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Per-session WebSocket ping/pong (pkg/revdial, ssh/)

Gateway Nginx proxy_read_timeout 0 (gateway/nginx/conf.d/shellhub.conf)

WebSocket adapter Close() race fix (pkg/wsconnadapter/wsconnadapter.go)

WebSocket adapter Ping() init race fix (pkg/wsconnadapter/wsconnadapter.go)

Cross-reference analysis (from #5946)

Test plan

Review

Uh oh!

gustavosbarreto commented Mar 5, 2026

Uh oh!

otavio commented Mar 5, 2026

Uh oh!

otavio commented Mar 5, 2026

Uh oh!

otavio commented Mar 6, 2026

What to pay attention to

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

otavio commented Mar 5, 2026 •

edited

Loading

Per-session WebSocket ping/pong (`pkg/revdial`, `ssh/`)

Gateway Nginx `proxy_read_timeout 0` (`gateway/nginx/conf.d/shellhub.conf`)

WebSocket adapter `Close()` race fix (`pkg/wsconnadapter/wsconnadapter.go`)

WebSocket adapter `Ping()` init race fix (`pkg/wsconnadapter/wsconnadapter.go`)