Skip to content

fix: send HTTP 102 keepalive during long-poll to prevent middlebox timeouts#4106

Open
msfstef wants to merge 4 commits intomainfrom
msfstef/keepalive-during-longpoll
Open

fix: send HTTP 102 keepalive during long-poll to prevent middlebox timeouts#4106
msfstef wants to merge 4 commits intomainfrom
msfstef/keepalive-during-longpoll

Conversation

@msfstef
Copy link
Copy Markdown
Contributor

@msfstef msfstef commented Apr 9, 2026

Summary

  • Sends periodic HTTP 102 Processing informational responses during long-poll holds to prevent network middleboxes (Cloudflare edge nodes on long paths to origin) from dropping idle connections (522 errors)
  • Zero client changes required — fully backwards compatible with existing TypeScript and Elixir clients
  • All 51 existing API tests pass with no modifications

Problem

Long-poll requests to /v1/shape?live=true hold connections idle for up to 20 seconds inside hold_until_change. During this time, no bytes flow on the wire. Middleboxes on long network paths (e.g., Cloudflare colos BAH, HKG hitting ALB in us-east-1) can drop these connections, causing sporadic 522 errors.

Approach

Uses Plug.Conn.inform(conn, 102) to send HTTP 1xx informational responses every 5–15 seconds during the hold. These responses:

  1. Send bytes on the wire — keeps middlebox connections alive
  2. Don't commit the final response — headers and status code are sent only after the hold resolves, with correct values
  3. Are invisible to HTTP clientsfetch(), Req, and all standard HTTP clients transparently ignore 1xx responses per RFC 9110 §15.2
  4. Preserve CDN request collapsing — since 1xx responses are not final responses, the CDN's coalescing window remains open for the full hold duration (see CDN section below)

Implementation details

  • A :timer.send_interval starts when a live non-SSE request enters the Plug path (serve_shape_response/2 or serve_shape_log/2)
  • A register_before_send callback cancels the timer and flushes stale messages when the response is sent
  • The keepalive callback is stored as a closure (on_keepalive) on the Request struct, keeping Plug.Conn out of the domain struct
  • hold_until_change is split into two functions: the outer sets up a Process.send_after timeout timer (replacing receive...after which would reset on keepalive re-entry), the inner do_hold_until_change handles the receive loop with a new :long_poll_keepalive clause
  • Keepalive interval is div(long_poll_timeout, 4) clamped to 5–15 seconds

Alternatives considered

Chunked transfer encoding with whitespace keepalive bytes

Send Transfer-Encoding: chunked immediately and emit " " (space) chunks as keepalive during the hold, then send the real JSON payload as the final chunk.

Rejected for two reasons:

  1. Breaks backwards compatibility. HTTP headers are committed when chunked encoding starts, before the hold resolves. The electric-offset header would be stale when new data arrives during the hold. This requires a TypeScript client change (using body-based offset instead of header-based) to avoid an infinite re-fetch loop — breaking old clients on the new server.

  2. Breaks CDN request collapsing. Live long-poll responses are cacheable (Cache-Control: public, max-age=5, stale-while-revalidate=5). With the current non-chunked approach, CDNs like Cloudflare hold the coalescing window open for the full hold duration — multiple clients requesting the same shape+offset are collapsed into one origin request. Starting a chunked response immediately closes this window on Cloudflare and CloudFront because the response is already "in flight." With 102, the coalescing window stays open since 1xx responses are not final responses.

Stream.resource pattern (modeled on existing SSE keepalive)

Replace hold_until_change entirely with a Stream.resource that emits keepalive chunks and data chunks as a streaming response body.

Rejected because it changes the response semantics: the Response struct fields (offset, up_to_date, status) are frozen at stream creation time rather than set after the hold resolves. This broke 11 existing tests and required changing error behavior (shape rotation, out-of-bounds, stack failure all became 200 with empty body instead of their proper status codes). Also has the same CDN collapsing and backwards-compatibility issues as the chunked approach.

Shorter long-poll timeout

Reduce long_poll_timeout to well under the middlebox timeout.

Rejected because it doubles the request rate for idle shapes without solving the root cause.

On HTTP 102 Processing

102 Processing was defined in RFC 2518 (WebDAV, 1999) specifically for preventing client timeouts on long-running requests — the exact use case here. It was removed in RFC 4918 (2007) with the stated reason: "due to lack of implementation" — not because it was harmful or architecturally unsound.

MDN labels 102 as "deprecated," but this is an editorial judgment, not a standards-body action. No RFC has ever formally deprecated 102. The status is more accurately described as "no longer defined by any active RFC" — a state shared by many widely-used status codes (429 Too Many Requests was also defined outside RFC 9110's predecessor until recently).

Why it's safe to use

  • RFC 9110 §15.2 requires all HTTP/1.1+ clients to parse unknown 1xx responses and allows ignoring them — sending 102 cannot break spec-compliant implementations
  • RFC 9110 §15.1 explicitly acknowledges status codes "outside the scope of this specification" as legitimate, provided they are IANA-registered
  • 102 is permanently registered in the IANA HTTP Status Code Registry with no deregistration attempts
  • Cloudflare explicitly supports and recommends it for keepalive: "If Cloudflare receives a 102 Processing response, it expects a final response within 120 seconds"
  • All major proxies forward it: nginx, HAProxy, Envoy (since envoyproxy/envoy#19023)
  • HTTP/2 (RFC 9113 §8.8.5) explicitly handles 1xx informational responses including 102
  • Bandit supports it natively via Plug.Conn.inform/2; Node.js has response.writeProcessing() since v10; Go added 1xx support in Go 1.19
  • No other 1xx code fits: 100 Continue is for request bodies, 103 Early Hints is for resource preloading and would confuse browsers

Known limitations

  • Go's net/http has a default limit of 5 consecutive 1xx responses. Our 3–4 responses per hold (at 5s intervals over 20s) are within this limit.
  • AWS ALB may silently drop 1xx responses — this is harmless since ALB's idle timeout (default 60s, configurable up to 4000s) already exceeds typical long-poll durations.
  • Spring Framework 7.0 deprecated the HttpStatus.PROCESSING enum constant. This is the only major framework taking action, and it's the constant, not the protocol behavior.

CDN request collapsing

Electric's live long-poll responses are cacheable (Cache-Control: public, max-age=5, stale-while-revalidate=5), which enables CDN request collapsing — multiple clients requesting the same shape+offset are collapsed into a single origin request.

The 102 approach preserves this because 1xx informational responses are not final responses. The CDN continues to hold the coalescing window open until the final 200 arrives. This is significant under high concurrency where many clients subscribe to the same shape.

By contrast, the alternative chunked-whitespace approach would close the coalescing window immediately by committing a 200 response before the hold resolves.

Test plan

  • All 51 existing api_test.exs tests pass with no modifications
  • All 311 TypeScript client unit tests pass with no modifications
  • Integration test with real HTTP server verifying 102 responses arrive on the wire
  • Manual verification against Cloudflare-fronted deployment

🤖 Generated with Claude Code

…meouts

Long-poll requests hold connections idle for up to 20s in a receive
block. Network middleboxes (particularly Cloudflare edge nodes on long
paths to origin) can drop these idle connections, causing 522 errors.

Send periodic HTTP 102 Processing informational responses via
Plug.Conn.inform during the hold to keep the connection alive. The 1xx
responses are invisible to HTTP clients (they only see the final
response), so this requires zero client changes and is fully backwards
compatible.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@msfstef msfstef added the claude label Apr 9, 2026
@msfstef msfstef requested review from alco and icehaunter April 9, 2026 14:26
@msfstef msfstef added the claude label Apr 9, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.67%. Comparing base (11b151b) to head (54a026e).
⚠️ Report is 3 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4106   +/-   ##
=======================================
  Coverage   88.67%   88.67%           
=======================================
  Files          25       25           
  Lines        2438     2438           
  Branches      612      611    -1     
=======================================
  Hits         2162     2162           
  Misses        274      274           
  Partials        2        2           
Flag Coverage Δ
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/start 82.83% <ø> (ø)
packages/typescript-client 93.81% <ø> (ø)
packages/y-electric 56.05% <ø> (ø)
typescript 88.67% <ø> (ø)
unit-tests 88.67% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@claude
Copy link
Copy Markdown

claude bot commented Apr 9, 2026

Claude Code Review

Summary

This PR adds periodic HTTP 102 Processing informational responses during long-poll holds to prevent middlebox connection drops. The implementation remains correct and clean; the latest commit (54a026eb7) cleanly addresses reviewer feedback from alco.

What's Working Well

  • Timer start co-located with timeout timer: start_keepalive_timer(request) and Process.send_after(self(), {:long_poll_timeout, ref}, long_poll_timeout) are both inside hold_until_change — started together, cancelled together in the try...after block.
  • Clean two-phase setup: set_long_poll_keepalive/2 (needs conn, called at Plug boundary) is correctly separated from start_keepalive_timer/1 (called only when the process enters the hold). This is the right trade-off given conn is not available inside hold_until_change.
  • Stale comment removed: The comment referencing receive...after re-entry (alco's first flag) is gone.
  • All prior positives remain: try...after cleanup, flush_long_poll_keepalive, lux macro correctness, SSE guard, interval clamping.

Issues Found

No new issues.

Suggestions (Nice to Have)

No linked issue — This PR has no linked GitHub issue. Per project convention, PRs should reference the issue they address.

Missing integration test for wire-level 102 delivery — Acknowledged in the PR description. The lux curl_shape macro now correctly strips 1xx responses, making a future integration test straightforward to write.

Issue Conformance

No linked issue. The PR description is thorough and self-contained as problem statement and acceptance criteria.

Previous Review Status

  • Resolved (iteration 3→4): Stale comment about receive...after re-entry removed.
  • Resolved (iteration 3→4): start_keepalive_timer moved into hold_until_change, co-located with the timeout timer start — directly addresses alco's feedback.
  • Acknowledged (open): Integration test for wire-level 102 delivery.

Review iteration: 4 | 2026-04-09

msfstef and others added 2 commits April 9, 2026 17:35
Address review feedback:
- Add on_keepalive field to Request @type t() spec for Dialyzer
- Document intentional ignore of Plug.Conn.inform/2 return value

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The curl_shape awk script assumed a single status line followed by
headers and body. With the new 102 Processing keepalive, curl outputs
the 1xx response before the final response, causing awk to treat the
blank line after 102 as the header/body separator and pipe the real
headers into jq.

Skip 1xx informational response blocks entirely so only the final
response headers and body are processed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move keepalive timer start into hold_until_change so it only runs
  when the process actually enters the hold (not for immediate responses)
- Separate set_long_poll_keepalive (sets closure, needs conn) from
  start_keepalive_timer (starts interval, called in hold_until_change)
- Clean up comments to reference what exists, not what was removed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@netlify
Copy link
Copy Markdown

netlify bot commented Apr 9, 2026

Deploy Preview for electric-next ready!

Name Link
🔨 Latest commit 54a026e
🔍 Latest deploy log https://app.netlify.com/projects/electric-next/deploys/69d7cb75186e26000875e7ef
😎 Deploy Preview https://deploy-preview-4106--electric-next.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@msfstef msfstef self-assigned this Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants