fix: send HTTP 102 keepalive during long-poll to prevent middlebox timeouts by msfstef · Pull Request #4106 · electric-sql/electric

msfstef · 2026-04-09T14:26:33Z

Summary

Sends periodic HTTP 102 Processing informational responses during long-poll holds to prevent network middleboxes (Cloudflare edge nodes on long paths to origin) from dropping idle connections (522 errors)
Zero client changes required — fully backwards compatible with existing TypeScript and Elixir clients
All 51 existing API tests pass with no modifications

Problem

Long-poll requests to /v1/shape?live=true hold connections idle for up to 20 seconds inside hold_until_change. During this time, no bytes flow on the wire. Middleboxes on long network paths (e.g., Cloudflare colos BAH, HKG hitting ALB in us-east-1) can drop these connections, causing sporadic 522 errors.

Approach

Uses Plug.Conn.inform(conn, 102) to send HTTP 1xx informational responses every 5–15 seconds during the hold. These responses:

Send bytes on the wire — keeps middlebox connections alive
Don't commit the final response — headers and status code are sent only after the hold resolves, with correct values
Are invisible to HTTP clients — fetch(), Req, and all standard HTTP clients transparently ignore 1xx responses per RFC 9110 §15.2
Preserve CDN request collapsing — since 1xx responses are not final responses, the CDN's coalescing window remains open for the full hold duration (see CDN section below)

Implementation details

A :timer.send_interval starts when a live non-SSE request enters the Plug path (serve_shape_response/2 or serve_shape_log/2)
A register_before_send callback cancels the timer and flushes stale messages when the response is sent
The keepalive callback is stored as a closure (on_keepalive) on the Request struct, keeping Plug.Conn out of the domain struct
hold_until_change is split into two functions: the outer sets up a Process.send_after timeout timer (replacing receive...after which would reset on keepalive re-entry), the inner do_hold_until_change handles the receive loop with a new :long_poll_keepalive clause
Keepalive interval is div(long_poll_timeout, 4) clamped to 5–15 seconds

Alternatives considered

Chunked transfer encoding with whitespace keepalive bytes

Send Transfer-Encoding: chunked immediately and emit " " (space) chunks as keepalive during the hold, then send the real JSON payload as the final chunk.

Rejected for two reasons:

Breaks backwards compatibility. HTTP headers are committed when chunked encoding starts, before the hold resolves. The electric-offset header would be stale when new data arrives during the hold. This requires a TypeScript client change (using body-based offset instead of header-based) to avoid an infinite re-fetch loop — breaking old clients on the new server.
Breaks CDN request collapsing. Live long-poll responses are cacheable (Cache-Control: public, max-age=5, stale-while-revalidate=5). With the current non-chunked approach, CDNs like Cloudflare hold the coalescing window open for the full hold duration — multiple clients requesting the same shape+offset are collapsed into one origin request. Starting a chunked response immediately closes this window on Cloudflare and CloudFront because the response is already "in flight." With 102, the coalescing window stays open since 1xx responses are not final responses.

Stream.resource pattern (modeled on existing SSE keepalive)

Replace hold_until_change entirely with a Stream.resource that emits keepalive chunks and data chunks as a streaming response body.

Rejected because it changes the response semantics: the Response struct fields (offset, up_to_date, status) are frozen at stream creation time rather than set after the hold resolves. This broke 11 existing tests and required changing error behavior (shape rotation, out-of-bounds, stack failure all became 200 with empty body instead of their proper status codes). Also has the same CDN collapsing and backwards-compatibility issues as the chunked approach.

Shorter long-poll timeout

Reduce long_poll_timeout to well under the middlebox timeout.

Rejected because it doubles the request rate for idle shapes without solving the root cause.

On HTTP 102 Processing

102 Processing was defined in RFC 2518 (WebDAV, 1999) specifically for preventing client timeouts on long-running requests — the exact use case here. It was removed in RFC 4918 (2007) with the stated reason: "due to lack of implementation" — not because it was harmful or architecturally unsound.

MDN labels 102 as "deprecated," but this is an editorial judgment, not a standards-body action. No RFC has ever formally deprecated 102. The status is more accurately described as "no longer defined by any active RFC" — a state shared by many widely-used status codes (429 Too Many Requests was also defined outside RFC 9110's predecessor until recently).

Why it's safe to use

RFC 9110 §15.2 requires all HTTP/1.1+ clients to parse unknown 1xx responses and allows ignoring them — sending 102 cannot break spec-compliant implementations
RFC 9110 §15.1 explicitly acknowledges status codes "outside the scope of this specification" as legitimate, provided they are IANA-registered
102 is permanently registered in the IANA HTTP Status Code Registry with no deregistration attempts
Cloudflare explicitly supports and recommends it for keepalive: "If Cloudflare receives a 102 Processing response, it expects a final response within 120 seconds"
All major proxies forward it: nginx, HAProxy, Envoy (since envoyproxy/envoy#19023)
HTTP/2 (RFC 9113 §8.8.5) explicitly handles 1xx informational responses including 102
Bandit supports it natively via Plug.Conn.inform/2; Node.js has response.writeProcessing() since v10; Go added 1xx support in Go 1.19
No other 1xx code fits: 100 Continue is for request bodies, 103 Early Hints is for resource preloading and would confuse browsers

Known limitations

Go's net/http has a default limit of 5 consecutive 1xx responses. Our 3–4 responses per hold (at 5s intervals over 20s) are within this limit.
AWS ALB may silently drop 1xx responses — this is harmless since ALB's idle timeout (default 60s, configurable up to 4000s) already exceeds typical long-poll durations.
Spring Framework 7.0 deprecated the HttpStatus.PROCESSING enum constant. This is the only major framework taking action, and it's the constant, not the protocol behavior.

CDN request collapsing

Electric's live long-poll responses are cacheable (Cache-Control: public, max-age=5, stale-while-revalidate=5), which enables CDN request collapsing — multiple clients requesting the same shape+offset are collapsed into a single origin request.

The 102 approach preserves this because 1xx informational responses are not final responses. The CDN continues to hold the coalescing window open until the final 200 arrives. This is significant under high concurrency where many clients subscribe to the same shape.

By contrast, the alternative chunked-whitespace approach would close the coalescing window immediately by committing a 200 response before the hold resolves.

Test plan

All 51 existing api_test.exs tests pass with no modifications
All 311 TypeScript client unit tests pass with no modifications
Integration test with real HTTP server verifying 102 responses arrive on the wire
Manual verification against Cloudflare-fronted deployment

🤖 Generated with Claude Code

…meouts Long-poll requests hold connections idle for up to 20s in a receive block. Network middleboxes (particularly Cloudflare edge nodes on long paths to origin) can drop these idle connections, causing 522 errors. Send periodic HTTP 102 Processing informational responses via Plug.Conn.inform during the hold to keep the connection alive. The 1xx responses are invisible to HTTP clients (they only see the final response), so this requires zero client changes and is fully backwards compatible. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

codecov · 2026-04-09T14:30:11Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.67%. Comparing base (11b151b) to head (54a026e).
⚠️ Report is 3 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #4106   +/-   ##
=======================================
  Coverage   88.67%   88.67%           
=======================================
  Files          25       25           
  Lines        2438     2438           
  Branches      612      611    -1     
=======================================
  Hits         2162     2162           
  Misses        274      274           
  Partials        2        2

Flag	Coverage Δ
packages/experimental	`87.73% <ø> (ø)`
packages/react-hooks	`86.48% <ø> (ø)`
packages/start	`82.83% <ø> (ø)`
packages/typescript-client	`93.81% <ø> (ø)`
packages/y-electric	`56.05% <ø> (ø)`
typescript	`88.67% <ø> (ø)`
unit-tests	`88.67% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

claude · 2026-04-09T14:30:22Z

Claude Code Review

Summary

This PR adds periodic HTTP 102 Processing informational responses during long-poll holds to prevent middlebox connection drops. The implementation remains correct and clean; the latest commit (54a026eb7) cleanly addresses reviewer feedback from alco.

What's Working Well

Timer start co-located with timeout timer: start_keepalive_timer(request) and Process.send_after(self(), {:long_poll_timeout, ref}, long_poll_timeout) are both inside hold_until_change — started together, cancelled together in the try...after block.
Clean two-phase setup: set_long_poll_keepalive/2 (needs conn, called at Plug boundary) is correctly separated from start_keepalive_timer/1 (called only when the process enters the hold). This is the right trade-off given conn is not available inside hold_until_change.
Stale comment removed: The comment referencing receive...after re-entry (alco's first flag) is gone.
All prior positives remain: try...after cleanup, flush_long_poll_keepalive, lux macro correctness, SSE guard, interval clamping.

Issues Found

No new issues.

Suggestions (Nice to Have)

No linked issue — This PR has no linked GitHub issue. Per project convention, PRs should reference the issue they address.

Missing integration test for wire-level 102 delivery — Acknowledged in the PR description. The lux curl_shape macro now correctly strips 1xx responses, making a future integration test straightforward to write.

Issue Conformance

No linked issue. The PR description is thorough and self-contained as problem statement and acceptance criteria.

Previous Review Status

Resolved (iteration 3→4): Stale comment about receive...after re-entry removed.
Resolved (iteration 3→4): start_keepalive_timer moved into hold_until_change, co-located with the timeout timer start — directly addresses alco's feedback.
Acknowledged (open): Integration test for wire-level 102 delivery.

Review iteration: 4 | 2026-04-09

@type

Address review feedback: - Add on_keepalive field to Request @type t() spec for Dialyzer - Document intentional ignore of Plug.Conn.inform/2 return value Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The curl_shape awk script assumed a single status line followed by headers and body. With the new 102 Processing keepalive, curl outputs the 1xx response before the final response, causing awk to treat the blank line after 102 as the header/body separator and pipe the real headers into jq. Skip 1xx informational response blocks entirely so only the final response headers and body are processed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

packages/sync-service/lib/electric/shapes/api.ex

- Move keepalive timer start into hold_until_change so it only runs when the process actually enters the hold (not for immediate responses) - Separate set_long_poll_keepalive (sets closure, needs conn) from start_keepalive_timer (starts interval, called in hold_until_change) - Clean up comments to reference what exists, not what was removed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

netlify · 2026-04-09T15:54:52Z

✅ Deploy Preview for electric-next ready!

Name	Link
🔨 Latest commit	`54a026e`
🔍 Latest deploy log	https://app.netlify.com/projects/electric-next/deploys/69d7cb75186e26000875e7ef
😎 Deploy Preview	https://deploy-preview-4106--electric-next.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

msfstef added the claude label Apr 9, 2026

msfstef requested review from alco and icehaunter April 9, 2026 14:26

msfstef added the claude label Apr 9, 2026

msfstef and others added 2 commits April 9, 2026 17:35

alco approved these changes Apr 9, 2026

View reviewed changes

packages/sync-service/lib/electric/shapes/api.ex Outdated Show resolved Hide resolved

packages/sync-service/lib/electric/shapes/api.ex Outdated Show resolved Hide resolved

msfstef self-assigned this Apr 9, 2026

icehaunter approved these changes Apr 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: send HTTP 102 keepalive during long-poll to prevent middlebox timeouts#4106

fix: send HTTP 102 keepalive during long-poll to prevent middlebox timeouts#4106
msfstef wants to merge 4 commits intomainfrom
msfstef/keepalive-during-longpoll

msfstef commented Apr 9, 2026 •

edited

Loading

Uh oh!

codecov bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

claude bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

netlify bot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

msfstef commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Approach

Implementation details

Alternatives considered

Chunked transfer encoding with whitespace keepalive bytes

Stream.resource pattern (modeled on existing SSE keepalive)

Shorter long-poll timeout

On HTTP 102 Processing

Why it's safe to use

Known limitations

CDN request collapsing

Test plan

Uh oh!

codecov bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

claude bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Code Review

Summary

What's Working Well

Issues Found

Suggestions (Nice to Have)

Issue Conformance

Previous Review Status

Uh oh!

Uh oh!

Uh oh!

netlify bot commented Apr 9, 2026

✅ Deploy Preview for electric-next ready!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

msfstef commented Apr 9, 2026 •

edited

Loading

codecov bot commented Apr 9, 2026 •

edited

Loading

claude bot commented Apr 9, 2026 •

edited

Loading