Fix replication connection drops from wal_sender_timeout during backpressure by msfstef · Pull Request #4105 · electric-sql/electric

msfstef · 2026-04-09T11:40:25Z

Problem

When the BEAM is busy or downstream event processing is slow/failing, the replication client's apply_with_retries blocks inside handle_data, preventing the gen_statem from responding to PostgreSQL's keepalive requests. After wal_sender_timeout (default 60s), PostgreSQL terminates the replication connection with errors like:

%DBConnection.ConnectionError{message: "ssl send: closed", severity: :error, reason: :closed}

The root cause is that PostgreSQL's replication protocol requires the client to periodically send StandbyStatusUpdate messages. These are application-level heartbeats — TCP keepalives (handled by the kernel) don't help. When the gen_statem is stuck in a blocking retry loop inside handle_data, it cannot send these messages, and PostgreSQL kills the connection.

Investigation

We confirmed this is a well-known problem across PostgreSQL logical replication consumers (Debezium, pgjdbc, pg_recvlogical). PostgreSQL's protocol has no flow control mechanism — keepalive handling is coupled with data processing on a single connection.

Key findings:

TCP keepalives don't help — they're kernel-level and work regardless of process scheduling, but wal_sender_timeout checks for application-level StandbyStatusUpdate messages
SSL has no heartbeat — TLS is just a record layer on top of TCP
wal_sender_timeout is the culprit — it fires when the client fails to send any StandbyStatusUpdate within the timeout window
Sending StandbyStatusUpdate without advancing the LSN is safe — it resets last_reply_timestamp without affecting the replication slot's confirmed_flush_lsn

Solution

The fix addresses two requirements that are in tension:

Send keepalives while event processing is blocked, to prevent wal_sender_timeout
Stop the replication stream from pushing more data over the socket, to provide backpressure

Vendored `Postgrex.ReplicationConnection` with socket pause/resume

We vendor Postgrex.ReplicationConnection as Electric.Postgres.ReplicationConnection, adding two new callback return types:

{:noreply_and_pause, ack, state} — send ack messages, then pause socket reads. The gen_statem stops receiving new WAL data but remains responsive to handle_info messages (timers, notifications).
{:noreply_and_resume, ack, state} — send acks, process any buffered data, resume socket reads.

The pause mechanism leverages {active, :once} — by not re-arming the socket after processing a batch, no new data enters the process. At most one Erlang message is buffered. TCP flow control naturally backpressures PostgreSQL's walsender when the kernel receive buffer fills (~128KB-6MB depending on OS).

Non-blocking event dispatch in `ReplicationClient`

Instead of blocking in handle_data via apply_with_retries:

handle_data(XLogData) → dispatch_event(event) → {:noreply_and_pause, [], state}

The event is dispatched as a {:process_event, ...} message to self. The gen_statem returns immediately and is free to process other messages. On success, {:noreply_and_resume, acks, state} resumes the socket. On failure, retries are scheduled via Process.send_after.

Periodic keepalive timer

A StandbyStatusUpdate is sent every min(wal_sender_timeout/3, 15s). The interval is derived from PostgreSQL's wal_sender_timeout (queried from pg_settings during connection setup). The 15s cap ensures responsiveness even if the timeout is set very high or changes after connection.

Event-driven retry with `StatusMonitor.wait_until_async`

Replaces the blocking StatusMonitor.wait_until_active(timeout: :infinity) with a new generic wait_until_async/2 that subscribes to status transitions and notifies the caller via a tagged message. This also replaces the previous wait_until_conn_up_async with a single generic mechanism.

Changes

File	Change
`lib/electric/postgres/replication_connection.ex`	New — vendored `Postgrex.ReplicationConnection` with socket pause/resume
`lib/electric/postgres/replication_client.ex`	Non-blocking event dispatch, keepalive timer, async retry
`lib/electric/postgres/replication_client/connection_setup.ex`	Query `wal_sender_timeout` from `pg_settings` during setup
`lib/electric/replication/shape_log_collector.ex`	`suspend/resume_event_processing` for integration testing
`lib/electric/status_monitor.ex`	Generic `wait_until_async/2` (replaces `wait_until_conn_up_async`)
`lib/electric/connection/restarter.ex`	Use `wait_until_async`
`test/electric/status_monitor_test.exs`	Tests for `wait_until_async`
`test/electric/postgres/replication_client_test.exs`	Test: connection survives `wal_sender_timeout` during backpressure
`integration-tests/tests/replication-keepalive-during-backpressure.lux`	Lux integration test

Test plan

All 1537 existing unit tests pass
6 new wait_until_async tests pass
New Elixir test: connection survives wal_sender_timeout=3s with 6s of stuck processing — fails on old code with tcp send: closed, passes with fix
Lux integration test: connection survives wal_sender_timeout=5s with 10s of stuck processing — fails on old code with gen_statem termination, passes with fix
21/22 existing lux integration tests pass (1 pre-existing flaky failure unrelated to this change)

🤖 Generated with Claude Code

Replace the single-purpose wait_until_conn_up_async with a generic wait_until_async/2 that supports any status level (:active, :read_only). Uses a tagged reply format {{Electric.StatusMonitor, ref}, {:ok, level}} to avoid message collisions in the caller's mailbox. This removes the dedicated :wait_until_conn_up handler and conn_waiters state, consolidating all async waiting through the existing level-based waiter infrastructure. The Restarter is updated to use wait_until_async(stack_id, :active). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ressure When downstream event processing is slow or failing, the replication client retries inside handle_data via apply_with_retries. This blocks the gen_statem process, preventing it from responding to PostgreSQL's keepalive requests. After wal_sender_timeout (default 60s), PostgreSQL terminates the connection. The fix makes event processing non-blocking by: 1. Vendoring Postgrex.ReplicationConnection with socket pause/resume support. When handle_data receives WAL data, it dispatches the event asynchronously and pauses the socket — stopping PostgreSQL from pushing more data while the gen_statem remains responsive to timer and info messages. 2. Adding a periodic keepalive timer that sends StandbyStatusUpdate messages at wal_sender_timeout/3 intervals (queried from pg_settings during setup, capped at 15s). These fire between async retries, keeping the connection alive even during prolonged processing failures. 3. Using StatusMonitor.wait_until_async for event-driven retry notification instead of the blocking wait_until_active(timeout: :infinity) call. The socket pause mechanism leverages {active, :once} — by not re-arming the socket after processing a batch, no new data enters the process. TCP flow control naturally backpressures PostgreSQL's walsender when the kernel receive buffer fills. On resume, buffered copies and socket messages are drained in order. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

netlify · 2026-04-09T11:41:27Z

✅ Deploy Preview for electric-next ready!

Name	Link
🔨 Latest commit	`d5d75d6`
🔍 Latest deploy log	https://app.netlify.com/projects/electric-next/deploys/69d7902dd5cfe10008ec128f
😎 Deploy Preview	https://deploy-preview-4105--electric-next.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

codecov · 2026-04-09T11:44:27Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.67%. Comparing base (d69c59f) to head (0393830).
⚠️ Report is 5 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #4105   +/-   ##
=======================================
  Coverage   88.67%   88.67%           
=======================================
  Files          25       25           
  Lines        2438     2438           
  Branches      614      615    +1     
=======================================
  Hits         2162     2162           
  Misses        274      274           
  Partials        2        2

Flag	Coverage Δ
packages/experimental	`87.73% <ø> (ø)`
packages/react-hooks	`86.48% <ø> (ø)`
packages/start	`82.83% <ø> (ø)`
packages/typescript-client	`93.81% <ø> (ø)`
packages/y-electric	`56.05% <ø> (ø)`
typescript	`88.67% <ø> (ø)`
unit-tests	`88.67% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

claude · 2026-04-09T11:47:05Z

Claude Code Review

Summary

This PR fixes wal_sender_timeout crashes by switching from synchronous blocking retry loops to async event dispatch with periodic keepalives. The implementation is solid and the latest commit (0393830d) addressed four of the five previous issues. The only outstanding concern is closed below after verifying supervision guarantees.

What's Working Well

Catch-all for stale StatusMonitor notifications (new): The new handle_info clause at replication_client.ex:357 silently discards stale or unexpected StatusMonitor notifications, preventing a FunctionClauseError crash.
Logger.debug for keepalive interval (new): Downgraded from Logger.info — correct level for a per-reconnect diagnostic.
Retry budget comment (new): Lines 530-532 clearly explain why _remaining is intentionally discarded and how this matches the old apply_with_retries behavior.
Changeset file (new): fix-replication-keepalive-timeout.md correctly targets @core/sync-service as a patch.
Overall architecture: The pause/resume socket approach with async event dispatch is a clean solution to a genuinely tricky protocol constraint.

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

None remaining.

Suggestions (Nice to Have)

None remaining.

Issue Conformance

No linked issue — same as before. The PR description is thorough and self-contained, with a clear problem statement, investigation findings, and solution rationale.

Previous Review Status

#	Issue	Status
1	Missing catch-all `handle_info` for unmatched StatusMonitor refs	FIXED — catch-all added at `replication_client.ex:357`
2	StatusMonitor crash leaves ReplicationClient permanently stuck	CLOSED — see note below
3	Missing changeset file	FIXED — `.changeset/fix-replication-keepalive-timeout.md` added
4	`Logger.info` fires on every reconnect	FIXED — downgraded to `Logger.debug`
5	No comment explaining intentional `_remaining` discard	FIXED — comment added at `replication_client.ex:530-532`

On Issue 2 (StatusMonitor crash): After reviewing the supervision tree, this concern is mitigated by design. MonitoredCoreSupervisor uses :rest_for_one with StatusMonitor as its first child (monitored_core_supervisor.ex:33-37). A StatusMonitor crash cascades through the restart chain: CoreSupervisor → Connection.Supervisor → ConnectionManager → ReplicationClient (linked via start_link). The ReplicationClient is always restarted alongside StatusMonitor, so a stuck paused-socket state cannot persist across a StatusMonitor restart. No code change needed.

Review iteration: 3 | 2026-04-09

…ndored ReplicationConnection - Document the Postgrex version (0.22.0) and Protocol internals we depend on - Turn the unreachable dead-code clause for duplicate socket messages into a live defensive warning log - Vendor LSN encode/decode tests from upstream Postgrex to validate the vendored functions remain correct Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add catch-all handle_info for stale StatusMonitor notifications to prevent FunctionClauseError on unmatched refs - Downgrade keepalive interval log from info to debug (fires on every reconnect) - Add comment explaining intentional retry budget reset in wait_for_active_and_retry - Add changeset file - Fix field name to wait_for_active_ref consistently Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

msfstef · 2026-04-09T12:27:57Z

Thanks for the thorough review! Addressed the actionable items in 0393830:

Fixed:

1 (catch-all for stale StatusMonitor messages) — Added a catch-all handle_info clause that silently discards {{Electric.StatusMonitor, _ref}, _result} messages when wait_for_active_ref doesn't match. Prevents FunctionClauseError on stale notifications.
4 (Logger level) — Downgraded keepalive interval log from info to debug.
6 (_remaining parameter) — Added comment explaining the intentional budget reset (matches old apply_with_retries behavior).
3 (changeset) — Added .changeset/fix-replication-keepalive-timeout.md.

Not addressed (by design):

2 (StatusMonitor restart) — Not an issue. StatusMonitor is a child of MonitoredCoreSupervisor which uses :rest_for_one strategy, with StatusMonitor listed before CoreSupervisor. If StatusMonitor crashes, everything downstream (including ReplicationClient) is restarted. The supervision tree comment in monitored_core_supervisor.ex documents this explicitly.
5 (dead code clause) — Already addressed in an earlier commit (42ca443) — turned it into a live defensive Logger.warning that fires if a second socket message arrives while paused (shouldn't happen with {active, :once} but guards against future OTP changes).

msfstef · 2026-04-09T12:40:45Z

benchmark this

github-actions · 2026-04-09T12:49:50Z

Benchmark results, triggered for 03938

write fanout completed

concurrent shape creation completed

diverse shape fanout completed

many shapes one client latency completed

unrelated shapes one client latency completed

alco · 2026-04-09T15:25:03Z

Wow, these benchmark results are music!

msfstef · 2026-04-09T15:59:41Z

benchmark this

msfstef · 2026-04-09T16:25:49Z

@alco turns out it was against 1.2.8 so I updated the reference image and ran them again! there's no regression which is good (and mostly what I expected, the previous results were unexpectedly better)

alco

Brilliant!

msfstef and others added 2 commits April 9, 2026 13:03

msfstef added the claude label Apr 9, 2026

msfstef and others added 3 commits April 9, 2026 15:05

Fix formatting

05caa6e

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

msfstef requested review from alco and icehaunter April 9, 2026 12:56

msfstef marked this pull request as ready for review April 9, 2026 12:56

msfstef self-assigned this Apr 9, 2026

alco approved these changes Apr 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix replication connection drops from wal_sender_timeout during backpressure#4105

Fix replication connection drops from wal_sender_timeout during backpressure#4105
msfstef wants to merge 5 commits intomainfrom
msfstef/repl-connection-timeout

msfstef commented Apr 9, 2026

Uh oh!

netlify bot commented Apr 9, 2026

Uh oh!

codecov bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

claude bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

msfstef commented Apr 9, 2026 •

edited

Loading

Uh oh!

msfstef commented Apr 9, 2026

Uh oh!

github-actions bot commented Apr 9, 2026 •

edited by msfstef

Loading

Uh oh!

alco commented Apr 9, 2026

Uh oh!

msfstef commented Apr 9, 2026

Uh oh!

msfstef commented Apr 9, 2026

Uh oh!

alco left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

msfstef commented Apr 9, 2026

Problem

Investigation

Solution

Vendored Postgrex.ReplicationConnection with socket pause/resume

Non-blocking event dispatch in ReplicationClient

Periodic keepalive timer

Event-driven retry with StatusMonitor.wait_until_async

Changes

Test plan

Uh oh!

netlify bot commented Apr 9, 2026

✅ Deploy Preview for electric-next ready!

Uh oh!

codecov bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

claude bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Code Review

Summary

What's Working Well

Issues Found

Critical (Must Fix)

Important (Should Fix)

Suggestions (Nice to Have)

Issue Conformance

Previous Review Status

Uh oh!

msfstef commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msfstef commented Apr 9, 2026

Uh oh!

github-actions bot commented Apr 9, 2026 • edited by msfstef Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alco commented Apr 9, 2026

Uh oh!

msfstef commented Apr 9, 2026

Uh oh!

msfstef commented Apr 9, 2026

Uh oh!

alco left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Vendored `Postgrex.ReplicationConnection` with socket pause/resume

Non-blocking event dispatch in `ReplicationClient`

Event-driven retry with `StatusMonitor.wait_until_async`

codecov bot commented Apr 9, 2026 •

edited

Loading

claude bot commented Apr 9, 2026 •

edited

Loading

msfstef commented Apr 9, 2026 •

edited

Loading

github-actions bot commented Apr 9, 2026 •

edited by msfstef

Loading