Skip to content

Fix replication connection drops from wal_sender_timeout during backpressure#4105

Open
msfstef wants to merge 5 commits intomainfrom
msfstef/repl-connection-timeout
Open

Fix replication connection drops from wal_sender_timeout during backpressure#4105
msfstef wants to merge 5 commits intomainfrom
msfstef/repl-connection-timeout

Conversation

@msfstef
Copy link
Copy Markdown
Contributor

@msfstef msfstef commented Apr 9, 2026

Problem

When the BEAM is busy or downstream event processing is slow/failing, the replication client's apply_with_retries blocks inside handle_data, preventing the gen_statem from responding to PostgreSQL's keepalive requests. After wal_sender_timeout (default 60s), PostgreSQL terminates the replication connection with errors like:

%DBConnection.ConnectionError{message: "ssl send: closed", severity: :error, reason: :closed}

The root cause is that PostgreSQL's replication protocol requires the client to periodically send StandbyStatusUpdate messages. These are application-level heartbeats — TCP keepalives (handled by the kernel) don't help. When the gen_statem is stuck in a blocking retry loop inside handle_data, it cannot send these messages, and PostgreSQL kills the connection.

Investigation

We confirmed this is a well-known problem across PostgreSQL logical replication consumers (Debezium, pgjdbc, pg_recvlogical). PostgreSQL's protocol has no flow control mechanism — keepalive handling is coupled with data processing on a single connection.

Key findings:

  • TCP keepalives don't help — they're kernel-level and work regardless of process scheduling, but wal_sender_timeout checks for application-level StandbyStatusUpdate messages
  • SSL has no heartbeat — TLS is just a record layer on top of TCP
  • wal_sender_timeout is the culprit — it fires when the client fails to send any StandbyStatusUpdate within the timeout window
  • Sending StandbyStatusUpdate without advancing the LSN is safe — it resets last_reply_timestamp without affecting the replication slot's confirmed_flush_lsn

Solution

The fix addresses two requirements that are in tension:

  1. Send keepalives while event processing is blocked, to prevent wal_sender_timeout
  2. Stop the replication stream from pushing more data over the socket, to provide backpressure

Vendored Postgrex.ReplicationConnection with socket pause/resume

We vendor Postgrex.ReplicationConnection as Electric.Postgres.ReplicationConnection, adding two new callback return types:

  • {:noreply_and_pause, ack, state} — send ack messages, then pause socket reads. The gen_statem stops receiving new WAL data but remains responsive to handle_info messages (timers, notifications).
  • {:noreply_and_resume, ack, state} — send acks, process any buffered data, resume socket reads.

The pause mechanism leverages {active, :once} — by not re-arming the socket after processing a batch, no new data enters the process. At most one Erlang message is buffered. TCP flow control naturally backpressures PostgreSQL's walsender when the kernel receive buffer fills (~128KB-6MB depending on OS).

Non-blocking event dispatch in ReplicationClient

Instead of blocking in handle_data via apply_with_retries:

handle_data(XLogData) → dispatch_event(event) → {:noreply_and_pause, [], state}

The event is dispatched as a {:process_event, ...} message to self. The gen_statem returns immediately and is free to process other messages. On success, {:noreply_and_resume, acks, state} resumes the socket. On failure, retries are scheduled via Process.send_after.

Periodic keepalive timer

A StandbyStatusUpdate is sent every min(wal_sender_timeout/3, 15s). The interval is derived from PostgreSQL's wal_sender_timeout (queried from pg_settings during connection setup). The 15s cap ensures responsiveness even if the timeout is set very high or changes after connection.

Event-driven retry with StatusMonitor.wait_until_async

Replaces the blocking StatusMonitor.wait_until_active(timeout: :infinity) with a new generic wait_until_async/2 that subscribes to status transitions and notifies the caller via a tagged message. This also replaces the previous wait_until_conn_up_async with a single generic mechanism.

Changes

File Change
lib/electric/postgres/replication_connection.ex New — vendored Postgrex.ReplicationConnection with socket pause/resume
lib/electric/postgres/replication_client.ex Non-blocking event dispatch, keepalive timer, async retry
lib/electric/postgres/replication_client/connection_setup.ex Query wal_sender_timeout from pg_settings during setup
lib/electric/replication/shape_log_collector.ex suspend/resume_event_processing for integration testing
lib/electric/status_monitor.ex Generic wait_until_async/2 (replaces wait_until_conn_up_async)
lib/electric/connection/restarter.ex Use wait_until_async
test/electric/status_monitor_test.exs Tests for wait_until_async
test/electric/postgres/replication_client_test.exs Test: connection survives wal_sender_timeout during backpressure
integration-tests/tests/replication-keepalive-during-backpressure.lux Lux integration test

Test plan

  • All 1537 existing unit tests pass
  • 6 new wait_until_async tests pass
  • New Elixir test: connection survives wal_sender_timeout=3s with 6s of stuck processing — fails on old code with tcp send: closed, passes with fix
  • Lux integration test: connection survives wal_sender_timeout=5s with 10s of stuck processing — fails on old code with gen_statem termination, passes with fix
  • 21/22 existing lux integration tests pass (1 pre-existing flaky failure unrelated to this change)

🤖 Generated with Claude Code

msfstef and others added 2 commits April 9, 2026 13:03
Replace the single-purpose wait_until_conn_up_async with a generic
wait_until_async/2 that supports any status level (:active, :read_only).
Uses a tagged reply format {{Electric.StatusMonitor, ref}, {:ok, level}}
to avoid message collisions in the caller's mailbox.

This removes the dedicated :wait_until_conn_up handler and conn_waiters
state, consolidating all async waiting through the existing level-based
waiter infrastructure.

The Restarter is updated to use wait_until_async(stack_id, :active).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ressure

When downstream event processing is slow or failing, the replication client
retries inside handle_data via apply_with_retries. This blocks the gen_statem
process, preventing it from responding to PostgreSQL's keepalive requests.
After wal_sender_timeout (default 60s), PostgreSQL terminates the connection.

The fix makes event processing non-blocking by:

1. Vendoring Postgrex.ReplicationConnection with socket pause/resume support.
   When handle_data receives WAL data, it dispatches the event asynchronously
   and pauses the socket — stopping PostgreSQL from pushing more data while
   the gen_statem remains responsive to timer and info messages.

2. Adding a periodic keepalive timer that sends StandbyStatusUpdate messages
   at wal_sender_timeout/3 intervals (queried from pg_settings during setup,
   capped at 15s). These fire between async retries, keeping the connection
   alive even during prolonged processing failures.

3. Using StatusMonitor.wait_until_async for event-driven retry notification
   instead of the blocking wait_until_active(timeout: :infinity) call.

The socket pause mechanism leverages {active, :once} — by not re-arming
the socket after processing a batch, no new data enters the process. TCP
flow control naturally backpressures PostgreSQL's walsender when the kernel
receive buffer fills. On resume, buffered copies and socket messages are
drained in order.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@msfstef msfstef added the claude label Apr 9, 2026
@netlify
Copy link
Copy Markdown

netlify bot commented Apr 9, 2026

Deploy Preview for electric-next ready!

Name Link
🔨 Latest commit d5d75d6
🔍 Latest deploy log https://app.netlify.com/projects/electric-next/deploys/69d7902dd5cfe10008ec128f
😎 Deploy Preview https://deploy-preview-4105--electric-next.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.67%. Comparing base (d69c59f) to head (0393830).
⚠️ Report is 5 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4105   +/-   ##
=======================================
  Coverage   88.67%   88.67%           
=======================================
  Files          25       25           
  Lines        2438     2438           
  Branches      614      615    +1     
=======================================
  Hits         2162     2162           
  Misses        274      274           
  Partials        2        2           
Flag Coverage Δ
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/start 82.83% <ø> (ø)
packages/typescript-client 93.81% <ø> (ø)
packages/y-electric 56.05% <ø> (ø)
typescript 88.67% <ø> (ø)
unit-tests 88.67% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@claude
Copy link
Copy Markdown

claude bot commented Apr 9, 2026

Claude Code Review

Summary

This PR fixes wal_sender_timeout crashes by switching from synchronous blocking retry loops to async event dispatch with periodic keepalives. The implementation is solid and the latest commit (0393830d) addressed four of the five previous issues. The only outstanding concern is closed below after verifying supervision guarantees.


What's Working Well

  • Catch-all for stale StatusMonitor notifications (new): The new handle_info clause at replication_client.ex:357 silently discards stale or unexpected StatusMonitor notifications, preventing a FunctionClauseError crash.
  • Logger.debug for keepalive interval (new): Downgraded from Logger.info — correct level for a per-reconnect diagnostic.
  • Retry budget comment (new): Lines 530-532 clearly explain why _remaining is intentionally discarded and how this matches the old apply_with_retries behavior.
  • Changeset file (new): fix-replication-keepalive-timeout.md correctly targets @core/sync-service as a patch.
  • Overall architecture: The pause/resume socket approach with async event dispatch is a clean solution to a genuinely tricky protocol constraint.

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

None remaining.

Suggestions (Nice to Have)

None remaining.


Issue Conformance

No linked issue — same as before. The PR description is thorough and self-contained, with a clear problem statement, investigation findings, and solution rationale.


Previous Review Status

# Issue Status
1 Missing catch-all handle_info for unmatched StatusMonitor refs FIXED — catch-all added at replication_client.ex:357
2 StatusMonitor crash leaves ReplicationClient permanently stuck CLOSED — see note below
3 Missing changeset file FIXED.changeset/fix-replication-keepalive-timeout.md added
4 Logger.info fires on every reconnect FIXED — downgraded to Logger.debug
5 No comment explaining intentional _remaining discard FIXED — comment added at replication_client.ex:530-532

On Issue 2 (StatusMonitor crash): After reviewing the supervision tree, this concern is mitigated by design. MonitoredCoreSupervisor uses :rest_for_one with StatusMonitor as its first child (monitored_core_supervisor.ex:33-37). A StatusMonitor crash cascades through the restart chain: CoreSupervisorConnection.SupervisorConnectionManagerReplicationClient (linked via start_link). The ReplicationClient is always restarted alongside StatusMonitor, so a stuck paused-socket state cannot persist across a StatusMonitor restart. No code change needed.


Review iteration: 3 | 2026-04-09

msfstef and others added 3 commits April 9, 2026 15:05
…ndored ReplicationConnection

- Document the Postgrex version (0.22.0) and Protocol internals we depend on
- Turn the unreachable dead-code clause for duplicate socket messages into
  a live defensive warning log
- Vendor LSN encode/decode tests from upstream Postgrex to validate the
  vendored functions remain correct

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add catch-all handle_info for stale StatusMonitor notifications to
  prevent FunctionClauseError on unmatched refs
- Downgrade keepalive interval log from info to debug (fires on every
  reconnect)
- Add comment explaining intentional retry budget reset in
  wait_for_active_and_retry
- Add changeset file
- Fix field name to wait_for_active_ref consistently

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@msfstef
Copy link
Copy Markdown
Contributor Author

msfstef commented Apr 9, 2026

Thanks for the thorough review! Addressed the actionable items in 0393830:

Fixed:

  • 1 (catch-all for stale StatusMonitor messages) — Added a catch-all handle_info clause that silently discards {{Electric.StatusMonitor, _ref}, _result} messages when wait_for_active_ref doesn't match. Prevents FunctionClauseError on stale notifications.
  • 4 (Logger level) — Downgraded keepalive interval log from info to debug.
  • 6 (_remaining parameter) — Added comment explaining the intentional budget reset (matches old apply_with_retries behavior).
  • 3 (changeset) — Added .changeset/fix-replication-keepalive-timeout.md.

Not addressed (by design):

  • 2 (StatusMonitor restart) — Not an issue. StatusMonitor is a child of MonitoredCoreSupervisor which uses :rest_for_one strategy, with StatusMonitor listed before CoreSupervisor. If StatusMonitor crashes, everything downstream (including ReplicationClient) is restarted. The supervision tree comment in monitored_core_supervisor.ex documents this explicitly.
  • 5 (dead code clause) — Already addressed in an earlier commit (42ca443) — turned it into a live defensive Logger.warning that fires if a second socket message arrives while paused (shouldn't happen with {active, :once} but guards against future OTP changes).

@msfstef
Copy link
Copy Markdown
Contributor Author

msfstef commented Apr 9, 2026

benchmark this

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

Benchmark results, triggered for 03938

  • write fanout completed

write fanout results

  • concurrent shape creation completed

concurrent shape creation results

  • diverse shape fanout completed

diverse shape fanout results

  • many shapes one client latency completed

many shapes one client latency results

  • unrelated shapes one client latency completed

unrelated shapes one client latency results

@msfstef msfstef requested review from alco and icehaunter April 9, 2026 12:56
@msfstef msfstef marked this pull request as ready for review April 9, 2026 12:56
@alco
Copy link
Copy Markdown
Member

alco commented Apr 9, 2026

Wow, these benchmark results are music!

@msfstef
Copy link
Copy Markdown
Contributor Author

msfstef commented Apr 9, 2026

benchmark this

@msfstef msfstef self-assigned this Apr 9, 2026
@msfstef
Copy link
Copy Markdown
Contributor Author

msfstef commented Apr 9, 2026

@alco turns out it was against 1.2.8 so I updated the reference image and ran them again! there's no regression which is good (and mostly what I expected, the previous results were unexpectedly better)

Copy link
Copy Markdown
Member

@alco alco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brilliant!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants