Fix replication connection drops from wal_sender_timeout during backpressure#4105
Fix replication connection drops from wal_sender_timeout during backpressure#4105
Conversation
Replace the single-purpose wait_until_conn_up_async with a generic
wait_until_async/2 that supports any status level (:active, :read_only).
Uses a tagged reply format {{Electric.StatusMonitor, ref}, {:ok, level}}
to avoid message collisions in the caller's mailbox.
This removes the dedicated :wait_until_conn_up handler and conn_waiters
state, consolidating all async waiting through the existing level-based
waiter infrastructure.
The Restarter is updated to use wait_until_async(stack_id, :active).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ressure
When downstream event processing is slow or failing, the replication client
retries inside handle_data via apply_with_retries. This blocks the gen_statem
process, preventing it from responding to PostgreSQL's keepalive requests.
After wal_sender_timeout (default 60s), PostgreSQL terminates the connection.
The fix makes event processing non-blocking by:
1. Vendoring Postgrex.ReplicationConnection with socket pause/resume support.
When handle_data receives WAL data, it dispatches the event asynchronously
and pauses the socket — stopping PostgreSQL from pushing more data while
the gen_statem remains responsive to timer and info messages.
2. Adding a periodic keepalive timer that sends StandbyStatusUpdate messages
at wal_sender_timeout/3 intervals (queried from pg_settings during setup,
capped at 15s). These fire between async retries, keeping the connection
alive even during prolonged processing failures.
3. Using StatusMonitor.wait_until_async for event-driven retry notification
instead of the blocking wait_until_active(timeout: :infinity) call.
The socket pause mechanism leverages {active, :once} — by not re-arming
the socket after processing a batch, no new data enters the process. TCP
flow control naturally backpressures PostgreSQL's walsender when the kernel
receive buffer fills. On resume, buffered copies and socket messages are
drained in order.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
✅ Deploy Preview for electric-next ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4105 +/- ##
=======================================
Coverage 88.67% 88.67%
=======================================
Files 25 25
Lines 2438 2438
Branches 614 615 +1
=======================================
Hits 2162 2162
Misses 274 274
Partials 2 2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Claude Code ReviewSummaryThis PR fixes What's Working Well
Issues FoundCritical (Must Fix)None. Important (Should Fix)None remaining. Suggestions (Nice to Have)None remaining. Issue ConformanceNo linked issue — same as before. The PR description is thorough and self-contained, with a clear problem statement, investigation findings, and solution rationale. Previous Review Status
On Issue 2 (StatusMonitor crash): After reviewing the supervision tree, this concern is mitigated by design. Review iteration: 3 | 2026-04-09 |
…ndored ReplicationConnection - Document the Postgrex version (0.22.0) and Protocol internals we depend on - Turn the unreachable dead-code clause for duplicate socket messages into a live defensive warning log - Vendor LSN encode/decode tests from upstream Postgrex to validate the vendored functions remain correct Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add catch-all handle_info for stale StatusMonitor notifications to prevent FunctionClauseError on unmatched refs - Downgrade keepalive interval log from info to debug (fires on every reconnect) - Add comment explaining intentional retry budget reset in wait_for_active_and_retry - Add changeset file - Fix field name to wait_for_active_ref consistently Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thanks for the thorough review! Addressed the actionable items in 0393830: Fixed:
Not addressed (by design):
|
|
benchmark this |
|
Wow, these benchmark results are music! |
|
benchmark this |
|
@alco turns out it was against 1.2.8 so I updated the reference image and ran them again! there's no regression which is good (and mostly what I expected, the previous results were unexpectedly better) |
Problem
When the BEAM is busy or downstream event processing is slow/failing, the replication client's
apply_with_retriesblocks insidehandle_data, preventing the gen_statem from responding to PostgreSQL's keepalive requests. Afterwal_sender_timeout(default 60s), PostgreSQL terminates the replication connection with errors like:The root cause is that PostgreSQL's replication protocol requires the client to periodically send
StandbyStatusUpdatemessages. These are application-level heartbeats — TCP keepalives (handled by the kernel) don't help. When the gen_statem is stuck in a blocking retry loop insidehandle_data, it cannot send these messages, and PostgreSQL kills the connection.Investigation
We confirmed this is a well-known problem across PostgreSQL logical replication consumers (Debezium, pgjdbc, pg_recvlogical). PostgreSQL's protocol has no flow control mechanism — keepalive handling is coupled with data processing on a single connection.
Key findings:
wal_sender_timeoutchecks for application-levelStandbyStatusUpdatemessageswal_sender_timeoutis the culprit — it fires when the client fails to send anyStandbyStatusUpdatewithin the timeout windowStandbyStatusUpdatewithout advancing the LSN is safe — it resetslast_reply_timestampwithout affecting the replication slot'sconfirmed_flush_lsnSolution
The fix addresses two requirements that are in tension:
wal_sender_timeoutVendored
Postgrex.ReplicationConnectionwith socket pause/resumeWe vendor
Postgrex.ReplicationConnectionasElectric.Postgres.ReplicationConnection, adding two new callback return types:{:noreply_and_pause, ack, state}— send ack messages, then pause socket reads. The gen_statem stops receiving new WAL data but remains responsive tohandle_infomessages (timers, notifications).{:noreply_and_resume, ack, state}— send acks, process any buffered data, resume socket reads.The pause mechanism leverages
{active, :once}— by not re-arming the socket after processing a batch, no new data enters the process. At most one Erlang message is buffered. TCP flow control naturally backpressures PostgreSQL's walsender when the kernel receive buffer fills (~128KB-6MB depending on OS).Non-blocking event dispatch in
ReplicationClientInstead of blocking in
handle_dataviaapply_with_retries:The event is dispatched as a
{:process_event, ...}message to self. The gen_statem returns immediately and is free to process other messages. On success,{:noreply_and_resume, acks, state}resumes the socket. On failure, retries are scheduled viaProcess.send_after.Periodic keepalive timer
A
StandbyStatusUpdateis sent everymin(wal_sender_timeout/3, 15s). The interval is derived from PostgreSQL'swal_sender_timeout(queried frompg_settingsduring connection setup). The 15s cap ensures responsiveness even if the timeout is set very high or changes after connection.Event-driven retry with
StatusMonitor.wait_until_asyncReplaces the blocking
StatusMonitor.wait_until_active(timeout: :infinity)with a new genericwait_until_async/2that subscribes to status transitions and notifies the caller via a tagged message. This also replaces the previouswait_until_conn_up_asyncwith a single generic mechanism.Changes
lib/electric/postgres/replication_connection.exPostgrex.ReplicationConnectionwith socket pause/resumelib/electric/postgres/replication_client.exlib/electric/postgres/replication_client/connection_setup.exwal_sender_timeoutfrompg_settingsduring setuplib/electric/replication/shape_log_collector.exsuspend/resume_event_processingfor integration testinglib/electric/status_monitor.exwait_until_async/2(replaceswait_until_conn_up_async)lib/electric/connection/restarter.exwait_until_asynctest/electric/status_monitor_test.exswait_until_asynctest/electric/postgres/replication_client_test.exswal_sender_timeoutduring backpressureintegration-tests/tests/replication-keepalive-during-backpressure.luxTest plan
wait_until_asynctests passwal_sender_timeout=3swith 6s of stuck processing — fails on old code withtcp send: closed, passes with fixwal_sender_timeout=5swith 10s of stuck processing — fails on old code with gen_statem termination, passes with fix🤖 Generated with Claude Code