Skip to content

fix: handle missing last processed LSN gracefully#4109

Open
msfstef wants to merge 2 commits intomainfrom
msfstef/fix-last-processed-lsn-missing
Open

fix: handle missing last processed LSN gracefully#4109
msfstef wants to merge 2 commits intomainfrom
msfstef/fix-last-processed-lsn-missing

Conversation

@msfstef
Copy link
Copy Markdown
Contributor

@msfstef msfstef commented Apr 9, 2026

Summary

Fixes #4107MatchError: no match of right hand side value: [] in LsnTracker.get_last_processed_lsn/1 during hold_until_change.

Root cause

The crash occurs when a long-polling request's timeout fires and determine_global_last_seen_lsn calls LsnTracker.get_last_processed_lsn, but the ETS key hasn't been populated.

The hold_until_change after block enters the read-only branch based on runtime status (status.shape == :read_only), but determine_global_last_seen_lsn dispatched on the request-time flag (request.read_only?). When status.shape degraded to :read_only during the long poll but request.read_only? was still false, the non-read-only clause called LsnTracker.get_last_processed_lsn which crashed on the empty ETS table.

The LsnTracker data is lost when the ShapeStatusOwner process (which owns the ETS table) crashes and restarts — it recreates an empty table. Bandit HTTP handler processes survive this because they live in a separate supervision tree. A request validated during active mode (read_only?: false) can then hit the empty table when its long-poll timeout fires.

Changes

  • LsnTracker.get_last_processed_lsn/1 returns nil instead of crashing when the ETS key is absent
  • determine_global_last_seen_lsn/1 falls back to the shape's own offset when LsnTracker returns nil — this is the same value the read-only path used, so the separate read_only?: true clause was removed
  • hold_until_change after block aligns request.read_only? with runtime status when entering the read-only branch, so downstream functions (determine_log_chunk_offset, get_merged_log_stream) use correct read strategies
  • replication_client handle_result was using the pre-query state.flushed_wal (always 0) instead of the updated state from process_query_result when populating the LsnTracker during slot creation/query — fixed by returning the updated state from process_query_result
  • shape_log_collector mark_as_ready adds an explicit assertion for the LsnTracker invariant (must be populated before the collector starts) with a clear error message

Test plan

  • New unit test: LsnTracker.get_last_processed_lsn returns nil when not populated
  • New API test: non-read-only request falls back to shape offset when LsnTracker is empty
  • New API test: live long-poll timeout doesn't crash when LsnTracker is not populated
  • Full test suite passes (1535 tests, 0 failures)

🤖 Generated with Claude Code

The LsnTracker ETS table can be empty when a request handler survives
a MonitoredCoreSupervisor restart (ShapeStatusOwner owns the table and
recreates it empty on restart, while Bandit handlers live in a separate
supervision tree). A long-polling request validated during active mode
(read_only?: false) could then crash on MatchError when the timeout
fires and determine_global_last_seen_lsn calls get_last_processed_lsn.

- Make get_last_processed_lsn return nil instead of crashing when the
  ETS key is absent
- determine_global_last_seen_lsn falls back to the shape's own offset
  when LsnTracker returns nil, removing the need for a separate
  read_only? clause
- Align request.read_only? with runtime status in hold_until_change's
  after block so downstream functions use correct read strategies
- Fix stale flushed_wal in replication client: handle_result was using
  the pre-query state (flushed_wal=0) instead of the updated state
  from process_query_result when populating the LsnTracker
- Add explicit assertion in shape_log_collector's mark_as_ready for
  the LsnTracker invariant (must be populated before collector starts)

Closes #4107

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@msfstef msfstef added the claude label Apr 9, 2026
@msfstef msfstef requested review from alco and icehaunter April 9, 2026 15:53
@netlify
Copy link
Copy Markdown

netlify bot commented Apr 9, 2026

Deploy Preview for electric-next ready!

Name Link
🔨 Latest commit 8b7194d
🔍 Latest deploy log https://app.netlify.com/projects/electric-next/deploys/69d7cb9197ee3b00084c169b
😎 Deploy Preview https://deploy-preview-4109--electric-next.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@claude
Copy link
Copy Markdown

claude bot commented Apr 9, 2026

Claude Code Review

Summary

This PR correctly fixes the MatchError crash in LsnTracker.get_last_processed_lsn/1 caused by a race between long-poll request lifecycle and ShapeStatusOwner restarts. All issues from the previous review have been addressed; the PR is ready to merge.

What's Working Well

  • Root cause analysis is thorough: The PR description correctly identifies the primary crash path (ETS empty after ShapeStatusOwner restart) and a secondary latent bug (flushed_wal: 0 being written to LsnTracker on slot creation because the pre-update state was used instead of the updated state).
  • flushed_wal fix in process_query_result: Returning a 5-tuple {step, next_step, extra_info, state, callback_return} is the right approach — the updated state had already been computed inside process_query_result but was silently discarded by the caller.
  • Defensive assertion in mark_as_ready: Raising with a clear message instead of silently producing LogOffset.new(0, :infinity) makes invariant violations immediately diagnosable.
  • Code simplification: Collapsing the two-clause determine_global_last_seen_lsn/1 into one by handling nil from the LsnTracker eliminates the flag mismatch bug at its source and handles both read-only mode and transient ETS reset uniformly.
  • request.read_only? alignment: Rebinding the flag before entering the check_for_disk_updates branch ensures downstream functions use the correct read strategy.
  • New connection_setup_test.exs: Two well-targeted tests verify that process_query_result returns the updated flushed_wal (actual slot LSN, not 0) for both :create_slot and :query_slot_flushed_lsn steps. Correct use of async: true and matching the exact Postgrex.Result structure the production code pattern-matches against.

Issues Found

None.

Issue Conformance

Issue #4107 is well-specified with a Sentry stack trace and a clear reproduction hypothesis. The implementation directly addresses the reported crash, and the PR goes further by fixing the root cause of why the LsnTracker value was unreliable in the first place. No scope creep.

Previous Review Status

Both issues from iteration 1 have been resolved:

  • Missing changeset file: .changeset/fix-last-processed-lsn-missing.md added with the correct "@core/sync-service": patch package name.
  • flushed_wal fix had no test: connection_setup_test.exs added, covering both :create_slot and :query_slot_flushed_lsn result handling.

Review iteration: 2 | 2026-04-09

Address Claude review feedback:
- Add patch changeset for sync-service
- Add unit tests for process_query_result verifying that the updated
  state returned in the 5-tuple contains the correct flushed_wal from
  slot creation and slot query results

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.67%. Comparing base (11b151b) to head (c69bc79).
⚠️ Report is 3 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4109   +/-   ##
=======================================
  Coverage   88.67%   88.67%           
=======================================
  Files          25       25           
  Lines        2438     2438           
  Branches      612      610    -2     
=======================================
  Hits         2162     2162           
  Misses        274      274           
  Partials        2        2           
Flag Coverage Δ
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/start 82.83% <ø> (ø)
packages/typescript-client 93.81% <ø> (ø)
packages/y-electric 56.05% <ø> (ø)
typescript 88.67% <ø> (ø)
unit-tests 88.67% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@msfstef msfstef self-assigned this Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Last processed LSN not populated on request handled

3 participants