Skip to content

Conversation

@zhijun42
Copy link
Contributor

@zhijun42 zhijun42 commented Nov 6, 2025

Summary

This PR handles a network race condition that would cause a replica to read stale cluster state and then incorrectly promote itself to an empty primary within an existing shard.

Issues

When multiple replicas attempt to synchronize from the same primary concurrently, the first replica to issue PSYNC triggers an RDB child fork (RDB_CHILD_TYPE_SOCKET) for replication.

Any replicas connecting shortly after the fork miss the join window — their PING commands are accepted but not replied to until the primary server's fork completes. As a result, those replicas remain blocked in REPL_STATE_RECEIVE_PING_REPLY on their main thread, unable to process any cluster/client traffic, effectively dead to the outside world.

If a failover occurs during this period (e.g., the primary goes down and another replica is elected leader), the blocked replica will:

  1. Time out after a while (server.repl_syncio_timeout is 5s by default) and resume its main thread, reconnecting with other nodes
  2. Receive fresh PONG reply from the new leader.
  3. Then receive delayed, stale PING messages from the new leader buffered before failover (didn't reach replica earlier, because replica was "dead").
  4. Misinterpret that stale message as current truth, think it's now a sub-replica, and decide to follow the old primary again, and—after detecting it as FAILED—start and win a new election.

This results in two primaries in the same shard, with one being empty (0 slots) but still considered authoritative by all cluster servers.

For a concrete example, read the test case test_blocked_replica_stale_state_race and its comment in file tests/unit/cluster/replica-migration.tcl.

Overall there're two issues:

  • Replicas waiting for replication reply will block on main thread and stop all activities. This is true even with rdb-key-save-delay = 0, because the underlying cause is replication enrollment timing, not artificial delay.
  • The blocked replica can go through the events above and become empty primary.

This PR handles the second issue. Notice that this issue is flaky - If event [3] happens before [2] (inbound and outbound links are independent, so we can't guarantee the ordering), the replica won't become sub-replica and we won't have the empty primary issue.

I added guardrail logic in case such race condition does happen, and I also wrote a new test case to test my code, but since this issue can't be reliably reproduced (runs fine on my machine, but not on CI), this test case is disabled for now.

Fix

There could be different approaches to solving the problem. The way I do it is:

  • Replica shouldn't trigger failover when replication offset is 0.
  • Replica has a failed primary. Then it receives PING/PONG packet from a sender claiming it's the primary server in the same shard. Replica should accept and follow it.

File tests/unit/cluster/replica-migration.tcl is mostly meant to test the logic in function clusterUpdateSlotsConfigWith, so I add my new test case here. And since the test file has some duplicates, I extract them into helper functions to simplify the code.

Signed-off-by: Zhijun <dszhijun@gmail.com>
Signed-off-by: Zhijun <dszhijun@gmail.com>
Signed-off-by: Zhijun <dszhijun@gmail.com>
Signed-off-by: Zhijun <dszhijun@gmail.com>
Signed-off-by: Zhijun <dszhijun@gmail.com>
Signed-off-by: Zhijun <dszhijun@gmail.com>
Signed-off-by: Zhijun <dszhijun@gmail.com>
Signed-off-by: Zhijun <dszhijun@gmail.com>
Signed-off-by: Zhijun <dszhijun@gmail.com>
Signed-off-by: Zhijun <dszhijun@gmail.com>
Signed-off-by: Zhijun <dszhijun@gmail.com>
@codecov
Copy link

codecov bot commented Nov 6, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.40%. Comparing base (d16788e) to head (e3a98ce).

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2811      +/-   ##
============================================
- Coverage     72.45%   72.40%   -0.05%     
============================================
  Files           128      128              
  Lines         70485    70497      +12     
============================================
- Hits          51068    51042      -26     
- Misses        19417    19455      +38     
Files with missing lines Coverage Δ
src/cluster_legacy.c 87.58% <100.00%> (+0.04%) ⬆️
src/replication.c 85.82% <ø> (-0.08%) ⬇️
src/socket.c 94.21% <100.00%> (ø)

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@zhijun42 zhijun42 force-pushed the race-empty-primary-with-test branch 2 times, most recently from 61bbeab to 6133920 Compare November 28, 2025 04:18
@zhijun42 zhijun42 force-pushed the race-empty-primary-with-test branch from 6133920 to e3a98ce Compare November 28, 2025 04:35
@zhijun42
Copy link
Contributor Author

@PingXie @enjoy-binbin @hpatro Friendly ping 👀 Could you take a look at this when you got some time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant