Cluster: Fix sub-replica and prevent empty primary when replica is blocked at replication and has inbound/outbound link race #2811

zhijun42 · 2025-11-06T10:14:11Z

Summary

This PR handles a network race condition that would cause a replica to read stale cluster state and then incorrectly promote itself to an empty primary within an existing shard.

Issues

When multiple replicas attempt to synchronize from the same primary concurrently, the first replica to issue PSYNC triggers an RDB child fork (RDB_CHILD_TYPE_SOCKET) for replication.

Any replicas connecting shortly after the fork miss the join window — their PING commands are accepted but not replied to until the primary server's fork completes. As a result, those replicas remain blocked in REPL_STATE_RECEIVE_PING_REPLY on their main thread, unable to process any cluster/client traffic, effectively dead to the outside world.

If a failover occurs during this period (e.g., the primary goes down and another replica is elected leader), the blocked replica will:

Time out after a while (server.repl_syncio_timeout is 5s by default) and resume its main thread, reconnecting with other nodes
Receive fresh PONG reply from the new leader.
Then receive delayed, stale PING messages from the new leader buffered before failover (didn't reach replica earlier, because replica was "dead").
Misinterpret that stale message as current truth, think it's now a sub-replica, and decide to follow the old primary again, and—after detecting it as FAILED—start and win a new election.

This results in two primaries in the same shard, with one being empty (0 slots) but still considered authoritative by all cluster servers.

For a concrete example, read the test case test_blocked_replica_stale_state_race and its comment in file tests/unit/cluster/replica-migration.tcl.

Overall there're two issues:

Replicas waiting for replication reply will block on main thread and stop all activities. This is true even with rdb-key-save-delay = 0, because the underlying cause is replication enrollment timing, not artificial delay.
The blocked replica can go through the events above and become empty primary.

This PR handles the second issue. Notice that this issue is flaky - If event [3] happens before [2] (inbound and outbound links are independent, so we can't guarantee the ordering), the replica won't become sub-replica and we won't have the empty primary issue.

I added guardrail logic in case such race condition does happen, and I also wrote a new test case to test my code, but since this issue can't be reliably reproduced (runs fine on my machine, but not on CI), this test case is disabled for now.

Fix

There could be different approaches to solving the problem. The way I do it is:

Replica shouldn't trigger failover when replication offset is 0.
Replica has a failed primary. Then it receives PING/PONG packet from a sender claiming it's the primary server in the same shard. Replica should accept and follow it.

File tests/unit/cluster/replica-migration.tcl is mostly meant to test the logic in function clusterUpdateSlotsConfigWith, so I add my new test case here. And since the test file has some duplicates, I extract them into helper functions to simplify the code.

Signed-off-by: Zhijun <dszhijun@gmail.com>

codecov · 2025-11-06T12:02:12Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.40%. Comparing base (d16788e) to head (e3a98ce).

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2811      +/-   ##
============================================
- Coverage     72.45%   72.40%   -0.05%     
============================================
  Files           128      128              
  Lines         70485    70497      +12     
============================================
- Hits          51068    51042      -26     
- Misses        19417    19455      +38

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`87.58% <100.00%> (+0.04%)`	⬆️
src/replication.c	`85.82% <ø> (-0.08%)`	⬇️
src/socket.c	`94.21% <100.00%> (ø)`

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Zhijun <dszhijun@gmail.com>

…y-with-test

zhijun42 · 2025-11-28T05:47:44Z

@PingXie @enjoy-binbin @hpatro Friendly ping 👀 Could you take a look at this when you got some time?

zhijun42 added 9 commits November 5, 2025 15:52

Fix the challenging issue

86bfb9d

Signed-off-by: Zhijun <dszhijun@gmail.com>

Spelling

13ef533

Signed-off-by: Zhijun <dszhijun@gmail.com>

slot_owner might be uninitialized

72fc2d1

Signed-off-by: Zhijun <dszhijun@gmail.com>

I shouldn't use Java style

9b7bb1f

Signed-off-by: Zhijun <dszhijun@gmail.com>

Another format fix

f04de59

Signed-off-by: Zhijun <dszhijun@gmail.com>

Turns out CLion IDE uses different style of pointer

2ef8dca

Signed-off-by: Zhijun <dszhijun@gmail.com>

Small twist

c6b48e2

Signed-off-by: Zhijun <dszhijun@gmail.com>

Turn off the new test

d1b2423

Signed-off-by: Zhijun <dszhijun@gmail.com>

Fix manual-failover test

021915f

Signed-off-by: Zhijun <dszhijun@gmail.com>

github-actions bot assigned zhijun42 Nov 6, 2025

zhijun42 added 2 commits November 6, 2025 18:21

Add reference to get_current_ts

0999ce6

Signed-off-by: Zhijun <dszhijun@gmail.com>

Trigger rerunning codecov test

0a0b948

Signed-off-by: Zhijun <dszhijun@gmail.com>

zuiderkwast requested review from PingXie, enjoy-binbin and hpatro November 10, 2025 01:24

zhijun42 mentioned this pull request Nov 19, 2025

Cluster: Enhance debugging in logs #2815

Merged

zhijun42 force-pushed the race-empty-primary-with-test branch 2 times, most recently from 61bbeab to 6133920 Compare November 28, 2025 04:18

zhijun42 added 2 commits November 28, 2025 12:34

Small twist in test TCL to make code review a bit easier

9522e07

Signed-off-by: Zhijun <dszhijun@gmail.com>

Merge remote-tracking branch 'origin/unstable' into race-empty-primar…

e3a98ce

…y-with-test

zhijun42 force-pushed the race-empty-primary-with-test branch from 6133920 to e3a98ce Compare November 28, 2025 04:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cluster: Fix sub-replica and prevent empty primary when replica is blocked at replication and has inbound/outbound link race #2811

Cluster: Fix sub-replica and prevent empty primary when replica is blocked at replication and has inbound/outbound link race #2811

zhijun42 commented Nov 6, 2025 •

edited

Loading

Uh oh!

codecov bot commented Nov 6, 2025 •

edited

Loading

Uh oh!

zhijun42 commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Cluster: Fix sub-replica and prevent empty primary when replica is blocked at replication and has inbound/outbound link race #2811

Are you sure you want to change the base?

Cluster: Fix sub-replica and prevent empty primary when replica is blocked at replication and has inbound/outbound link race #2811

Conversation

zhijun42 commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Issues

Fix

Uh oh!

codecov bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zhijun42 commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zhijun42 commented Nov 6, 2025 •

edited

Loading

codecov bot commented Nov 6, 2025 •

edited

Loading