election won't work if wrong NodeId is passed when adding node to cluster #918

pfwang80s · 2023-10-27T10:53:31Z

version: v0.8.3

To Reproduce

start node 0: {0, addr0} as a singleton cluster, {0, addr1} means {NodeId, BasicNode.addr}
start node 1: {1, addr1} as non-voter
start node 2: {2, addr2} as non-voter
call node0.add_learner(2, <node 1>), yes, we use NodeId different with <node 1>'s real id
call node0.add_learner(1, <node 2>)
call node0.change_membership(AddVoterIds(1,2))
kill node 0

Expected behavior
the following two are both OK:

step 4/5 failed, or
election works, no matter whether node 1/2's NodeId are changed.

Actual behavior
node 1/2 stop working, "openraft::core::tick: Tick fails to send, receiving end quit: channel closed"

github-actions · 2023-10-27T10:53:41Z

👋 Thanks for opening this issue!

Get help or engage by:

/help : to print help messages.
/assignme : to assign this issue to you.

drmingdrmer · 2023-10-29T07:06:58Z

What script were you using to produce this issue?

I think you're using one of the shell script in the example.
You can just push the modify code base to a branch so I can reproduce this issue using your branch.

And please attach the full log in DEBUG level so I can see what was happening.

And I don't understand why you believe that step two/three should fail.
Adding non-voter without join them to a cluster should always work.

pfwang80s · 2023-10-29T13:14:43Z

And I don't understand why you believe that step two/three should fail.

sorry, my bad, they should be step 4/5.

Adding non-voters with different NodeId always works, but after upgrading them to voter, killing the leader, the whole cluster stopped.

I think you can reproduce it with the mem kv example. To express myself better, I redescribe it in Chinese:
如何重现
有三个节点，node0/1/2，对应的 RaftNetwork 地址分别是 addr 0/1/2，启动这三个节点时，传给 openraft::Raft::new 的 NodeId 分别是 0/1/2

启动 node0，作为单节点 cluster。
启动 node1，作为 learner。
启动 node2，作为 learner。
在 node0 上调用 add_learner，但是用 NodeId = 2 来添加 node1 (addr1)
在 node0 上调用 add_learner，用 NodeId = 1 来添加 node2 (addr2)
在 node0 上调用 change_membership，把两个新加进来的 node 升级为 voter
kill node0, the leader

期待结果
下面两种结果都可以接受：

步骤 4/5 直接报错。因为 change_membership 日志中记录 id 和 raft 启动时的 id 不符，这导致了在 learner 升级成 voter 且 leader 挂掉之后整个集群不再工作。或者
选举正常工作，哪怕 node1/2 的 id 被改掉了。
相比之下 1 会简单一些，2 可能需要在 learner 被加入 cluster 时去做个判断，并更新自己的 id，目测这个需要梳理很多逻辑来保证更新 id 的正确性。

实际结果
node 1/2 stop working, "openraft::core::tick: Tick fails to send, receiving end quit: channel closed"

drmingdrmer · 2023-10-29T14:30:09Z

OK, I see the root cause of this problem is this, node-2 expected itself to be a follower but it is acutally a leader:

thread 'main' panicked at openraft/openraft/src/engine/engine_impl.rs:793:9:
assertion failed: self.internal_server_state.is_following()

Because you mistakenly configured a wrong network addresses for node-3: the endpoint address of node-3 is the address of node-2.
When node-1 is killed and node-2 becomes the leader, node-2 tries to replicate a message to node-3 by replicating it to itself.
And Openraft only allows a follower to receive replication messages.
Because node-2 itself is a leader, the assertion fails.

- Related issue: databendlabs#918

drmingdrmer added a commit to drmingdrmer/openraft that referenced this issue Oct 29, 2023

Doc: add FAQ: can not survive incorrectly configured cluster

eb7b62a

- Related issue: databendlabs#918

drmingdrmer mentioned this issue Oct 29, 2023

Doc: add FAQ: can not survive incorrectly configured cluster #919

Merged

drmingdrmer closed this as completed Oct 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

election won't work if wrong NodeId is passed when adding node to cluster #918

election won't work if wrong NodeId is passed when adding node to cluster #918

pfwang80s commented Oct 27, 2023 •

edited

Loading

github-actions bot commented Oct 27, 2023

drmingdrmer commented Oct 29, 2023 •

edited

Loading

pfwang80s commented Oct 29, 2023

drmingdrmer commented Oct 29, 2023 •

edited

Loading

election won't work if wrong NodeId is passed when adding node to cluster #918

election won't work if wrong NodeId is passed when adding node to cluster #918

Comments

pfwang80s commented Oct 27, 2023 • edited Loading

github-actions bot commented Oct 27, 2023

drmingdrmer commented Oct 29, 2023 • edited Loading

pfwang80s commented Oct 29, 2023

drmingdrmer commented Oct 29, 2023 • edited Loading

pfwang80s commented Oct 27, 2023 •

edited

Loading

drmingdrmer commented Oct 29, 2023 •

edited

Loading

drmingdrmer commented Oct 29, 2023 •

edited

Loading