Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

election won't work if wrong NodeId is passed when adding node to cluster #918

Closed
pfwang80s opened this issue Oct 27, 2023 · 4 comments
Closed

Comments

@pfwang80s
Copy link
Contributor

pfwang80s commented Oct 27, 2023

version: v0.8.3

To Reproduce

  1. start node 0: {0, addr0} as a singleton cluster, {0, addr1} means {NodeId, BasicNode.addr}
  2. start node 1: {1, addr1} as non-voter
  3. start node 2: {2, addr2} as non-voter
  4. call node0.add_learner(2, <node 1>), yes, we use NodeId different with <node 1>'s real id
  5. call node0.add_learner(1, <node 2>)
  6. call node0.change_membership(AddVoterIds(1,2))
  7. kill node 0

Expected behavior
the following two are both OK:

  1. step 4/5 failed, or
  2. election works, no matter whether node 1/2's NodeId are changed.

Actual behavior
node 1/2 stop working, "openraft::core::tick: Tick fails to send, receiving end quit: channel closed"

@github-actions
Copy link

👋 Thanks for opening this issue!

Get help or engage by:

  • /help : to print help messages.
  • /assignme : to assign this issue to you.

@drmingdrmer
Copy link
Member

drmingdrmer commented Oct 29, 2023

What script were you using to produce this issue?

I think you're using one of the shell script in the example.
You can just push the modify code base to a branch so I can reproduce this issue using your branch.

And please attach the full log in DEBUG level so I can see what was happening.

And I don't understand why you believe that step two/three should fail.
Adding non-voter without join them to a cluster should always work.

@pfwang80s
Copy link
Contributor Author

And I don't understand why you believe that step two/three should fail.

sorry, my bad, they should be step 4/5.

Adding non-voters with different NodeId always works, but after upgrading them to voter, killing the leader, the whole cluster stopped.

I think you can reproduce it with the mem kv example. To express myself better, I redescribe it in Chinese:
如何重现
有三个节点,node0/1/2,对应的 RaftNetwork 地址分别是 addr 0/1/2,启动这三个节点时,传给 openraft::Raft::new 的 NodeId 分别是 0/1/2

  1. 启动 node0,作为单节点 cluster。
  2. 启动 node1,作为 learner。
  3. 启动 node2,作为 learner。
  4. 在 node0 上调用 add_learner,但是用 NodeId = 2 来添加 node1 (addr1)
  5. 在 node0 上调用 add_learner,用 NodeId = 1 来添加 node2 (addr2)
  6. 在 node0 上调用 change_membership,把两个新加进来的 node 升级为 voter
  7. kill node0, the leader

期待结果
下面两种结果都可以接受:

  1. 步骤 4/5 直接报错。因为 change_membership 日志中记录 id 和 raft 启动时的 id 不符,这导致了在 learner 升级成 voter 且 leader 挂掉之后整个集群不再工作。或者
  2. 选举正常工作,哪怕 node1/2 的 id 被改掉了。
    相比之下 1 会简单一些,2 可能需要在 learner 被加入 cluster 时去做个判断,并更新自己的 id,目测这个需要梳理很多逻辑来保证更新 id 的正确性。

实际结果
node 1/2 stop working, "openraft::core::tick: Tick fails to send, receiving end quit: channel closed"

@drmingdrmer
Copy link
Member

drmingdrmer commented Oct 29, 2023

OK, I see the root cause of this problem is this, node-2 expected itself to be a follower but it is acutally a leader:

thread 'main' panicked at openraft/openraft/src/engine/engine_impl.rs:793:9:
assertion failed: self.internal_server_state.is_following()

Because you mistakenly configured a wrong network addresses for node-3: the endpoint address of node-3 is the address of node-2.
When node-1 is killed and node-2 becomes the leader, node-2 tries to replicate a message to node-3 by replicating it to itself.
And Openraft only allows a follower to receive replication messages.
Because node-2 itself is a leader, the assertion fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants