Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Auto promotion not get triggered when master leader experience network failure/degradation. #16848

Open
amberzsy opened this issue Dec 15, 2024 · 16 comments
Labels
bug Something isn't working Cluster Manager

Comments

@amberzsy
Copy link

amberzsy commented Dec 15, 2024

Describe the bug

the cluster has 3 master nodes and 50+ data nodes in OpenSearch cluster. During the network failure/high network degradation on master leader node, bunch of data nodes failed on master leader check and got "disconnected" with master leader. On master node side, those data nodes got excluded/removed from cluster due to the failure on follower check and failure on cluster state publish process. (note, master leader at this point, still processing, publishing logs and updating cluster state etc)
It further leads massive shard relocation or Red state in some extreme cases(60% data nodes marked as disconnected and removed by master).

Related component

Cluster Manager

To Reproduce

  1. set up cluster with 3 master nodes (1 leader and 2 standby). and couple of data nodes.
  2. trigger network degradation only on master leader node. (or trigger network layer packet drop etc) for more than 5mins.
  3. check master leader and data nodes log if there's follower/leader check failures and data nodes starting get removed from master leader.

Expected behavior

Ideally, what would be expected is during network degradation/failures on Mater leader, it would automatically promote or elect one of the two standby to leader. However, it didn't happen.

We tried with other scenarios as mentioned below, and auto promotion is working properly.

  1. trigger gracefully shutdown on master leader. The standby master-eligible node is able to be promoted
  2. trigger ungracefully shutdown on leader (e.g kill -9 the master leader process while it's running). The standby master-eligible node is able to be promoted and keep running the cluster. Data nodes can update the leader info and re-communicate with newly elected leader.

Additional Details

Plugins
opensearch-alerting
opensearch-anomaly-detection
opensearch-asynchronous-search
opensearch-cross-cluster-replication
opensearch-custom-codecs
opensearch-flow-framework
opensearch-geospatial
opensearch-index-management
opensearch-job-scheduler
opensearch-knn
opensearch-ml
opensearch-neural-search
opensearch-notifications
opensearch-notifications-core
opensearch-observability
opensearch-performance-analyzer
opensearch-reports-scheduler
opensearch-security
opensearch-skills
opensearch-sql
prometheus-exporter
repository-gcs
repository-s3

Screenshots
N/A

Host/Environment (please complete the following information):

  • OS: Linux
  • Version 2.16.1

Additional context
N/A

@shwetathareja
Copy link
Member

shwetathareja commented Dec 17, 2024

@amberzsy can you share the logs from all 3 cluster manager (aka master). Also did you check if standby masters were also network partitioned and unable to ping each other?

@rajiv-kv
Copy link
Contributor

[Triage Attendees - 1, 2, 3]

Thanks @amberzsy for filing the issue. Could you provide us the following details

  • Logs from stand-by nodes as to why the n/w disruption was not detected
  • Timeline of events (Did the cluster eventually recover ?)
  • Steps to reproduce (Could you explain as to how packet loss / network isolation was acheived)

Please take a look at the existing disruption tests and see if there is a relevant one that can be referenced for this issue.

@andrross
Copy link
Member

Unfortunately I haven't found any good documentation, but the setting index.unassigned.node_left.delayed_timeout can be tuned to increase the time before the system will reallocate replicas when a node is lost for any reason. This avoids starting expensive reallocations if the node is likely to return (i.e. in the case of an unstable network).

@rajiv-kv rajiv-kv moved this from 🆕 New to Later (6 months plus) in Cluster Manager Project Board Jan 16, 2025
@patelsmit32123
Copy link

patelsmit32123 commented Jan 28, 2025

@andrross we tried replicating the issue by injecting random 50% packet drops in the leader master, our setup had 3 master nodes (1 leader, 2 eligible), but only 1 master eligible node detected that leader has some issue (leader check failed), while the other master eligible node didn't detect any issue. So during pre voting round, this other master eligible node rejected the vote from the master eligible node which did see the issue with the leader, hence the pre voting is not proceeding with the election. What can we do here?

@andrross
Copy link
Member

@shwetathareja @rajiv-kv Do you have any suggestions for @patelsmit32123 regarding his previous comment? Are there any settings to tune the sensitivity of the health check between the cluster manager nodes?

@rajiv-kv
Copy link
Contributor

The following settings can be tuned on the stand-by nodes to see if they detect the leader failure earlier.
cluster.fault_detection.leader_check.interval
cluster.fault_detection.leader_check.timeout
cluster.fault_detection.leader_check.retry_count

However default value of these settings are good and they are generally required to be increased for large clusters

Before tuning these settings it will be good to understand if there were any failures reported on stand-by nodes during the network disruption. @patelsmit32123 - Would it be possible to capture logs after turning on TRACE for the class org.opensearch.cluster.coordination.LeaderChecker during your experiments.

@patelsmit32123
Copy link

Here are some logs which we had collected by enabling debug mode

org.opensearch.OpenSearchException: node [{esdock-databook-opensearch-staging-cluster-phx-bofuv-ponog}{bFU9nUwDQxOeItYYHd6LIg}{T2rzQdCsQRif_N43RenriA}{10.76.3.152}{10.76.3.152:25603}{m}{rack=phxb2-c1-p1-r8, zone=phx7, shard_indexing_pressure_enabled=true, isolation-group=isolation-group-1}] failed [3] consecutive checks

[esdock-databook-opensearch-staging-cluster-phx-sunag-bovat] onLeaderFailure: coordinator becoming CANDIDATE in term 34 (was FOLLOWER, lastKnownLeader was [Optional[{esdock-databook-opensearch-staging-cluster-phx-bofuv-ponog}{bFU9nUwDQxOeItYYHd6LIg}{T2rzQdCsQRif_N43RenriA}{10.76.3.152}{10.76.3.152:25603}{m}{rack=phxb2-c1-p1-r8, zone=phx7, shard_indexing_pressure_enabled=true, isolation-group=isolation-group-1}]])

Caused by: org.opensearch.cluster.coordination.CoordinationStateRejectedException: rejecting PreVoteRequest{sourceNode={esdock-databook-opensearch-staging-cluster-phx-sunag-bovat}{RE72RaO5Tx6zF6DuTclV-w}{XV8Kjl3CSeCoDNIz6nJbyQ}{10.76.19.152}{10.76.19.152:25603}{m}{rack=phxb2-c1-p3-r8, zone=phx7, shard_indexing_pressure_enabled=true, isolation-group=isolation-group-2}, currentTerm=34} as there is already a leader

[esdock-databook-opensearch-staging-cluster-phx-sunag-bovat] PreVotingRound{preVotesReceived={{esdock-databook-opensearch-staging-cluster-phx-sunag-bovat}{RE72RaO5Tx6zF6DuTclV-w}{XV8Kjl3CSeCoDNIz6nJbyQ}{10.76.19.152}{10.76.19.152:25603}{m}{rack=phxb2-c1-p3-r8, zone=phx7, shard_indexing_pressure_enabled=true, isolation-group=isolation-group-2}=PreVoteResponse{currentTerm=34, lastAcceptedTerm=34, lastAcceptedVersion=9807}}, electionStarted=false, preVoteRequest=PreVoteRequest{sourceNode={esdock-databook-opensearch-staging-cluster-phx-sunag-bovat}{RE72RaO5Tx6zF6DuTclV-w}{XV8Kjl3CSeCoDNIz6nJbyQ}{10.76.19.152}{10.76.19.152:25603}{m}{rack=phxb2-c1-p3-r8, zone=phx7, shard_indexing_pressure_enabled=true, isolation-group=isolation-group-2}, currentTerm=34}, isClosed=false} added PreVoteResponse{currentTerm=34, lastAcceptedTerm=34, lastAcceptedVersion=9807} from {esdock-databook-opensearch-staging-cluster-phx-sunag-bovat}{RE72RaO5Tx6zF6DuTclV-w}{XV8Kjl3CSeCoDNIz6nJbyQ}{10.76.19.152}{10.76.19.152:25603}{m}{rack=phxb2-c1-p3-r8, zone=phx7, shard_indexing_pressure_enabled=true, isolation-group=isolation-group-2}, no quorum yet

@shwetathareja
Copy link
Member

shwetathareja commented Jan 30, 2025

Thanks @patelsmit32123 for sharing more insights into the issue.

@andrross we tried replicating the issue by injecting random 50% packet drops in the leader master, our setup had 3 master nodes (1 leader, 2 eligible), but only 1 master eligible node detected that leader has some issue (leader check failed), while the other master eligible node didn't detect any issue. So during pre voting round, this other master eligible node rejected the vote from the master eligible node which did see the issue with the leader, hence the pre voting is not proceeding with the election. What can we do here?

@patelsmit32123 it is important to establish what you are trying to achieve with packet loss. Lets say there are 3 cluster manager (aka master) nodes in the cluster namely node1 (leader), node2, node3. In this case like you mentioned only one cluster manager node detected the issue, lets say it is node3. For any pre-voting round and election to succeed, we need majority of the votes (2/3 here) otherwise it would not succeed. Only one node can't conclude the election as what if that node (node3 in case) has the problem instead of others. If node1 and node2 are connected to each other and their leader and follower checks are passing, it means there is quorum of cluster manager nodes in the cluster and it will not trigger new election and any attempts from node3 will be rejected. You can ensure node2 also fails to connect to node1 during packet loss simulation then, pre-voting and new election round should go through. You can also test with 5 dedicated cluster manager setup, basically increasing the quorum size from 2 in 3 node setup to 3 in 5 node setup. This will increase the probability of the leader failure in packet drop simulation.

Now coming to the too many shard relocation problem, you can tune the setting @andrross pointed index.unassigned.node_left.delayed_timeout from 1 min to 5 mins or so to delay shard allocations.

With 60% of nodes dropping, red cluster is sort of inevitable.

You can also look into tuning leader/ follower settings shared by @rajiv-kv . Those generally help when there are large clusters and small network blips shouldn't cause node drops and too much churn in the cluster with shard allocations.

In general with 2.17, OpenSearch has lot of improvements in shard allocation algorithm and optimizing shard info fetches across all nodes when there are too many unassigned shards in the cluster so that once the issue subsides, the cluster can recover fast.

@patelsmit32123
Copy link

You can also test with 5 dedicated cluster manager setup

Yes, but then we would need quorum from 3 out of 4 remaining master eligible nodes, so its still bad in terms of probability. I will try tuning the leader/follower settings first and share the results, then take it forward from there.

@amberzsy
Copy link
Author

amberzsy commented Feb 3, 2025

Now coming to the too many shard relocation problem, you can tune the setting @andrross pointed index.unassigned.node_left.delayed_timeout from 1 min to 5 mins or so to delay shard allocations.

if it's planned or we know it's network blip not actually node is going bad or lost, bump to 5 mins make sense. But in scenario of primary shard node getting disconnected due to host or other hardware issue, it will cause 5 mins write down time. (primary shard completed down without promoting to replica. )

In general with 2.17, OpenSearch has lot of improvements in shard allocation algorithm and optimizing shard info fetches across all nodes when there are too many unassigned shards in the cluster so that once the issue subsides, the cluster can recover fast.

do you have those GH link or can you elaborate more on shard allocation algorithm improvement?

@patelsmit32123
Copy link

patelsmit32123 commented Feb 4, 2025

I tested with leader check retry_count = 2 and 1, in both cases the leader election succeeded as master eligible nodes were able to make quorum. @shwetathareja if retry_count = 1 and if there is false positive of leader check failure, what happens to the master eligible node now which has become a candidate but not gettign quorum? Can it join back in same leader term?

Also, the metric exported for leader check failure is emitted only when configured amount of consecutive checks failed, it would be great if we can emit each individual check failure so that users can create alerts for leader check failures in a given window i.e. it failed for 20 times out of last 100.

@andrross
Copy link
Member

andrross commented Feb 4, 2025

it would be great if we can emit each individual check failure so that users can create alerts for leader check failures in a given window i.e. it failed for 20 times out of last 100.

@patelsmit32123 Please feel free to open an issue/pull request for this specific enhancement.

@shwetathareja
Copy link
Member

But in scenario of primary shard node getting disconnected due to host or other hardware issue, it will cause 5 mins write down time. (primary shard completed down without promoting to replica. )

Replica promotion doesn't wait for this delayed allocation. That will happen immediately.

@shwetathareja
Copy link
Member

shwetathareja commented Feb 5, 2025

what happens to the master eligible node now which has become a candidate but not gettign quorum? Can it join back in same leader term?

Yes, if the candidate doesn't get quorum, it will discard that round and if some other nodes becomes candidate, it will join that new candidate.

@shwetathareja
Copy link
Member

shwetathareja commented Feb 5, 2025

but then we would need quorum from 3 out of 4 remaining master eligible nodes,

Voting configuration is always odd. So, out of 4 only 3 will participate in election process and quorum will still be 2.

@patelsmit32123
Copy link

Yes, if the candidate doesn't get quorum, it will discard that round and if some other nodes becomes candidate, it will join that new candidate.

what happens if there are no other candidates? assuming only this particular master eligible node had some network blip due to which its leader checks failed. All other master eligible nodes and leader master are well connected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cluster Manager
Projects
Status: Later (6 months plus)
Development

No branches or pull requests

5 participants