[BUG] Auto promotion not get triggered when master leader experience network failure/degradation. #16848

amberzsy · 2024-12-15T23:40:15Z

Describe the bug

the cluster has 3 master nodes and 50+ data nodes in OpenSearch cluster. During the network failure/high network degradation on master leader node, bunch of data nodes failed on master leader check and got "disconnected" with master leader. On master node side, those data nodes got excluded/removed from cluster due to the failure on follower check and failure on cluster state publish process. (note, master leader at this point, still processing, publishing logs and updating cluster state etc)
It further leads massive shard relocation or Red state in some extreme cases(60% data nodes marked as disconnected and removed by master).

Related component

Cluster Manager

To Reproduce

set up cluster with 3 master nodes (1 leader and 2 standby). and couple of data nodes.
trigger network degradation only on master leader node. (or trigger network layer packet drop etc) for more than 5mins.
check master leader and data nodes log if there's follower/leader check failures and data nodes starting get removed from master leader.

Expected behavior

Ideally, what would be expected is during network degradation/failures on Mater leader, it would automatically promote or elect one of the two standby to leader. However, it didn't happen.

We tried with other scenarios as mentioned below, and auto promotion is working properly.

trigger gracefully shutdown on master leader. The standby master-eligible node is able to be promoted
trigger ungracefully shutdown on leader (e.g kill -9 the master leader process while it's running). The standby master-eligible node is able to be promoted and keep running the cluster. Data nodes can update the leader info and re-communicate with newly elected leader.

Additional Details

Plugins
opensearch-alerting
opensearch-anomaly-detection
opensearch-asynchronous-search
opensearch-cross-cluster-replication
opensearch-custom-codecs
opensearch-flow-framework
opensearch-geospatial
opensearch-index-management
opensearch-job-scheduler
opensearch-knn
opensearch-ml
opensearch-neural-search
opensearch-notifications
opensearch-notifications-core
opensearch-observability
opensearch-performance-analyzer
opensearch-reports-scheduler
opensearch-security
opensearch-skills
opensearch-sql
prometheus-exporter
repository-gcs
repository-s3

Screenshots
N/A

Host/Environment (please complete the following information):

OS: Linux
Version 2.16.1

Additional context
N/A

shwetathareja · 2024-12-17T08:06:46Z

@amberzsy can you share the logs from all 3 cluster manager (aka master). Also did you check if standby masters were also network partitioned and unable to ping each other?

rajiv-kv · 2024-12-19T16:56:54Z

[Triage Attendees - 1, 2, 3]

Thanks @amberzsy for filing the issue. Could you provide us the following details

Logs from stand-by nodes as to why the n/w disruption was not detected
Timeline of events (Did the cluster eventually recover ?)
Steps to reproduce (Could you explain as to how packet loss / network isolation was acheived)

Please take a look at the existing disruption tests and see if there is a relevant one that can be referenced for this issue.

andrross · 2024-12-19T17:50:04Z

Unfortunately I haven't found any good documentation, but the setting index.unassigned.node_left.delayed_timeout can be tuned to increase the time before the system will reallocate replicas when a node is lost for any reason. This avoids starting expensive reallocations if the node is likely to return (i.e. in the case of an unstable network).

patelsmit32123 · 2025-01-28T17:33:09Z

@andrross we tried replicating the issue by injecting random 50% packet drops in the leader master, our setup had 3 master nodes (1 leader, 2 eligible), but only 1 master eligible node detected that leader has some issue (leader check failed), while the other master eligible node didn't detect any issue. So during pre voting round, this other master eligible node rejected the vote from the master eligible node which did see the issue with the leader, hence the pre voting is not proceeding with the election. What can we do here?

andrross · 2025-01-28T21:45:59Z

@shwetathareja @rajiv-kv Do you have any suggestions for @patelsmit32123 regarding his previous comment? Are there any settings to tune the sensitivity of the health check between the cluster manager nodes?

rajiv-kv · 2025-01-29T04:53:18Z

The following settings can be tuned on the stand-by nodes to see if they detect the leader failure earlier.
cluster.fault_detection.leader_check.interval
cluster.fault_detection.leader_check.timeout
cluster.fault_detection.leader_check.retry_count

However default value of these settings are good and they are generally required to be increased for large clusters

Before tuning these settings it will be good to understand if there were any failures reported on stand-by nodes during the network disruption. @patelsmit32123 - Would it be possible to capture logs after turning on TRACE for the class org.opensearch.cluster.coordination.LeaderChecker during your experiments.

patelsmit32123 · 2025-01-29T10:49:33Z

Here are some logs which we had collected by enabling debug mode

org.opensearch.OpenSearchException: node [{esdock-databook-opensearch-staging-cluster-phx-bofuv-ponog}{bFU9nUwDQxOeItYYHd6LIg}{T2rzQdCsQRif_N43RenriA}{10.76.3.152}{10.76.3.152:25603}{m}{rack=phxb2-c1-p1-r8, zone=phx7, shard_indexing_pressure_enabled=true, isolation-group=isolation-group-1}] failed [3] consecutive checks

[esdock-databook-opensearch-staging-cluster-phx-sunag-bovat] onLeaderFailure: coordinator becoming CANDIDATE in term 34 (was FOLLOWER, lastKnownLeader was [Optional[{esdock-databook-opensearch-staging-cluster-phx-bofuv-ponog}{bFU9nUwDQxOeItYYHd6LIg}{T2rzQdCsQRif_N43RenriA}{10.76.3.152}{10.76.3.152:25603}{m}{rack=phxb2-c1-p1-r8, zone=phx7, shard_indexing_pressure_enabled=true, isolation-group=isolation-group-1}]])

Caused by: org.opensearch.cluster.coordination.CoordinationStateRejectedException: rejecting PreVoteRequest{sourceNode={esdock-databook-opensearch-staging-cluster-phx-sunag-bovat}{RE72RaO5Tx6zF6DuTclV-w}{XV8Kjl3CSeCoDNIz6nJbyQ}{10.76.19.152}{10.76.19.152:25603}{m}{rack=phxb2-c1-p3-r8, zone=phx7, shard_indexing_pressure_enabled=true, isolation-group=isolation-group-2}, currentTerm=34} as there is already a leader

[esdock-databook-opensearch-staging-cluster-phx-sunag-bovat] PreVotingRound{preVotesReceived={{esdock-databook-opensearch-staging-cluster-phx-sunag-bovat}{RE72RaO5Tx6zF6DuTclV-w}{XV8Kjl3CSeCoDNIz6nJbyQ}{10.76.19.152}{10.76.19.152:25603}{m}{rack=phxb2-c1-p3-r8, zone=phx7, shard_indexing_pressure_enabled=true, isolation-group=isolation-group-2}=PreVoteResponse{currentTerm=34, lastAcceptedTerm=34, lastAcceptedVersion=9807}}, electionStarted=false, preVoteRequest=PreVoteRequest{sourceNode={esdock-databook-opensearch-staging-cluster-phx-sunag-bovat}{RE72RaO5Tx6zF6DuTclV-w}{XV8Kjl3CSeCoDNIz6nJbyQ}{10.76.19.152}{10.76.19.152:25603}{m}{rack=phxb2-c1-p3-r8, zone=phx7, shard_indexing_pressure_enabled=true, isolation-group=isolation-group-2}, currentTerm=34}, isClosed=false} added PreVoteResponse{currentTerm=34, lastAcceptedTerm=34, lastAcceptedVersion=9807} from {esdock-databook-opensearch-staging-cluster-phx-sunag-bovat}{RE72RaO5Tx6zF6DuTclV-w}{XV8Kjl3CSeCoDNIz6nJbyQ}{10.76.19.152}{10.76.19.152:25603}{m}{rack=phxb2-c1-p3-r8, zone=phx7, shard_indexing_pressure_enabled=true, isolation-group=isolation-group-2}, no quorum yet

shwetathareja · 2025-01-30T03:56:57Z

Thanks @patelsmit32123 for sharing more insights into the issue.

@andrross we tried replicating the issue by injecting random 50% packet drops in the leader master, our setup had 3 master nodes (1 leader, 2 eligible), but only 1 master eligible node detected that leader has some issue (leader check failed), while the other master eligible node didn't detect any issue. So during pre voting round, this other master eligible node rejected the vote from the master eligible node which did see the issue with the leader, hence the pre voting is not proceeding with the election. What can we do here?

@patelsmit32123 it is important to establish what you are trying to achieve with packet loss. Lets say there are 3 cluster manager (aka master) nodes in the cluster namely node1 (leader), node2, node3. In this case like you mentioned only one cluster manager node detected the issue, lets say it is node3. For any pre-voting round and election to succeed, we need majority of the votes (2/3 here) otherwise it would not succeed. Only one node can't conclude the election as what if that node (node3 in case) has the problem instead of others. If node1 and node2 are connected to each other and their leader and follower checks are passing, it means there is quorum of cluster manager nodes in the cluster and it will not trigger new election and any attempts from node3 will be rejected. You can ensure node2 also fails to connect to node1 during packet loss simulation then, pre-voting and new election round should go through. You can also test with 5 dedicated cluster manager setup, basically increasing the quorum size from 2 in 3 node setup to 3 in 5 node setup. This will increase the probability of the leader failure in packet drop simulation.

Now coming to the too many shard relocation problem, you can tune the setting @andrross pointed index.unassigned.node_left.delayed_timeout from 1 min to 5 mins or so to delay shard allocations.

With 60% of nodes dropping, red cluster is sort of inevitable.

You can also look into tuning leader/ follower settings shared by @rajiv-kv . Those generally help when there are large clusters and small network blips shouldn't cause node drops and too much churn in the cluster with shard allocations.

In general with 2.17, OpenSearch has lot of improvements in shard allocation algorithm and optimizing shard info fetches across all nodes when there are too many unassigned shards in the cluster so that once the issue subsides, the cluster can recover fast.

patelsmit32123 · 2025-02-03T05:29:19Z

You can also test with 5 dedicated cluster manager setup

Yes, but then we would need quorum from 3 out of 4 remaining master eligible nodes, so its still bad in terms of probability. I will try tuning the leader/follower settings first and share the results, then take it forward from there.

amberzsy · 2025-02-03T05:56:58Z

Now coming to the too many shard relocation problem, you can tune the setting @andrross pointed index.unassigned.node_left.delayed_timeout from 1 min to 5 mins or so to delay shard allocations.

if it's planned or we know it's network blip not actually node is going bad or lost, bump to 5 mins make sense. But in scenario of primary shard node getting disconnected due to host or other hardware issue, it will cause 5 mins write down time. (primary shard completed down without promoting to replica. )

In general with 2.17, OpenSearch has lot of improvements in shard allocation algorithm and optimizing shard info fetches across all nodes when there are too many unassigned shards in the cluster so that once the issue subsides, the cluster can recover fast.

do you have those GH link or can you elaborate more on shard allocation algorithm improvement?

patelsmit32123 · 2025-02-04T11:30:01Z

I tested with leader check retry_count = 2 and 1, in both cases the leader election succeeded as master eligible nodes were able to make quorum. @shwetathareja if retry_count = 1 and if there is false positive of leader check failure, what happens to the master eligible node now which has become a candidate but not gettign quorum? Can it join back in same leader term?

Also, the metric exported for leader check failure is emitted only when configured amount of consecutive checks failed, it would be great if we can emit each individual check failure so that users can create alerts for leader check failures in a given window i.e. it failed for 20 times out of last 100.

andrross · 2025-02-04T17:47:03Z

it would be great if we can emit each individual check failure so that users can create alerts for leader check failures in a given window i.e. it failed for 20 times out of last 100.

@patelsmit32123 Please feel free to open an issue/pull request for this specific enhancement.

shwetathareja · 2025-02-05T13:16:06Z

But in scenario of primary shard node getting disconnected due to host or other hardware issue, it will cause 5 mins write down time. (primary shard completed down without promoting to replica. )

Replica promotion doesn't wait for this delayed allocation. That will happen immediately.

shwetathareja · 2025-02-05T13:17:39Z

what happens to the master eligible node now which has become a candidate but not gettign quorum? Can it join back in same leader term?

Yes, if the candidate doesn't get quorum, it will discard that round and if some other nodes becomes candidate, it will join that new candidate.

shwetathareja · 2025-02-05T13:19:35Z

but then we would need quorum from 3 out of 4 remaining master eligible nodes,

Voting configuration is always odd. So, out of 4 only 3 will participate in election process and quorum will still be 2.

patelsmit32123 · 2025-02-05T14:27:11Z

Yes, if the candidate doesn't get quorum, it will discard that round and if some other nodes becomes candidate, it will join that new candidate.

what happens if there are no other candidates? assuming only this particular master eligible node had some network blip due to which its leader checks failed. All other master eligible nodes and leader master are well connected.

amberzsy added bug Something isn't working untriaged labels Dec 15, 2024

github-actions bot added the Cluster Manager label Dec 15, 2024

github-project-automation bot added this to Cluster Manager Project Board Dec 15, 2024

github-project-automation bot moved this to 🆕 New in Cluster Manager Project Board Dec 15, 2024

rajiv-kv removed the untriaged label Dec 19, 2024

rajiv-kv moved this from 🆕 New to Later (6 months plus) in Cluster Manager Project Board Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Auto promotion not get triggered when master leader experience network failure/degradation. #16848

[BUG] Auto promotion not get triggered when master leader experience network failure/degradation. #16848

amberzsy commented Dec 15, 2024 •

edited

Loading

shwetathareja commented Dec 17, 2024 •

edited

Loading

rajiv-kv commented Dec 19, 2024

andrross commented Dec 19, 2024

patelsmit32123 commented Jan 28, 2025 •

edited

Loading

andrross commented Jan 28, 2025

rajiv-kv commented Jan 29, 2025

patelsmit32123 commented Jan 29, 2025

shwetathareja commented Jan 30, 2025 •

edited

Loading

patelsmit32123 commented Feb 3, 2025

amberzsy commented Feb 3, 2025 •

edited

Loading

patelsmit32123 commented Feb 4, 2025 •

edited

Loading

andrross commented Feb 4, 2025

shwetathareja commented Feb 5, 2025

shwetathareja commented Feb 5, 2025 •

edited

Loading

shwetathareja commented Feb 5, 2025 •

edited

Loading

patelsmit32123 commented Feb 5, 2025

[BUG] Auto promotion not get triggered when master leader experience network failure/degradation. #16848

[BUG] Auto promotion not get triggered when master leader experience network failure/degradation. #16848

Comments

amberzsy commented Dec 15, 2024 • edited Loading

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

shwetathareja commented Dec 17, 2024 • edited Loading

rajiv-kv commented Dec 19, 2024

andrross commented Dec 19, 2024

patelsmit32123 commented Jan 28, 2025 • edited Loading

andrross commented Jan 28, 2025

rajiv-kv commented Jan 29, 2025

patelsmit32123 commented Jan 29, 2025

shwetathareja commented Jan 30, 2025 • edited Loading

patelsmit32123 commented Feb 3, 2025

amberzsy commented Feb 3, 2025 • edited Loading

patelsmit32123 commented Feb 4, 2025 • edited Loading

andrross commented Feb 4, 2025

shwetathareja commented Feb 5, 2025

shwetathareja commented Feb 5, 2025 • edited Loading

shwetathareja commented Feb 5, 2025 • edited Loading

patelsmit32123 commented Feb 5, 2025

amberzsy commented Dec 15, 2024 •

edited

Loading

shwetathareja commented Dec 17, 2024 •

edited

Loading

patelsmit32123 commented Jan 28, 2025 •

edited

Loading

shwetathareja commented Jan 30, 2025 •

edited

Loading

amberzsy commented Feb 3, 2025 •

edited

Loading

patelsmit32123 commented Feb 4, 2025 •

edited

Loading

shwetathareja commented Feb 5, 2025 •

edited

Loading

shwetathareja commented Feb 5, 2025 •

edited

Loading