-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Auto promotion not get triggered when master leader experience network failure/degradation. #16848
Comments
@amberzsy can you share the logs from all 3 cluster manager (aka master). Also did you check if standby masters were also network partitioned and unable to ping each other? |
Thanks @amberzsy for filing the issue. Could you provide us the following details
Please take a look at the existing disruption tests and see if there is a relevant one that can be referenced for this issue. |
Unfortunately I haven't found any good documentation, but the setting |
@andrross we tried replicating the issue by injecting random 50% packet drops in the leader master, our setup had 3 master nodes (1 leader, 2 eligible), but only 1 master eligible node detected that leader has some issue (leader check failed), while the other master eligible node didn't detect any issue. So during pre voting round, this other master eligible node rejected the vote from the master eligible node which did see the issue with the leader, hence the pre voting is not proceeding with the election. What can we do here? |
@shwetathareja @rajiv-kv Do you have any suggestions for @patelsmit32123 regarding his previous comment? Are there any settings to tune the sensitivity of the health check between the cluster manager nodes? |
The following settings can be tuned on the stand-by nodes to see if they detect the leader failure earlier. However default value of these settings are good and they are generally required to be increased for large clusters Before tuning these settings it will be good to understand if there were any failures reported on stand-by nodes during the network disruption. @patelsmit32123 - Would it be possible to capture logs after turning on TRACE for the class |
Here are some logs which we had collected by enabling debug mode
|
Thanks @patelsmit32123 for sharing more insights into the issue.
@patelsmit32123 it is important to establish what you are trying to achieve with packet loss. Lets say there are 3 cluster manager (aka master) nodes in the cluster namely node1 (leader), node2, node3. In this case like you mentioned only one cluster manager node detected the issue, lets say it is node3. For any pre-voting round and election to succeed, we need majority of the votes (2/3 here) otherwise it would not succeed. Only one node can't conclude the election as what if that node (node3 in case) has the problem instead of others. If node1 and node2 are connected to each other and their leader and follower checks are passing, it means there is quorum of cluster manager nodes in the cluster and it will not trigger new election and any attempts from node3 will be rejected. You can ensure node2 also fails to connect to node1 during packet loss simulation then, pre-voting and new election round should go through. You can also test with 5 dedicated cluster manager setup, basically increasing the quorum size from 2 in 3 node setup to 3 in 5 node setup. This will increase the probability of the leader failure in packet drop simulation. Now coming to the too many shard relocation problem, you can tune the setting @andrross pointed index.unassigned.node_left.delayed_timeout from 1 min to 5 mins or so to delay shard allocations. With 60% of nodes dropping, red cluster is sort of inevitable. You can also look into tuning leader/ follower settings shared by @rajiv-kv . Those generally help when there are large clusters and small network blips shouldn't cause node drops and too much churn in the cluster with shard allocations. In general with 2.17, OpenSearch has lot of improvements in shard allocation algorithm and optimizing shard info fetches across all nodes when there are too many unassigned shards in the cluster so that once the issue subsides, the cluster can recover fast. |
Yes, but then we would need quorum from 3 out of 4 remaining master eligible nodes, so its still bad in terms of probability. I will try tuning the leader/follower settings first and share the results, then take it forward from there. |
if it's planned or we know it's network blip not actually node is going bad or lost, bump to 5 mins make sense. But in scenario of primary shard node getting disconnected due to host or other hardware issue, it will cause 5 mins write down time. (primary shard completed down without promoting to replica. )
do you have those GH link or can you elaborate more on shard allocation algorithm improvement? |
I tested with leader check retry_count = 2 and 1, in both cases the leader election succeeded as master eligible nodes were able to make quorum. @shwetathareja if retry_count = 1 and if there is false positive of leader check failure, what happens to the master eligible node now which has become a candidate but not gettign quorum? Can it join back in same leader term? Also, the metric exported for leader check failure is emitted only when configured amount of consecutive checks failed, it would be great if we can emit each individual check failure so that users can create alerts for leader check failures in a given window i.e. it failed for 20 times out of last 100. |
@patelsmit32123 Please feel free to open an issue/pull request for this specific enhancement. |
Replica promotion doesn't wait for this delayed allocation. That will happen immediately. |
Yes, if the candidate doesn't get quorum, it will discard that round and if some other nodes becomes candidate, it will join that new candidate. |
Voting configuration is always odd. So, out of 4 only 3 will participate in election process and quorum will still be 2. |
what happens if there are no other candidates? assuming only this particular master eligible node had some network blip due to which its leader checks failed. All other master eligible nodes and leader master are well connected. |
Describe the bug
the cluster has 3 master nodes and 50+ data nodes in OpenSearch cluster. During the network failure/high network degradation on master leader node, bunch of data nodes failed on master leader check and got "disconnected" with master leader. On master node side, those data nodes got excluded/removed from cluster due to the failure on follower check and failure on cluster state publish process. (note, master leader at this point, still processing, publishing logs and updating cluster state etc)
It further leads massive shard relocation or Red state in some extreme cases(60% data nodes marked as disconnected and removed by master).
Related component
Cluster Manager
To Reproduce
Expected behavior
Ideally, what would be expected is during network degradation/failures on Mater leader, it would automatically promote or elect one of the two standby to leader. However, it didn't happen.
We tried with other scenarios as mentioned below, and auto promotion is working properly.
Additional Details
Plugins
opensearch-alerting
opensearch-anomaly-detection
opensearch-asynchronous-search
opensearch-cross-cluster-replication
opensearch-custom-codecs
opensearch-flow-framework
opensearch-geospatial
opensearch-index-management
opensearch-job-scheduler
opensearch-knn
opensearch-ml
opensearch-neural-search
opensearch-notifications
opensearch-notifications-core
opensearch-observability
opensearch-performance-analyzer
opensearch-reports-scheduler
opensearch-security
opensearch-skills
opensearch-sql
prometheus-exporter
repository-gcs
repository-s3
Screenshots
N/A
Host/Environment (please complete the following information):
Additional context
N/A
The text was updated successfully, but these errors were encountered: