-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the resync logic for node network status #112
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
heypnus
force-pushed
the
fix/2731098
branch
3 times, most recently
from
March 23, 2021 01:09
b24c9ac
to
2922830
Compare
heypnus
force-pushed
the
fix/2731098
branch
2 times, most recently
from
March 23, 2021 07:56
805d6d1
to
aa003ee
Compare
dantingl
reviewed
Mar 24, 2021
heypnus
changed the title
Remove redundant judgment during updating node status
Improve the resync logic for node network status
Mar 25, 2021
heypnus
force-pushed
the
fix/2731098
branch
4 times, most recently
from
March 25, 2021 09:33
dd30c91
to
18a5e8d
Compare
dantingl
reviewed
Mar 25, 2021
dantingl
approved these changes
Mar 25, 2021
heypnus
force-pushed
the
fix/2731098
branch
2 times, most recently
from
March 26, 2021 13:26
4698114
to
9bb9b04
Compare
Sometimes when a nsx-node-agent pod is created and running for less than 180 seconds, the operator will try to update the node status twice (firstly set network-unavailable=true, then sleep and try to set network-unavailable=false after 180 seconds)[1]. The code has a redundant check before sleeping, and in the check logic, Get API reads the node status from cache which may be not synced after the first update operation was executed, so an unexpected "Node condition is not changed" will be reported, then the taints cannot be removed until the removal logic was triggerred accidentally by another event from nsx-node-agent pod. This patch will remove the redundant check. And we will assume that the data read by client will eventually be correct, but may be slightly out of date. So this patch introduced the logic assertNodeStatus to ensure the final status is expected. This patch also replace the goroutine with RequeueAfter, the latter is a more native and less error-prone implementation. [1] The following logs show this case: {"level":"info","ts":"2021-03-08T14:56:37.864Z","logger":"status_manager","msg":"nsx-node-agent-p8ss5/nsx-kube-proxy for node compute-2 started for less than 17.864554094s"} {"level":"info","ts":"2021-03-08T14:56:37.864Z","logger":"status_manager","msg":"nsx-node-agent-p8ss5/nsx-node-agent for node compute-2 started for less than 17.864554094s"} {"level":"info","ts":"2021-03-08T14:56:37.864Z","logger":"status_manager","msg":"nsx-node-agent-p8ss5/nsx-ovs for node compute-2 started for less than 17.864554094s"} {"level":"info","ts":"2021-03-08T14:56:37.864Z","logger":"status_manager","msg":"Setting status NetworkUnavailable to true for node compute-2"} {"level":"info","ts":"2021-03-08T14:56:37.876Z","logger":"status_manager","msg":"Updated node condition NetworkUnavailable to true for node compute-2"} {"level":"info","ts":"2021-03-08T14:56:37.876Z","logger":"status_manager","msg":"Node condition is not changed"} ... {"level":"info","ts":"2021-03-08T15:26:13.541Z","logger":"status_manager","msg":"Setting status NetworkUnavailable to false for node compute-2"} {"level":"info","ts":"2021-03-08T15:26:13.541Z","logger":"status_manager","msg":"Setting status NetworkUnavailable to false for node compute-2 after -26m53.541741583s"}
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Sometimes when a nsx-node-agent pod is created and
running for less than 180 seconds, the operator will
try to update the node status twice (firstly set
network-unavailable=true, then sleep and try to set
network-unavailable=false after 180 seconds)[1].
The code has a redundant check before sleeping, and
in the check logic, Get API reads the node status
from cache which may be not synced after the first
update operation was executed, so an unexpected
"Node condition is not changed" will be reported,
then the taints cannot be removed until the removal
logic was triggerred accidentally by another event
from nsx-node-agent pod. This patch will remove the
redundant check.
And we will assume that the data read by client will
eventually be correct, but may be slightly out of
date. So this patch introduced the logic
assertNodeStatus to ensure the final status is
expected.
This patch also replace the goroutine with RequeueAfter,
the latter is a more native and less error-prone
implementation.
[1] The following logs show this case:
{"level":"info","ts":"2021-03-08T14:56:37.864Z","logger":"status_manager","msg":"nsx-node-agent-p8ss5/nsx-kube-proxy for node compute-2 started for less than 17.864554094s"}
{"level":"info","ts":"2021-03-08T14:56:37.864Z","logger":"status_manager","msg":"nsx-node-agent-p8ss5/nsx-node-agent for node compute-2 started for less than 17.864554094s"}
{"level":"info","ts":"2021-03-08T14:56:37.864Z","logger":"status_manager","msg":"nsx-node-agent-p8ss5/nsx-ovs for node compute-2 started for less than 17.864554094s"}
{"level":"info","ts":"2021-03-08T14:56:37.864Z","logger":"status_manager","msg":"Setting status NetworkUnavailable to true for node compute-2"}
{"level":"info","ts":"2021-03-08T14:56:37.876Z","logger":"status_manager","msg":"Updated node condition NetworkUnavailable to true for node compute-2"}
{"level":"info","ts":"2021-03-08T14:56:37.876Z","logger":"status_manager","msg":"Node condition is not changed"}
...
{"level":"info","ts":"2021-03-08T15:26:13.541Z","logger":"status_manager","msg":"Setting status NetworkUnavailable to false for node compute-2"}
{"level":"info","ts":"2021-03-08T15:26:13.541Z","logger":"status_manager","msg":"Setting status NetworkUnavailable to false for node compute-2 after -26m53.541741583s"}