Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the resync logic for node network status #112

Merged
merged 1 commit into from
Mar 26, 2021

Conversation

heypnus
Copy link
Contributor

@heypnus heypnus commented Mar 18, 2021

Sometimes when a nsx-node-agent pod is created and
running for less than 180 seconds, the operator will
try to update the node status twice (firstly set
network-unavailable=true, then sleep and try to set
network-unavailable=false after 180 seconds)[1].

The code has a redundant check before sleeping, and
in the check logic, Get API reads the node status
from cache which may be not synced after the first
update operation was executed, so an unexpected
"Node condition is not changed" will be reported,
then the taints cannot be removed until the removal
logic was triggerred accidentally by another event
from nsx-node-agent pod. This patch will remove the
redundant check.

And we will assume that the data read by client will
eventually be correct, but may be slightly out of
date. So this patch introduced the logic
assertNodeStatus to ensure the final status is
expected.

This patch also replace the goroutine with RequeueAfter,
the latter is a more native and less error-prone
implementation.

[1] The following logs show this case:

{"level":"info","ts":"2021-03-08T14:56:37.864Z","logger":"status_manager","msg":"nsx-node-agent-p8ss5/nsx-kube-proxy for node compute-2 started for less than 17.864554094s"}
{"level":"info","ts":"2021-03-08T14:56:37.864Z","logger":"status_manager","msg":"nsx-node-agent-p8ss5/nsx-node-agent for node compute-2 started for less than 17.864554094s"}
{"level":"info","ts":"2021-03-08T14:56:37.864Z","logger":"status_manager","msg":"nsx-node-agent-p8ss5/nsx-ovs for node compute-2 started for less than 17.864554094s"}
{"level":"info","ts":"2021-03-08T14:56:37.864Z","logger":"status_manager","msg":"Setting status NetworkUnavailable to true for node compute-2"}
{"level":"info","ts":"2021-03-08T14:56:37.876Z","logger":"status_manager","msg":"Updated node condition NetworkUnavailable to true for node compute-2"}
{"level":"info","ts":"2021-03-08T14:56:37.876Z","logger":"status_manager","msg":"Node condition is not changed"}
...
{"level":"info","ts":"2021-03-08T15:26:13.541Z","logger":"status_manager","msg":"Setting status NetworkUnavailable to false for node compute-2"}
{"level":"info","ts":"2021-03-08T15:26:13.541Z","logger":"status_manager","msg":"Setting status NetworkUnavailable to false for node compute-2 after -26m53.541741583s"}

pkg/controller/pod/pod_controller.go Outdated Show resolved Hide resolved
pkg/controller/pod/pod_controller.go Outdated Show resolved Hide resolved
pkg/controller/statusmanager/pod_status.go Show resolved Hide resolved
@heypnus heypnus changed the title Remove redundant judgment during updating node status Improve the resync logic for node network status Mar 25, 2021
@heypnus heypnus force-pushed the fix/2731098 branch 4 times, most recently from dd30c91 to 18a5e8d Compare March 25, 2021 09:33
pkg/controller/statusmanager/pod_status.go Outdated Show resolved Hide resolved
pkg/controller/statusmanager/pod_status.go Outdated Show resolved Hide resolved
pkg/controller/statusmanager/pod_status.go Outdated Show resolved Hide resolved
@heypnus heypnus force-pushed the fix/2731098 branch 2 times, most recently from 4698114 to 9bb9b04 Compare March 26, 2021 13:26
Sometimes when a nsx-node-agent pod is created and
running for less than 180 seconds, the operator will
try to update the node status twice (firstly set
network-unavailable=true, then sleep and try to set
network-unavailable=false after 180 seconds)[1].

The code has a redundant check before sleeping, and
in the check logic, Get API reads the node status
from cache which may be not synced after the first
update operation was executed, so an unexpected
"Node condition is not changed" will be reported,
then the taints cannot be removed until the removal
logic was triggerred accidentally by another event
from nsx-node-agent pod. This patch will remove the
redundant check.

And we will assume that the data read by client will
eventually be correct, but may be slightly out of
date. So this patch introduced the logic
assertNodeStatus to ensure the final status is
expected.

This patch also replace the goroutine with RequeueAfter,
the latter is a more native and less error-prone
implementation.

[1] The following logs show this case:

{"level":"info","ts":"2021-03-08T14:56:37.864Z","logger":"status_manager","msg":"nsx-node-agent-p8ss5/nsx-kube-proxy for node compute-2 started for less than 17.864554094s"}
{"level":"info","ts":"2021-03-08T14:56:37.864Z","logger":"status_manager","msg":"nsx-node-agent-p8ss5/nsx-node-agent for node compute-2 started for less than 17.864554094s"}
{"level":"info","ts":"2021-03-08T14:56:37.864Z","logger":"status_manager","msg":"nsx-node-agent-p8ss5/nsx-ovs for node compute-2 started for less than 17.864554094s"}
{"level":"info","ts":"2021-03-08T14:56:37.864Z","logger":"status_manager","msg":"Setting status NetworkUnavailable to true for node compute-2"}
{"level":"info","ts":"2021-03-08T14:56:37.876Z","logger":"status_manager","msg":"Updated node condition NetworkUnavailable to true for node compute-2"}
{"level":"info","ts":"2021-03-08T14:56:37.876Z","logger":"status_manager","msg":"Node condition is not changed"}
...
{"level":"info","ts":"2021-03-08T15:26:13.541Z","logger":"status_manager","msg":"Setting status NetworkUnavailable to false for node compute-2"}
{"level":"info","ts":"2021-03-08T15:26:13.541Z","logger":"status_manager","msg":"Setting status NetworkUnavailable to false for node compute-2 after -26m53.541741583s"}
@heypnus heypnus merged commit e129229 into vmware:master Mar 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants