Improve the resync logic for node network status #112

heypnus · 2021-03-18T07:54:45Z

Sometimes when a nsx-node-agent pod is created and
running for less than 180 seconds, the operator will
try to update the node status twice (firstly set
network-unavailable=true, then sleep and try to set
network-unavailable=false after 180 seconds)[1].

The code has a redundant check before sleeping, and
in the check logic, Get API reads the node status
from cache which may be not synced after the first
update operation was executed, so an unexpected
"Node condition is not changed" will be reported,
then the taints cannot be removed until the removal
logic was triggerred accidentally by another event
from nsx-node-agent pod. This patch will remove the
redundant check.

And we will assume that the data read by client will
eventually be correct, but may be slightly out of
date. So this patch introduced the logic
assertNodeStatus to ensure the final status is
expected.

This patch also replace the goroutine with RequeueAfter,
the latter is a more native and less error-prone
implementation.

[1] The following logs show this case:

{"level":"info","ts":"2021-03-08T14:56:37.864Z","logger":"status_manager","msg":"nsx-node-agent-p8ss5/nsx-kube-proxy for node compute-2 started for less than 17.864554094s"}
{"level":"info","ts":"2021-03-08T14:56:37.864Z","logger":"status_manager","msg":"nsx-node-agent-p8ss5/nsx-node-agent for node compute-2 started for less than 17.864554094s"}
{"level":"info","ts":"2021-03-08T14:56:37.864Z","logger":"status_manager","msg":"nsx-node-agent-p8ss5/nsx-ovs for node compute-2 started for less than 17.864554094s"}
{"level":"info","ts":"2021-03-08T14:56:37.864Z","logger":"status_manager","msg":"Setting status NetworkUnavailable to true for node compute-2"}
{"level":"info","ts":"2021-03-08T14:56:37.876Z","logger":"status_manager","msg":"Updated node condition NetworkUnavailable to true for node compute-2"}
{"level":"info","ts":"2021-03-08T14:56:37.876Z","logger":"status_manager","msg":"Node condition is not changed"}
...
{"level":"info","ts":"2021-03-08T15:26:13.541Z","logger":"status_manager","msg":"Setting status NetworkUnavailable to false for node compute-2"}
{"level":"info","ts":"2021-03-08T15:26:13.541Z","logger":"status_manager","msg":"Setting status NetworkUnavailable to false for node compute-2 after -26m53.541741583s"}

pkg/controller/pod/pod_controller.go

pkg/controller/statusmanager/pod_status.go

Sometimes when a nsx-node-agent pod is created and running for less than 180 seconds, the operator will try to update the node status twice (firstly set network-unavailable=true, then sleep and try to set network-unavailable=false after 180 seconds)[1]. The code has a redundant check before sleeping, and in the check logic, Get API reads the node status from cache which may be not synced after the first update operation was executed, so an unexpected "Node condition is not changed" will be reported, then the taints cannot be removed until the removal logic was triggerred accidentally by another event from nsx-node-agent pod. This patch will remove the redundant check. And we will assume that the data read by client will eventually be correct, but may be slightly out of date. So this patch introduced the logic assertNodeStatus to ensure the final status is expected. This patch also replace the goroutine with RequeueAfter, the latter is a more native and less error-prone implementation. [1] The following logs show this case: {"level":"info","ts":"2021-03-08T14:56:37.864Z","logger":"status_manager","msg":"nsx-node-agent-p8ss5/nsx-kube-proxy for node compute-2 started for less than 17.864554094s"} {"level":"info","ts":"2021-03-08T14:56:37.864Z","logger":"status_manager","msg":"nsx-node-agent-p8ss5/nsx-node-agent for node compute-2 started for less than 17.864554094s"} {"level":"info","ts":"2021-03-08T14:56:37.864Z","logger":"status_manager","msg":"nsx-node-agent-p8ss5/nsx-ovs for node compute-2 started for less than 17.864554094s"} {"level":"info","ts":"2021-03-08T14:56:37.864Z","logger":"status_manager","msg":"Setting status NetworkUnavailable to true for node compute-2"} {"level":"info","ts":"2021-03-08T14:56:37.876Z","logger":"status_manager","msg":"Updated node condition NetworkUnavailable to true for node compute-2"} {"level":"info","ts":"2021-03-08T14:56:37.876Z","logger":"status_manager","msg":"Node condition is not changed"} ... {"level":"info","ts":"2021-03-08T15:26:13.541Z","logger":"status_manager","msg":"Setting status NetworkUnavailable to false for node compute-2"} {"level":"info","ts":"2021-03-08T15:26:13.541Z","logger":"status_manager","msg":"Setting status NetworkUnavailable to false for node compute-2 after -26m53.541741583s"}

vmwclabot added the cla-not-required label Mar 18, 2021

heypnus force-pushed the fix/2731098 branch 3 times, most recently from b24c9ac to 2922830 Compare March 23, 2021 01:09

heypnus requested review from dantingl, timdengyun and salv-orlando March 23, 2021 03:15

heypnus force-pushed the fix/2731098 branch from 2922830 to 277e5af Compare March 23, 2021 04:53

heypnus requested review from jwsui and ggverma March 23, 2021 07:47

heypnus force-pushed the fix/2731098 branch 2 times, most recently from 805d6d1 to aa003ee Compare March 23, 2021 07:56

dantingl reviewed Mar 24, 2021

View reviewed changes

pkg/controller/pod/pod_controller.go Outdated Show resolved Hide resolved

pkg/controller/pod/pod_controller.go Outdated Show resolved Hide resolved

pkg/controller/statusmanager/pod_status.go Show resolved Hide resolved

heypnus force-pushed the fix/2731098 branch from aa003ee to 2d01cc3 Compare March 25, 2021 00:48

heypnus changed the title ~~Remove redundant judgment during updating node status~~ Improve the resync logic for node network status Mar 25, 2021

heypnus force-pushed the fix/2731098 branch 4 times, most recently from dd30c91 to 18a5e8d Compare March 25, 2021 09:33

dantingl reviewed Mar 25, 2021

View reviewed changes

pkg/controller/statusmanager/pod_status.go Outdated Show resolved Hide resolved

pkg/controller/statusmanager/pod_status.go Outdated Show resolved Hide resolved

pkg/controller/statusmanager/pod_status.go Outdated Show resolved Hide resolved

dantingl approved these changes Mar 25, 2021

View reviewed changes

heypnus force-pushed the fix/2731098 branch 2 times, most recently from 4698114 to 9bb9b04 Compare March 26, 2021 13:26

heypnus force-pushed the fix/2731098 branch from 9bb9b04 to 8287b40 Compare March 26, 2021 13:29

heypnus merged commit e129229 into vmware:master Mar 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the resync logic for node network status #112

Improve the resync logic for node network status #112

heypnus commented Mar 18, 2021 •

edited

Loading

Improve the resync logic for node network status #112

Improve the resync logic for node network status #112

Conversation

heypnus commented Mar 18, 2021 • edited Loading

heypnus commented Mar 18, 2021 •

edited

Loading