Skip to content

Commit

Permalink
Remove redundant judgment during updating node status
Browse files Browse the repository at this point in the history
Sometimes when a nsx-node-agent pod are created and
running for less than 180 seconds, the operator will
try to update the node status twice (firstly set
network-unavailable=true, then sleep and try to set
network-unavailable=false after 180 seconds).

The code has a redundant check before sleeping, and
in the check logic, Get API reads the node status
from cache which may be not synced after the first
update operator was executed, so an unexpected
"Node condition is not changed" will be reported,
then the taints cannot be removed until the removal
logic was triggerred accidentally by another event
from nsx-node-agent pod.

The following logs show this case:

{"level":"info","ts":"2021-03-08T14:56:37.864Z","logger":"status_manager","msg":"nsx-node-agent-p8ss5/nsx-kube-proxy for node compute-2 started for less than 17.864554094s"}
{"level":"info","ts":"2021-03-08T14:56:37.864Z","logger":"status_manager","msg":"nsx-node-agent-p8ss5/nsx-node-agent for node compute-2 started for less than 17.864554094s"}
{"level":"info","ts":"2021-03-08T14:56:37.864Z","logger":"status_manager","msg":"nsx-node-agent-p8ss5/nsx-ovs for node compute-2 started for less than 17.864554094s"}
{"level":"info","ts":"2021-03-08T14:56:37.864Z","logger":"status_manager","msg":"Setting status NetworkUnavailable to true for node compute-2"}
{"level":"info","ts":"2021-03-08T14:56:37.876Z","logger":"status_manager","msg":"Updated node condition NetworkUnavailable to true for node compute-2"}
{"level":"info","ts":"2021-03-08T14:56:37.876Z","logger":"status_manager","msg":"Node condition is not changed"}
...
{"level":"info","ts":"2021-03-08T15:26:13.541Z","logger":"status_manager","msg":"Setting status NetworkUnavailable to false for node compute-2"}
{"level":"info","ts":"2021-03-08T15:26:13.541Z","logger":"status_manager","msg":"Setting status NetworkUnavailable to false for node compute-2 after -26m53.541741583s"}

This patch will remove the redundant check and put
the SetNodeConditionFromPod logic in a periodic
resync to ensure the taints can always be deleted.
  • Loading branch information
heypnus committed Mar 23, 2021
1 parent be658eb commit 2922830
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 6 deletions.
1 change: 1 addition & 0 deletions pkg/controller/pod/pod_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -199,6 +199,7 @@ func (r *ReconcilePod) Reconcile(request reconcile.Request) (reconcile.Result, e
if r.isForNsxNodeAgentPod(request) {
reqLogger.Info("Reconciling pod update for network status")
r.status.SetNodeConditionFromPod(request.NamespacedName, r.sharedInfo, nil)
return reconcile.Result{RequeueAfter: ResyncPeriod}, nil
}
return reconcile.Result{}, nil
}
Expand Down
7 changes: 1 addition & 6 deletions pkg/controller/statusmanager/pod_status.go
Original file line number Diff line number Diff line change
Expand Up @@ -144,11 +144,6 @@ func (status *StatusManager) setNodeNetworkUnavailable(nodeName string, ready bo
} else {
delete(sharedInfo.LastNetworkAvailable, nodeName)
}
_, changed := status.combineNode(nodeName, ready, reason, message)
if !changed {
log.Info("Node condition is not changed")
return
}
log.Info(fmt.Sprintf("Setting status NetworkUnavailable to %v for node %s", !ready, nodeName))
if !ready {
status.updateNodeStatus(nodeName, ready, reason, message)
Expand Down Expand Up @@ -202,7 +197,7 @@ func (status *StatusManager) updateNodeStatus(nodeName string, ready bool, reaso
// need to retrieve the node info, otherwise, the update may fail because the node has been modified during monitor slept
node, changed := status.combineNode(nodeName, ready, reason, message)
if !changed {
log.Info("Node condition is not changed, skip updating")
log.Info(fmt.Sprintf("Node condition is not changed(nodeName: %s, reason: %s, message: %s), skip updating", nodeName, reason, message))
return
}
err := status.client.Status().Update(context.TODO(), node)
Expand Down

0 comments on commit 2922830

Please sign in to comment.