You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Now when we are using sriov-network-operator in a cluster where there is only one "worker node" where we deploy data plane pods on (including sriov-device-plugin and sriov-network-config-daemon, we make those pods to go to worker-node only by adding configDaemonNodeSelector in SriovOperatorConfig , see https://cloud.google.com/anthos/clusters/docs/bare-metal/latest/how-to/sriov#configure_the_sr-iov_operator), and control plane is used for kube-apiserver, the worker-node will stuck in SchedulingDisabled after applying SriovNetworkNodePolicy:
$ k get nodes
NAME STATUS ROLES AGE VERSION
control-plane-0 Ready control-plane 41d v1.26.2-gke.1001
control-plane-1 Ready control-plane 41d v1.26.2-gke.1001
control-plane-2 Ready control-plane 41d v1.26.2-gke.1001
worker-node Ready,SchedulingDisabled <none> 41d v1.25.5-gke.1001
The reason behind is that we have PodDisruptionBudget for some pods on worker-node:
$ k get PodDisruptionBudget/istio-ingress -n gke-system
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
istio-ingress 1 N/A 0 41d
$ k get PodDisruptionBudget/istiod -n gke-system
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
istiod 1 N/A 0 41d
and those pods are supposed to be scheduled on work-nodes only as control-plane nodes have taints:
which those pods don't tolerate. So when sriov-network-config-daemon try to drain the node and evict those pods, they don't have any other nodes to go and show the following error in sriov-network-config-daemon
2023-06-26T18:57:43.787766559Z stderr F I0626 18:57:43.787737 11036 writer.go:132] setNodeStateStatus(): syncStatus: InProgress, lastSyncError:
2023-06-26T18:57:44.764460102Z stderr F I0626 18:57:44.764395 11036 daemon.go:133] evicting pod gke-system/istiod-665ccd8cfb-dtcc9
2023-06-26T18:57:44.765505368Z stderr F I0626 18:57:44.765470 11036 daemon.go:133] evicting pod gke-system/istio-ingress-77cbf5d986-8cvhc
2023-06-26T18:57:44.783806543Z stderr F E0626 18:57:44.783775 11036 daemon.go:133] error when evicting pods/"istiod-665ccd8cfb-dtcc9" -n "gke-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
2023-06-26T18:57:44.784855495Z stderr F E0626 18:57:44.784818 11036 daemon.go:133] error when evicting pods/"istio-ingress-77cbf5d986-8cvhc" -n "gke-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
and it just keeps retrying so node drain will never succeed. I think for such kind of cluster settings we should just skip node drain as it will never succeed.
Given that configDaemonNodeSelector in SriovOperatorConfig is already deciding which nodes to deploy sriov daemon for configs and for potential drains, I think it makes more sense to just decide whether there is only one "node" in cluster by listing node according to labels listed in configDaemonNodeSelector . I am thinking of the following 2 solutions:
In pkg/utils/cluster.go we modify the node list to be based on label filtering in ns os.Getenv("NAMESPACE") sriovnetworkv1.SriovOperatorConfig named default. Then in controllers/sriovoperatorconfig_controller.go it will add DisableDrain based on whether there is only one node.
We don't make the controller to manipulate SriovOperatorConfig CR spec by adding DisableDrain, instead when there is DisableDrain in operator config, we just follow that, when there is no DisableDrain in operator config, we modify the code to still decide whether it is a single node cluster based on node label filtering, but we need to expose this if it is a single node cluster somewhere through CR I suppose, would like to hear people's ideas to see where we should put this information to show status of single node cluster determination.
The text was updated successfully, but these errors were encountered:
@SchSeba Hey Sebastian can you comment more on what we discussed on Jul, 17, 2023's meeting?
We are trying to determine node count by filtering based on node labels in defaultSriovOperatorConfig.Spec.ConfigDaemonNodeSelector if they are provided. (Since only nodes with those labels will have sriov daemon deployed.) Thus we can really make the right decision to disable drain if there is only one node that allows sriov daemon to be deployed.
Would like to get more clarifications on whether this is a feasible change as I think this is a bug when counting nodes. Thanks!
we don't want to do it automatically when you have more than 1 node because the user needs to understand the implications of that.
meaning when you have one node you need to handle the pods reset after a configuration. when you have multiple nodes the user must manually configure the skipDrain and understand that the implication his that he will need to reset the workloads
Currently in
sriov-network-operator
it is determining whether a cluster is a single worker node cluster by checking node count without any filtering https://github.com/k8snetworkplumbingwg/sriov-network-operator/blob/master/pkg/utils/cluster.go#L78 . If it is a single worker node cluster then it marks as "skip drain" https://github.com/k8snetworkplumbingwg/sriov-network-operator/blob/master/main.go#L249-L263 and then daemon will decide whether to "skip drain" in https://github.com/k8snetworkplumbingwg/sriov-network-operator/blob/master/pkg/daemon/daemon.go#L519 .Now when we are using
sriov-network-operator
in a cluster where there is only one "worker node" where we deploy data plane pods on (includingsriov-device-plugin
andsriov-network-config-daemon
, we make those pods to go to worker-node only by addingconfigDaemonNodeSelector
inSriovOperatorConfig
, see https://cloud.google.com/anthos/clusters/docs/bare-metal/latest/how-to/sriov#configure_the_sr-iov_operator), and control plane is used for kube-apiserver, the worker-node will stuck inSchedulingDisabled
after applyingSriovNetworkNodePolicy
:The reason behind is that we have
PodDisruptionBudget
for some pods on worker-node:and those pods are supposed to be scheduled on work-nodes only as control-plane nodes have taints:
which those pods don't tolerate. So when
sriov-network-config-daemon
try to drain the node and evict those pods, they don't have any other nodes to go and show the following error insriov-network-config-daemon
and it just keeps retrying so node drain will never succeed. I think for such kind of cluster settings we should just skip node drain as it will never succeed.
Given that
configDaemonNodeSelector
inSriovOperatorConfig
is already deciding which nodes to deploy sriov daemon for configs and for potential drains, I think it makes more sense to just decide whether there is only one "node" in cluster by listing node according to labels listed inconfigDaemonNodeSelector
. I am thinking of the following 2 solutions:pkg/utils/cluster.go
we modify the node list to be based on label filtering in nsos.Getenv("NAMESPACE")
sriovnetworkv1.SriovOperatorConfig nameddefault
. Then in controllers/sriovoperatorconfig_controller.go it will addDisableDrain
based on whether there is only one node.SriovOperatorConfig
CR spec by addingDisableDrain
, instead when there isDisableDrain
in operator config, we just follow that, when there is noDisableDrain
in operator config, we modify the code to still decide whether it is a single node cluster based on node label filtering, but we need to expose thisif it is a single node cluster
somewhere through CR I suppose, would like to hear people's ideas to see where we should put this information to show status of single node cluster determination.The text was updated successfully, but these errors were encountered: