-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesign device plugin reset #747
base: master
Are you sure you want to change the base?
Redesign device plugin reset #747
Conversation
Thanks for your PR,
To skip the vendors CIs, Maintainers can use one of:
|
Pull Request Test Coverage Report for Build 11273255954Details
💛 - Coveralls |
a3e8a58
to
ba8d6ef
Compare
ba8d6ef
to
ef25135
Compare
@zeeke @adrianchiris @ykulazhenkov when you have time please take a look on this :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments
d719a8c
to
0981cec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments. Overall design looks good to me but we must be super sure about all of these annotation states. I'm pretty scared of having a deadlock in some production cluster :)
0981cec
to
a71ec00
Compare
146b706
to
95e9d84
Compare
324dfb1
to
8152db1
Compare
…device plugin * use a general nodeSelector to avoid updating the daemonset yaml * remove the config-daemon removing pod (better security) * make the operator in charge of resetting the device plugin via annotations * mark the node as cordon BEFORE we remove the device plugin (without drain) to avoid scheduling new pods until the device plugin is backed up Signed-off-by: Sebastian Sch <sebassch@gmail.com>
8152db1
to
e2e2ad2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partial review.
left with drain controller and its helper.
its hard to review added logic since a bunch of code was moved while new code added in existing functions in the same commit :\
} | ||
|
||
if newObj.GetLabels()[key] != value { | ||
log.Log.V(2).Info("LabelObject(): Annotate object", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Annotate -> Label
err := c.Patch(ctx, | ||
newObj, patch) | ||
if err != nil { | ||
log.Log.Error(err, "annotateObject(): Failed to patch object") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: fix func name
@@ -161,3 +162,40 @@ func AnnotateNode(ctx context.Context, nodeName string, key, value string, c cli | |||
|
|||
return AnnotateObject(ctx, node, key, value, c) | |||
} | |||
|
|||
// LabelObject adds label to a kubernetes object | |||
func LabelObject(ctx context.Context, obj client.Object, key, value string, c client.Client) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: keep it internal to the package for now ? since you only use LabelNode in controller.
@@ -228,6 +228,18 @@ func (c *openshiftContext) OpenshiftAfterCompleteDrainNode(ctx context.Context, | |||
return false, err | |||
} | |||
|
|||
value, exist := mcp.Annotations[consts.MachineConfigPoolPausedAnnotation] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this change related to this PR ?
if not, do you prefer to submit a separate PR for it.
consts.NodeStateDrainAnnotationCurrent, | ||
consts.DrainIdle) { | ||
log.Log.Info("nodeStateSyncHandler(): apply 'Device_Plugin_Reset_Required' annotation for node") | ||
err := utils.AnnotateNode(context.Background(), vars.NodeName, consts.NodeDrainAnnotation, consts.DevicePluginResetRequired, dn.client) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still need to annotate both node and nodestate ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we requesting device plugin restart in a different place than the original?
@@ -67,12 +67,17 @@ const ( | |||
MachineConfigPoolPausedAnnotationIdle = "Idle" | |||
MachineConfigPoolPausedAnnotationPaused = "Paused" | |||
|
|||
SriovDevicePluginLabel = "sriovnetwork.openshift.io/device-plugin" | |||
SriovDevicePluginLabelEnabled = "Enabled" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use lower case for label values ?
no special preference on my end. but i seldom see label values that are CamelCase
@@ -185,42 +173,17 @@ func syncPluginDaemonObjs(ctx context.Context, | |||
data.Data["ReleaseVersion"] = os.Getenv("RELEASEVERSION") | |||
data.Data["ResourcePrefix"] = vars.ResourcePrefix | |||
data.Data["ImagePullSecrets"] = GetImagePullSecrets() | |||
data.Data["NodeSelectorField"] = GetDefaultNodeSelector() | |||
data.Data["NodeSelectorField"] = GetDefaultNodeSelectorForDevicePlugin() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we merge this with the labels we set on sriov network config daemon ?
that way we have sriov device plugin deployed wherever sriov network config daemon is deployed AND has the device-plugin enabled label.
just in case we have leftovers OR we changed the selector for config daemon.
also should we clean this label (and possibly some annot while at it) from node obj if config daemon is not targeting them ?
General comment: to me it seems (much ?) simpler that the config daemon will be in charge of updating its own node label to evict DP pod. the drain controller will then handle only what it was supposed to do originally (handle cordon/drain related operations) we should discuss this. |
} | ||
|
||
// if we manage to cordon we label the node state with drain completed and finish | ||
err = utils.AnnotateObject(ctx, nodeNetworkState, constants.NodeStateDrainAnnotationCurrent, constants.DrainComplete, dr.Client) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bit confusing for the device plugin reset flow where we did not really drain.
consts.NodeStateDrainAnnotationCurrent, | ||
consts.DrainIdle) { | ||
log.Log.Info("nodeStateSyncHandler(): apply 'Device_Plugin_Reset_Required' annotation for node") | ||
err := utils.AnnotateNode(context.Background(), vars.NodeName, consts.NodeDrainAnnotation, consts.DevicePluginResetRequired, dn.client) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we requesting device plugin restart in a different place than the original?
This commit introduces a new redesign on how the operator resets the device plugin