Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign device plugin reset #747

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

SchSeba
Copy link
Collaborator

@SchSeba SchSeba commented Jul 25, 2024

This commit introduces a new redesign on how the operator resets the device plugin

  • use a general nodeSelector to avoid updating the daemonset yaml
  • remove the config-daemon removing pod (better security)
  • make the operator in charge of resetting the device plugin via annotations
  • mark the node as cordon BEFORE we remove the device plugin (without drain) to avoid scheduling new pods until the device plugin is backed up

Copy link

Thanks for your PR,
To run vendors CIs, Maintainers can use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs, Maintainers can use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@coveralls
Copy link

coveralls commented Jul 25, 2024

Pull Request Test Coverage Report for Build 11273255954

Details

  • 273 of 435 (62.76%) changed or added relevant lines in 8 files are covered.
  • 18 unchanged lines in 5 files lost coverage.
  • Overall coverage increased (+0.2%) to 45.18%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/drain/drainer.go 12 14 85.71%
pkg/daemon/daemon.go 11 16 68.75%
pkg/utils/cluster.go 27 32 84.38%
pkg/platforms/openshift/openshift.go 0 9 0.0%
controllers/drain_controller_helper.go 196 337 58.16%
Files with Coverage Reduction New Missed Lines %
controllers/helper.go 1 70.37%
pkg/daemon/daemon.go 2 46.9%
pkg/utils/cluster.go 3 38.58%
controllers/drain_controller.go 4 75.0%
api/v1/helper.go 8 75.07%
Totals Coverage Status
Change from base Build 11258053806: 0.2%
Covered Lines: 6721
Relevant Lines: 14876

💛 - Coveralls

controllers/drain_controller.go Outdated Show resolved Hide resolved
controllers/drain_controller.go Outdated Show resolved Hide resolved
controllers/drain_controller.go Outdated Show resolved Hide resolved
@SchSeba SchSeba changed the title This commit introduces a new redesign on how the operator resets the … Redesign device plugin reset Jul 25, 2024
@SchSeba SchSeba force-pushed the device_plugin_redesign branch 5 times, most recently from a3e8a58 to ba8d6ef Compare July 30, 2024 09:05
@SchSeba
Copy link
Collaborator Author

SchSeba commented Aug 19, 2024

@zeeke @adrianchiris @ykulazhenkov when you have time please take a look on this :)

Copy link
Member

@zeeke zeeke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments

controllers/drain_controller.go Outdated Show resolved Hide resolved
pkg/utils/cluster.go Outdated Show resolved Hide resolved
controllers/drain_controller.go Outdated Show resolved Hide resolved
@SchSeba SchSeba force-pushed the device_plugin_redesign branch 4 times, most recently from d719a8c to 0981cec Compare September 3, 2024 13:04
Copy link
Member

@zeeke zeeke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. Overall design looks good to me but we must be super sure about all of these annotation states. I'm pretty scared of having a deadlock in some production cluster :)

controllers/helper.go Outdated Show resolved Hide resolved
controllers/drain_controller_helper.go Show resolved Hide resolved
controllers/drain_controller_test.go Show resolved Hide resolved
controllers/drain_controller_helper.go Outdated Show resolved Hide resolved
pkg/consts/constants.go Outdated Show resolved Hide resolved
@SchSeba SchSeba force-pushed the device_plugin_redesign branch 3 times, most recently from 146b706 to 95e9d84 Compare October 7, 2024 13:01
@SchSeba SchSeba force-pushed the device_plugin_redesign branch 2 times, most recently from 324dfb1 to 8152db1 Compare October 10, 2024 10:58
…device plugin

* use a general nodeSelector to avoid updating the daemonset yaml
* remove the config-daemon removing pod (better security)
* make the operator in charge of resetting the device plugin via annotations
* mark the node as cordon BEFORE we remove the device plugin (without drain) to avoid scheduling new pods until the device plugin is backed up

Signed-off-by: Sebastian Sch <sebassch@gmail.com>
Copy link
Collaborator

@adrianchiris adrianchiris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial review.

left with drain controller and its helper.

its hard to review added logic since a bunch of code was moved while new code added in existing functions in the same commit :\

}

if newObj.GetLabels()[key] != value {
log.Log.V(2).Info("LabelObject(): Annotate object",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Annotate -> Label

err := c.Patch(ctx,
newObj, patch)
if err != nil {
log.Log.Error(err, "annotateObject(): Failed to patch object")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: fix func name

@@ -161,3 +162,40 @@ func AnnotateNode(ctx context.Context, nodeName string, key, value string, c cli

return AnnotateObject(ctx, node, key, value, c)
}

// LabelObject adds label to a kubernetes object
func LabelObject(ctx context.Context, obj client.Object, key, value string, c client.Client) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: keep it internal to the package for now ? since you only use LabelNode in controller.

@@ -228,6 +228,18 @@ func (c *openshiftContext) OpenshiftAfterCompleteDrainNode(ctx context.Context,
return false, err
}

value, exist := mcp.Annotations[consts.MachineConfigPoolPausedAnnotation]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this change related to this PR ?

if not, do you prefer to submit a separate PR for it.

consts.NodeStateDrainAnnotationCurrent,
consts.DrainIdle) {
log.Log.Info("nodeStateSyncHandler(): apply 'Device_Plugin_Reset_Required' annotation for node")
err := utils.AnnotateNode(context.Background(), vars.NodeName, consts.NodeDrainAnnotation, consts.DevicePluginResetRequired, dn.client)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need to annotate both node and nodestate ?

Copy link
Collaborator

@adrianchiris adrianchiris Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we requesting device plugin restart in a different place than the original?

@@ -67,12 +67,17 @@ const (
MachineConfigPoolPausedAnnotationIdle = "Idle"
MachineConfigPoolPausedAnnotationPaused = "Paused"

SriovDevicePluginLabel = "sriovnetwork.openshift.io/device-plugin"
SriovDevicePluginLabelEnabled = "Enabled"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use lower case for label values ?

no special preference on my end. but i seldom see label values that are CamelCase

@@ -185,42 +173,17 @@ func syncPluginDaemonObjs(ctx context.Context,
data.Data["ReleaseVersion"] = os.Getenv("RELEASEVERSION")
data.Data["ResourcePrefix"] = vars.ResourcePrefix
data.Data["ImagePullSecrets"] = GetImagePullSecrets()
data.Data["NodeSelectorField"] = GetDefaultNodeSelector()
data.Data["NodeSelectorField"] = GetDefaultNodeSelectorForDevicePlugin()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we merge this with the labels we set on sriov network config daemon ?

that way we have sriov device plugin deployed wherever sriov network config daemon is deployed AND has the device-plugin enabled label.

just in case we have leftovers OR we changed the selector for config daemon.

also should we clean this label (and possibly some annot while at it) from node obj if config daemon is not targeting them ?

@adrianchiris
Copy link
Collaborator

adrianchiris commented Oct 10, 2024

General comment:

to me it seems (much ?) simpler that the config daemon will be in charge of updating its own node label to evict DP pod.
in this case it just requests cordon from the drain controller.

the drain controller will then handle only what it was supposed to do originally (handle cordon/drain related operations)
which is easier to comprehend IMO.

we should discuss this.

}

// if we manage to cordon we label the node state with drain completed and finish
err = utils.AnnotateObject(ctx, nodeNetworkState, constants.NodeStateDrainAnnotationCurrent, constants.DrainComplete, dr.Client)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit confusing for the device plugin reset flow where we did not really drain.

consts.NodeStateDrainAnnotationCurrent,
consts.DrainIdle) {
log.Log.Info("nodeStateSyncHandler(): apply 'Device_Plugin_Reset_Required' annotation for node")
err := utils.AnnotateNode(context.Background(), vars.NodeName, consts.NodeDrainAnnotation, consts.DevicePluginResetRequired, dn.client)
Copy link
Collaborator

@adrianchiris adrianchiris Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we requesting device plugin restart in a different place than the original?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants