From 03b497edc7dc2eaaf5aa9c91a81f70a0bc0efd2d Mon Sep 17 00:00:00 2001 From: Ivan Kolodiazhnyi Date: Wed, 19 Jul 2023 09:13:35 +0300 Subject: [PATCH] Add 'Parallel SR-IOV configuration' design document --- README.md | 2 +- doc/design/parallel-node-config.md | 195 +++++++++++++++++++++++++++++ 2 files changed, 196 insertions(+), 1 deletion(-) create mode 100644 doc/design/parallel-node-config.md diff --git a/README.md b/README.md index 7117ab61ee..85b7d4ab69 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,7 @@ SR-IOV network is an optional feature of an Openshift cluster. To make it work, - Initialize the supported SR-IOV NIC types on selected nodes. - Provision/upgrade SR-IOV device plugin executable on selected node. -- Provision/upgrade SR-IOV CNI plugin executable on selected nodes. +- Provision/upgrade SR-IOV CNI plugin executableSriovNetworkPoolConfigon selected nodes. - Manage configuration of SR-IOV device plugin on host. - Generate net-att-def CRs for SR-IOV CNI plugin - Supports operation in a virtualized Kubernetes deployment diff --git a/doc/design/parallel-node-config.md b/doc/design/parallel-node-config.md new file mode 100644 index 0000000000..4fd9e4145f --- /dev/null +++ b/doc/design/parallel-node-config.md @@ -0,0 +1,195 @@ +--- +title: Parallel SR-IOV configuration +authors: + - e0ne +reviewers: + - adrianchiris + - SchSeba +creation-date: 18-07-2023 +last-updated: 18-07-2023 +--- + +# Parallel SR-IOV configuration + +## Summary +SR-IOV Network Operator configures NICs one by one. That means we’ll need to wait hours or even days to configure all +NICs on large cluster deployments. With multi-NICs deployments Operator configures NICs one by one on each node which +leads to a lot of unnecessary node drain calls during SR-IOV configuration. + +## Motivation +Allow SR-IOV Network Operator to configure more than one node at the moment. + +### Use Cases + +### Goals +* Number of drainable at the moment nodes should be 1 be default +* Number of drainable at the moment nodes should be configured by pool +* Nodes pool should be defined by node selector +* A node could be included into several pools. In this case pool should be + selected by it’s priority like it’s implemented for SriovNetworkNodePolicy + + +### Non-Goals +Parallel NICs configuration on the same node is out of scope of this proposal + +## Proposal + +Introduce nodes pool drain configuration to meet following requirements: +* Number of drainable at the moment nodes should be 1 be default +* Number of drainable at the moment nodes should be configured by pool +* Nodes pool should be defined by node selector +* A node could be included into several pools. In this case pool should be selected by it’s priority like it’s + implemented for `SriovNetworkNodePolicy` + + +### Workflow Description +A new Drain controller will be introduced to manage node drain and cordon procedures. That means we don't need to do +drain and use `drain lock` in config daemon anymore. The overall drain process will be covered by the following states: + +```golang +AnnoIdle = "Idle" +AnnoDrainRequired = "Drain_Required" +AnnoMcpPaused = "Draining_MCP_Paused" +AnnoDraining = "Draining" +AnnoDraining = "Draining_Complete" +``` + +Drain controller will watch for node annotation changes, `SriovNetworkPoolConfig` and `SriovNetworkNodeState` changes: +```golang +// +kubebuilder:rbac:groups="",resources=nodes,verbs=get;list;watch;update;patch +// +kubebuilder:rbac:groups="",resources=SriovNetworkPoolConfig,verbs=get;list;watch;update;patch +// +kubebuilder:rbac:groups="",resources=SriovNetworkNodeState,verbs=get;list;watch;update;patch +``` + +Since new `SriovNetworkNodeState` will appear with `syncState=InProgess` controller will `Drain_Required` annotation. +Another reconcile loop will check if current `Draining` nodes count is less than `maxUnavailable` and will +mark it as `Draining` to start draining procedure. If current Draining nodes count equals to +`maxUnavailable`, no annotation updates will be applied in the current reconcile loop. + +If `SriovNetworkNodeState` will appear with `syncState=Succeed` and node annotation `Draining_Complete` controller +should change annotation to `Idle`. + +We can add a webhook to block the change of the configuration when the nodes are draining to a lower number. Without +validating webhook controller will proceed drain procedure for the current `Draining` nodes until the count of them +will decrease to specified `maxUnavailable` number. + +We need this logic in controller to have it centralized and not to have raise condition issues during annotation update. + +Config daemon will not have drain-related logic with will simplify it a lot. + +### API Extensions + +#### Option 1: extend existing CR SriovNetworkPoolConfig +SriovNetworkPoolConfig is used only for OpenShift to provide configuration for +OVS Hardware Offloading. We can extend it to add configuration for the drain +pool. E.g.: + +```yaml +apiVersion: v1 +kind: SriovNetworkPoolConfig +metadata: +name: pool-1 +namespace: network-operator +spec: +priority: 44 +nodeSelectorTerms: +- matchExpressions: + - key: some-label + operator: In + values: + - val-2 +- matchExpressions: + - key: other-label + operator: "Exists" + DrainConfig: + maxUnavailable: 5 + ovsHardwareOffloadConfig: {} +``` + +Default pool should be always created with the lowest priority: +```yaml +apiVersion: v1 +kind: SriovNetworkPoolConfig +metadata: +name: default +namespace: network-operator +spec: +priority: 100 +DrainConfig: +maxUnavailable: 1 +``` + +#### Option 2: extend SriovOperatorConfig CRD +We can extend SriovOperatorConfig CRD to include drain pools configuration. E.g.: + +```yaml +apiVersion: sriovnetwork.openshift.io/v1 +kind: SriovOperatorConfig +metadata: +name: default +namespace: network-operator +spec: +# Add fields here +enableInjector: false +enableOperatorWebhook: false +configDaemonNodeSelector: {} +disableDrain: false +drainConfig: +- name: default + maxUnavailable: 1 + priority: 100 # the lowest priority +- name: pool-1 + maxUnavailable: 5 + priority: 44 + # empty nodeSelectorTerms means 'all nodes' + nodeSelectorTerms: + - matchExpressions: + - key: some-label + operator: In + values: + - val-1 + - val-2 + - matchExpressions: + - key: other-label + operator: "Exists" +``` + +#### Option 3: New CRD +Add new `DrainConfiguration`CRD with fields mentioned in previous options. +We can extend SriovOperatorConfig CRD to include drain pools configuration. E.g.: +```yaml +apiVersion: sriovnetwork.openshift.io/v1 +kind: SriovDrainConfig +metadata: +name: default +namespace: network-operator +spec: +maxUnavailable: 1 +priority: 100 # the lowest priority +# empty nodeSelectorTerms means 'all nodes' +nodeSelectorTerms: +- matchExpressions: +- key: some-label +operator: In +``` + +### Implementation Details/Notes/Constraints +The implementation of this feature is very complex and requires a lot of changes to different parts of SR-IOV Network +Operator. To not introduce breaking changes we have to split this effort to several phases: +* implement new Drain controller without API changes: + it will proceed only one node configuration at the same time and doesn't require API changes +* introduce new API changes to support pool of nodes to be drained in a parallel: + at this phase we introduce new API from one of the proposed options above and modify Drain controller to watch for + the specified CRs and proceed drain in a parallel per node pool configuration +* drop `NodeDrainAnnotation` usage and move this annotation values into the `SriovNetworkNodeState` object + +All phases should be implemented one-by-one in a separate PRs in the order above. + +### Upgrade & Downgrade considerations +This feature introduces changes to CRDs which means CRDs should be applied during upgrade or downgrade + +### Test Plan +* Unit tests should be implemented for new Drain Controller. +** E2E, manual or automation functional testing should have such test cases: +** to verify that we actually configure SR-IOV on `maxUnavailable` nodes at the same time +** to check that we don't configure more than `maxUnavailable` nodes at the same time