-
Notifications
You must be signed in to change notification settings - Fork 113
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add 'Parallel SR-IOV configuration' design document
- Loading branch information
Showing
2 changed files
with
196 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,195 @@ | ||
--- | ||
title: Parallel SR-IOV configuration | ||
authors: | ||
- e0ne | ||
reviewers: | ||
- adrianchiris | ||
- SchSeba | ||
creation-date: 18-07-2023 | ||
last-updated: 18-07-2023 | ||
--- | ||
|
||
# Parallel SR-IOV configuration | ||
|
||
## Summary | ||
SR-IOV Network Operator configures NICs one by one. That means we’ll need to wait hours or even days to configure all | ||
NICs on large cluster deployments. With multi-NICs deployments Operator configures NICs one by one on each node which | ||
leads to a lot of unnecessary node drain calls during SR-IOV configuration. | ||
|
||
## Motivation | ||
Allow SR-IOV Network Operator to configure more than one node at the moment. | ||
|
||
### Use Cases | ||
|
||
### Goals | ||
* Number of drainable at the moment nodes should be 1 be default | ||
* Number of drainable at the moment nodes should be configured by pool | ||
* Nodes pool should be defined by node selector | ||
* A node could be included into several pools. In this case pool should be | ||
selected by it’s priority like it’s implemented for SriovNetworkNodePolicy | ||
|
||
|
||
### Non-Goals | ||
Parallel NICs configuration on the same node is out of scope of this proposal | ||
|
||
## Proposal | ||
|
||
Introduce nodes pool drain configuration to meet following requirements: | ||
* Number of drainable at the moment nodes should be 1 be default | ||
* Number of drainable at the moment nodes should be configured by pool | ||
* Nodes pool should be defined by node selector | ||
* A node could be included into several pools. In this case pool should be selected by it’s priority like it’s | ||
implemented for `SriovNetworkNodePolicy` | ||
|
||
|
||
### Workflow Description | ||
A new Drain controller will be introduced to manage node drain and cordon procedures. That means we don't need to do | ||
drain and use `drain lock` in config daemon anymore. The overall drain process will be covered by the following states: | ||
|
||
```golang | ||
AnnoIdle = "Idle" | ||
AnnoDrainRequired = "Drain_Required" | ||
AnnoMcpPaused = "Draining_MCP_Paused" | ||
AnnoDraining = "Draining" | ||
AnnoDraining = "Draining_Complete" | ||
``` | ||
|
||
Drain controller will watch for node annotation changes, `SriovNetworkPoolConfig` and `SriovNetworkNodeState` changes: | ||
```golang | ||
// +kubebuilder:rbac:groups="",resources=nodes,verbs=get;list;watch;update;patch | ||
// +kubebuilder:rbac:groups="",resources=SriovNetworkPoolConfig,verbs=get;list;watch;update;patch | ||
// +kubebuilder:rbac:groups="",resources=SriovNetworkNodeState,verbs=get;list;watch;update;patch | ||
``` | ||
|
||
Since new `SriovNetworkNodeState` will appear with `syncState=InProgess` controller will `Drain_Required` annotation. | ||
Another reconcile loop will check if current `Draining` nodes count is less than `maxUnavailable` and will | ||
mark it as `Draining` to start draining procedure. If current Draining nodes count equals to | ||
`maxUnavailable`, no annotation updates will be applied in the current reconcile loop. | ||
|
||
If `SriovNetworkNodeState` will appear with `syncState=Succeed` and node annotation `Draining_Complete` controller | ||
should change annotation to `Idle`. | ||
|
||
We can add a webhook to block the change of the configuration when the nodes are draining to a lower number. Without | ||
validating webhook controller will proceed drain procedure for the current `Draining` nodes until the count of them | ||
will decrease to specified `maxUnavailable` number. | ||
|
||
We need this logic in controller to have it centralized and not to have raise condition issues during annotation update. | ||
|
||
Config daemon will not have drain-related logic with will simplify it a lot. | ||
|
||
### API Extensions | ||
|
||
#### Option 1: extend existing CR SriovNetworkPoolConfig | ||
SriovNetworkPoolConfig is used only for OpenShift to provide configuration for | ||
OVS Hardware Offloading. We can extend it to add configuration for the drain | ||
pool. E.g.: | ||
|
||
```yaml | ||
apiVersion: v1 | ||
kind: SriovNetworkPoolConfig | ||
metadata: | ||
name: pool-1 | ||
namespace: network-operator | ||
spec: | ||
priority: 44 | ||
nodeSelectorTerms: | ||
- matchExpressions: | ||
- key: some-label | ||
operator: In | ||
values: | ||
- val-2 | ||
- matchExpressions: | ||
- key: other-label | ||
operator: "Exists" | ||
DrainConfig: | ||
maxUnavailable: 5 | ||
ovsHardwareOffloadConfig: {} | ||
``` | ||
Default pool should be always created with the lowest priority: | ||
```yaml | ||
apiVersion: v1 | ||
kind: SriovNetworkPoolConfig | ||
metadata: | ||
name: default | ||
namespace: network-operator | ||
spec: | ||
priority: 100 | ||
DrainConfig: | ||
maxUnavailable: 1 | ||
``` | ||
#### Option 2: extend SriovOperatorConfig CRD | ||
We can extend SriovOperatorConfig CRD to include drain pools configuration. E.g.: | ||
```yaml | ||
apiVersion: sriovnetwork.openshift.io/v1 | ||
kind: SriovOperatorConfig | ||
metadata: | ||
name: default | ||
namespace: network-operator | ||
spec: | ||
# Add fields here | ||
enableInjector: false | ||
enableOperatorWebhook: false | ||
configDaemonNodeSelector: {} | ||
disableDrain: false | ||
drainConfig: | ||
- name: default | ||
maxUnavailable: 1 | ||
priority: 100 # the lowest priority | ||
- name: pool-1 | ||
maxUnavailable: 5 | ||
priority: 44 | ||
# empty nodeSelectorTerms means 'all nodes' | ||
nodeSelectorTerms: | ||
- matchExpressions: | ||
- key: some-label | ||
operator: In | ||
values: | ||
- val-1 | ||
- val-2 | ||
- matchExpressions: | ||
- key: other-label | ||
operator: "Exists" | ||
``` | ||
#### Option 3: New CRD | ||
Add new `DrainConfiguration`CRD with fields mentioned in previous options. | ||
We can extend SriovOperatorConfig CRD to include drain pools configuration. E.g.: | ||
```yaml | ||
apiVersion: sriovnetwork.openshift.io/v1 | ||
kind: SriovDrainConfig | ||
metadata: | ||
name: default | ||
namespace: network-operator | ||
spec: | ||
maxUnavailable: 1 | ||
priority: 100 # the lowest priority | ||
# empty nodeSelectorTerms means 'all nodes' | ||
nodeSelectorTerms: | ||
- matchExpressions: | ||
- key: some-label | ||
operator: In | ||
``` | ||
|
||
### Implementation Details/Notes/Constraints | ||
The implementation of this feature is very complex and requires a lot of changes to different parts of SR-IOV Network | ||
Operator. To not introduce breaking changes we have to split this effort to several phases: | ||
* implement new Drain controller without API changes: | ||
it will proceed only one node configuration at the same time and doesn't require API changes | ||
* introduce new API changes to support pool of nodes to be drained in a parallel: | ||
at this phase we introduce new API from one of the proposed options above and modify Drain controller to watch for | ||
the specified CRs and proceed drain in a parallel per node pool configuration | ||
* drop `NodeDrainAnnotation` usage and move this annotation values into the `SriovNetworkNodeState` object | ||
|
||
All phases should be implemented one-by-one in a separate PRs in the order above. | ||
|
||
### Upgrade & Downgrade considerations | ||
This feature introduces changes to CRDs which means CRDs should be applied during upgrade or downgrade | ||
|
||
### Test Plan | ||
* Unit tests should be implemented for new Drain Controller. | ||
** E2E, manual or automation functional testing should have such test cases: | ||
** to verify that we actually configure SR-IOV on `maxUnavailable` nodes at the same time | ||
** to check that we don't configure more than `maxUnavailable` nodes at the same time |