Skip to content

Commit

Permalink
Add 'Parallel SR-IOV configuration' design document
Browse files Browse the repository at this point in the history
  • Loading branch information
e0ne committed Jul 19, 2023
1 parent 52e9d20 commit 03b497e
Show file tree
Hide file tree
Showing 2 changed files with 196 additions and 1 deletion.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ SR-IOV network is an optional feature of an Openshift cluster. To make it work,

- Initialize the supported SR-IOV NIC types on selected nodes.
- Provision/upgrade SR-IOV device plugin executable on selected node.
- Provision/upgrade SR-IOV CNI plugin executable on selected nodes.
- Provision/upgrade SR-IOV CNI plugin executableSriovNetworkPoolConfigon selected nodes.
- Manage configuration of SR-IOV device plugin on host.
- Generate net-att-def CRs for SR-IOV CNI plugin
- Supports operation in a virtualized Kubernetes deployment
Expand Down
195 changes: 195 additions & 0 deletions doc/design/parallel-node-config.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
---
title: Parallel SR-IOV configuration
authors:
- e0ne
reviewers:
- adrianchiris
- SchSeba
creation-date: 18-07-2023
last-updated: 18-07-2023
---

# Parallel SR-IOV configuration

## Summary
SR-IOV Network Operator configures NICs one by one. That means we’ll need to wait hours or even days to configure all
NICs on large cluster deployments. With multi-NICs deployments Operator configures NICs one by one on each node which
leads to a lot of unnecessary node drain calls during SR-IOV configuration.

## Motivation
Allow SR-IOV Network Operator to configure more than one node at the moment.

### Use Cases

### Goals
* Number of drainable at the moment nodes should be 1 be default
* Number of drainable at the moment nodes should be configured by pool
* Nodes pool should be defined by node selector
* A node could be included into several pools. In this case pool should be
selected by it’s priority like it’s implemented for SriovNetworkNodePolicy


### Non-Goals
Parallel NICs configuration on the same node is out of scope of this proposal

## Proposal

Introduce nodes pool drain configuration to meet following requirements:
* Number of drainable at the moment nodes should be 1 be default
* Number of drainable at the moment nodes should be configured by pool
* Nodes pool should be defined by node selector
* A node could be included into several pools. In this case pool should be selected by it’s priority like it’s
implemented for `SriovNetworkNodePolicy`


### Workflow Description
A new Drain controller will be introduced to manage node drain and cordon procedures. That means we don't need to do
drain and use `drain lock` in config daemon anymore. The overall drain process will be covered by the following states:

```golang
AnnoIdle = "Idle"
AnnoDrainRequired = "Drain_Required"
AnnoMcpPaused = "Draining_MCP_Paused"
AnnoDraining = "Draining"
AnnoDraining = "Draining_Complete"
```

Drain controller will watch for node annotation changes, `SriovNetworkPoolConfig` and `SriovNetworkNodeState` changes:
```golang
// +kubebuilder:rbac:groups="",resources=nodes,verbs=get;list;watch;update;patch
// +kubebuilder:rbac:groups="",resources=SriovNetworkPoolConfig,verbs=get;list;watch;update;patch
// +kubebuilder:rbac:groups="",resources=SriovNetworkNodeState,verbs=get;list;watch;update;patch
```

Since new `SriovNetworkNodeState` will appear with `syncState=InProgess` controller will `Drain_Required` annotation.
Another reconcile loop will check if current `Draining` nodes count is less than `maxUnavailable` and will
mark it as `Draining` to start draining procedure. If current Draining nodes count equals to
`maxUnavailable`, no annotation updates will be applied in the current reconcile loop.

If `SriovNetworkNodeState` will appear with `syncState=Succeed` and node annotation `Draining_Complete` controller
should change annotation to `Idle`.

We can add a webhook to block the change of the configuration when the nodes are draining to a lower number. Without
validating webhook controller will proceed drain procedure for the current `Draining` nodes until the count of them
will decrease to specified `maxUnavailable` number.

We need this logic in controller to have it centralized and not to have raise condition issues during annotation update.

Config daemon will not have drain-related logic with will simplify it a lot.

### API Extensions

#### Option 1: extend existing CR SriovNetworkPoolConfig
SriovNetworkPoolConfig is used only for OpenShift to provide configuration for
OVS Hardware Offloading. We can extend it to add configuration for the drain
pool. E.g.:

```yaml
apiVersion: v1
kind: SriovNetworkPoolConfig
metadata:
name: pool-1
namespace: network-operator
spec:
priority: 44
nodeSelectorTerms:
- matchExpressions:
- key: some-label
operator: In
values:
- val-2
- matchExpressions:
- key: other-label
operator: "Exists"
DrainConfig:
maxUnavailable: 5
ovsHardwareOffloadConfig: {}
```
Default pool should be always created with the lowest priority:
```yaml
apiVersion: v1
kind: SriovNetworkPoolConfig
metadata:
name: default
namespace: network-operator
spec:
priority: 100
DrainConfig:
maxUnavailable: 1
```
#### Option 2: extend SriovOperatorConfig CRD
We can extend SriovOperatorConfig CRD to include drain pools configuration. E.g.:
```yaml
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovOperatorConfig
metadata:
name: default
namespace: network-operator
spec:
# Add fields here
enableInjector: false
enableOperatorWebhook: false
configDaemonNodeSelector: {}
disableDrain: false
drainConfig:
- name: default
maxUnavailable: 1
priority: 100 # the lowest priority
- name: pool-1
maxUnavailable: 5
priority: 44
# empty nodeSelectorTerms means 'all nodes'
nodeSelectorTerms:
- matchExpressions:
- key: some-label
operator: In
values:
- val-1
- val-2
- matchExpressions:
- key: other-label
operator: "Exists"
```
#### Option 3: New CRD
Add new `DrainConfiguration`CRD with fields mentioned in previous options.
We can extend SriovOperatorConfig CRD to include drain pools configuration. E.g.:
```yaml
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovDrainConfig
metadata:
name: default
namespace: network-operator
spec:
maxUnavailable: 1
priority: 100 # the lowest priority
# empty nodeSelectorTerms means 'all nodes'
nodeSelectorTerms:
- matchExpressions:
- key: some-label
operator: In
```

### Implementation Details/Notes/Constraints
The implementation of this feature is very complex and requires a lot of changes to different parts of SR-IOV Network
Operator. To not introduce breaking changes we have to split this effort to several phases:
* implement new Drain controller without API changes:
it will proceed only one node configuration at the same time and doesn't require API changes
* introduce new API changes to support pool of nodes to be drained in a parallel:
at this phase we introduce new API from one of the proposed options above and modify Drain controller to watch for
the specified CRs and proceed drain in a parallel per node pool configuration
* drop `NodeDrainAnnotation` usage and move this annotation values into the `SriovNetworkNodeState` object

All phases should be implemented one-by-one in a separate PRs in the order above.

### Upgrade & Downgrade considerations
This feature introduces changes to CRDs which means CRDs should be applied during upgrade or downgrade

### Test Plan
* Unit tests should be implemented for new Drain Controller.
** E2E, manual or automation functional testing should have such test cases:
** to verify that we actually configure SR-IOV on `maxUnavailable` nodes at the same time
** to check that we don't configure more than `maxUnavailable` nodes at the same time

0 comments on commit 03b497e

Please sign in to comment.