-
Notifications
You must be signed in to change notification settings - Fork 106
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
doc(*): add proposal for enhance orm by nri
Signed-off-by: Airren <qiang.ren@intel.com>
- Loading branch information
Showing
5 changed files
with
373 additions
and
0 deletions.
There are no files selected for viewing
373 changes: 373 additions & 0 deletions
373
docs/proposals/qos-management/orm-nri/20240303-orm-nri.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,373 @@ | ||
--- | ||
title: Enhance ORM by NRI | ||
authors: | ||
- "airren" | ||
- "hle2" | ||
reviewers: | ||
- "caohe" | ||
creation-date: 2024-03-03 | ||
last-updated: 2024-04-24 | ||
status: implementable | ||
|
||
--- | ||
|
||
# Enhance ORM by NRI | ||
|
||
<!--ts--> | ||
* [Enhance ORM by NRI](#enhance-orm-by-nri) | ||
* [Summary](#summary) | ||
* [Motivation](#motivation) | ||
* [Goals](#goals) | ||
* [Non-Goals/Future Work](#non-goalsfuture-work) | ||
* [Proposal](#proposal) | ||
* [User Stories](#user-stories) | ||
* [Story1: Use origin kubernetes without intrusive modifications](#story1-use-origin-kubernetes-without--intrusive-modifications) | ||
* [Story2: Synchronous configuration of QoS policies and injection of environment variables](#story2-synchronous-configuration-of-qos-policies-and-injection-of-environment-variables) | ||
* [Requirements](#requirements) | ||
* [Functional Requirements](#functional-requirements) | ||
* [Non-Functional Requirements](#non-functional-requirements) | ||
* [Design Details](#design-details) | ||
* [Detailed working flow](#detailed-working-flow) | ||
* [Addon](#addon) | ||
* [Modification](#modification) | ||
* [Test Plan](#test-plan) | ||
* [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) | ||
* [Feature Enablement and Rollback](#feature-enablement-and-rollback) | ||
* [How can this feature be enabled / disabled in a live cluster?](#how-can-this-feature-be-enabled--disabled-in-a-live-cluster) | ||
* [Troubleshooting](#troubleshooting) | ||
* [How does this feature react if the NRI not supported?](#how-does-this-feature-react-if-the-nri-not-supported) | ||
* [How to handle resource allocation failures?](#how-to-handle-resource-allocation-failures) | ||
* [What happens if the NRI stub times out or if the socket connection fails?](#what-happens-if-the-nri-stub-times-out-or-if-the-socket-connection-fails) | ||
* [Appendix](#appendix) | ||
* [Implementation History](#implementation-history) | ||
|
||
<!-- Created by https://github.com/ekalinin/github-markdown-toc --> | ||
<!-- Added by: airren, at: Wed Mar 27 14:55:54 CST 2024 --> | ||
|
||
<!--te--> | ||
|
||
## Summary | ||
|
||
To meet the needs of various business application scenarios, ensuring sufficient | ||
resource guarantees for latency-sensitive services is necessary, especially when | ||
online and offline tasks are mixed. This requires Kubernetes to provide more | ||
granular resource management capabilities, enhance container isolation, and reduce | ||
interference between containers. | ||
|
||
As of now, Kubernetes does not offer a fully comprehensive resource management | ||
solution. Many open-source projects in the Kubernetes ecosystem have devised | ||
their methods to modify the deployment and management processes of pods, enabling | ||
fine-grained resource allocation. | ||
|
||
There are various approaches to extending Kubernetes, which we have summarized | ||
as follows. | ||
|
||
![kubernetes-enhance-overview](kubernetes-enhance-overview.png) | ||
|
||
All the methods listed above can enhance Kubernetes, but except for the standalone | ||
approach, they unavoidably involve intrusive modifications to the upstream Kubernetes | ||
components, making it difficult for users to stay synchronized with upstream | ||
components. Although the standalone approach avoids modifications to upstream | ||
components, this asynchronous update method also has numerous drawbacks. | ||
|
||
To address the need for intrusive modifications to Kubernetes and changes to the | ||
default process, enabling developers to have a more unified implementation | ||
approach, NRI has emerged. | ||
|
||
[NRI](https://github.com/containerd/nri) is a plugin-based node resource management approach introduced by | ||
the upstream community. Using NRI, Kubernetes' node resource management capabilities | ||
can be enhanced through plugins without intrusive modifications to the upstream | ||
Kubernetes components. | ||
|
||
> NRI allows plugging domain- or vendor-specific custom logic into OCI- compatible | ||
> runtimes. This logic can make controlled changes to containers or perform extra | ||
> actions outside the scope of OCI at certain points in a containers lifecycle. | ||
> This can be used, for instance, for improved allocation and management of devices | ||
> and other container resources. | ||
![nri-architecture](nri-architecture.png) | ||
|
||
This proposal introduces how to enhance Katalyst using NRI, allowing Katalyst to | ||
be deployed based on origin Kubernetes and making it easier to maintain and use. | ||
|
||
## Motivation | ||
|
||
Katalyst enhances Kubernetes resource management policies on a single node through | ||
the QoS Resource Manager (QRM). However, the current QRM mode involves intrusive | ||
modifications to the Kubelet, which makes it inconvenient for some users who use | ||
the origin Kubernetes but not the distribution Kubewharf. To address this, Katalyst | ||
proposes the ORM architecture, which provides a decoupled solution from Kubelet as | ||
a supplement to the QRM solution. | ||
|
||
In the ORM architecture, there are two implementation approaches. The first approach | ||
is named Bypass, which polls Kubelet's API for pod events on the current node and | ||
updates pod resources. This approach is asynchronous and cannot inject parameters | ||
such as environment variables. The other approach is based on NRI. NRI (Node | ||
Resource Interface) is a general framework for CRI-compatible container runtime | ||
plugin extensions. It offers a mechanism for extensions to monitor pod/container | ||
states and make limited configuration modifications. Using NRI, Katalyst can | ||
synchronously modify resources and inject other information, such as environment | ||
variables, during pod events. | ||
|
||
### Goals | ||
|
||
- Expand Katalyst‘s ORM mode using NRI to enhance the Resource management capabilities | ||
of Kubernetes。 | ||
- Support for fine-grained resource control when containerd is used as the CRI runtime. | ||
|
||
### Non-Goals/Future Work | ||
|
||
- Support for other runtimes besides containerd, such as cri-o and docker. | ||
|
||
## Proposal | ||
|
||
Diverging from QRM or ORM's Bypass Mode, the Katalyst-agent will work as an NRI | ||
plugin to subscribe pod/container lifecycle events from CRI runtime (in this | ||
proposal, it is containerd), and then the Katalyst-agent will return an adjusted | ||
Container spec in the hook events, or update the container spec by an active update. | ||
|
||
- Get pod/container lifecycle events and pod or container information from NRI. | ||
- Transform the NRI format information into CRI format to reuse existing admit | ||
implementation by QRM Plugins. | ||
- Update the NRI format container spec to the CRI runtime. | ||
- While reconciling use NRI UpdateContainter to reconfigure resources. | ||
|
||
**NRI Enhanced ORM(Along with kubelet polling)** | ||
|
||
![orm-architecture](orm-architecture.png) | ||
|
||
### User Stories | ||
|
||
#### Story1: Use origin kubernetes without intrusive modifications | ||
|
||
Extending and enhancing Kubernetes' resource management capabilities is a common | ||
requirement in many business scenarios. However, while enhancing Kubernetes, it's | ||
a common requirement to ensure that all Kubernetes components remain consistent | ||
with the upstream community and avoid making any intrusive modifications to the | ||
original Kubernetes components. After enabling NRI mode, deploying Katalyst on | ||
existing clusters does not require restarting the original cluster. Enhancements | ||
to the original Kubernetes can be achieved through a plugin-based approach. | ||
|
||
#### Story2: Synchronous configuration of QoS policies and injection of environment variables | ||
|
||
When enhancing QoS policies in Kubernetes, synchronous modification is the most | ||
efficient method. With NRI Mode enabled, Katalyst plugins can synchronously modify | ||
pod resources during pod creation, ensuring QoS policy allocation before pod | ||
execution. Additionally, through NRI Mode, dynamic updates to pod resources | ||
are possible. During pod creation, adjustments to pod resources, device binding, | ||
RDT, and environment variable injection can be achieved via NRI Mode. | ||
|
||
### Requirements | ||
|
||
- Need to upgrade containerd to >= v1.7.0 | ||
|
||
#### Functional Requirements | ||
|
||
- Support all functionalities corresponding to Bypass Mode under the existing ORM | ||
architecture. This includes: adjusting container's cpuset / cfsquota, memory QoS. | ||
- Support injecting environment variables into containers | ||
|
||
#### Non-Functional Requirements | ||
|
||
- It can achieve synchronous configuration of QoS policies, improving the | ||
responsiveness of QoS policy configuration. | ||
- Fully compatible with upstream native Kubernetes components, requiring no | ||
intrusive modifications. | ||
|
||
### Design Details | ||
|
||
#### Detailed working flow | ||
|
||
![orm-nri-details](orm-nir-details.png) | ||
|
||
In this part, the method based on the Kubelet API polling is referred to as | ||
**_Bypass_** Mode, while another method based on NRI is referred to as **_NRI_** Mode. | ||
|
||
#### Addon | ||
|
||
- The ORM support two operational modes: Bypass or NRI. Only one mode can be active | ||
at any given time. When creating a new ORM Manger, the current operational mode can | ||
be determined by reading the configuration, and it does not support changing the | ||
mode during runtime. | ||
|
||
```go | ||
type workMode string | ||
const ( | ||
workModeNri workMode = "nri" | ||
workModeBypass workMode = "bypass" | ||
) | ||
|
||
|
||
type ManagerImpl struct { | ||
ctx context.Context | ||
.... | ||
// ORM run mode: bypass or nri. | ||
// Bypass mode is triggered by polling kubelet api to get the pod event. | ||
// NRI mode is required containerd version >= 1.7.0 and NRI enabled. | ||
mode workMode | ||
.... | ||
} | ||
func NewManger(... config *config.Configuration){ | ||
// init orm work mode with essential components | ||
m.initORMWorkMode(config, metaServer, emitter) | ||
} | ||
func (m *ManagerImpl) initORMWorkMode(config *config.Configuration, metaServer *metaserver.MetaServer, emitter metrics.MetricEmitter) { | ||
// init ORM work node according to the configuration and NRI status | ||
} | ||
``` | ||
|
||
- The ORM ManagerImpl functions as an NRI stub, implementing processing logic | ||
within the corresponding hook event functions. | ||
|
||
```go | ||
import "github.com/containerd/nri/pkg/stub" | ||
|
||
type ManagerImpl struct { | ||
ctx context.Context | ||
.... | ||
// nriStub is the implementtion of NRI events handlers | ||
nriStub stub.Stub | ||
// nriMask stores the specific events that need to be hooked | ||
nriMask stub.EventMask | ||
nriOptions []stub.Option | ||
nriConf nriConfig | ||
.... | ||
} | ||
``` | ||
|
||
- In enhancing the ORM implementation, three hook functions are required: | ||
`RunPodSandbox()`, `CreateContainer()`, and `RemovePodSandbox()`. | ||
|
||
**Step 1**, during `RunPodSanbox()`, the `Admit()` function is triggered. | ||
If `Admit()` succeeds, resources are allocated for the container, and the pod | ||
creation process continues. If `Admit()` fails, pod creation also fails. | ||
```go | ||
func (m *MangerImpl) RunPodSandbox(podSandbox *api.PodSandbox) error { | ||
err := m.processAddPod(pod.Uid) | ||
if err != nil { | ||
klog.Errorf("[ORM] RunPodSandbox processAddPod fail, pod: %s/%s/%s, err: %v", | ||
pod.Namespace, pod.Name, pod.Uid, err) | ||
} | ||
return err | ||
} | ||
``` | ||
|
||
**Step 2**, after a successful `Admit()`, the process proceeds to the | ||
`CreateContainer()` event. At this point, resources have been allocated for the | ||
container by `Admit()`. The corresponding resources are updated in the container's | ||
spec and returned. | ||
```go | ||
func (m *MangerImpl) CreateContainer(pod *api.PodSandbox, container *api.Container) (*api.ContainerAdjustment, []*api.ContainerUpdate, error) { | ||
// Update Container Spec from the podResources | ||
adjust, err:= m.updateContainer(pod, container) | ||
return adjust, nil, err | ||
} | ||
``` | ||
|
||
**Step 3**, During `RemovePodSandbox()`, all resource allocations related to | ||
the pod are returned. | ||
|
||
```go | ||
func (p *plugin) RemovePodSandbox(pod *api.PodSandbox) error { | ||
err := m.processDeletePod(pod.Uid) | ||
if err != nil { | ||
klog.Errorf("[ORM] RemovePodSandbox processDeletePod fail, pod: %s/%s/%s, err: %v", | ||
pod.Namespace, pod.Name, pod.Uid, err) | ||
} | ||
return err | ||
} | ||
``` | ||
|
||
#### Modification | ||
|
||
- If using the NRI Mode, after the allocation of resources is completed in the | ||
`Admit()` , the `Allocate()` does not need to execute `syncContainer()`; it should | ||
simply return after the resources have been allocated. | ||
|
||
```go | ||
func (m *ManagerImpl) Allocate(pod *v1.Pod, container *v1.Container) error { | ||
.... | ||
err := m.addContainer(pod, container) | ||
// return after resource allocate when run in NRIMode | ||
if err != nil || m.mode == workModeNri { | ||
return err | ||
} | ||
err = m.syncContainer(pod, container) | ||
return err | ||
} | ||
``` | ||
|
||
- In NRI Mode, the executer in `syncContainer()` can be implemented through NRI's | ||
`updateContainer()` . | ||
|
||
```go | ||
if m.mode == workModeNri { | ||
m.updateContainerByNRI(pod, container) | ||
} else { | ||
m.syncContainer(pod, &container) | ||
} | ||
``` | ||
|
||
- The `metaServer` as a member variable of the ORM `ManagerImpl` because it is | ||
used in both Bypass and NRI modes. | ||
- During NRI mode, halt the MetaManager's Reconcile, user NRI to hook the Pod/Container events. | ||
- During NRI mode, the executor is conduct by NRI, do not need to create an Executor. | ||
|
||
#### Test Plan | ||
|
||
We will test the enhancement of ORM by NRI in a real cluster by deploying simulated | ||
task invocation resource management plugins to configure QoS policies, which will | ||
cover key points listed below: | ||
|
||
- ORM completes registration to Containerd as an NRI plugin and establishes a connection. | ||
- ORM can configure the correct LinuxContainerResources configuration with allocation | ||
results for containers through NRI. | ||
- ORM can add environment variables to containers through NRI. | ||
- Validate that reconcileState() of ORM will update the cgroup configs for containers | ||
by the latest resource allocation results. | ||
|
||
## Production Readiness Review Questionnaire | ||
|
||
### Feature Enablement and Rollback | ||
|
||
#### How can this feature be enabled / disabled in a live cluster? | ||
|
||
This feature is disable by default, you can enable it by configuration. | ||
If a failure is detected in the NRI runtime environment while NRI mode enables, | ||
it will fall back to Bypass Mode. | ||
|
||
### Troubleshooting | ||
|
||
#### How does this feature react if the NRI not supported? | ||
|
||
It will fall back to Bypass mode of ORM. | ||
|
||
#### How to handle resource allocation failures? | ||
|
||
If encounter admit failure, the pod will enter a retry loop. | ||
|
||
#### What happens if the NRI stub times out or if the socket connection fails? | ||
|
||
Currently, if the NRI plugin times out, it leads to Containerd no longer invoking | ||
this plugin. To address this, the following strategy needs to be adopted. | ||
|
||
While timeout, in `OnClose()` invoke `stub.Restart` to re-create connection to containerd | ||
|
||
And, do `Admit()` with a timeout (configured) context, if timeout try to create again. | ||
|
||
## Appendix | ||
|
||
NRI : [https://github.com/containerd/nri](https://github.com/containerd/nri) | ||
|
||
ORM PR: [#406](https://github.com/kubewharf/katalyst-core/pull/406) [#430](https://github.com/kubewharf/katalyst-core/issues/430) | ||
|
||
## Implementation History | ||
- [x] 01/16/2024 Proposed idea in community meeting | ||
- [x] 03/12/2024 Compile a document following the proposal template | ||
- [x] 03/19/2024 Present proposal at a community meeting | ||
- [x] 04/20/2024 Complete the basic functionalities of NRI as covered in the detailed | ||
design | ||
- [ ] 05/10/2024 commence the first round of testing | ||
- [ ] 05/20/2024 open proposal PR for code |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.