Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc(*): add proposal for enhance orm by nri #525

Merged
merged 1 commit into from
Apr 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
373 changes: 373 additions & 0 deletions docs/proposals/qos-management/orm-nri/20240303-orm-nri.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,373 @@
---
title: Enhance ORM by NRI
authors:
- "airren"
- "hle2"
reviewers:
- "caohe"
creation-date: 2024-03-03
last-updated: 2024-04-24
status: implementable

---

# Enhance ORM by NRI

<!--ts-->
* [Enhance ORM by NRI](#enhance-orm-by-nri)
* [Summary](#summary)
* [Motivation](#motivation)
* [Goals](#goals)
* [Non-Goals/Future Work](#non-goalsfuture-work)
* [Proposal](#proposal)
* [User Stories](#user-stories)
* [Story1: Use origin kubernetes without intrusive modifications](#story1-use-origin-kubernetes-without--intrusive-modifications)
* [Story2: Synchronous configuration of QoS policies and injection of environment variables](#story2-synchronous-configuration-of-qos-policies-and-injection-of-environment-variables)
* [Requirements](#requirements)
* [Functional Requirements](#functional-requirements)
* [Non-Functional Requirements](#non-functional-requirements)
* [Design Details](#design-details)
* [Detailed working flow](#detailed-working-flow)
* [Addon](#addon)
* [Modification](#modification)
* [Test Plan](#test-plan)
* [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
* [Feature Enablement and Rollback](#feature-enablement-and-rollback)
* [How can this feature be enabled / disabled in a live cluster?](#how-can-this-feature-be-enabled--disabled-in-a-live-cluster)
* [Troubleshooting](#troubleshooting)
* [How does this feature react if the NRI not supported?](#how-does-this-feature-react-if-the-nri-not-supported)
* [How to handle resource allocation failures?](#how-to-handle-resource-allocation-failures)
* [What happens if the NRI stub times out or if the socket connection fails?](#what-happens-if-the-nri-stub-times-out-or-if-the-socket-connection-fails)
* [Appendix](#appendix)
* [Implementation History](#implementation-history)

<!-- Created by https://github.com/ekalinin/github-markdown-toc -->
<!-- Added by: airren, at: Wed Mar 27 14:55:54 CST 2024 -->

<!--te-->

## Summary

To meet the needs of various business application scenarios, ensuring sufficient
resource guarantees for latency-sensitive services is necessary, especially when
online and offline tasks are mixed. This requires Kubernetes to provide more
granular resource management capabilities, enhance container isolation, and reduce
interference between containers.

As of now, Kubernetes does not offer a fully comprehensive resource management
solution. Many open-source projects in the Kubernetes ecosystem have devised
their methods to modify the deployment and management processes of pods, enabling
fine-grained resource allocation.

There are various approaches to extending Kubernetes, which we have summarized
as follows.

![kubernetes-enhance-overview](kubernetes-enhance-overview.png)

All the methods listed above can enhance Kubernetes, but except for the standalone
approach, they unavoidably involve intrusive modifications to the upstream Kubernetes
components, making it difficult for users to stay synchronized with upstream
components. Although the standalone approach avoids modifications to upstream
components, this asynchronous update method also has numerous drawbacks.

To address the need for intrusive modifications to Kubernetes and changes to the
default process, enabling developers to have a more unified implementation
approach, NRI has emerged.

[NRI](https://github.com/containerd/nri) is a plugin-based node resource management approach introduced by
the upstream community. Using NRI, Kubernetes' node resource management capabilities
can be enhanced through plugins without intrusive modifications to the upstream
Kubernetes components.

> NRI allows plugging domain- or vendor-specific custom logic into OCI- compatible
> runtimes. This logic can make controlled changes to containers or perform extra
> actions outside the scope of OCI at certain points in a containers lifecycle.
> This can be used, for instance, for improved allocation and management of devices
> and other container resources.

![nri-architecture](nri-architecture.png)

This proposal introduces how to enhance Katalyst using NRI, allowing Katalyst to
be deployed based on origin Kubernetes and making it easier to maintain and use.

## Motivation

Katalyst enhances Kubernetes resource management policies on a single node through
the QoS Resource Manager (QRM). However, the current QRM mode involves intrusive
modifications to the Kubelet, which makes it inconvenient for some users who use
the origin Kubernetes but not the distribution Kubewharf. To address this, Katalyst
proposes the ORM architecture, which provides a decoupled solution from Kubelet as
a supplement to the QRM solution.

In the ORM architecture, there are two implementation approaches. The first approach
is named Bypass, which polls Kubelet's API for pod events on the current node and
updates pod resources. This approach is asynchronous and cannot inject parameters
such as environment variables. The other approach is based on NRI. NRI (Node
Resource Interface) is a general framework for CRI-compatible container runtime
plugin extensions. It offers a mechanism for extensions to monitor pod/container
states and make limited configuration modifications. Using NRI, Katalyst can
synchronously modify resources and inject other information, such as environment
variables, during pod events.

### Goals

- Expand Katalyst‘s ORM mode using NRI to enhance the Resource management capabilities
of Kubernetes。
- Support for fine-grained resource control when containerd is used as the CRI runtime.

### Non-Goals/Future Work

- Support for other runtimes besides containerd, such as cri-o and docker.

## Proposal

Diverging from QRM or ORM's Bypass Mode, the Katalyst-agent will work as an NRI
plugin to subscribe pod/container lifecycle events from CRI runtime (in this
proposal, it is containerd), and then the Katalyst-agent will return an adjusted
Container spec in the hook events, or update the container spec by an active update.

- Get pod/container lifecycle events and pod or container information from NRI.
- Transform the NRI format information into CRI format to reuse existing admit
implementation by QRM Plugins.
- Update the NRI format container spec to the CRI runtime.
- While reconciling use NRI UpdateContainter to reconfigure resources.

**NRI Enhanced ORM(Along with kubelet polling)**

![orm-architecture](orm-architecture.png)

### User Stories

#### Story1: Use origin kubernetes without intrusive modifications

Extending and enhancing Kubernetes' resource management capabilities is a common
requirement in many business scenarios. However, while enhancing Kubernetes, it's
a common requirement to ensure that all Kubernetes components remain consistent
with the upstream community and avoid making any intrusive modifications to the
original Kubernetes components. After enabling NRI mode, deploying Katalyst on
existing clusters does not require restarting the original cluster. Enhancements
to the original Kubernetes can be achieved through a plugin-based approach.

#### Story2: Synchronous configuration of QoS policies and injection of environment variables

When enhancing QoS policies in Kubernetes, synchronous modification is the most
efficient method. With NRI Mode enabled, Katalyst plugins can synchronously modify
pod resources during pod creation, ensuring QoS policy allocation before pod
execution. Additionally, through NRI Mode, dynamic updates to pod resources
are possible. During pod creation, adjustments to pod resources, device binding,
RDT, and environment variable injection can be achieved via NRI Mode.

### Requirements

- Need to upgrade containerd to >= v1.7.0

#### Functional Requirements

- Support all functionalities corresponding to Bypass Mode under the existing ORM
architecture. This includes: adjusting container's cpuset / cfsquota, memory QoS.
- Support injecting environment variables into containers

#### Non-Functional Requirements

- It can achieve synchronous configuration of QoS policies, improving the
responsiveness of QoS policy configuration.
- Fully compatible with upstream native Kubernetes components, requiring no
intrusive modifications.

### Design Details

#### Detailed working flow

![orm-nri-details](orm-nir-details.png)

In this part, the method based on the Kubelet API polling is referred to as
**_Bypass_** Mode, while another method based on NRI is referred to as **_NRI_** Mode.

#### Addon

- The ORM support two operational modes: Bypass or NRI. Only one mode can be active
at any given time. When creating a new ORM Manger, the current operational mode can
be determined by reading the configuration, and it does not support changing the
mode during runtime.

```go
type workMode string
const (
workModeNri workMode = "nri"
workModeBypass workMode = "bypass"
)


type ManagerImpl struct {
ctx context.Context
....
// ORM run mode: bypass or nri.
// Bypass mode is triggered by polling kubelet api to get the pod event.
// NRI mode is required containerd version >= 1.7.0 and NRI enabled.
mode workMode
....
}

func NewManger(... config *config.Configuration){
// init orm work mode with essential components
m.initORMWorkMode(config, metaServer, emitter)
}

func (m *ManagerImpl) initORMWorkMode(config *config.Configuration, metaServer *metaserver.MetaServer, emitter metrics.MetricEmitter) {
// init ORM work node according to the configuration and NRI status
}
```

- The ORM ManagerImpl functions as an NRI stub, implementing processing logic
within the corresponding hook event functions.

```go
import "github.com/containerd/nri/pkg/stub"

type ManagerImpl struct {
ctx context.Context
....
// nriStub is the implementtion of NRI events handlers
nriStub stub.Stub
// nriMask stores the specific events that need to be hooked
nriMask stub.EventMask
nriOptions []stub.Option
nriConf nriConfig
....
}
```

- In enhancing the ORM implementation, three hook functions are required:
`RunPodSandbox()`, `CreateContainer()`, and `RemovePodSandbox()`.

**Step 1**, during `RunPodSanbox()`, the `Admit()` function is triggered.
If `Admit()` succeeds, resources are allocated for the container, and the pod
creation process continues. If `Admit()` fails, pod creation also fails.
```go
func (m *MangerImpl) RunPodSandbox(podSandbox *api.PodSandbox) error {
err := m.processAddPod(pod.Uid)
if err != nil {
klog.Errorf("[ORM] RunPodSandbox processAddPod fail, pod: %s/%s/%s, err: %v",
pod.Namespace, pod.Name, pod.Uid, err)
}
return err
}
```

**Step 2**, after a successful `Admit()`, the process proceeds to the
`CreateContainer()` event. At this point, resources have been allocated for the
container by `Admit()`. The corresponding resources are updated in the container's
spec and returned.
```go
func (m *MangerImpl) CreateContainer(pod *api.PodSandbox, container *api.Container) (*api.ContainerAdjustment, []*api.ContainerUpdate, error) {
// Update Container Spec from the podResources
adjust, err:= m.updateContainer(pod, container)
return adjust, nil, err
}
```

**Step 3**, During `RemovePodSandbox()`, all resource allocations related to
Airren marked this conversation as resolved.
Show resolved Hide resolved
the pod are returned.

```go
func (p *plugin) RemovePodSandbox(pod *api.PodSandbox) error {
err := m.processDeletePod(pod.Uid)
if err != nil {
klog.Errorf("[ORM] RemovePodSandbox processDeletePod fail, pod: %s/%s/%s, err: %v",
pod.Namespace, pod.Name, pod.Uid, err)
}
return err
}
```

#### Modification

- If using the NRI Mode, after the allocation of resources is completed in the
`Admit()` , the `Allocate()` does not need to execute `syncContainer()`; it should
simply return after the resources have been allocated.

```go
func (m *ManagerImpl) Allocate(pod *v1.Pod, container *v1.Container) error {
....
err := m.addContainer(pod, container)
// return after resource allocate when run in NRIMode
if err != nil || m.mode == workModeNri {
return err
}
err = m.syncContainer(pod, container)
return err
}
```

- In NRI Mode, the executer in `syncContainer()` can be implemented through NRI's
`updateContainer()` .

```go
if m.mode == workModeNri {
m.updateContainerByNRI(pod, container)
} else {
m.syncContainer(pod, &container)
}
```

- The `metaServer` as a member variable of the ORM `ManagerImpl` because it is
used in both Bypass and NRI modes.
- During NRI mode, halt the MetaManager's Reconcile, user NRI to hook the Pod/Container events.
- During NRI mode, the executor is conduct by NRI, do not need to create an Executor.

#### Test Plan

We will test the enhancement of ORM by NRI in a real cluster by deploying simulated
task invocation resource management plugins to configure QoS policies, which will
cover key points listed below:

- ORM completes registration to Containerd as an NRI plugin and establishes a connection.
- ORM can configure the correct LinuxContainerResources configuration with allocation
results for containers through NRI.
- ORM can add environment variables to containers through NRI.
- Validate that reconcileState() of ORM will update the cgroup configs for containers
by the latest resource allocation results.

## Production Readiness Review Questionnaire

### Feature Enablement and Rollback

#### How can this feature be enabled / disabled in a live cluster?

This feature is disable by default, you can enable it by configuration.
If a failure is detected in the NRI runtime environment while NRI mode enables,
it will fall back to Bypass Mode.

### Troubleshooting

#### How does this feature react if the NRI not supported?

It will fall back to Bypass mode of ORM.

#### How to handle resource allocation failures?

If encounter admit failure, the pod will enter a retry loop.

#### What happens if the NRI stub times out or if the socket connection fails?

Currently, if the NRI plugin times out, it leads to Containerd no longer invoking
this plugin. To address this, the following strategy needs to be adopted.

While timeout, in `OnClose()` invoke `stub.Restart` to re-create connection to containerd

And, do `Admit()` with a timeout (configured) context, if timeout try to create again.

## Appendix

NRI : [https://github.com/containerd/nri](https://github.com/containerd/nri)

ORM PR: [#406](https://github.com/kubewharf/katalyst-core/pull/406) [#430](https://github.com/kubewharf/katalyst-core/issues/430)

## Implementation History
- [x] 01/16/2024 Proposed idea in community meeting
- [x] 03/12/2024 Compile a document following the proposal template
- [x] 03/19/2024 Present proposal at a community meeting
- [x] 04/20/2024 Complete the basic functionalities of NRI as covered in the detailed
design
- [ ] 05/10/2024 commence the first round of testing
- [ ] 05/20/2024 open proposal PR for code
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading