Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for sriovgpu and vgpu management #60

Merged
merged 1 commit into from
Jan 23, 2024

Conversation

ibrokethecloud
Copy link
Collaborator

IMPORTANT: Please do not create a Pull Request without creating an issue first.

Problem:

PR introduces the initial work to allow management of NVIDIA GPU's capable of sriov vgpus.

Solution:

Related Issue:
harvester/harvester#2764
Test plan:

@scydas
Copy link

scydas commented Dec 28, 2023

Hi, is srivgpu also supported? I haven't used srivgpu, only passthrough and nvidia vgpu.

Copy link

@tserong tserong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read the HEP (harvester/harvester#4833), and I've reviewed all the code (I went commit by commit because it's easier to try to understand that way), and this all looks well thought out, and like it does what it says. Also the tests are passing :-) But I have to disclaim that as I'm still relatively new to the codebase, there's things I don't really understand (e.g. exactly how kubevirt device plugins work), and as it's a big review there might be stuff I missed.

I've got a couple of small suggestions and a couple of questions/comments, but in general LGTM.

tests/snapshots/vgpu-node.umockdev Outdated Show resolved Hide resolved
pkg/controller/gpudevice/vgpu_controller.go Show resolved Hide resolved
pkg/controller/gpudevice/vgpu_controller.go Outdated Show resolved Hide resolved
pkg/deviceplugins/vgpu_device_manager.go Show resolved Hide resolved
pkg/util/executor/interface.go Outdated Show resolved Hide resolved
pkg/util/executor/remote.go Show resolved Hide resolved
Copy link
Member

@WebberHuang1118 WebberHuang1118 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just please help to answer some questions, thanks.

pkg/controller/gpudevice/vgpu_controller.go Show resolved Hide resolved
pkg/controller/gpudevice/vgpu_controller.go Show resolved Hide resolved
pkg/controller/gpudevice/vgpu_controller.go Outdated Show resolved Hide resolved
pkg/deviceplugins/vgpu_device_manager.go Outdated Show resolved Hide resolved
pkg/deviceplugins/vgpu_device_manager.go Show resolved Hide resolved
pkg/deviceplugins/vgpu_device_manager.go Show resolved Hide resolved
pkg/util/executor/interface.go Outdated Show resolved Hide resolved
pkg/util/executor/remote.go Outdated Show resolved Hide resolved
pkg/util/executor/remote.go Outdated Show resolved Hide resolved
pkg/controller/gpudevice/gpu_controller.go Show resolved Hide resolved
Copy link
Member

@WebberHuang1118 WebberHuang1118 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

Copy link
Member

@bk201 bk201 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just need to fix ci

updated gpuhelper and generated client/controller for gpu crds

initial logic for SRIOVGPUDevice controller

initial wiring logic for gpu/vgpu device setup

initial wiring and reconcile of sriovGPU and vGPU devices

fixed up reconcile of vgpu devices

reconcile of gpu/vgpu devices

added device plugin for vgpu

fixed up device health in vgpu plugin

added integration to upgrade kubevirt CR

improved reconcile for related vgpudevices

fixed up gpu status reconcile and env variable sent for VGPU plugin

Webhook to valid sriovgpu and vgpu changes

Cleaned up disable vgpu

fixed reconcile issue for device health to ensure plugin is shutdown

stage changes for remote command executor

fix up missing error return and remote execution

modified pod mutation logic when gpu's are present

changes to label nodes with custom labels to ensure driver is deployed to specific nodes only

leverage dynamic lookup for driver container

remove dapper files

refactor health check and pod mutation methods to pass codefactor complexity checks

refactor how vgpu / gpu status is reconcilled post reboots

add readiness check

cleanup un-used methods and address feedback from PR review

change mutating webhook to ensure only compute pod is mutated, changes to Dockerfile to include awk which is needed by sriov-manage in localExecutor mode

revert to SYS_RESOURCE condition

fix up vgpu device plugin shutdown sequence and fix up failing integration tests

fix revive linting errors
@ibrokethecloud ibrokethecloud merged commit 5417cee into harvester:master Jan 23, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants