-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support for sriovgpu and vgpu management #60
Conversation
e16353f
to
953924b
Compare
Hi, is srivgpu also supported? I haven't used srivgpu, only passthrough and nvidia vgpu. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've read the HEP (harvester/harvester#4833), and I've reviewed all the code (I went commit by commit because it's easier to try to understand that way), and this all looks well thought out, and like it does what it says. Also the tests are passing :-) But I have to disclaim that as I'm still relatively new to the codebase, there's things I don't really understand (e.g. exactly how kubevirt device plugins work), and as it's a big review there might be stuff I missed.
I've got a couple of small suggestions and a couple of questions/comments, but in general LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just please help to answer some questions, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just need to fix ci
updated gpuhelper and generated client/controller for gpu crds initial logic for SRIOVGPUDevice controller initial wiring logic for gpu/vgpu device setup initial wiring and reconcile of sriovGPU and vGPU devices fixed up reconcile of vgpu devices reconcile of gpu/vgpu devices added device plugin for vgpu fixed up device health in vgpu plugin added integration to upgrade kubevirt CR improved reconcile for related vgpudevices fixed up gpu status reconcile and env variable sent for VGPU plugin Webhook to valid sriovgpu and vgpu changes Cleaned up disable vgpu fixed reconcile issue for device health to ensure plugin is shutdown stage changes for remote command executor fix up missing error return and remote execution modified pod mutation logic when gpu's are present changes to label nodes with custom labels to ensure driver is deployed to specific nodes only leverage dynamic lookup for driver container remove dapper files refactor health check and pod mutation methods to pass codefactor complexity checks refactor how vgpu / gpu status is reconcilled post reboots add readiness check cleanup un-used methods and address feedback from PR review change mutating webhook to ensure only compute pod is mutated, changes to Dockerfile to include awk which is needed by sriov-manage in localExecutor mode revert to SYS_RESOURCE condition fix up vgpu device plugin shutdown sequence and fix up failing integration tests fix revive linting errors
IMPORTANT: Please do not create a Pull Request without creating an issue first.
Problem:
PR introduces the initial work to allow management of NVIDIA GPU's capable of sriov vgpus.
Solution:
Related Issue:
harvester/harvester#2764
Test plan: