initial work to reconcile devices from launcher pods #91
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
IMPORTANT: Please do not create a Pull Request without creating an issue first.
Problem:
Rancher allows provisioning of downstream clusters to leverage vGPUs.
When used with a machine deployment of more than 1 node, the actual gpu allocated is different from name specified since kubevirt leverages the DeviceName field to calculate launcher pod resource requirements, which are then subsequently used by schedular to identify correct nodes.
Since the actual name is not really used for device allocation, any random string can be used and this makes it difficult to track vGPU allocation in the cluster.
Solution:
PR introduces a minor change to leverage pod environment variables being set by the device plugin during the ContainerAllocateResponse. The additional vmi controller execs into the launcher pod to identify the ID set for resource, and subsequently map it to the GPU/Hostdevice resources. Once the devices are identify an annotation is set on the VM to provide info about actual device from a pool of devices passed through to the VM
For example for a host device this looks as follows:
For GPU devices ithis looks as follows:
When a VM is shutdown the annotation is used to replace
name
for vGPU or host devices with actual device names from the deviceAllocationDetails annotation. This ensures that the VM can be edited and devices can be removed from Harvester UI post provisioning.Related Issue:
Test plan: