You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have issue with nvidia-gpu-operator, where when setting limits for "nvidia.com/gpu: 1". I get scheduled with a GPU that is already allocated to another container.
Additionally, I had previously troubles with containers showing one additional GPU even though limit was set to 1.
What I want: Only allocate a GPU that is not already in use by another pod.
What it does: Allocates a GPU that is already in use by another pod.
Environment:
GPU model H100, NVIDIA-SMI 550.90.12 , Driver Version: 550.90.12 , CUDA Version: 12.4
I ended up exposing all GPUs to unprivileged containers, then stopped using "nvidia.com/gpu" resources to remove additional GPUs that gets allocated randomly by device plugin. However, it would be great if there was a solution that had more granular approach to allocating GPUs. I do not want to use "nvidia.com/gpu", because of GPU locality issue, if I set a "nvidia.com/gpu = 2" will it respect locality and allocate GPUs that have NVLINK interconnects?
I have issue with nvidia-gpu-operator, where when setting limits for "nvidia.com/gpu: 1". I get scheduled with a GPU that is already allocated to another container.
Additionally, I had previously troubles with containers showing one additional GPU even though limit was set to 1.
What I want: Only allocate a GPU that is not already in use by another pod.
What it does: Allocates a GPU that is already in use by another pod.
Environment:
GPU model H100, NVIDIA-SMI 550.90.12 , Driver Version: 550.90.12 , CUDA Version: 12.4
Installation steps:
Would appreciate any help I can get, thank you
The text was updated successfully, but these errors were encountered: