GPU already used, showing up in multiple containers #1021

astranero · 2024-09-30T10:35:00Z

I have issue with nvidia-gpu-operator, where when setting limits for "nvidia.com/gpu: 1". I get scheduled with a GPU that is already allocated to another container.
Additionally, I had previously troubles with containers showing one additional GPU even though limit was set to 1.

What I want: Only allocate a GPU that is not already in use by another pod.
What it does: Allocates a GPU that is already in use by another pod.

Environment:
GPU model H100, NVIDIA-SMI 550.90.12 , Driver Version: 550.90.12 , CUDA Version: 12.4

Installation steps:

Installing gpu-operator resources

microk8s.helm3 install gpu-operator -n gpu-operator-resources --create-namespace   nvidia/gpu-operator --version v24.6.1   --set toolkit.env[0].name=CONTAINERD_CONFIG   --set toolkit.env[0].value=/var/snap/microk8s/current/args/containerd-template.toml   --set toolkit.env[1].name=CONTAINERD_SOCKET   --set toolkit.env[1].value=/var/snap/microk8s/common/run/containerd.sock   --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS   --set toolkit.env[2].value=nvidia   --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT   --set-string toolkit.env[3].value=true --set cdi.default=false --set cdi.enabled=true --set toolkit.enabled=true --set driver.enabled=false

Patching CDI manually

kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \
    -p='[{"op": "replace", "path": "/spec/cdi/default", "value":true}]'

Removing default runtime to restrict giving two GPUs

vi /var/snap/microk8s/current/args/containerd-template.toml

default_runtime_name = "nvidia"   # REMOVED THIS

Would appreciate any help I can get, thank you

The text was updated successfully, but these errors were encountered:

astranero · 2024-10-09T08:11:22Z

I ended up exposing all GPUs to unprivileged containers, then stopped using "nvidia.com/gpu" resources to remove additional GPUs that gets allocated randomly by device plugin. However, it would be great if there was a solution that had more granular approach to allocating GPUs. I do not want to use "nvidia.com/gpu", because of GPU locality issue, if I set a "nvidia.com/gpu = 2" will it respect locality and allocate GPUs that have NVLINK interconnects?

microk8s helm install nvidia/gpu-operator --generate-name -n gpu-operator-resources --version 24.6.1 $HELM_OPTIONS \
  --set toolkit.env[0].name=CONTAINERD_CONFIG \
  --set toolkit.env[0].value=/var/snap/microk8s/current/args/containerd-template.toml \
  --set toolkit.env[1].name=CONTAINERD_SOCKET \
  --set toolkit.env[1].value=/var/snap/microk8s/common/run/containerd.sock \
  --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
  --set toolkit.env[2].value=nvidia \
  --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
  --set-string toolkit.env[3].value=true \
  --set toolkit.env[4].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED \
  --set-string toolkit.env[4].value=true \
  --set toolkit.env[5].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS \
  --set-string toolkit.env[5].value=false \
  --set devicePlugin.env[0].name=DEVICE_LIST_STRATEGY \
  --set devicePlugin.env[0].value="envvar" \
  --set driver.enabled=false

astranero changed the title ~~Issues with allocation GPUs~~ Issues with allocation of GPUs Sep 30, 2024

astranero changed the title ~~Issues with allocation of GPUs~~ GPU already used, showing up in multiple containers Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU already used, showing up in multiple containers #1021

GPU already used, showing up in multiple containers #1021

astranero commented Sep 30, 2024

astranero commented Oct 9, 2024 •

edited

Loading

GPU already used, showing up in multiple containers #1021

GPU already used, showing up in multiple containers #1021

Comments

astranero commented Sep 30, 2024

astranero commented Oct 9, 2024 • edited Loading

astranero commented Oct 9, 2024 •

edited

Loading