Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU already used, showing up in multiple containers #1021

Open
astranero opened this issue Sep 30, 2024 · 1 comment
Open

GPU already used, showing up in multiple containers #1021

astranero opened this issue Sep 30, 2024 · 1 comment

Comments

@astranero
Copy link

I have issue with nvidia-gpu-operator, where when setting limits for "nvidia.com/gpu: 1". I get scheduled with a GPU that is already allocated to another container.
Additionally, I had previously troubles with containers showing one additional GPU even though limit was set to 1.

What I want: Only allocate a GPU that is not already in use by another pod.
What it does: Allocates a GPU that is already in use by another pod.

Environment:
GPU model H100, NVIDIA-SMI 550.90.12 , Driver Version: 550.90.12 , CUDA Version: 12.4

Installation steps:

  1. Installing gpu-operator resources
microk8s.helm3 install gpu-operator -n gpu-operator-resources --create-namespace   nvidia/gpu-operator --version v24.6.1   --set toolkit.env[0].name=CONTAINERD_CONFIG   --set toolkit.env[0].value=/var/snap/microk8s/current/args/containerd-template.toml   --set toolkit.env[1].name=CONTAINERD_SOCKET   --set toolkit.env[1].value=/var/snap/microk8s/common/run/containerd.sock   --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS   --set toolkit.env[2].value=nvidia   --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT   --set-string toolkit.env[3].value=true --set cdi.default=false --set cdi.enabled=true --set toolkit.enabled=true --set driver.enabled=false
  1. Patching CDI manually
kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \
    -p='[{"op": "replace", "path": "/spec/cdi/default", "value":true}]'
  1. Removing default runtime to restrict giving two GPUs
vi /var/snap/microk8s/current/args/containerd-template.toml
default_runtime_name = "nvidia"   # REMOVED THIS

Would appreciate any help I can get, thank you

@astranero astranero changed the title Issues with allocation GPUs Issues with allocation of GPUs Sep 30, 2024
@astranero astranero changed the title Issues with allocation of GPUs GPU already used, showing up in multiple containers Sep 30, 2024
@astranero
Copy link
Author

astranero commented Oct 9, 2024

I ended up exposing all GPUs to unprivileged containers, then stopped using "nvidia.com/gpu" resources to remove additional GPUs that gets allocated randomly by device plugin. However, it would be great if there was a solution that had more granular approach to allocating GPUs. I do not want to use "nvidia.com/gpu", because of GPU locality issue, if I set a "nvidia.com/gpu = 2" will it respect locality and allocate GPUs that have NVLINK interconnects?

microk8s helm install nvidia/gpu-operator --generate-name -n gpu-operator-resources --version 24.6.1 $HELM_OPTIONS \
  --set toolkit.env[0].name=CONTAINERD_CONFIG \
  --set toolkit.env[0].value=/var/snap/microk8s/current/args/containerd-template.toml \
  --set toolkit.env[1].name=CONTAINERD_SOCKET \
  --set toolkit.env[1].value=/var/snap/microk8s/common/run/containerd.sock \
  --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
  --set toolkit.env[2].value=nvidia \
  --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
  --set-string toolkit.env[3].value=true \
  --set toolkit.env[4].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED \
  --set-string toolkit.env[4].value=true \
  --set toolkit.env[5].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS \
  --set-string toolkit.env[5].value=false \
  --set devicePlugin.env[0].name=DEVICE_LIST_STRATEGY \
  --set devicePlugin.env[0].value="envvar" \
  --set driver.enabled=false

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant