Skip to content

nvidia-operator-validator cannot check host driver installation state but cuda-sample:vectoradd-cuda runs successfully #1276

@davidshen84

Description

@davidshen84
  • OS: Gentoo Linux
  • k8s: k3s v1.31.5+k3s1
  • nvidia driver version: 550.144.03
  • libnvidia-container version: 1.17.2
  • nvidia-container-toolkit version: 1.17.3
  helm upgrade --install --wait gpu-operator-1739580441 \
       --namespace gpu-operator --create-namespace \
       nvidia/gpu-operator \
       --version=v24.9.2 \
       --set driver.enabled=false,toolkit.enabled=false

In the nvidia-operator-validator pod, the driver-validation container keeps logging the following message while the other containers are stuck in the initializing state.

Attempting to validate a driver container installation
failed to validate the driver, retrying after 5 seconds\n

However, if I run a pod like the following, note I removed the resource request and added runtimeClassName.

  apiVersion: v1
  kind: Pod
  metadata:
    name: cuda-vectoradd
  spec:
    restartPolicy: OnFailure
    runtimeClassName: nvidia
    containers:
      - name: cuda-vectoradd
        image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04

The pod ran successfully with these messages in the log.

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Note, I have set driver.enabled=false for the gpu-operator chart, but the message still says, "Attempting to validate a driver container installation". Maybe the validator should validate the driver in my host OS?

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions