nvidia-operator-validator cannot check host driver installation state but cuda-sample:vectoradd-cuda runs successfully


- OS: Gentoo Linux
- k8s: k3s v1.31.5+k3s1
- nvidia driver version: 550.144.03
- libnvidia-container version: 1.17.2
- nvidia-container-toolkit version: 1.17.3

```
  helm upgrade --install --wait gpu-operator-1739580441 \
       --namespace gpu-operator --create-namespace \
       nvidia/gpu-operator \
       --version=v24.9.2 \
       --set driver.enabled=false,toolkit.enabled=false
```

In the `nvidia-operator-validator` pod, the `driver-validation` container keeps logging the following message while the other containers are stuck in the initializing state.

```
Attempting to validate a driver container installation
failed to validate the driver, retrying after 5 seconds\n
```

However, if I run a pod like the following, note I removed the resource request and added `runtimeClassName`.

```yaml
  apiVersion: v1
  kind: Pod
  metadata:
    name: cuda-vectoradd
  spec:
    restartPolicy: OnFailure
    runtimeClassName: nvidia
    containers:
      - name: cuda-vectoradd
        image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04
```

The pod ran successfully with these messages in the log.

```
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```

Note, I have set `driver.enabled=false` for the gpu-operator chart, but the message still says, "Attempting to validate a driver container installation". Maybe the validator should validate the driver in my host OS?


Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

nvidia-operator-validator cannot check host driver installation state but cuda-sample:vectoradd-cuda runs successfully #1276

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

nvidia-operator-validator cannot check host driver installation state but cuda-sample:vectoradd-cuda runs successfully #1276

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions