-
Notifications
You must be signed in to change notification settings - Fork 392
Closed
Description
- OS: Gentoo Linux
- k8s: k3s v1.31.5+k3s1
- nvidia driver version: 550.144.03
- libnvidia-container version: 1.17.2
- nvidia-container-toolkit version: 1.17.3
helm upgrade --install --wait gpu-operator-1739580441 \
--namespace gpu-operator --create-namespace \
nvidia/gpu-operator \
--version=v24.9.2 \
--set driver.enabled=false,toolkit.enabled=false
In the nvidia-operator-validator
pod, the driver-validation
container keeps logging the following message while the other containers are stuck in the initializing state.
Attempting to validate a driver container installation
failed to validate the driver, retrying after 5 seconds\n
However, if I run a pod like the following, note I removed the resource request and added runtimeClassName
.
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
spec:
restartPolicy: OnFailure
runtimeClassName: nvidia
containers:
- name: cuda-vectoradd
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04
The pod ran successfully with these messages in the log.
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
Note, I have set driver.enabled=false
for the gpu-operator chart, but the message still says, "Attempting to validate a driver container installation". Maybe the validator should validate the driver in my host OS?
Thanks
Metadata
Metadata
Assignees
Labels
No labels