Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include /usr/bin/nvidia-smi for nvidia-kmod extension #385

Open
rothgar opened this issue May 14, 2024 · 4 comments
Open

Include /usr/bin/nvidia-smi for nvidia-kmod extension #385

rothgar opened this issue May 14, 2024 · 4 comments

Comments

@rothgar
Copy link
Member

rothgar commented May 14, 2024

When attempting to run the NVIDIA gpu-operator it fails to fully initialize. From what I can tell it is because the nvidia-validator tries to run the nvidia-smi binary from the host in /usr/bin/

NAMESPACE     NAME                                                          READY   STATUS     RESTARTS      AGE
kube-system   coredns-85b955d87b-9cx56                                      1/1     Running    0             70m
kube-system   coredns-85b955d87b-nfdgb                                      1/1     Running    0             70m
kube-system   gpu-feature-discovery-jn6ps                                   0/1     Init:0/1   0             49m
kube-system   gpu-operator-7bbf8bb6b7-g4pd2                                 1/1     Running    0             50m
kube-system   gpu-operator-node-feature-discovery-gc-79d6d968bb-jkn2s       1/1     Running    0             50m
kube-system   gpu-operator-node-feature-discovery-master-6d9f8d497c-xvttn   1/1     Running    0             50m
kube-system   gpu-operator-node-feature-discovery-worker-6cgnv              1/1     Running    0             50m
kube-system   gpu-operator-node-feature-discovery-worker-tdc8j              1/1     Running    0             50m
kube-system   kube-apiserver-up                                             1/1     Running    0             69m
kube-system   kube-controller-manager-up                                    1/1     Running    1 (70m ago)   68m
kube-system   kube-flannel-ffftw                                            1/1     Running    0             69m
kube-system   kube-flannel-q972c                                            1/1     Running    0             69m
kube-system   kube-proxy-mrc75                                              1/1     Running    0             69m
kube-system   kube-proxy-n5qdc                                              1/1     Running    0             69m
kube-system   kube-scheduler-up                                             1/1     Running    2 (70m ago)   68m
kube-system   nvidia-dcgm-exporter-jlqbb                                    0/1     Init:0/1   0             49m
kube-system   nvidia-device-plugin-daemonset-q89xh                          0/1     Init:0/1   0             49m
kube-system   nvidia-operator-validator-jfs6m                               0/1     Init:0/4   0             49m

I installed the operator via helm with the following values.yaml

driver:
  enabled: false

toolkit:
  enabled: false
  env:
    - name: CONTAINERD_CONFIG
      value: /etc/cri/conf.d/nvidia-container-runtime.part
    - name: CONTAINERD_SET_AS_DEFAULT
      value: "true"

This should skip installing drivers and changing containerd config (already included with the extensions), but it apparently doesn't skip checking them.

The chart was installed with

helm install gpu-operator \                                              
    -n kube-system nvidia/gpu-operator --values values.yaml

I tried manually touching the files that the validator creates and it still attempts to execute the nvidia-smi command

running command chroot with args [/run/nvidia/driver nvidia-smi]
chroot: failed to run command 'nvidia-smi': No such file or directory

more information in the repo
https://github.com/NVIDIA/gpu-operator/tree/master
and installation docs
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#operator-install-guide

@jfroy
Copy link
Contributor

jfroy commented Jul 1, 2024

The Talos Nvidia driver extensions installs nvidia-smi under /usr/local/bin, which is somewhat of a non-standard location for an Nvidia driver component (other components are under /usr/local/lib, which is also non-standard; this will come up later if you read on). The current release version of nvidia-validator will not find nvidia-smi at that path. However, the main branch of the operator and operator-validator have significantly different code (to handle driver container images). If you override the image for operator and operator-validator to use one of the daily CI builds on Github, you should get past that issue.

However, you will then find that the device plug-in will not find a core CUDA library as part of its driver detection process. This is because of the aforementioned custom install path for other driver components. Furthermore, Talos applies a patch to the container toolkit to change the ldcache path (which the toolkit uses to find libraries), because Talos needs to maintain separate glibc and musl LD caches and thus stores them in custom locations. You will need to patch the device plug-in, build and publish a custom image, and use that image to get past that issue. Something like this:

diff --git a/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/ldcache/ldcache.go b/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/ldcache/ldcache.go
index 2f6de2fe..35f62f45 100644
--- a/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/ldcache/ldcache.go
+++ b/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/ldcache/ldcache.go
@@ -33,7 +33,7 @@ import (
 	"github.com/NVIDIA/nvidia-container-toolkit/internal/lookup/symlinks"
 )
 
-const ldcachePath = "/etc/ld.so.cache"
+const ldcachePath = "/usr/local/glibc/etc/ld.so.cache"
 
 const (
 	magicString1 = "ld.so-1.7.0"
diff --git a/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/lookup/library.go b/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/lookup/library.go
index 7f5cf7c8..85fd1db9 100644
--- a/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/lookup/library.go
+++ b/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/lookup/library.go
@@ -36,6 +36,7 @@ func NewLibraryLocator(opts ...Option) Locator {
 
 	// If search paths are already specified, we return a locator for the specified search paths.
 	if len(b.searchPaths) > 0 {
+		b.logger.Infof("Returning symlink locator with paths: %v", b.searchPaths)
 		return NewSymlinkLocator(
 			WithLogger(b.logger),
 			WithSearchPaths(b.searchPaths...),
@@ -56,6 +57,7 @@ func NewLibraryLocator(opts ...Option) Locator {
 			"/lib/aarch64-linux-gnu",
 			"/lib/x86_64-linux-gnu/nvidia/current",
 			"/lib/aarch64-linux-gnu/nvidia/current",
+			"/usr/local/lib",
 		}...),
 	)
 	// We construct a symlink locator for expected library locations.

With the previously mentioned upcoming support for driver container images in the GPU operator, Talos may want to consider reworking their Nvidia extensions to deliver all the components as container image. That should hopefully provide a more supported and long-term stable solution.

@TimJones
Copy link
Member

TimJones commented Jul 1, 2024

Hi @jfroy, one issue here is that Talos requires signed drivers, and the singing key is ephemeral to each build process, hence why each release of Talos has a specific corresponding release of each system extension.
We would love to be able to work through some potential ideas with you if you'd be interested in joining our Slack community, or even be available for a call to run though them.

@jfroy
Copy link
Contributor

jfroy commented Jul 1, 2024

Hi @jfroy, one issue here is that Talos requires signed drivers, and the singing key is ephemeral to each build process, hence why each release of Talos has a specific corresponding release of each system extension.

We would love to be able to work through some potential ideas with you if you'd be interested in joining our Slack community, or even be available for a call to run though them.

Yeah I like that Talos provides a chain of trust. You would need a per-release driver container just like you have a per-release extension.

I work at Nvidia, but I only speak for myself here. It would be inappropriate to engage beyond the occasional comment and bug fix PR on GitHub. I will however reach out to the folks working on our container technologies.

@TimJones
Copy link
Member

TimJones commented Jul 1, 2024

I will however reach out to the folks working on our container technologies.

That would be greatly appreciated, and thank you for reaching out in the first instance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants