Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WSL2 Support #318

Open
mchikyt3 opened this issue Feb 2, 2022 · 10 comments
Open

WSL2 Support #318

mchikyt3 opened this issue Feb 2, 2022 · 10 comments

Comments

@mchikyt3
Copy link

mchikyt3 commented Feb 2, 2022

Hi, I wonder if it's possible to use the gpu-operator in a single-node Microk8s cluster hosted on a wsl2 Ubuntu distribution. Thanks.

@shivamerla
Copy link
Contributor

@elezar to comment if this is supported by our container-toolkit.

@elezar
Copy link
Member

elezar commented Mar 28, 2022

Hi @mchikyt3 the combination you mention is untested by us, and I cannot provide a concrete answer.

The NVIDIA Container Toolkit which ensures that a launched container includes the required devices and libraries to use GPUs in the container does offer some support WSL2. It should be noted however, that there may be some use cases that do not work as expected.

Also note that I am not sure whether the other operands such as GPU Feature Discovery or the NVIDIA Device Plugin will function as expected.

@valxv
Copy link

valxv commented Jul 7, 2023

It appears that GPU Feature Discovery does not work properly. @elezar, are there any plans to address this? I have no problems running CUDA code inside Docker containers on WSL2 with Docker, podman, but it doesn't work with several Kubernetes distributions I tried. I posted several logs from my laptop on this MicroK8s thread and would be grateful if someone could help me to solve this issue.

Maybe the problem could be solved by creation of a couple of symlinks.

@wizpresso-steve-cy-fan
Copy link

wizpresso-steve-cy-fan commented Jul 25, 2023

Can someone fix this?

(combined from similar events): Error: failed to generate container "0520a1a018b798ce299be6171c3daa405d549219457b6c1e42cb1774b1b92e9e" spec: failed to generate spec: path "/" is mounted on "/" but it is not a shared or slave mount

- name: host-root
hostPath:
path: /

This is not working in WSL2, I confirmed this on k0s

EDIT: fixed, just run mount --make-rshared /

@wizpresso-steve-cy-fan
Copy link

wizpresso-steve-cy-fan commented Oct 6, 2023

Also make sure you edit the labels to cheat the GPU operator on the specific WSL2 node:

    feature.node.kubernetes.io/pci-10de.present: 'true'
    nvidia.com/device-plugin.config: RTX-4070-Ti # needed because GFD is not available
    nvidia.com/gpu.count: '1'
    nvidia.com/gpu.deploy.container-toolkit: 'true'
    nvidia.com/gpu.deploy.dcgm: 'true' # optional
    nvidia.com/gpu.deploy.dcgm-exporter: 'true' # optional
    nvidia.com/gpu.deploy.device-plugin: 'true' 
    nvidia.com/gpu.deploy.driver: 'false' # need special treatments
    nvidia.com/gpu.deploy.gpu-feature-discovery: 'false' # incompatible with WSL2
    nvidia.com/gpu.deploy.node-status-exporter: 'false' # optional
    nvidia.com/gpu.deploy.operator-validator: 'true'
    nvidia.com/gpu.present: 'true'
    nvidia.com/gpu.replicas: '16'

You can either auto-insert those labels if you use k0sctl or add them manually once the node is onboarded.

The drivers and container-toolkit are technically optional as WSL2 already installed all the prerequisites...But we still need to cheat the system.
We will need to make the following files:

$ touch /run/nvidia/validations/host-driver-ready
$ touch /run/nvidia/validations/toolkit-ready # if you skipped validator
$ touch /run/nvidia/validations/cuda-ready # if you skipped validator
$ touch /run/nvidia/validations/plugin-ready # if you skipped validator

So that we could effectively bypass the GPU operator checkings, then the GPU operator will finally register the node to be compatible with nvidia runtime and runs it. You can try to use a DaemonSet script to do that.

It is also noted if you have preinstalled drivers then you don't need to touch the files in any case. But then you need to figure out how to pass the condition.

I'm using k0s in my company local cluster under WSL2, but this should apply to all k8s distributions that runs under WSL2.

By the way this is the Helm install for GPU operator that should work on k0s:

cdi:
  enabled: false
daemonsets:
  tolerations:
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
  - effect: NoSchedule
    key: k8s.wizpresso.com/wsl-node
    operator: Exists
devicePlugin:
  config:
    name: time-slicing-config
driver:
  enabled: true
operator:
  defaultRuntime: containerd
toolkit:
  enabled: true
  env:
  - name: CONTAINERD_CONFIG
    value: /etc/k0s/containerd.d/nvidia.toml
  - name: CONTAINERD_SOCKET
    value: /run/k0s/containerd.sock
  - name: CONTAINERD_RUNTIME_CLASS
    value: nvidia
  - name: CONTAINERD_SET_AS_DEFAULT
    value: "false"

@AntonOfTheWoods
Copy link

@wizpresso-steve-cy-fan , you wouldn't happen to have an install doc or script you could share for getting k0s set up on wsl2 by any chance?

@wizpresso-steve-cy-fan
Copy link

wizpresso-steve-cy-fan commented Nov 6, 2023

@AntonOfTheWoods let me push the changes to GitLab first
https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/881
https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/merge_requests/481

@alexeadem
Copy link

alexeadem commented Jan 30, 2024

@AntonOfTheWoods see comment for full instructions on how to make this work locally

  • Windows 11
  • WSL2
  • Docker cgroup v2
  • Nvidia GPU operator
  • Kubeflow

on kind or qbo Kubernetes

@cbrendanprice
Copy link

@alexeadem thanks so much for this! i was dragging my feet to create images for wizpresso-steve-cy-fan's prs so this saved me some time.

i am curious to know if you've had success with running a cuda workload with this implemented? i am able to successfully get the gpu operator helm chart running with these values:

cdi:
  enabled: false
daemonsets:
  tolerations:
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
devicePlugin:
  image: k8s-device-plugin
  repository: eadem
  version: v0.14.3-ubuntu20.04
driver:
  enabled: true
operator:
  defaultRuntime: containerd
  image: gpu-operator
  repository: eadem
  version: v23.9.1-ubi8
runtimeClassName: "nvidia"
toolkit:
  enabled: true
  env:
  - name: CONTAINERD_RUNTIME_CLASS
    value: nvidia
  - name: CONTAINERD_SET_AS_DEFAULT
    value: "false"
  image: container-toolkit
  repository: eadem
  version: 1.14.3-ubuntu20.04
validator:
  driver:
    env:
    - name: DISABLE_DEV_CHAR_SYMLINK_CREATION
      value: "true"
  image: gpu-operator-validator
  repository: eadem
  version: v23.9.1-ubi8

and my pods are now successfully getting past preemption when specifying gpu limits, however, when i try to run a gpu workload (e.g. nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04)) it fails to run with the error:

Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!

just curious to know if you're having this problem or not.

oh, and not that it matters much, but just a heads up that the custom docker image you linked for the operator in your comment is actually linking to your custom validator image.

thanks again!

@alexeadem
Copy link

alexeadem commented Jan 30, 2024

@alexeadem thanks so much for this! i was dragging my feet to create images for wizpresso-steve-cy-fan's prs so this saved me some time.

i am curious to know if you've had success with running a cuda workload with this implemented? i am able to successfully get the gpu operator helm chart running with these values:

cdi:
  enabled: false
daemonsets:
  tolerations:
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
devicePlugin:
  image: k8s-device-plugin
  repository: eadem
  version: v0.14.3-ubuntu20.04
driver:
  enabled: true
operator:
  defaultRuntime: containerd
  image: gpu-operator
  repository: eadem
  version: v23.9.1-ubi8
runtimeClassName: "nvidia"
toolkit:
  enabled: true
  env:
  - name: CONTAINERD_RUNTIME_CLASS
    value: nvidia
  - name: CONTAINERD_SET_AS_DEFAULT
    value: "false"
  image: container-toolkit
  repository: eadem
  version: 1.14.3-ubuntu20.04
validator:
  driver:
    env:
    - name: DISABLE_DEV_CHAR_SYMLINK_CREATION
      value: "true"
  image: gpu-operator-validator
  repository: eadem
  version: v23.9.1-ubi8

and my pods are now successfully getting past preemption when specifying gpu limits, however, when i try to run a gpu workload (e.g. nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04)) it fails to run with the error:

Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!

just curious to know if you're having this problem or not.

oh, and not that it matters much, but just a heads up that the custom docker image you linked for the operator in your comment is actually linking to your custom validator image.

thanks again!

np @cbrendanprice Thanks for the docker links. I fixed it.

Nvidia driver, cuda, toolkit and operator are pretty tight together when it comes to versions. That error should be easily fixed by using the right versions. Here is an example of the version needed for cuda 12.2 and a full example of a cuda workload in kubeflow and directly into a pod in the operator. And I don't see that error with eadem images.

https://ce.qbo.io/#/ai_and_ml

Try this one instead. The link you provided is an old version
https://ce.qbo.io/#/ai_and_ml?id=_3-deploy-vector-add

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants