WSLg/Cuda suddenly broken due to nvidia-smi unable to find GPU #9099

devttebayo · 2022-11-01T17:32:46Z

Version

10.0.22000.1098

WSL Version

WSL 2
~~WSL 1~~

Kernel Version

5.15.68.1

Distro Version

Ubuntu 22.04

Other Software

WSL version: 0.70.5.0
WSLg version: 1.0.45
Direct3D version: 1.606.4
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp

Nvidia Driver: 526.47, Game Ready Driver, released 10/27/2022

Repro Steps

Open wsl terminal
Execute command nvidia-smi

Expected Behavior

The nvidia-smi utility dumps diagnostic details about the GPU.

nvidia-smi.exe on Windows is able to display the expected output:

┖[~]> nvidia-smi
Tue Nov  1 10:18:07 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 526.47       Driver Version: 526.47       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ... WDDM  | 00000000:0B:00.0 Off |                  N/A |
|  0%   37C    P8    17W / 350W |   1619MiB / 12288MiB |      8%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Actual Behavior

nvidia-smi on wsl/ubuntu 22.04 outputs a generic error instead:

dattebayo@<NGP'd>:~/dev$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Failed to properly shut down NVML: Driver Not Loaded

Diagnostic Logs

I'll admit I'm kinda dumb when it comes to doing the linux diagnostics, which is part of what brought me here. Here's what I've been able to gather from various Googlings and such though:

`dpkg -l | grep nvidia`

ii  libnvidia-compute-495:amd64     510.85.02-0ubuntu0.22.04.1              amd64        Transitional package for libnvidia-compute-510
ii  libnvidia-compute-510:amd64     510.85.02-0ubuntu0.22.04.1              amd64        NVIDIA libcompute package
rc  libnvidia-compute-520:amd64     520.56.06-0ubuntu0.20.04.1              amd64        NVIDIA libcompute package
ii  libnvidia-ml-dev:amd64          11.5.50~11.5.1-1ubuntu1                 amd64        NVIDIA Management Library (NVML) development files
ii  nvidia-cuda-dev:amd64           11.5.1-1ubuntu1                         amd64        NVIDIA CUDA development files
ii  nvidia-cuda-gdb                 11.5.114~11.5.1-1ubuntu1                amd64        NVIDIA CUDA Debugger (GDB)
rc  nvidia-cuda-toolkit             11.5.1-1ubuntu1                         amd64        NVIDIA CUDA development toolkit
ii  nvidia-cuda-toolkit-doc         11.5.1-1ubuntu1                         all          NVIDIA CUDA and OpenCL documentation
ii  nvidia-opencl-dev:amd64         11.5.1-1ubuntu1                         amd64        NVIDIA OpenCL development files
ii  nvidia-profiler                 11.5.114~11.5.1-1ubuntu1                amd64        NVIDIA Profiler for CUDA and OpenCL
ii  nvidia-visual-profiler          11.5.114~11.5.1-1ubuntu1                amd64        NVIDIA Visual Profiler for CUDA and OpenCL

`lsmod | grep nvidia`

No output

`(Truncated) DxDiag Output`

------------------
System Information
------------------
      Time of this report: 11/1/2022, 10:23:47
             Machine name: NGP'd
               Machine Id: {534C5435-EB65-464A-801F-79979E08B34E}
         Operating System: Windows 11 Pro 64-bit (10.0, Build 22000) (22000.co_release.210604-1628)
                 Language: English (Regional Setting: English)
      System Manufacturer: ASUS
             System Model: System Product Name
                     BIOS: 3601 (type: UEFI)
                Processor: AMD Ryzen 9 5950X 16-Core Processor             (32 CPUs), ~3.4GHz
                   Memory: 131072MB RAM
      Available OS Memory: 130980MB RAM
                Page File: 53817MB used, 96617MB available
              Windows Dir: C:\Windows
          DirectX Version: DirectX 12
      DX Setup Parameters: Not found
         User DPI Setting: 144 DPI (150 percent)
       System DPI Setting: 96 DPI (100 percent)
          DWM DPI Scaling: Disabled
                 Miracast: Available, no HDCP
Microsoft Graphics Hybrid: Not Supported
 DirectX Database Version: 1.2.2
           DxDiag Version: 10.00.22000.0653 64bit Unicode
...
---------------
Display Devices
---------------
           Card name: NVIDIA GeForce RTX 3080 Ti
        Manufacturer: NVIDIA
           Chip type: NVIDIA GeForce RTX 3080 Ti
            DAC type: Integrated RAMDAC
         Device Type: Full Device (POST)
          Device Key: Enum\PCI\VEN_10DE&DEV_2208&SUBSYS_261219DA&REV_A1
       Device Status: 0180200A [DN_DRIVER_LOADED|DN_STARTED|DN_DISABLEABLE|DN_NT_ENUMERATOR|DN_NT_DRIVER] 
 Device Problem Code: No Problem
 Driver Problem Code: Unknown
      Display Memory: Unknown
    Dedicated Memory: n/a
       Shared Memory: n/a
        Current Mode: Unknown
         HDR Support: Unknown
    Display Topology: Unknown
 Display Color Space: Unknown
     Color Primaries: Unknown
   Display Luminance: Unknown
         Driver Name: C:\Windows\System32\DriverStore\FileRepository\nv_dispi.inf_amd64_ade64cd54ec2f9ed\nvldumdx.dll,C:\Windows\System32\DriverStore\FileRepository\nv_dispi.inf_amd64_ade64cd54ec2f9ed\nvldumdx.dll,C:\Windows\System32\DriverStore\FileRepository\nv_dispi.inf_amd64_ade64cd54ec2f9ed\nvldumdx.dll,C:\Windows\System32\DriverStore\FileRepository\nv_dispi.inf_amd64_ade64cd54ec2f9ed\nvldumdx.dll
 Driver File Version: 31.00.0015.2647 (English)
      Driver Version: 31.0.15.2647
         DDI Version: unknown
      Feature Levels: Unknown
        Driver Model: WDDM 3.0
 Hardware Scheduling: DriverSupportState:Stable Enabled:True 
 Graphics Preemption: Pixel
  Compute Preemption: Dispatch
            Miracast: Not Supported by Graphics driver
      Detachable GPU: No
 Hybrid Graphics GPU: Discrete
      Power P-states: Not Supported
      Virtualization: Paravirtualization 
          Block List: No Blocks
  Catalog Attributes: Universal:False Declarative:True 
   Driver Attributes: Final Retail
    Driver Date/Size: 10/24/2022 5:00:00 PM, 772488 bytes
         WHQL Logo'd: Yes
     WHQL Date Stamp: Unknown
   Device Identifier: Unknown
           Vendor ID: 0x10DE
           Device ID: 0x2208
           SubSys ID: 0x261219DA
         Revision ID: 0x00A1
  Driver Strong Name: oem52.inf:0f066de3b91c4385:Section071:31.0.15.2647:pci\ven_10de&dev_2208
      Rank Of Driver: 00CF2001
         Video Accel: Unknown
         DXVA2 Modes: Unknown
      Deinterlace Caps: n/a
        D3D9 Overlay: Unknown
             DXVA-HD: Unknown
        DDraw Status: Enabled
          D3D Status: Not Available
          AGP Status: Enabled
       MPO MaxPlanes: Unknown
            MPO Caps: Unknown
         MPO Stretch: Unknown
     MPO Media Hints: Unknown
         MPO Formats: Unknown
    PanelFitter Caps: Unknown
 PanelFitter Stretch: Unknown
....

The text was updated successfully, but these errors were encountered:

devttebayo · 2022-11-01T18:54:23Z

Taking some shots in the dark here (mainly because I'm really motivated to fix this 😅)

Looking at dmesg trace pops something potentially interesting? Idk if these ioctls are expected to fail...

[    3.894321] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[    3.894829] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[    3.895274] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[    3.895634] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -2

These messages come immediately after some BAR assignment operations and a log warning about libcuda not being a symlink.

Something else I noticed is that dmesg goes quiet for a real long time, and then later there's more spew from dxg:

[   49.226483] hv_balloon: Max. dynamic memory size: 65488 MB
[ 3305.059848] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[ 3305.060250] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[ 3305.060464] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[ 3305.060744] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -2
[ 3489.978602] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[ 3489.979056] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[ 3489.979465] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[ 3489.979955] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -2
[ 3573.200318] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[ 3573.200674] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[ 3573.200914] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[ 3573.201269] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -2
[ 3593.582798] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[ 3593.583127] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[ 3593.583354] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[ 3593.583633] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -2
<EOF>

It looks like my issue might be related (but not same failure mode?) as #8937 possibly?

iourit · 2022-11-01T23:34:52Z

Taking some shots in the dark here (mainly because I'm really motivated to fix this 😅)

Looking at dmesg trace pops something potentially interesting? Idk if these ioctls are expected to fail...

[    3.894321] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[    3.894829] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[    3.895274] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[    3.895634] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -2

These messages come immediately after some BAR assignment operations and a log warning about libcuda not being a symlink.

Something else I noticed is that dmesg goes quiet for a real long time, and then later there's more spew from dxg:

[   49.226483] hv_balloon: Max. dynamic memory size: 65488 MB
[ 3305.059848] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[ 3305.060250] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[ 3305.060464] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[ 3305.060744] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -2
[ 3489.978602] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[ 3489.979056] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[ 3489.979465] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[ 3489.979955] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -2
[ 3573.200318] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[ 3573.200674] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[ 3573.200914] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[ 3573.201269] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -2
[ 3593.582798] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[ 3593.583127] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[ 3593.583354] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[ 3593.583633] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -2
<EOF>

It looks like my issue might be related (but not same failure mode?) as #8937 possibly?

The error messages are most likely benign. How was the nvidia-smi utility installed? I installed it using "sudo apt install nvidia-utils-520" and it works for me just fine with the same host driver version.
Iouri

elsaco · 2022-11-02T00:19:03Z

@devttebayo it is not sound practice to add any nvidia-utils packages or drivers on WSL side. The nvidia-smi located at /usr/lib/wsl/lib/nvidia-smi is exported from the Windows side, being part of the Windows Nvidia driver. It is located at c:\windows\System32\lxss\lib\ along with some Nvidia libraries used by WSL.

devttebayo · 2022-11-02T19:00:31Z

@elsaco Thanks for explaining that, I should have known that it wouldn't be wise to add utils intended for native hw to a virtualized guest. Being honest, I can't remember at which point I installed them (or if it was a side effect of careless debug copy+paste...)

More embarrassing, it appears I somehow I lost the repro? I'm not exactly sure how though, seeing as I rebooted both WSL and my host PC a few times prior to opening this issue. Ah well, thanks for the helpful pointers! I think I'm going to go ahead and close this out for now. Sorry for the noise!

devttebayo · 2022-11-10T08:02:09Z

Reopening this because it looks like I hit a repro again.

Currently in a state where WSL is unable to detect my GPU and running a wsl.exe -d Ubuntu --shutdown didn't resolve the issue. I verified that I don't have the nvidia-utils package installed either.

Going to hope my PC doesn't reboot and lose the repro in case someone has ideas of next steps I could take to investigate.

tvwenger · 2022-11-10T17:12:05Z

I am encountering this issue as well. I start WSL via the Task Scheduler on login, and nvidia-smi reports an error connecting with the driver. If I manually shutdown WSL and restart it, then nvidia-smi successfully contacts the driver and all works fine. It seems that something is broken on the first WSL launch.

anubhavashok · 2022-11-10T21:59:09Z

I have the same issue as described in #9134, so you are not alone.

I haven't installed any external nvidia libraries in WSL either and I can run nvidia-smi.exe in WSL successfully but running the nvidia-smi located in /usr/lib/wsl/lib/nvidia-smi and /mnt/c/Windows/System32/lxss/lib/nvidia-smi both produce the same error as you have above.

Manually shutting down and restarting doesn't seem to yield any results either.

anubhavashok · 2022-11-10T23:15:01Z

So, I was able to get this to work on my end (possibly temporarily) after trying a few things. I'm not sure what exactly got it to work but I did the following.

Roll-back nvidia driver to 522.06 and install CUDA 11.8 on windows (it was still failing after this)
Install nvidia-settings in WSL (it was still failing after this)
wsl.exe -d Ubuntu --shutdown (tried this a couple times and it was still failing)
wsl.exe -d Ubuntu --shutdown + wsl.exe --terminate Ubuntu (after this, it started working again)

I think the last step is what got it to work, let me know if you can reproduce it.

tvwenger · 2022-11-11T01:08:20Z

~~I fixed my problem by reinstalling my graphics driver.~~

Nevermind... Initially, after reinstalling the graphics driver and rebooting, there was no issue. After rebooting again, however, the issue reappears. nvidia-smi works on Windows but not on WSL. Manually shutting down WSL and restarting it fixes the problem.

devttebayo · 2022-11-11T01:18:24Z

Just updated to the lates 526.86 driver (released today) and ran the shutdown + terminate combo @anubhavashok called out above with no luck.

Was able to verify nvidia-smi in Windows is still working correctly.

devttebayo · 2022-11-12T09:27:47Z

So this is a strange development... I updated to WSL 0.70.8 and I'm now in a strange state where nvidia-smi works in some WSL windows but not others?

What I mean is:

Launch 'Ubuntu on Windows' app from Start Menu, this loads a standalone terminal for the Ubuntu instance
In this standalone terminal, run nvidia-smi. At this point I observe the expected output
Launch my WSL Ubuntu on Windows terminal as a tab in an existing Windows Terminal instance
Run nvidia-smi in the WSL Windows Terminal tab. Observe the NVML Driver load error??

What's super strange to me is I can have the two terminals open side by side and run nvidia-smi repeatedly with the same results in each terminal. I guess this is a workaround for me, but I have no idea why it works?

cq01 · 2022-11-13T03:56:23Z

I meet same problem since Sep. and I can run cuda in docker in wsl2, but not in kali-linux
I find that:

If I start from Windows Terminal(Admin): nvidia-smi will fail like this
If I start from Windows Terminal(without Admin): nvidia-smi sucess

devttebayo · 2022-11-13T04:52:00Z

@cq01 Just tried this to verify and that's 100% the difference in my setup above - my Windows Terminal always launches as Admin (and nvidia-smi fails 100% of the time)

Re-launching without Admin rights gets nvidia-smi working. At least I have a workaround I understand now :)

NanoNM · 2022-11-13T22:03:39Z

i have same problem
i think my problem is related to wslg
when i installed wslg and gedit is not working and i uninstall wslg is back to work

i am rookie i dont know why

environment：
Win32NT 10.0.22621.0 Microsoft Windows NT 10.0.22621.0

WSL 版本： 0.70.4.0
内核版本： 5.15.68.1
WSLg 版本： 1.0.45
MSRDC 版本： 1.2.3575
Direct3D 版本： 1.606.4
DXCore 版本： 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows版本： 10.0.22621.819

wang@wang:~$ glxinfo -B
name of display: :0
display: :0 screen: 0
direct rendering: Yes
Extended renderer info (GLX_MESA_query_renderer):
Vendor: Mesa/X.org (0xffffffff)
Device: llvmpipe (LLVM 14.0.6, 256 bits) (0xffffffff) 《=========here
Version: 22.2.3
Accelerated: no
Video memory: 7873MB
Unified memory: no
Preferred profile: core (0x1)
Max core profile version: 4.5
Max compat profile version: 4.5
Max GLES1 profile version: 1.1
Max GLES[23] profile version: 3.2
OpenGL vendor string: Mesa/X.org
OpenGL renderer string: llvmpipe (LLVM 14.0.6, 256 bits)
OpenGL core profile version string: 4.5 (Core Profile) Mesa 22.2.3 - kisak-mesa PPA
OpenGL core profile shading language version string: 4.50
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile

OpenGL version string: 4.5 (Compatibility Profile) Mesa 22.2.3 - kisak-mesa PPA
OpenGL shading language version string: 4.50
OpenGL context flags: (none)
OpenGL profile mask: compatibility profile

OpenGL ES profile version string: OpenGL ES 3.2 Mesa 22.2.3 - kisak-mesa PPA
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20

NanoNM · 2022-11-13T22:12:05Z

I meet same problem since Sep. and I can run cuda in docker in wsl2, but not in kali-linux I find that:

If I start from Windows Terminal(Admin): nvidia-smi will fail like this

If I start from Windows Terminal(without Admin): nvidia-smi sucess

谢谢兄弟这玩意搞了我一个晚上就离谱头痛

hanzec · 2022-11-18T22:58:33Z

In my cases, the nvidia-smi only worked when exec from Windows Terminal as Admin.

babeal · 2022-11-26T18:47:26Z

This is the same behavior I am observing as well. In addition, when running wsl from an elevated windows terminal session, then I have to run sudo when running a gpu test sudo docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark or I will get errors. When running wsl from non-elevated window terminal session I no longer need sudo any more to utilize --gpus. Would love to know why this is.

fzhan · 2022-12-06T14:24:35Z

So, I was able to get this to work on my end (possibly temporarily) after trying a few things. I'm not sure what exactly got it to work but I did the following.

Roll-back nvidia driver to 522.06 and install CUDA 11.8 on windows (it was still failing after this)

Install nvidia-settings in WSL (it was still failing after this)

wsl.exe -d Ubuntu --shutdown (tried this a couple times and it was still failing)

wsl.exe -d Ubuntu --shutdown + wsl.exe --terminate Ubuntu (after this, it started working again)

I think the last step is what got it to work, let me know if you can reproduce it.

Just tried reinstalled 522.06, and CUDA 11.8, then did all the shutdown and terminate, still produce

Failed to initialize NVML: Unknown Error

fzhan · 2022-12-06T14:34:21Z

Just trying to link all the relevant threads here:

canonical/microk8s#3024
#9254
#8174
#9134

This cannot be a coincident.

iourit · 2022-12-06T18:26:59Z

Just trying to link all the relevant threads here:

canonical/microk8s#3024 #9254 #8174 #9134

This cannot be a coincident.

@fzhan

nvidia-smi needs to be from the Windows driver package. It is mapped to /usr/lib/wsl/lib/nvidia-smi,

There is an issue when nvidia-smi and other Cuda applications are running from a WSL window, started as Administrator or not.
For example, if you start WSL as Administrator the very first time after boot, nvidia-smi works. If you start another WSL as non-Admininstator, it fails. The opposite is also true, If you start WSL the very first time as non-Admininstator, nvidia -smi works. If you start another WSL window as Admininstator, nvidia -smi fails. This is under investigation. It might be related to your case.

fzhan · 2022-12-09T14:02:34Z

Just trying to link all the relevant threads here:
canonical/microk8s#3024 #9254 #8174 #9134
This cannot be a coincident.

@fzhan

nvidia-smi needs to be from the Windows driver package. It is mapped to /usr/lib/wsl/lib/nvidia-smi,

There is an issue when nvidia-smi and other Cuda applications are running from a WSL window, started as Administrator or not. For example, if you start WSL as Administrator the very first time after boot, nvidia-smi works. If you start another WSL as non-Admininstator, it fails. The opposite is also true, If you start WSL the very first time as non-Admininstator, nvidia -smi works. If you start another WSL window as Admininstator, nvidia -smi fails. This is under investigation. It might be related to your case.

I have nvidia-smi installed only on Windows side, and had WSL installed with Administrator, that said, the user is literally "Administrator" and is an "Administrator" level account.
I have also tried to create a "non-Admin" account, redo the entire WSL under that account, it still fails.

I've noticed a couple of similarities in these issues:

latest Windows 11 with WSL up-to-date.
latest nvidia driver, or at least after 520.
cuda 11.8 or 12 (being the latest)
wsl-ubuntu with the instruction listed on nvidia website, but some are using it on Ubuntu-22.04 where the instruction has 20.04

TMBJ-Jerry · 2022-12-19T01:35:08Z

I meet same problem since Sep. and I can run cuda in docker in wsl2, but not in kali-linux I find that:

If I start from Windows Terminal(Admin): nvidia-smi will fail like this

If I start from Windows Terminal(without Admin): nvidia-smi sucess

My situation is the same as yours, it's amazing!

CharlesSL · 2022-12-29T06:28:57Z

I meet same problem since Sep. and I can run cuda in docker in wsl2, but not in kali-linux I find that:

If I start from Windows Terminal(Admin): nvidia-smi will fail like this

If I start from Windows Terminal(without Admin): nvidia-smi sucess

Thanks, Im using the Windows Terminal Preview, turn the admin mode off and restart the terminal, yeal it's ok

fzhan · 2023-01-03T03:06:51Z

@CharlesSL may I know your version of Windows?

CharlesSL · 2023-01-03T03:09:16Z

@fzhan win11 insider build 25267

hijkzzz · 2023-01-03T12:52:17Z

same problem

I think this is an regression bug.

fzhan · 2023-01-04T00:45:17Z

@CharlesSL cool thanks, I have issue with the latest Win 11, fresh installed not upgraded.

Nabu-thinker-ru · 2023-01-04T18:39:31Z

In my case nvidia-smi works fine

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.65       Driver Version: 527.56       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
| N/A   53C    P8     7W /  74W |    110MiB /  4096MiB |      7%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A        29      G   /Xwayland                       N/A      |
+-----------------------------------------------------------------------------+

But pytorch cannot allocate GPU Memory:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.16 GiB (GPU 0; 4.00 GiB total capacity; 2.55 GiB already allocated; 0 bytes free; 2.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

and exactly in this moment these lines appears in dmesg :

[  619.372029] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[ 1095.227415] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[ 1095.255181] misc dxg: dxgk: dxgkio_reserve_gpu_va: Ioctl failed: -75
[ 1100.182822] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -22
[ 1105.528132] misc dxg: dxgk: dxgkio_make_resident: Ioctl failed: -12
[ 1105.658121] misc dxg: dxgk: dxgkio_make_resident: Ioctl failed: -12
[ 1105.747456] misc dxg: dxgk: dxgkio_make_resident: Ioctl failed: -12
[ 1105.835194] misc dxg: dxgk: dxgkio_make_resident: Ioctl failed: -12

dp1795 · 2023-01-05T03:29:16Z

The wslg problems started right after upgrading to Store version of wsl.

Run under administrator terminal:
nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Failed to properly shut down NVML: Driver Not Loaded

$ glxinfo -B
name of display: :0
display: :0 screen: 0
direct rendering: Yes
Extended renderer info (GLX_MESA_query_renderer):
Vendor: Mesa/X.org (0xffffffff)
Device: llvmpipe (LLVM 13.0.1, 256 bits) (0xffffffff)
Version: 22.0.5
Accelerated: no
Video memory: 31950MB
Unified memory: no
Preferred profile: core (0x1)
Max core profile version: 4.5
Max compat profile version: 4.5
Max GLES1 profile version: 1.1
Max GLES[23] profile version: 3.2
OpenGL vendor string: Mesa/X.org
OpenGL renderer string: llvmpipe (LLVM 13.0.1, 256 bits)
OpenGL core profile version string: 4.5 (Core Profile) Mesa 22.0.5
OpenGL core profile shading language version string: 4.50
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile

OpenGL version string: 4.5 (Compatibility Profile) Mesa 22.0.5
OpenGL shading language version string: 4.50
OpenGL context flags: (none)
OpenGL profile mask: compatibility profile

OpenGL ES profile version string: OpenGL ES 3.2 Mesa 22.0.5
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20

Running under Non Admin user, nvidia-smi runs but segfaults at end, and applications trying to use gpu after have various error and exit problems.

Output from Standard Terminal:

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 23 G /Xwayland N/A |
| 0 N/A N/A 23 G /Xwayland N/A |
| 0 N/A N/A 26 G /Xwayland N/A |
+-----------------------------------------------------------------------------+
$ glxinfo -B
name of display: :0
display: :0 screen: 0
direct rendering: Yes
Extended renderer info (GLX_MESA_query_renderer):
Vendor: Microsoft Corporation (0xffffffff)
Device: D3D12 (NVIDIA GeForce RTX 3060 Laptop GPU) (0xffffffff)
Version: 22.0.5
Accelerated: yes
Video memory: 38627MB
Unified memory: no
Preferred profile: core (0x1)
Max core profile version: 3.3
Max compat profile version: 3.3
Max GLES1 profile version: 1.1
Max GLES[23] profile version: 3.1
OpenGL vendor string: Microsoft Corporation
OpenGL renderer string: D3D12 (NVIDIA GeForce RTX 3060 Laptop GPU)
OpenGL core profile version string: 3.3 (Core Profile) Mesa 22.0.5
OpenGL core profile shading language version string: 3.30
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile

OpenGL version string: 3.3 (Compatibility Profile) Mesa 22.0.5
OpenGL shading language version string: 3.30
OpenGL context flags: (none)
OpenGL profile mask: compatibility profile

OpenGL ES profile version string: OpenGL ES 3.1 Mesa 22.0.5
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.10

Segmentation fault

System Version Info:
wsl --version
WSL version: 1.0.3.0
Kernel version: 5.15.79.1
WSLg version: 1.0.47
MSRDC version: 1.2.3575
Direct3D version: 1.606.4
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.22000.1335

All worked before install wsl store version.

hanzec · 2023-01-10T23:09:23Z

similar problem also existed in snapd

ubuntu/WSL#318

Ziy1-Tan · 2023-01-29T08:25:24Z

I meet same problem since Sep. and I can run cuda in docker in wsl2, but not in kali-linux I find that:

If I start from Windows Terminal(Admin): nvidia-smi will fail like this

If I start from Windows Terminal(without Admin): nvidia-smi sucess

Thx, bro! It works for me:)

MattWolf74 · 2023-01-31T18:07:33Z

Make sure you set the distro to be the default in WSL.

wsl --shutdown
wsl --setdefault
then start the distro
wsl -d
and run nvidia-smi

this worked for me. Needless to say make sure nvidia-smi works fine from within Windows before trying any of the above

EricJin2002 · 2023-02-22T12:02:48Z

In my cases, the nvidia-smi only worked when exec from Windows Terminal as Admin.

Same for me once, until I closed both windows and restarted them, surprised to find nvidia-smi only works as non-admin now...

By the way, I believed the problem is associated with wslg, since my nvdia-smi is always functioning well till I modified .wslgconf.

arcayi · 2023-02-23T03:59:40Z

this may helps:
netsh winsock reset

iourit · 2023-03-09T19:26:47Z

This issue should be fixed in WSL version [1.1.3]. https://github.com/microsoft/WSL/releases/tag/1.1.3

kyohei-utf · 2023-03-10T20:27:44Z

I meet same problem since Sep. and I can run cuda in docker in wsl2, but not in kali-linux I find that:

If I start from Windows Terminal(Admin): nvidia-smi will fail like this

If I start from Windows Terminal(without Admin): nvidia-smi sucess

Thank you !

My attempt:
The problem does not occur when connecting via ssh from another machine in the local network to the Ubuntu distribution of WSL2 via a Windows machine. Of course, there is no element of Window terminal (powershell) administrator execution involved there, so it is not surprising.

jacobdang · 2023-04-01T01:22:21Z

During a program compilation, I need to link with a system library, so I set the environment variable like this:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/x86_64-linux-gnu

I found that after this setting, nvidia-smi will not longer work and generate an error message similar to the one in this discussion. However, after removing this environment variable settings. Things can go back to normal gain.

A pretty irritating issue and I spent quite some time debugging it.

fzhan · 2023-04-25T00:12:41Z

With everything updated, nvidia-smi still reports:"Failed to initialize NVML: Unknown Error"

iourit · 2023-04-25T00:42:33Z

With everything updated, nvidia-smi still reports:"Failed to initialize NVML: Unknown Error"

Based on the error message it looks like a separate issue.
What do you mean by "everything updated"? What are the repro steps?

Desquenazi-creator · 2023-06-26T10:31:57Z

Using Cuda 12.2 version with WSL, i was receiving the same issue not having it work on wsl but command prompt instead. i just, after clicking on another page came across this solution which worked: Run the command wsl --shutdown to stop all running WSL instances.
Run the command wsl --setdefault to set your desired distribution as the default. Replace with the name of your WSL distribution, such as "Ubuntu".
Start the distribution by running wsl -d . Again, replace with the name of your chosen distribution. Finally, run nvidia-smi to check if it works within the WSL environment.

gserbanut · 2023-08-18T07:09:35Z

What works for me...

OS: Windows 11
GPU: NVIDIA GeForce GTX 1650
WSL: 2
WSL distro: Ubuntu 20.04 (it should work on 22.04 as well)
NVIDIA drivers: R536
CUDA: 11.8 (required by TensorFlow 2.13.0)
cuDNN: 8.9 (compatibility requirement with CUDA 11.8)
TensorRT: 8.6 (compatibility requirement with CUDA 11.8)

Add /mnt/c/Windows/System32/lxss/lib to LD_LIBRARY_PATH as below:
export LD_LIBRARY_PATH=/mnt/c/Windows/System32/lxss/lib:$LD_LIBRARY_PATH

NB! The lxss lib path must be before the other lib paths.

This should fix the problem.

My environment:

(tf2) root@DESKTOP-WIN11:~# cat /etc/environment
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda/bin:/usr/include"
LD_LIBRARY_PATH="/mnt/c/Windows/System32/lxss/lib:/usr/local/cuda/lib64:/usr/local/cuda/lib64/stubs:/usr/lib/x86_64-linux-gnu"

and my .bashrc related to this:

(tf2) root@DESKTOP-WIN11:~# grep environment .bashrc
set -a; source /etc/environment; set +a;

Test on TensorFlow 2.13.0:

(tf2) root@DESKTOP-WIN11:~# python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
2023-08-18 09:06:27.738396: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-18 09:06:31.057671: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-08-18 09:06:31.096912: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-08-18 09:06:31.098038: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Do not mind NUMA warnings. WSL does not support NUMA at this time.

Enjoy your stable environment! At least, for the time being. :)

rohit7044 · 2023-10-19T06:02:42Z

Found this thing on wsl installation page. On section 4.1 it mentions that:

Root user on bare metal (not containers) will not find nvidia-smi at the expected location.	Use /usr/lib/wsl/lib/nvidia-smi or manually add /usr/lib/wsl/lib/ to the PATH).

chozzz · 2023-12-02T15:59:15Z

What works for me...

OS: Windows 11
GPU: NVIDIA GeForce GTX 1650
WSL: 2
WSL distro: Ubuntu 20.04 (it should work on 22.04 as well)
NVIDIA drivers: R536
CUDA: 11.8 (required by TensorFlow 2.13.0)
cuDNN: 8.9 (compatibility requirement with CUDA 11.8)
TensorRT: 8.6 (compatibility requirement with CUDA 11.8)

Add /mnt/c/Windows/System32/lxss/lib to LD_LIBRARY_PATH as below: export LD_LIBRARY_PATH=/mnt/c/Windows/System32/lxss/lib:$LD_LIBRARY_PATH

NB! The lxss lib path must be before the other lib paths.

This should fix the problem.

My environment:
(tf2) root@DESKTOP-WIN11:~# cat /etc/environment
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda/bin:/usr/include"
LD_LIBRARY_PATH="/mnt/c/Windows/System32/lxss/lib:/usr/local/cuda/lib64:/usr/local/cuda/lib64/stubs:/usr/lib/x86_64-linux-gnu"
and my .bashrc related to this:
(tf2) root@DESKTOP-WIN11:~# grep environment .bashrc
set -a; source /etc/environment; set +a;
Test on TensorFlow 2.13.0:
(tf2) root@DESKTOP-WIN11:~# python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
2023-08-18 09:06:27.738396: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-18 09:06:31.057671: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-08-18 09:06:31.096912: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-08-18 09:06:31.098038: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Do not mind NUMA warnings. WSL does not support NUMA at this time.

Enjoy your stable environment! At least, for the time being. :)

Thanks, this works for me on WSL2 when having problem with import torch;

Habush · 2024-01-02T09:18:21Z

Add /mnt/c/Windows/System32/lxss/lib to LD_LIBRARY_PATH as below:
export LD_LIBRARY_PATH=/mnt/c/Windows/System32/lxss/lib:$LD_LIBRARY_PATH
NB! The lxss lib path must be before the other lib paths.

Thank you @gserbanut! After countless hours of trying to fix this (including installing/uninstalling cuda toolkit multiple times), your simple suggestions fixed my issue. Now both nvidia-smi works and PyTorch can see the gpu

nhjoy · 2024-06-08T14:24:33Z

Hi! So here’s what I did to fix the issue. First, I installed the Nvidia driver 537.58. After giving my system a quick reboot, that pesky “Segmentation fault” disappeared. Just to be on the safe side, I then updated to the latest driver and guess what? No more problems! Hope this helps! 😊

benhillis added the GPU label Nov 1, 2022

devttebayo closed this as completed Nov 2, 2022

devttebayo reopened this Nov 10, 2022

Ziy1-Tan mentioned this issue Jan 29, 2023

The nvidia cannot be used by wsl2 after win11 updating the nvidia driver #9035

Closed

WSLg/Cuda suddenly broken due to nvidia-smi unable to find GPU #9099

WSLg/Cuda suddenly broken due to nvidia-smi unable to find GPU #9099

Comments

devttebayo commented Nov 1, 2022 • edited Loading

Version

WSL Version

Kernel Version

Distro Version

Other Software

Repro Steps

Expected Behavior

Actual Behavior

Diagnostic Logs

dpkg -l | grep nvidia

lsmod | grep nvidia

(Truncated) DxDiag Output

devttebayo commented Nov 1, 2022

iourit commented Nov 1, 2022

elsaco commented Nov 2, 2022

devttebayo commented Nov 2, 2022

devttebayo commented Nov 10, 2022

tvwenger commented Nov 10, 2022

anubhavashok commented Nov 10, 2022

anubhavashok commented Nov 10, 2022

tvwenger commented Nov 11, 2022 • edited Loading

devttebayo commented Nov 11, 2022

devttebayo commented Nov 12, 2022

cq01 commented Nov 13, 2022 • edited Loading

devttebayo commented Nov 13, 2022

NanoNM commented Nov 13, 2022 • edited Loading

NanoNM commented Nov 13, 2022

hanzec commented Nov 18, 2022

babeal commented Nov 26, 2022

fzhan commented Dec 6, 2022

fzhan commented Dec 6, 2022

iourit commented Dec 6, 2022

fzhan commented Dec 9, 2022

TMBJ-Jerry commented Dec 19, 2022

CharlesSL commented Dec 29, 2022

fzhan commented Jan 3, 2023

CharlesSL commented Jan 3, 2023

hijkzzz commented Jan 3, 2023 • edited Loading

fzhan commented Jan 4, 2023

Nabu-thinker-ru commented Jan 4, 2023

dp1795 commented Jan 5, 2023

hanzec commented Jan 10, 2023

Ziy1-Tan commented Jan 29, 2023

MattWolf74 commented Jan 31, 2023 • edited Loading

EricJin2002 commented Feb 22, 2023 • edited Loading

arcayi commented Feb 23, 2023

iourit commented Mar 9, 2023

kyohei-utf commented Mar 10, 2023

jacobdang commented Apr 1, 2023 • edited Loading

fzhan commented Apr 25, 2023

iourit commented Apr 25, 2023

Desquenazi-creator commented Jun 26, 2023

gserbanut commented Aug 18, 2023 • edited Loading

rohit7044 commented Oct 19, 2023

chozzz commented Dec 2, 2023

Habush commented Jan 2, 2024

nhjoy commented Jun 8, 2024

devttebayo commented Nov 1, 2022 •

edited

Loading

`dpkg -l | grep nvidia`

`lsmod | grep nvidia`

`(Truncated) DxDiag Output`

tvwenger commented Nov 11, 2022 •

edited

Loading

cq01 commented Nov 13, 2022 •

edited

Loading

NanoNM commented Nov 13, 2022 •

edited

Loading

hijkzzz commented Jan 3, 2023 •

edited

Loading

MattWolf74 commented Jan 31, 2023 •

edited

Loading

EricJin2002 commented Feb 22, 2023 •

edited

Loading

jacobdang commented Apr 1, 2023 •

edited

Loading

gserbanut commented Aug 18, 2023 •

edited

Loading