Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Installation issues #9

Closed
wd15 opened this issue May 15, 2018 · 37 comments
Closed

Installation issues #9

wd15 opened this issue May 15, 2018 · 37 comments

Comments

@wd15
Copy link
Contributor

wd15 commented May 15, 2018

I tried to install amgx / pyamgx using this Nix recipe. I've encountered three problems while testing using demo.py.

  • After installation I need to place import numpy as the first import in demo.py. Otherwise, I get the following error.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "__init__.pxd", line 163, in init pyamgx
  File "/nix/store/mq134cl5nfiy422cjzvjms90az1zxnwh-python2.7-numpy-1.14.0/lib/python2.7/site-packages/numpy/__init__.py", line 142, in <module>
    from . import add_newdocs
...
ImportError: 
Importing the multiarray numpy extension module failed.  Most
likely you are trying to import a failed build of numpy.
If you're working with a numpy git repo, try `git clean -xdf` (removes all
files not under version control).  Otherwise reinstall numpy.
  • The configuration in demo.py on line 6 needs to be changed to cfg = pyamgx.Config().create_from_file(os.environ['AMGX_DIR']+'/lib/configs/core/FGMRES_AGGREGATION.json') since I'm accessing the json file in lib/ where pyamgx is installed.

  • When running demo.py I now get the following error.

AMGX version 2.0.0.130-opensource
Built on May 14 2018, 23:06:38
Failed while initializing CUDA runtime in cudaRuntimeGetVersionVariable 'solver' not registered
Converting config string to current config version
Error parsing parameter string: Incorrect config entry (number of equal signs is not 1) :  "config_version": 2

Error parsing parameter string obtained from file: 
Traceback (most recent call last):
  File "demo.py", line 8, in <module>
    cfg = pyamgx.Config().create_from_file(os.environ['AMGX_DIR']+'/lib/configs/core/FGMRES_AGGREGATION.json')
  File "pyamgx/Config.pyx", line 49, in pyamgx.Config.create_from_file
  File "pyamgx/Errors.pyx", line 62, in pyamgx.check_error
pyamgx.AMGXError: Incorrect amgx configuration provided.

I'm using version 6cb23fed266 of amgx and version df32133 of pyamgx. Are those versions compatible?

@shwina
Copy link
Owner

shwina commented May 15, 2018

Thanks.

  1. I am still trying to play with nix and figure this one out.

  2. AMGX_DIR is supposed to be set to the cloned AMGX repo directory, but yes, it can also be set to the AMGX install directory (which is lib in this case). I will try to improve the wording in the setup instructions about what exactly AMGX_DIR is

  3. This error means that pyamgx was not correctly initialized. Within the nix environment, I see the following error when running pyamgx.initialize:

Failed while initializing CUDA runtime in cudaRuntimeGetVersion

This might indicate some issue with the CUDA libraries in nix, or that they are incompatible with our systems. Might have to look into this deeper.

@shwina
Copy link
Owner

shwina commented May 15, 2018

This last issue definitely seems to be unrelated to (py)amgx, as I'm unable to run the AMGX example program:

[nix-shell:/nix/store/2mmp5wjg7f829lvgxg7gfl5fcyav9bf5-AmgX/lib/examples]$ pwd
/nix/store/2mmp5wjg7f829lvgxg7gfl5fcyav9bf5-AmgX/lib/examples

[nix-shell:/nix/store/2mmp5wjg7f829lvgxg7gfl5fcyav9bf5-AmgX/lib/examples]$ ./amgx_capi -c ../configs/core/CG_DILU.json 
AMGX version 2.0.0.130-opensource
Built on May 15 2018, 15:33:04
AMGX ERROR: file /tmp/nix-build-AmgX.drv-0/lafn8qxabfn95rh3bh3y0bi113kzwl8w-source/examples/amgx_capi.c line    245
AMGX ERROR: Error initializing amgx core.
Failed while initializing CUDA runtime in cudaRuntimeGetVersion

@wd15
Copy link
Contributor Author

wd15 commented May 15, 2018

@shwina, thanks for looking into that. I'll try and debug or switch to Conda.

@wd15
Copy link
Contributor Author

wd15 commented May 15, 2018

@shwina, I submitted an issue on amgx, NVIDIA/AMGX#27

@shwina
Copy link
Owner

shwina commented May 15, 2018

Thanks, but I think the problem may be at a lower level than AMGX. I cannot even run the deviceQuery sample from the CUDA toolkit provided by Nix, and the error indicates an incompatibility between the NVIDIA driver and CUDA toolkit version.

[nix-shell:/nix/store/l7xmd5899g9789saqkd9bm7fh2hp3jlq-cudatoolkit-9.1.85.1/samples/bin/x86_64/linux/release]$ ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
Result = FAIL

I'm trying to see if installing NVIDIA drivers via Nix fixes anything.

@shwina
Copy link
Owner

shwina commented May 15, 2018

So far that doesn't seem to be working.

@shwina
Copy link
Owner

shwina commented May 16, 2018

OK some progress after going down a small Nix hole:

  1. First (and sort of unrelated), I added a fix to pyamgx so that it produces the appropriate error message when pyamgx.initialize() fails:
[nix-shell:~/tmp/nixes/amgx]$ python demo.py
AMGX version 2.0.0.130-opensource
Built on May 16 2018, 13:20:54
Failed while initializing CUDA runtime in cudaRuntimeGetVersionTraceback (most recent call last):
  File "demo.py", line 7, in <module>
    pyamgx.initialize()
  File "pyamgx/pyamgx.pyx", line 14, in pyamgx.initialize
  File "pyamgx/Errors.pyx", line 62, in pyamgx.check_error
pyamgx.AMGXError: Error initializing amgx core.

  1. From the discussion here it looks like the way to get the previous command to work is to ensure that the libucda.so is picked up from the host system. The NVIDIA drivers must be installed on the host system.

So first I tried:

[nix-shell:~/tmp/nixes/amgx]$ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libcuda.so python demo.py
python: error while loading shared libraries: libnvidia-fatbinaryloader.so.375.82: cannot open shared object file: No such file or directory

So I also added path to libnvidia-fatbinaryloader.so.375.82 to LD_PRELOAD, but now I get:

LD_PRELOAD="/usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.375.82" python demo.py

AMGX version 2.0.0.130-opensource
Built on May 16 2018, 13:20:54
Compiled with CUDA Runtime 8.0, using CUDA driver 8.0
Traceback (most recent call last):
  File "demo.py", line 20, in <module>
    import scipy.sparse as sparse
  File "/nix/store/097b7q6jlzsl5fa0zcvbkpcf3gnk53yw-python2.7-scipy-1.0.0/lib/python2.7/site-packages/scipy/sparse/__init__.py", line 229, in <module>
    from .csr import *
  File "/nix/store/097b7q6jlzsl5fa0zcvbkpcf3gnk53yw-python2.7-scipy-1.0.0/lib/python2.7/site-packages/scipy/sparse/csr.py", line 15, in <module>
    from ._sparsetools import csr_tocsc, csr_tobsr, csr_count_blocks, \
ImportError: /nix/store/nvdymgkdcp7cmyvh318bzs397sy2hrxp-gcc-4.8.5-lib/lib/libstdc++.so.6: version `CXXABI_1.3.9' not found (required by /nix/store/097b7q6jlzsl5fa0zcvbkpcf3gnk53yw-python2.7-scipy-1.0.0/lib/python2.7/site-packages/scipy/sparse/_sparsetools.so)

So as with NumPy, for some reason I had to import scipy before importing pyamgx:

[nix-shell:~/tmp/nixes/amgx]$ head -5 demo.py
import numpy as np
import scipy.sparse as sparse
import scipy.sparse.linalg as splinalg
import pyamgx
import os

Finally:


[nix-shell:~/tmp/nixes/amgx]$ LD_PRELOAD="/usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.375.82" python demo.py
AMGX version 2.0.0.130-opensource
Built on May 16 2018, 13:20:54
Compiled with CUDA Runtime 8.0, using CUDA driver 8.0
AMG Grid:
         Number of Levels: 1
            LVL         ROWS               NNZ    SPRSTY       Mem (GB)
         --------------------------------------------------------------
           0(D)            5                25         1       3.69e-07
         --------------------------------------------------------------
         Grid Complexity: 1
         Operator Complexity: 1
         Total Memory Usage: 3.68804e-07 GB
         --------------------------------------------------------------
           iter      Mem Usage (GB)       residual           rate
         --------------------------------------------------------------
            Ini            0.406494   1.248728e+00
              0            0.406494   1.441213e-15         0.0000
         --------------------------------------------------------------
         Total Iterations: 1
         Avg Convergence Rate: 		         0.0000
         Final Residual: 		   1.441213e-15
         Total Reduction in Residual: 	   1.154145e-15
         Maximum Memory Usage: 		          0.406 GB
         --------------------------------------------------------------
Total Time: 0.00239411
    setup: 0.00153699 s
    solve: 0.00085712 s
    solve(per iteration): 0.00085712 s
('pyamgx solution: ', array([  5.10430199, -12.78780803,   1.91116712,  -7.30070306,
        13.65398112]))
('scipy solution: ', array([  5.10430199, -12.78780803,   1.91116712,  -7.30070306,
        13.65398112]))

OK so a more elegant way to do this (also recommended in the above discussion) is to create symlinks to the above libraries to a folder /nix/var/nix/lib, and add this folder to LD_LIBRARY_PATH:

[nix-shell:~/tmp/nixes/amgx]$ ls -l /nix/var/nix/lib/
total 4
lrwxrwxrwx 1 root root 51 May 16 09:50 libcuda.so -> /usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so
lrwxrwxrwx 1 root root 61 May 16 09:50 libnvidia-fatbinaryloader.so.375.82 -> /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.375.82

[nix-shell:~/tmp/nixes/amgx]$ $LD_LIBRARY_PATH
bash: /nix/var/nix/lib:/home/ashwin/local/bin:/usr/lib/x86_64-linux-gnu/nvidia/current/: No such file or directory

[nix-shell:~/tmp/nixes/amgx]$ python demo.py
AMGX version 2.0.0.130-opensource
Built on May 16 2018, 13:20:54
Compiled with CUDA Runtime 8.0, using CUDA driver 8.0
AMG Grid:
         Number of Levels: 1
            LVL         ROWS               NNZ    SPRSTY       Mem (GB)
         --------------------------------------------------------------
           0(D)            5                25         1       3.69e-07
         --------------------------------------------------------------
         Grid Complexity: 1
         Operator Complexity: 1
         Total Memory Usage: 3.68804e-07 GB
         --------------------------------------------------------------
           iter      Mem Usage (GB)       residual           rate
         --------------------------------------------------------------
            Ini            0.406494   1.393182e+00
              0            0.406494   4.370313e-16         0.0000
         --------------------------------------------------------------
         Total Iterations: 1
         Avg Convergence Rate: 		         0.0000
         Final Residual: 		   4.370313e-16
         Total Reduction in Residual: 	   3.136930e-16
         Maximum Memory Usage: 		          0.406 GB
         --------------------------------------------------------------
Total Time: 0.0024313
    setup: 0.00156934 s
    solve: 0.000861952 s
    solve(per iteration): 0.000861952 s
('pyamgx solution: ', array([-0.76851926,  5.01697985, -1.61899648,  2.52336589, -1.42913961]))
('scipy solution: ', array([-0.76851926,  5.01697985, -1.61899648,  2.52336589, -1.42913961]))

[nix-shell:~/tmp/nixes/amgx]$

@shwina
Copy link
Owner

shwina commented May 16, 2018

I also made a few small changes to the .nix files:


[nix-shell:~/tmp/nixes/amgx]$ cat amgx.nix
{ nixpkgs ? import <nixpkgs> {} }:
let
  stdenv48 = nixpkgs.overrideCC nixpkgs.stdenv nixpkgs.pkgs.gcc48;
in
  stdenv48.mkDerivation rec {
    name = "AmgX";

    src = nixpkgs.fetchFromGitHub {
      owner = "NVIDIA";
      repo = "AMGX";
      rev = "6cb23fed26602e4873d5c1deb694a2c8480feac3";
      sha256 = "1g5zj7wzxc8b2lyn00xp7jqq70bz550q8fmzcb5mzzapa44xjk7q";
    };

    buildInputs = [
      nixpkgs.pkgs.cmake
      nixpkgs.pkgs.cudatoolkit8
    ];

    unpackPhase = ''
      cp --recursive "$src" ./
      chmod --recursive u=rwx ./"$(basename "$src")"
      cd ./"$(basename "$src")"
    '';

    configurePhase = ''
      mkdir -p build
      cd build
      mkdir --parents "$out"
      cmake -DCMAKE_INSTALL_PREFIX:PATH="$out" ../
    '';

    buildPhase = ''
      make -j"$NIX_BUILD_CORES" all
    '';
  }

[nix-shell:~/tmp/nixes/amgx]$ cat pyamgx.nix
{ nixpkgs ? import <nixpkgs> {} }:
let
  amgx = import ./amgx.nix { inherit nixpkgs; };
in
  nixpkgs.python27Packages.buildPythonPackage rec {
    pname = "pyamgx";
    version = "";
    src = nixpkgs.fetchFromGitHub {
      owner = "shwina";
      repo = pname;
      rev = "fac3c841e1527942da64c7d1805d1ffe94f58766";
      sha256 = "1752yhhq82980qhn5i8mngjlybkgvp96qlgnv6y5cdn8921m8h2s";
    };
    doCheck=false;
    buildInputs = [
      nixpkgs.python27Packages.scipy
      nixpkgs.python27Packages.numpy
      amgx
      nixpkgs.python27Packages.cython
    ];
    AMGX_DIR = "/blah";
    # shellHook = ''
    #   export AMGX_DIR = "/blah"
    # '';
  }



@wd15
Copy link
Contributor Author

wd15 commented May 16, 2018

Awesome work! I'm going to try and follow along. Apologies for sending you down the Nix rabbit hole. I've been trying to get into Nix lately and I think it's an improvement over using Conda. I hope that you enjoy it.

Please do add the final nix recipes to this repository if you/we do get things working. I can submit a pull request if you'd like some outside contributions. I haven't tried submitting anything to nixpkgs yet, but maybe that is also an option down the road.

I think that the LD_LIBRARY_PATH can probably be set during the nix build via a shell hook. I'll look into that assuming I can reproduce your work above.

Also, I wasn't using a machine with a gpu or drivers, which I am now. I don't use gpus much so didn't really know what I was doing.

@shwina
Copy link
Owner

shwina commented May 17, 2018

I've figured out what causes the issues with numpy/scipy imports: it's because of the different GCC versions used to compile AMGX (gcc-4.8) v/s numpy/scipy (Nix default I think gcc-7.3.0).

See below: when pyamgx is imported first, numpy and scipy are not happy about finding the gcc-4.8.5 versions of libgomp.so and libstdc++.so instead of the gcc-7.3.0 versions:

[nix-shell:~/tmp/nixes/amgx]$ python demo.py
Traceback (most recent call last):
  File "demo.py", line 1, in <module>
    import pyamgx
  File "__init__.pxd", line 163, in init pyamgx
  File "/nix/store/mq134cl5nfiy422cjzvjms90az1zxnwh-python2.7-numpy-1.14.0/lib/python2.7/site-packages/numpy/__init__.py", line 142, in <module>
    from . import add_newdocs
  File "/nix/store/mq134cl5nfiy422cjzvjms90az1zxnwh-python2.7-numpy-1.14.0/lib/python2.7/site-packages/numpy/add_newdocs.py", line 13, in <module>
    from numpy.lib import add_newdoc
  File "/nix/store/mq134cl5nfiy422cjzvjms90az1zxnwh-python2.7-numpy-1.14.0/lib/python2.7/site-packages/numpy/lib/__init__.py", line 8, in <module>
    from .type_check import *
  File "/nix/store/mq134cl5nfiy422cjzvjms90az1zxnwh-python2.7-numpy-1.14.0/lib/python2.7/site-packages/numpy/lib/type_check.py", line 11, in <module>
    import numpy.core.numeric as _nx
  File "/nix/store/mq134cl5nfiy422cjzvjms90az1zxnwh-python2.7-numpy-1.14.0/lib/python2.7/site-packages/numpy/core/__init__.py", line 26, in <module>
    raise ImportError(msg)
ImportError:
Importing the multiarray numpy extension module failed.  Most
likely you are trying to import a failed build of numpy.
If you're working with a numpy git repo, try `git clean -xdf` (removes all
files not under version control).  Otherwise reinstall numpy.

Original error was: /nix/store/nvdymgkdcp7cmyvh318bzs397sy2hrxp-gcc-4.8.5-lib/lib/libgomp.so.1: version `GOMP_4.0' not found (required by /nix/store/lhd411dd1r4nzqpi52l2sxhyfjqiwlph-openblas-0.2.20/lib/libopenblas.so.0)

LD_PRELOAD libgomp.so to make numpy happy:

[nix-shell:~/tmp/nixes/amgx]$ LD_PRELOAD=/nix/store/236pskgkb440l3q0458fbd4gikgplw5w-gcc-7.3.0-lib/lib/libgomp.so.1 python demo.py
Traceback (most recent call last):
  File "demo.py", line 4, in <module>
    import scipy.sparse as sparse
  File "/nix/store/097b7q6jlzsl5fa0zcvbkpcf3gnk53yw-python2.7-scipy-1.0.0/lib/python2.7/site-packages/scipy/sparse/__init__.py", line 229, in <module>
    from .csr import *
  File "/nix/store/097b7q6jlzsl5fa0zcvbkpcf3gnk53yw-python2.7-scipy-1.0.0/lib/python2.7/site-packages/scipy/sparse/csr.py", line 15, in <module>
    from ._sparsetools import csr_tocsc, csr_tobsr, csr_count_blocks, \
ImportError: /nix/store/nvdymgkdcp7cmyvh318bzs397sy2hrxp-gcc-4.8.5-lib/lib/libstdc++.so.6: version `CXXABI_1.3.9' not found (required by /nix/store/097b7q6jlzsl5fa0zcvbkpcf3gnk53yw-python2.7-scipy-1.0.0/lib/python2.7/site-packages/scipy/sparse/_sparsetools.so

LD_PRELOAD libstdc++.so to make scipy happy:


[nix-shell:~/tmp/nixes/amgx]$ LD_PRELOAD="/nix/store/236pskgkb440l3q0458fbd4gikgplw5w-gcc-7.3.0-lib/lib/libgomp.so.1 /nix/store/236pskgkb440l3q0458fbd4gikgplw5w-gcc-7.3.0-lib/lib/libstdc++.so" python demo.py
AMGX version 2.0.0.130-opensource
Built on May 16 2018, 14:45:25
Compiled with CUDA Runtime 8.0, using CUDA driver 8.0
AMG Grid:
         Number of Levels: 1
            LVL         ROWS               NNZ    SPRSTY       Mem (GB)
         --------------------------------------------------------------
           0(D)            5                25         1       3.69e-07
         --------------------------------------------------------------
         Grid Complexity: 1
         Operator Complexity: 1
         Total Memory Usage: 3.68804e-07 GB
         --------------------------------------------------------------
           iter      Mem Usage (GB)       residual           rate
         --------------------------------------------------------------
            Ini            0.477661   1.113422e+00
              0            0.477661   1.189315e-15         0.0000
         --------------------------------------------------------------
         Total Iterations: 1
         Avg Convergence Rate: 		         0.0000
         Final Residual: 		   1.189315e-15
         Total Reduction in Residual: 	   1.068162e-15
         Maximum Memory Usage: 		          0.478 GB
         --------------------------------------------------------------
Total Time: 0.0023961
    setup: 0.00153821 s
    solve: 0.000857888 s
    solve(per iteration): 0.000857888 s
('pyamgx solution: ', array([-13.10772991,  -8.11496672,  -2.27576125,   6.75924547,
        12.55481743]))
('scipy solution: ', array([-13.10772991,  -8.11496672,  -2.27576125,   6.75924547,
        12.55481743]))

Use LD_LIBRARY_PATH to avoid LD_PRELOAD - this isn't a great solution but at least it's pretty:

[nix-shell:~/tmp/nixes/amgx]$ export LD_LIBRARY_PATH=/nix/store/236pskgkb440l3q0458fbd4gikgplw5w-gcc-7.3.0-lib/lib/:$LD_LIBRARY_PATH

[nix-shell:~/tmp/nixes/amgx]$ python demo.py
AMGX version 2.0.0.130-opensource
Built on May 16 2018, 14:45:25
Compiled with CUDA Runtime 8.0, using CUDA driver 8.0
AMG Grid:
         Number of Levels: 1
            LVL         ROWS               NNZ    SPRSTY       Mem (GB)
         --------------------------------------------------------------
           0(D)            5                25         1       3.69e-07
         --------------------------------------------------------------
         Grid Complexity: 1
         Operator Complexity: 1
         Total Memory Usage: 3.68804e-07 GB
         --------------------------------------------------------------
           iter      Mem Usage (GB)       residual           rate
         --------------------------------------------------------------
            Ini            0.477661   1.584991e+00
              0            0.477661   1.029213e-15         0.0000
         --------------------------------------------------------------
         Total Iterations: 1
         Avg Convergence Rate: 		         0.0000
         Final Residual: 		   1.029213e-15
         Total Reduction in Residual: 	   6.493497e-16
         Maximum Memory Usage: 		          0.478 GB
         --------------------------------------------------------------
Total Time: 0.00246966
    setup: 0.00157434 s
    solve: 0.000895328 s
    solve(per iteration): 0.000895328 s
('pyamgx solution: ', array([ 9.27278245,  6.04026388, -1.18125993, -5.33643819, -3.43642485]))
('scipy solution: ', array([ 9.27278245,  6.04026388, -1.18125993, -5.33643819, -3.43642485]))

@wd15
Copy link
Contributor Author

wd15 commented May 17, 2018

Currently, the following happens

$ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libcuda.so python demo.py
python: error while loading shared libraries: libnvidia-fatbinaryloader.so.384.111: cannot open shared object file: No such file or directory

but when the libnvidia-fatbinaryloader.so.384.111 is included, the following happens

$ LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libcuda.so /usr/lib/nvidia-384/libnvidia-fatbinaryloader.so.384.111 " python demo.py
AMGX version 2.0.0.130-opensource
Built on May 16 2018, 21:41:17
Failed while initializing CUDA runtime in cudaRuntimeGetVersionTraceback (most recent call last):
  File "demo.py", line 5, in <module>
    pyamgx.initialize()
  File "pyamgx/pyamgx.pyx", line 14, in pyamgx.initialize
  File "pyamgx/Errors.pyx", line 62, in pyamgx.check_error
pyamgx.AMGXError: Error initializing amgx core.

It's built using cudatoolkit9, not 8 as the machine has the drivers for 9 (edit), apparently. Could this be an issue for pyamgx / amgx?

@shwina
Copy link
Owner

shwina commented May 17, 2018

Could you try with cudatoolkit8? See here for the minimum driver versions required for different CUDA toolkit versions.

It's only the NVIDIA driver (which provides both libcuda.so [confusingly] and libnvidia-fatbinaryloader) that needs to be installed on the host system, so it shouldn't matter what CUDA toolkit is installed on the host system.

@shwina
Copy link
Owner

shwina commented May 17, 2018

It looks like cudatoolkit9 installs CUDA toolkit 9.1, which needs a minimum driver version 387.xx

@wd15
Copy link
Contributor Author

wd15 commented May 17, 2018

Sorry, I'm confused, does that mean I should try cudatoolkit8 or not? I think I go a different error pointing out the incompatibility with 9, but can't remember now.

@shwina
Copy link
Owner

shwina commented May 17, 2018

Yes, try with cudatoolkit8, because it looks like you have NVIDIA driver version 384.111, which is apparently not sufficient to support CUDA toolkit 9.1

@shwina
Copy link
Owner

shwina commented May 17, 2018

I can confirm that I get the same error with cudatoolkit9, but not with cudatoolkit8.

@wd15
Copy link
Contributor Author

wd15 commented May 17, 2018

Different error this time (using cudatoolkit8)

$ LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libcuda.so /usr/lib/nvidia-384/libnvidia-fatbinaryloader.so.384.111" python demo.py
AMGX version 2.0.0.130-opensource
Built on May 17 2018, 17:46:34
Compiled with CUDA Runtime 8.0, using CUDA driver 9.0
Thrust failure: function_attributes(): after cudaFuncGetAttributes: invalid device function

[snip]

  File "pyamgx/Solver.pyx", line 28, in pyamgx.Solver.create
  File "pyamgx/Errors.pyx", line 62, in pyamgx.check_error
pyamgx.AMGXError: CUDA kernel launch error.

@shwina
Copy link
Owner

shwina commented May 17, 2018

Hmm, what GPU is on the system? Can you provide the output of

$ nvidia-smi

on the host system?

@shwina
Copy link
Owner

shwina commented May 17, 2018

I think we are close. I suspect that the final piece is the CUDA_ARCH CMake variable described in the AMGX README. Depending on the GPU on your system, you may have to set this to a different value, e.g., -DCUDA_ARCH="30" for some older GPUs.

The appropriate value can be found for different GPUs here. For example, a Quadro K2200 GPU supports compute capability capability 3.0, so -DCUDA_ARCH="30".

I don't think AMGX supports anything lower than 3.0 though.

@wd15
Copy link
Contributor Author

wd15 commented May 17, 2018

$ nvidia-smi
Thu May 17 15:42:54 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla C2075         On   | 00000000:03:00.0 Off |                    0 |
| 30%   52C   P12    30W /  N/A |      1MiB /  5301MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

@wd15
Copy link
Contributor Author

wd15 commented May 17, 2018

Oh dear! It looks like the "Tesla C2075" is 2.0. Looks like I need to switch to another GPU. Correct?

@shwina
Copy link
Owner

shwina commented May 17, 2018

Let me try compiling AMGX with only 2.0 compute capability and see what happens.

@shwina
Copy link
Owner

shwina commented May 17, 2018

Sorry that the process is so long drawn out! I forgot how many things there are to keep in mind.

@shwina
Copy link
Owner

shwina commented May 17, 2018

Unfortunately, that didn't work. I can try to investigate further, but if you have access to a newer GPU for testing, that would probably be the way to go!

@wd15
Copy link
Contributor Author

wd15 commented May 18, 2018

My current situation:

  • GPU Name: Tesla K40m (with compute capability of 3.5, )

  • Building amgx with -DCUDA_ARCH="35", amgx.nix version

I cloned the AMGX repo so that I can do, LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libcuda.so /usr/lib/nvidia-384/libnvidia-fatbinaryloader.so.384.111" $AMGX_DIR/lib/examples/amgx_capi -m examples/matrix.mtx -c core/configs/FGMRES_AGGREGATION.json

The above gives,

AMGX version 2.0.0.130-opensource
Built on May 17 2018, 20:34:58
Compiled with CUDA Runtime 8.0, using CUDA driver 9.0
Warning: No mode specified, using dDDI by default.
Caught amgx exception: Could not create the CUDENSE handle
 at: /tmp/nix-build-AmgX.drv-0/lafn8qxabfn95rh3bh3y0bi113kzwl8w-source/core/src/solvers/dense_lu_solver.cu:733
Stack trace:
 /nix/store/6s94g7q56wkc7i3sd3zd9jhihwnwjrrg-AmgX/lib/libamgxsh.so : amgx::dense_lu_solver::DenseLUSolver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::DenseLUSolver(amgx::AMG_Config&, std::string const&, amgx::ThreadManager*)+0x159

...

Reading data...
RHS vector was not found. Using RHS b=[1,…,1]^T
Solution vector was not found. Setting initial solution to x=[0,…,0]^T
Finished reading
Caught amgx exception: Mode not found.

...

@wd15
Copy link
Contributor Author

wd15 commented May 21, 2018

As reported here, AMGX does actually seem to work with the nix recipe, however, running demo.py still gives the above error.

@wd15
Copy link
Contributor Author

wd15 commented May 21, 2018

Ok the following version of demo.py works now for me.

import numpy as np
import scipy.sparse as sparse
import scipy.sparse.linalg as splinalg

import pyamgx
import os

pyamgx.initialize()

# Initialize config and resources:
cfg = pyamgx.Config().create_from_file(os.environ['AMGX_DIR']+'/lib/configs/core/FGMRES_NOPREC.json')
...

@shwina
Copy link
Owner

shwina commented May 21, 2018

OK so the problem definitely has something to do with the creation of the dense LU solver although I have no idea what exactly. The FGMRES_NOPREC is a purely iterative solver without any preconditioning, so there is no dense system solved or created.

At least the Nix pieces are coming together nicely 🎉

@shwina
Copy link
Owner

shwina commented May 22, 2018

I noticed that a similar error (but not same) is raised with a singular matrix. This may be a long shot, but @wd15, could you please post the output of the following program:

import pyamgx
import os

pyamgx.initialize()

# Initialize config and resources:
cfg = pyamgx.Config().create_from_file(os.environ['AMGX_DIR']+'/core/configs/FGMRES_AGGREGATION.json')
rsc = pyamgx.Resources().create_simple(cfg)

# Create matrices and vectors:
A = pyamgx.Matrix().create(rsc)
x = pyamgx.Vector().create(rsc)
b = pyamgx.Vector().create(rsc)

# Create solver:
solver = pyamgx.Solver().create(rsc, cfg)

# Upload system:
import numpy as np
import scipy.sparse as sparse
import scipy.sparse.linalg as splinalg

R = np.random.rand(5, 5)
print(R)
M = sparse.csr_matrix(R)
rhs = np.random.rand(5)
sol = np.zeros(5, dtype=np.float64)

A.upload_CSR(M)
b.upload(rhs)
x.upload(sol)

# Setup and solve system:
solver.setup(A)
solver.solve(b, x)

# Download solution
x.download(sol)
print("pyamgx solution: ", sol)
print("scipy solution: ", splinalg.spsolve(M, rhs))

# Clean up:
A.destroy()
x.destroy()
b.destroy()
solver.destroy()
rsc.destroy()
cfg.destroy()

pyamgx.finalize()

The above just prints the random matrix before uploading it to AMGX.

@shwina
Copy link
Owner

shwina commented May 22, 2018

Sorry, but you have to rearrange the imports as before I think.

@wd15
Copy link
Contributor Author

wd15 commented May 22, 2018

I ran the above and I get the same, Could not create the CUDENSE handle

$ LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libcuda.so /usr/lib/nvidia-384/libnvidia-fatbinaryloader.so.384.111" python demo_shwina.py 
AMGX version 2.0.0.130-opensource
Built on May 18 2018, 19:09:36
Compiled with CUDA Runtime 8.0, using CUDA driver 9.0
Caught amgx exception: Could not create the CUDENSE handle
 at: /tmp/nix-build-AmgX.drv-0/lafn8qxabfn95rh3bh3y0bi113kzwl8w-source/core/src/solvers/dense_lu_solver.cu:733
Stack trace:

@shwina
Copy link
Owner

shwina commented May 22, 2018 via email

@wd15
Copy link
Contributor Author

wd15 commented May 22, 2018

It doesn't get that far, the Python traceback is

Traceback (most recent call last):
  File "demo_shwina.py", line 16, in <module>
    solver = pyamgx.Solver().create(rsc, cfg)
  File "pyamgx/Solver.pyx", line 28, in pyamgx.Solver.create
  File "pyamgx/Errors.pyx", line 62, in pyamgx.check_error
pyamgx.AMGXError: CUDA kernel launch error.

@shwina
Copy link
Owner

shwina commented May 22, 2018 via email

@wd15
Copy link
Contributor Author

wd15 commented May 23, 2018

The cuSolver wasn't required for testing FiPy with pyamgx so I'm no longer concerned about this. Please close this if you like.

@shwina
Copy link
Owner

shwina commented May 23, 2018

OK thanks for trying!

@shwina shwina closed this as completed May 23, 2018
@pyramidpoint
Copy link

@P6(UKPB7TAWVGW@WT(AX53

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants