Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] Support for AMD GPU #907

Open
cregouby opened this issue Oct 15, 2022 · 19 comments
Open

[feature request] Support for AMD GPU #907

cregouby opened this issue Oct 15, 2022 · 19 comments

Comments

@cregouby
Copy link
Collaborator

following up #455, I'd love to able to run torch load on my AMD GPU.
My Hardware is available for any test / debug / experiment around it
Thanks

@cregouby cregouby changed the title Support for AMD GPU [feature request] Support for AMD GPU Oct 15, 2022
@dfalbel
Copy link
Member

dfalbel commented Oct 17, 2022

Cool @cregouby !

In order to get support for AMD GPU's we will need to figure out:

  1. How to build lantern targeting ROCm, probably adding another set of conditions here to download the pre-built binaries for ROCm.
  2. Setup a workflow to build lantern for ROCm and upload the pre-built binaries here
  3. Then modify the install.R to allow installing from ROCm builds.

@cregouby
Copy link
Collaborator Author

cregouby commented Oct 19, 2022

nice push ! I'm on it in https://github.com/cregouby/torch/tree/platform/amd_gpu
Currently 1. seems to have a good start :

~/R/_packages/torch/lantern/build$ cmake ..
-- The C compiler identification is GNU 11.2.0
-- The CXX compiler identification is GNU 11.2.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Downloading /home/___/R/_packages/torch/lantern/build/libtorch.zip: https://download.pytorch.org/libtorch/rocm5.1.1/libtorch-cxx11-abi-shared-with-deps-1.12.1%2Brocm5.1.1.zip

I still need to add version-matching check (as I currently do not match the available rocm version on my machine)

@dfalbel
Copy link
Member

dfalbel commented Oct 19, 2022

Nice! This is looking great! Maybe ROCM can work with minor version mismatches? That's not the case for CUDA, but you could try.

@cregouby
Copy link
Collaborator Author

Sure !
Currently dealing with Github-action workflow, I'm wondering which runs-on should be selected to have a AMD GPU hardware to run on.. Any idea on this ? (I have to admit that part of the hardware is unclear to me in github runners)

@dfalbel
Copy link
Member

dfalbel commented Oct 20, 2022

I think you can cross-compile on the default ubuntu and install the ROCm compilers. Ie, I think you can compile for ROCm in a machine that doesn't include a AMD GPU.

See eg: https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html#installing-development-packages-for-cross-compilation

@cregouby
Copy link
Collaborator Author

I've made good progress on step 3. (maybe the easiest one)
I'm still hardly fighting the 1. with step-by-step progress. I've now fixed the hipBLAS requirement, and I'm now dealing with 3 more packages needed hipFFT, hipRAND, hipSPARSE. I'll keep you up to date...

@cregouby
Copy link
Collaborator Author

cregouby commented Nov 4, 2022

Some news on the task :

  • cmake is now successful on lantern
  • make -j8 fails with a weird error :
....
[ 39%] Building CXX object CMakeFiles/lantern.dir/src/Dimname.cpp.o                                                                                                                                                                          
In file included from /home/____/R/_packages/torch/lantern/src/Dtype.cpp:8:                                                                                                                                                                  
In file included from /home/____/R/_packages/torch/lantern/src/utils.hpp:2:                                                                                                                                                                  
/home/____/R/_packages/torch/lantern/include/lantern/types.h:13:10: warning: pack fold expression is a C++17 extension [-Wc++17-extensions]                                                                                                  
         ...);                                                                                                                                                                                                                               
         ^                                                                                                                                                                                                                                   
/home/____/R/_packages/torch/lantern/include/lantern/types.h:9:3: error: no member named 'apply' in namespace 'std'; did you mean 'torch::apply'?                                                                                            
  std::apply(                                                                                                                                                                                                                                
  ^~~~~~~~~~                                                                                                                                                                                                                                 
  torch::apply                                                                                                                                                                                                                               
/home/____/R/_packages/torch/lantern/build/libtorch/include/torch/csrc/utils/variadic.h:118:6: note: 'torch::apply' declared here                                                                                                            
void apply(Function function, Ts&&... ts) {                                                                                                                                                                                                  
     ^                                                                                                                                                                                                                                       
1 warning and 1 error generated when compiling for gfx900. 
...
make[2]: *** [CMakeFiles/lantern.dir/build.make:76 : CMakeFiles/lantern.dir/src/lantern.cpp.o] Erreur 1
make[1]: *** [CMakeFiles/Makefile2:85 : CMakeFiles/lantern.dir/all] Erreur 2
make: *** [Makefile:91 : all] Erreur 2

any suggestion would be appreciated

@dfalbel
Copy link
Member

dfalbel commented Nov 4, 2022

Great!!

Perhaps something equivalent to the below for ROCM is missing?

set_property(TARGET lantern PROPERTY CUDA_STANDARD 17)

@dfalbel
Copy link
Member

dfalbel commented Nov 4, 2022

It seems that setting this would help: https://cmake.org/cmake/help/latest/prop_tgt/HIP_STANDARD.html

@cregouby
Copy link
Collaborator Author

cregouby commented Nov 4, 2022

Thanks for the hint, setting it to value 14 or 17 did not remove the C++17 extension warning....

For the error lantern/types.h:9:3: error: no member named 'apply' in namespace 'std'; did you mean 'torch::apply' , I made the change into types.h (I must admit I'm completely lost with what to do - not to do in .h files)
https://github.com/cregouby/torch/blob/9c67675d43862cb53c7b47df7c5451eb741798ec/lantern/include/lantern/types.h#L9
and now build target lantern reaches 100 %

My two big uncertainties right now are

  • what is the impact of changing type.h std::apply into torch::apply
  • is src/Contrib/SortVertices/sort_vert_cpu.cpp sufficient to build on ROCm ? i.e. not including src/AllocatorCuda.cpp and src/Contrib/SortVertices/sort_vert_kernel.cu ...

@dfalbel
Copy link
Member

dfalbel commented Nov 4, 2022

I don't think torch::apply is equivalent to std::apply...
I think torch::apply is equivalent to https://pytorch.org/docs/stable/generated/torch.Tensor.apply_.html while std::apply is metaprogramming stuff from C++ https://en.cppreference.com/w/cpp/utility/apply

std::apply is a C++17 feature, so that warning is probably caused by the compiler not supporting c++17, or maybe that HIP standard flag is not being correctly propagated. AFAICT in the cuda world, nvcc (the compiler that supports cuda) works like a preprocessor, ie, it will take the CUDA parts and compile and the part that of the code that is not CUDA related is forwarded to a C++ compiler, and that's where those flags matter.

Yeah, I think you don't need to provide HIP kernel for the Contrib stuff, so just building with the CPU version should be fine.

@cregouby
Copy link
Collaborator Author

cregouby commented Nov 7, 2022

Thanks for those hints, I'll try to rework based on that !
FYI the 100% build of lantern makes `install_torch_from_file()to fail with

install_torch(version = version, type = type, install_config = install_config)
Erreur dans cpp_lantern_init(file.path(install_path(), "lib")) : 
  /home/____/R/x86_64-pc-linux-gnu-library/4.2/torch/lib/liblantern.so - /home/____/R/x86_64-pc-linux-gnu-library/4.2/torch/lib/liblantern.so: undefined symbol: _ZN2at4_ops4rand4callEN3c108ArrayRefIlEENS2_8optionalINS2_10ScalarTypeEEENS5_INS2_6LayoutEEENS5_INS2_6DeviceEEENS5_IbEE

And despite my effort, I can't get the HIP compiler to consider C++17 code... I'll question the authors... or maybe try
something else based on https://github.com/ROCm-Developer-Tools/HIP/blob/809149ecc8d751acd3c1595b590090cd86ada8df/bin/hipcc.pl#L397

    # nvcc does not handle standard compiler options properly
    # This can prevent hipcc being used as standard CXX/C Compiler
    # To fix this we need to pass -Xcompiler for options

@dfalbel
Copy link
Member

dfalbel commented Nov 9, 2022

That's great progress!! 👍

Hmm, this seems to be related to the clang version, perhaps? Or something like this?

@cregouby
Copy link
Collaborator Author

Ah some news here after some deeper investigation :

Support and Compatibility

libtorch public / nightly rocm ubuntu installer gfx card support R torch
- 5.0 908, 90a
1.13.0 - 1.13.1 / 1.13.0 - 2.0.0 5.2 18.04/20.04 (1) add 1011 (2) 0.10.0
- 5.3.0 22.04 add 11xx -
2.0.0-2.0.1 / 2.0.0 - 2.1.0 5.4.2 22.04 add 1100, 1102 0.12.0

Liblantern build

Strickly following the compatibility table, I've been able to build liblantern.so for

  • ROCM 5.2
  • ROCM 5.4.2

using the official buildlantern.R

{torch}

I've tweeked a bit the download torch right now and get to the following success :

>   # copy lantern
>   source("R/install.R")
>   source("R/lantern_sync.R")
>   lantern_sync(TRUE)
[1] TRUE
> library(torch)

Attachement du package :torchLes objets suivants sont masqués _par_.GlobalEnv:

    get_install_libs_url, install_torch, install_torch_from_file, torch_install_path, torch_is_installed

> torch_version
[1] "2.0.1"
> tt <- torch_tensor(c(1,2,3,4), device = "cuda")
> tt
torch_tensor
 1
 2
 3
 4
[ CUDAFloatType{4} ]

which is amazing !

I still have a discrepancy as I currently crash R when running tt + 1 due to a possible mismatch in version between libtorch and {torch}.

But I can feel the taste of success...

@RMHogervorst
Copy link

This is very exciting! is there a way I can help test? I have an AMD rocm computer and I would love it if torch would work on gpu, just like pytorch!

@cregouby
Copy link
Collaborator Author

Hello @RMHogervorst ,
I'm glad you want to help!
You should clone the repo and switch to the platform/amd_gpu branch, where building the ROCM lantern is documented following the /.github/CONTRIBUTING.md.
In order to build lantern for torch 0.12, you will need the ROCM 5.4.2 suite on your machine
Let us know if you can build it.

@RMHogervorst
Copy link

RMHogervorst commented Feb 21, 2024

@cregouby
after cloning your repository

  1. First install all packages (I used renv to do that)
  2. I had to create the lantern directory (otherwise the build_lantern condition is not true)
  3. installed cmake
  4. run `source("tools/build_lantern.R")
CMake Error: The source directory "/home/roel/Documents/projecten/experimenten/torch/lantern" does not appear to contain CMakeLists.txt.

object path not found in lantern_sync

I think I'm missing something

I have installed the latest version of rocm 6.0.2, I can probably install the 5.4.2 version too, but I think this error is not related to the rocm version

@RMHogervorst
Copy link

I realized that there are cmakelist files in the src directory.
(I have not a lot of experience building c projects so I probably learn a lot (do stupid stuff))

  • from the src directory
  • run cmake .
  • run cmake --build . --target lantern --config Release --parallel 8

This builds a library, but it seems to build it for cpu

@cregouby
Copy link
Collaborator Author

cregouby commented Feb 21, 2024

Sorry @RMHogervorst, I didn't commit my experimental lantern/CMakeLists.txt
You should now get it if you git pull again from the cregouby/torch repo on branch platform/amd_gpu

Feel free to question or improve every line inside the CMakeLists.txt file, as makefiles are far beyond my confort zone.

After lantern is compiled, you may want to setup some environment variables.

Those are mine, stored in .Renviron (again may need some changes)

# --- torch  / lantern build
# change ARCH target at `make` time
HCC_AMDGPU_TARGET=gfx900
USE_ROCM=1
BUILD_LANTERN=1

# ---- torch lantern package build ----
MAKE=make -j10
LD_LIBRARY_PATH=/opt/rocm-5.4.2/lib:/opt/rocm-5.4.2/llvm/lib:~/R/_packages/torch/inst/lib:~/R/x86_64-pc-linux-gnu-library/4.3/torch/lib
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/snap/bin:/opt/rocm-5.4.2:/opt/rocm-5.4.2/bin
ROCM_PATH=/opt/rocm
# ---- local liblantern.so usage----
# may need a ln -s of a liblantern_<version>.so in the same directory
  # The library URL can be 3 different things:
  # - real URL
  # - path to a zip file containing the library
  # - path to a directory containing the files to be installed. 
# if set, escape the download within lantern/CMakeLists.txt
# TORCH_URL= https://download.pytorch.org/libtorch/rocm5.4.2/libtorch-cxx11-abi-shared-with-deps-2.0.1%2Brocm5.4.2.zip
# local cache of the previous
TORCH_URL= "~/R/_packages/torch_experiment/libtorch-cxx11-abi-shared-with-deps-2.0.1%2Brocm5.4.2.zip"
TORCH_INSTALL_DEBUG=1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants