Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unable to run CUTLASS example 65_distributed_gemm #2097

Open
manishucsd opened this issue Feb 10, 2025 · 4 comments
Open

[BUG] Unable to run CUTLASS example 65_distributed_gemm #2097

manishucsd opened this issue Feb 10, 2025 · 4 comments
Labels
? - Needs Triage bug Something isn't working

Comments

@manishucsd
Copy link
Contributor

manishucsd commented Feb 10, 2025

Using patch #2086 to compile with CUDA Toolkit 12.6.3

cmake

cmake -B../build -S../cutlass -DCUTLASS_NVCC_ARCHS="90a" -DCUTLASS_ENABLE_GDC_FOR_SM90=1 
-- CMake Version: 3.31.4
-- CUTLASS 3.8.0
-- CUDART: /home/manish_magic_dev/sdk/cuda/12.6.3/lib64/libcudart.so
-- CUDA Driver: /home/manish_magic_dev/sdk/cuda/12.6.3/lib64/stubs/libcuda.so
-- NVRTC: /home/manish_magic_dev/sdk/cuda/12.6.3/lib64/libnvrtc.so
-- Default Install Location: install
-- Make cute::tuple be the new standard-layout tuple type
-- CUDA Compilation Architectures: 90a
-- Enable caching of reference results in conv unit tests
-- Enable rigorous conv problem sizes in conv unit tests
-- Grid Dependency Control (GDC) is enabled for SM90 kernels (required for programmatic dependent launches).
-- Using the following NVCC flags: 
  --expt-relaxed-constexpr
  -DCUTE_USE_PACKED_TUPLE=1
  -DCUTLASS_TEST_LEVEL=0
  -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=1
  -DCUTLASS_CONV_UNIT_TEST_RIGOROUS_SIZE_ENABLED=1
  -DCUTLASS_DEBUG_TRACE_LEVEL=0
  -DCUTLASS_ENABLE_GDC_FOR_SM90=1
  -Xcompiler=-Wconversion
  -Xcompiler=-fno-strict-aliasing
-- CUTLASS Revision: 833f6990
CMake Warning (dev) at /home/manish_magic_dev/.local/lib/python3.10/site-packages/cmake/data/share/cmake-3.31/Modules/FetchContent.cmake:1953 (message):
  Calling FetchContent_Populate(googletest) is deprecated, call
  FetchContent_MakeAvailable(googletest) instead.  Policy CMP0169 can be set
  to OLD to allow FetchContent_Populate(googletest) to be called directly for
  now, but the ability to call it with declared details will be removed
  completely in a future version.
Call Stack (most recent call first):
  cmake/googletest.cmake:47 (FetchContent_Populate)
  CMakeLists.txt:759 (include)
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Found Python3: /usr/bin/python3.10 (found version "3.10.12") found components: Interpreter
-- Configuring cublas ...
-- cuBLAS Disabled.
-- Configuring cuBLAS ... done.
-- Completed generation of library instances. See /home/manish_magic_dev/repos/cutlass/cutlass_tree_2/build/tools/library/library_instance_generation.log for more information.
-- Found Python3: /usr/bin/python3.10 (found suitable version "3.10.12", minimum required is "3.5") found components: Interpreter
-- Enable device reference verification in conv unit tests
-- Configuring done (3.2s)
-- Generating done (3.5s)
-- Build files have been written to: /home/manish_magic_dev/repos/cutlass/cutlass_tree_2/build

NVIDIA SMI

nvidia-smi topo -p2p r
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7
 GPU0   X       OK      OK      OK      OK      OK      OK      OK
 GPU1   OK      X       OK      OK      OK      OK      OK      OK
 GPU2   OK      OK      X       OK      OK      OK      OK      OK
 GPU3   OK      OK      OK      X       OK      OK      OK      OK
 GPU4   OK      OK      OK      OK      X       OK      OK      OK
 GPU5   OK      OK      OK      OK      OK      X       OK      OK
 GPU6   OK      OK      OK      OK      OK      OK      X       OK
 GPU7   OK      OK      OK      OK      OK      OK      OK      X

Legend:

  X    = Self
  OK   = Status Ok
  CNS  = Chipset not supported
  GNS  = GPU not supported
  TNS  = Topology not supported
  NS   = Not supported
  U    = Unknown

Builds example 65

make -C ../build/ 65_distributed_gemm

Running it on 8xH100

../build/examples/65_distributed_gemm/65_distributed_gemm 
Got cutlass error: Error Internal at: 682

nvidia-smi while it was running

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    241707      C   ...istributed_gemm/65_distributed_gemm    15658MiB |
|    1   N/A  N/A    241707      C   ...istributed_gemm/65_distributed_gemm      520MiB |
|    2   N/A  N/A    241707      C   ...istributed_gemm/65_distributed_gemm      520MiB |
|    3   N/A  N/A    241707      C   ...istributed_gemm/65_distributed_gemm      520MiB |
|    4   N/A  N/A    241707      C   ...istributed_gemm/65_distributed_gemm      520MiB |
|    5   N/A  N/A    241707      C   ...istributed_gemm/65_distributed_gemm      520MiB |
|    6   N/A  N/A    241707      C   ...istributed_gemm/65_distributed_gemm      520MiB |
|    7   N/A  N/A    241707      C   ...istributed_gemm/65_distributed_gemm      520MiB |
+---------------------------------------------------------------------------------------+
@manishucsd manishucsd added ? - Needs Triage bug Something isn't working labels Feb 10, 2025
@manishucsd
Copy link
Contributor Author

cc: @alihassanijr

@alihassanijr
Copy link
Contributor

Thanks for the information @manishucsd .

It looks like it's failing gemm.initialize, which means one of the following failed:

  • Per-gpu workspace initialization: probably not the reason
  • Setting shmem size: shouldn't happen as long as the kernel doesn't require more shmem than available.
  • Constructing CUDA graphs failed: this might be the most likely reason.

Could you confirm:

  1. Whether the kernel in the example was modified, and if so, to what tile shape (so we can rule out shmem size).
  2. Your CUDA driver version (If it's corresponding CTK version is older than 12.5, then specific CUDA graphs APIs required by the example wouldn't work even if the compiler's 12.5 or later and has the APIs available.

Meanwhile, could you also try building with -DCUTLASS_DEBUG_TRACE_LEVEL=5 and trying again to see if the traces tell us anything?

@manishucsd
Copy link
Contributor Author

Thanks @alihassanijr for the quick response.

  1. Example is unmodified.
  2. I compiled using 12.6.3, also tried 12.8 but my driver version is archaic (Driver Version: 535.183.01). So this is probably the reason. Maybe we can update the minimum driver version requirement in the example README.md just like the minimum toolkit version required is 12.6.

I will see if we can update the driver and try example 65 again!

@alihassanijr
Copy link
Contributor

Sorry about that; we'll definitely update the readme.
We've had at least one other report of driver version 535 failing to run the example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants