[BUG] Unable to run CUTLASS example 65_distributed_gemm #2097

manishucsd · 2025-02-10T22:02:50Z

Using patch #2086 to compile with CUDA Toolkit 12.6.3

cmake

cmake -B../build -S../cutlass -DCUTLASS_NVCC_ARCHS="90a" -DCUTLASS_ENABLE_GDC_FOR_SM90=1 
-- CMake Version: 3.31.4
-- CUTLASS 3.8.0
-- CUDART: /home/manish_magic_dev/sdk/cuda/12.6.3/lib64/libcudart.so
-- CUDA Driver: /home/manish_magic_dev/sdk/cuda/12.6.3/lib64/stubs/libcuda.so
-- NVRTC: /home/manish_magic_dev/sdk/cuda/12.6.3/lib64/libnvrtc.so
-- Default Install Location: install
-- Make cute::tuple be the new standard-layout tuple type
-- CUDA Compilation Architectures: 90a
-- Enable caching of reference results in conv unit tests
-- Enable rigorous conv problem sizes in conv unit tests
-- Grid Dependency Control (GDC) is enabled for SM90 kernels (required for programmatic dependent launches).
-- Using the following NVCC flags: 
  --expt-relaxed-constexpr
  -DCUTE_USE_PACKED_TUPLE=1
  -DCUTLASS_TEST_LEVEL=0
  -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=1
  -DCUTLASS_CONV_UNIT_TEST_RIGOROUS_SIZE_ENABLED=1
  -DCUTLASS_DEBUG_TRACE_LEVEL=0
  -DCUTLASS_ENABLE_GDC_FOR_SM90=1
  -Xcompiler=-Wconversion
  -Xcompiler=-fno-strict-aliasing
-- CUTLASS Revision: 833f6990
CMake Warning (dev) at /home/manish_magic_dev/.local/lib/python3.10/site-packages/cmake/data/share/cmake-3.31/Modules/FetchContent.cmake:1953 (message):
  Calling FetchContent_Populate(googletest) is deprecated, call
  FetchContent_MakeAvailable(googletest) instead.  Policy CMP0169 can be set
  to OLD to allow FetchContent_Populate(googletest) to be called directly for
  now, but the ability to call it with declared details will be removed
  completely in a future version.
Call Stack (most recent call first):
  cmake/googletest.cmake:47 (FetchContent_Populate)
  CMakeLists.txt:759 (include)
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Found Python3: /usr/bin/python3.10 (found version "3.10.12") found components: Interpreter
-- Configuring cublas ...
-- cuBLAS Disabled.
-- Configuring cuBLAS ... done.
-- Completed generation of library instances. See /home/manish_magic_dev/repos/cutlass/cutlass_tree_2/build/tools/library/library_instance_generation.log for more information.
-- Found Python3: /usr/bin/python3.10 (found suitable version "3.10.12", minimum required is "3.5") found components: Interpreter
-- Enable device reference verification in conv unit tests
-- Configuring done (3.2s)
-- Generating done (3.5s)
-- Build files have been written to: /home/manish_magic_dev/repos/cutlass/cutlass_tree_2/build

NVIDIA SMI

nvidia-smi topo -p2p r
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7
 GPU0   X       OK      OK      OK      OK      OK      OK      OK
 GPU1   OK      X       OK      OK      OK      OK      OK      OK
 GPU2   OK      OK      X       OK      OK      OK      OK      OK
 GPU3   OK      OK      OK      X       OK      OK      OK      OK
 GPU4   OK      OK      OK      OK      X       OK      OK      OK
 GPU5   OK      OK      OK      OK      OK      X       OK      OK
 GPU6   OK      OK      OK      OK      OK      OK      X       OK
 GPU7   OK      OK      OK      OK      OK      OK      OK      X

Legend:

  X    = Self
  OK   = Status Ok
  CNS  = Chipset not supported
  GNS  = GPU not supported
  TNS  = Topology not supported
  NS   = Not supported
  U    = Unknown

Builds example 65

make -C ../build/ 65_distributed_gemm

Running it on 8xH100

../build/examples/65_distributed_gemm/65_distributed_gemm 
Got cutlass error: Error Internal at: 682

nvidia-smi while it was running

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    241707      C   ...istributed_gemm/65_distributed_gemm    15658MiB |
|    1   N/A  N/A    241707      C   ...istributed_gemm/65_distributed_gemm      520MiB |
|    2   N/A  N/A    241707      C   ...istributed_gemm/65_distributed_gemm      520MiB |
|    3   N/A  N/A    241707      C   ...istributed_gemm/65_distributed_gemm      520MiB |
|    4   N/A  N/A    241707      C   ...istributed_gemm/65_distributed_gemm      520MiB |
|    5   N/A  N/A    241707      C   ...istributed_gemm/65_distributed_gemm      520MiB |
|    6   N/A  N/A    241707      C   ...istributed_gemm/65_distributed_gemm      520MiB |
|    7   N/A  N/A    241707      C   ...istributed_gemm/65_distributed_gemm      520MiB |
+---------------------------------------------------------------------------------------+

The text was updated successfully, but these errors were encountered:

manishucsd · 2025-02-10T22:04:22Z

cc: @alihassanijr

alihassanijr · 2025-02-10T22:15:02Z

Thanks for the information @manishucsd .

It looks like it's failing gemm.initialize, which means one of the following failed:

Per-gpu workspace initialization: probably not the reason
Setting shmem size: shouldn't happen as long as the kernel doesn't require more shmem than available.
Constructing CUDA graphs failed: this might be the most likely reason.

Could you confirm:

Whether the kernel in the example was modified, and if so, to what tile shape (so we can rule out shmem size).
Your CUDA driver version (If it's corresponding CTK version is older than 12.5, then specific CUDA graphs APIs required by the example wouldn't work even if the compiler's 12.5 or later and has the APIs available.

Meanwhile, could you also try building with -DCUTLASS_DEBUG_TRACE_LEVEL=5 and trying again to see if the traces tell us anything?

manishucsd · 2025-02-10T22:22:58Z

Thanks @alihassanijr for the quick response.

Example is unmodified.
I compiled using 12.6.3, also tried 12.8 but my driver version is archaic (Driver Version: 535.183.01). So this is probably the reason. Maybe we can update the minimum driver version requirement in the example README.md just like the minimum toolkit version required is 12.6.

I will see if we can update the driver and try example 65 again!

alihassanijr · 2025-02-11T00:13:35Z

Sorry about that; we'll definitely update the readme.
We've had at least one other report of driver version 535 failing to run the example.

manishucsd added ? - Needs Triage bug Something isn't working labels Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Unable to run CUTLASS example 65_distributed_gemm #2097

[BUG] Unable to run CUTLASS example 65_distributed_gemm #2097

manishucsd commented Feb 10, 2025 •

edited

Loading

manishucsd commented Feb 10, 2025

alihassanijr commented Feb 10, 2025

manishucsd commented Feb 10, 2025

alihassanijr commented Feb 11, 2025

[BUG] Unable to run CUTLASS example 65_distributed_gemm #2097

[BUG] Unable to run CUTLASS example 65_distributed_gemm #2097

Comments

manishucsd commented Feb 10, 2025 • edited Loading

NVIDIA SMI

Builds example 65

Running it on 8xH100

nvidia-smi while it was running

manishucsd commented Feb 10, 2025

alihassanijr commented Feb 10, 2025

manishucsd commented Feb 10, 2025

alihassanijr commented Feb 11, 2025

manishucsd commented Feb 10, 2025 •

edited

Loading