[BUG] Stream-K kernel breaks for some GEMM Problem-K #2100

manishucsd · 2025-02-12T04:35:11Z

GEMM Problem Shape --m=8 --n=8192 --k=8192 Does NOT Work

/tools/profiler/cutlass_profiler --dist=uniform,min:-2.3,max:2.3,scale:-1 --kernels=cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_4x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem --m=8 --n=8192  --k=8192 --verification-enabled
=false 



=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_4x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem

          Status: Success
    Verification: OFF
     Disposition: Failed


       Arguments: --gemm_kind=universal --m=8 --n=8192 --k=8192 --A=bf16:row --B=bf16:column --C=bf16:column --D=bf16:column  \
                  --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic  \
                  --runtime_input_datatype_a=invalid --runtime_input_datatype_b=invalid --use_pdl=false --enable_sm90_mixed_dtype_shuffle_test=false  \
                  --swizzle_size=1 --op_class=tensorop --accum=f32 --cta_m=128 --cta_n=128 --cta_k=64 --cluster_m=1 --cluster_n=1  \
                  --cluster_k=1 --cluster_m_fallback=0 --cluster_n_fallback=0 --cluster_k_fallback=0 --stages=7 --warps_m=4  \
                  --warps_n=2 --warps_k=1 --inst_m=64 --inst_n=128 --inst_k=16 --min_cc=90 --max_cc=90

           Bytes: 134479872  bytes
           FLOPs: 1073872896  flops
           FLOPs/Byte: 7

GEMM Problem Shape --m=8 --n=8192 --k=128 Works

./tools/profiler/cutlass_profiler --dist=uniform,min:-2.3,max:2.3,scale:-1 --kernels=cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_4x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem --m=8 --n=8192  --k=128 --verification-enabled=false



=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_4x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem

          Status: Success
    Verification: OFF
     Disposition: Not verified


       Arguments: --gemm_kind=universal --m=8 --n=8192 --k=128 --A=bf16:row --B=bf16:column --C=bf16:column --D=bf16:column  \
                  --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic  \
                  --runtime_input_datatype_a=invalid --runtime_input_datatype_b=invalid --use_pdl=false --enable_sm90_mixed_dtype_shuffle_test=false  \
                  --swizzle_size=1 --op_class=tensorop --accum=f32 --cta_m=128 --cta_n=128 --cta_k=64 --cluster_m=1 --cluster_n=1  \
                  --cluster_k=1 --cluster_m_fallback=0 --cluster_n_fallback=0 --cluster_k_fallback=0 --stages=7 --warps_m=4  \
                  --warps_n=2 --warps_k=1 --inst_m=64 --inst_n=128 --inst_k=16 --min_cc=90 --max_cc=90

           Bytes: 2230272  bytes
           FLOPs: 16908288  flops
           FLOPs/Byte: 7

         Runtime: 0.0130992  ms
          Memory: 158.567 GiB/s

            Math: 1290.79 GFLOP/s

The text was updated successfully, but these errors were encountered:

manishucsd · 2025-02-12T04:40:06Z

  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_4x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma

          Status: Success
    Verification: OFF
     Disposition: Failed

Another one that failed . The common pattern between these are cluster=4x1x1, stream_k, ws

manishucsd · 2025-02-12T04:42:48Z

  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_4x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem

          Status: Success
    Verification: OFF
     Disposition: Failed

Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_4x4x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem

          Status: Error: internal
    Verification: OFF
     Disposition: Failed

hwu36 · 2025-02-12T04:44:13Z

@jackkosaian

jackkosaian · 2025-02-12T12:15:03Z

Can you provide the full CMake config you used?

Also, we recently fixed a similar issue internally (also occurring with larger clusters only). It is planned to be pushed here soon.

Can you see if the following change (which is the one that will be upstreamed) fixes the issue for you?
Add the following line here:

      new_hw_info.max_active_clusters = hw_info.max_active_clusters;

manishucsd · 2025-02-12T17:36:43Z

cmake -DCMAKE_BUILD_TYPE:STRING=Release -DCUTLASS_NVCC_ARCHS:STRING=90a -DCUTLASS_NVCC_KEEP:STRING=OFF -DCUTLASS_ENABLE_F16C:STRING=ON -DCUTLASS_LIBRARY_KERNELS:STRING=cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16*tnn*align8 -DCUTLASS_LIBRARY_INSTANTIATION_LEVEL:STRING=max -DCUTLASS_LIBRARY_IGNORE_KERNELS:STRING=gemm_grouped*,gemm_planar* -DCUTLASS_ENABLE_CUBLAS:STRING=ON -DCMAKE_EXPORT_COMPILE_COMMANDS:BOOL=TRUE -DCMAKE_C_COMPILER:FILEPATH=/usr/bin/gcc -DCMAKE_CXX_COMPILER:FILEPATH=/usr/bin/g++ --no-warn-unused-cli -S/home/manish_magic_dev/repos/cutlass/cutlass_tree_2/cutlass -B/home/manish_magic_dev/repos/cutlass/cutlass_tree_2/build

manishucsd · 2025-02-12T17:39:12Z

hw_info.sm_count if set will restrict the kernel to run on lower than max SMs? For e.g., if I set this to 128 than 4 SMs on H100 with 132 SMs will be left out. Is this understanding correct? Or there is more to it?

@jackkosaian , code in your comment suggest the change is in hw_info.max_active_clusters and the hyperlink takes me to hw_info.sm_count. I will wait for the full fix to merge in the main, hopefully we can get it merged soon.

jackkosaian · 2025-02-12T19:12:04Z

Sorry, the suggestion I was trying to make was to add the code that I pasted in the comment (max_active_clusters) below the code linked.

Here's the diff:

diff --git a/include/cutlass/gemm/kernel/tile_scheduler_params.h b/include/cutlass/gemm/kernel/tile_scheduler_params.h
index aa599a35..a4467e8a 100644
--- a/include/cutlass/gemm/kernel/tile_scheduler_params.h
+++ b/include/cutlass/gemm/kernel/tile_scheduler_params.h
@@ -1204,6 +1204,7 @@ struct PersistentTileSchedulerSm90StreamKParams {
       KernelHardwareInfo new_hw_info;
       new_hw_info.device_id = hw_info.device_id;
       new_hw_info.sm_count = hw_info.sm_count;
+      new_hw_info.max_active_clusters = hw_info.max_active_clusters;
       if (new_hw_info.sm_count <= 0) {
         CUTLASS_TRACE_HOST("  WARNING: Arguments do not include a valid SM count.\n"
             "  For optimal performance, populate the arguments KernelHardwareInfo struct with the SM count.");

I was able to reproduce the issue you mentioned before making this diff, and the issue went away after the diff.

manishucsd · 2025-02-12T21:14:00Z

Let us check that in into mainline.

jackkosaian · 2025-02-12T22:37:11Z

It will be merged in when we tag 3.8 (soon).

manishucsd · 2025-02-20T06:59:40Z

Please close it if this merged and you verified it on your end. We will enable stream_k again, and reopen, if we see an issue.

manishucsd added ? - Needs Triage bug Something isn't working labels Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Stream-K kernel breaks for some GEMM Problem-K #2100

[BUG] Stream-K kernel breaks for some GEMM Problem-K #2100

manishucsd commented Feb 12, 2025 •

edited

Loading

manishucsd commented Feb 12, 2025

manishucsd commented Feb 12, 2025 •

edited

Loading

hwu36 commented Feb 12, 2025

jackkosaian commented Feb 12, 2025

manishucsd commented Feb 12, 2025

manishucsd commented Feb 12, 2025 •

edited

Loading

jackkosaian commented Feb 12, 2025

manishucsd commented Feb 12, 2025

jackkosaian commented Feb 12, 2025

manishucsd commented Feb 20, 2025

[BUG] Stream-K kernel breaks for some GEMM Problem-K #2100

[BUG] Stream-K kernel breaks for some GEMM Problem-K #2100

Comments

manishucsd commented Feb 12, 2025 • edited Loading

GEMM Problem Shape --m=8 --n=8192 --k=8192 Does NOT Work

GEMM Problem Shape --m=8 --n=8192 --k=128 Works

manishucsd commented Feb 12, 2025

manishucsd commented Feb 12, 2025 • edited Loading

hwu36 commented Feb 12, 2025

jackkosaian commented Feb 12, 2025

manishucsd commented Feb 12, 2025

manishucsd commented Feb 12, 2025 • edited Loading

jackkosaian commented Feb 12, 2025

manishucsd commented Feb 12, 2025

jackkosaian commented Feb 12, 2025

manishucsd commented Feb 20, 2025

manishucsd commented Feb 12, 2025 •

edited

Loading

manishucsd commented Feb 12, 2025 •

edited

Loading

manishucsd commented Feb 12, 2025 •

edited

Loading