Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Stream-K kernel breaks for some GEMM Problem-K #2100

Open
manishucsd opened this issue Feb 12, 2025 · 10 comments
Open

[BUG] Stream-K kernel breaks for some GEMM Problem-K #2100

manishucsd opened this issue Feb 12, 2025 · 10 comments
Labels
? - Needs Triage bug Something isn't working

Comments

@manishucsd
Copy link
Contributor

manishucsd commented Feb 12, 2025

GEMM Problem Shape --m=8 --n=8192 --k=8192 Does NOT Work

/tools/profiler/cutlass_profiler --dist=uniform,min:-2.3,max:2.3,scale:-1 --kernels=cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_4x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem --m=8 --n=8192  --k=8192 --verification-enabled
=false 



=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_4x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem

          Status: Success
    Verification: OFF
     Disposition: Failed


       Arguments: --gemm_kind=universal --m=8 --n=8192 --k=8192 --A=bf16:row --B=bf16:column --C=bf16:column --D=bf16:column  \
                  --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic  \
                  --runtime_input_datatype_a=invalid --runtime_input_datatype_b=invalid --use_pdl=false --enable_sm90_mixed_dtype_shuffle_test=false  \
                  --swizzle_size=1 --op_class=tensorop --accum=f32 --cta_m=128 --cta_n=128 --cta_k=64 --cluster_m=1 --cluster_n=1  \
                  --cluster_k=1 --cluster_m_fallback=0 --cluster_n_fallback=0 --cluster_k_fallback=0 --stages=7 --warps_m=4  \
                  --warps_n=2 --warps_k=1 --inst_m=64 --inst_n=128 --inst_k=16 --min_cc=90 --max_cc=90

           Bytes: 134479872  bytes
           FLOPs: 1073872896  flops
           FLOPs/Byte: 7

GEMM Problem Shape --m=8 --n=8192 --k=128 Works

./tools/profiler/cutlass_profiler --dist=uniform,min:-2.3,max:2.3,scale:-1 --kernels=cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_4x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem --m=8 --n=8192  --k=128 --verification-enabled=false



=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_4x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem

          Status: Success
    Verification: OFF
     Disposition: Not verified


       Arguments: --gemm_kind=universal --m=8 --n=8192 --k=128 --A=bf16:row --B=bf16:column --C=bf16:column --D=bf16:column  \
                  --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic  \
                  --runtime_input_datatype_a=invalid --runtime_input_datatype_b=invalid --use_pdl=false --enable_sm90_mixed_dtype_shuffle_test=false  \
                  --swizzle_size=1 --op_class=tensorop --accum=f32 --cta_m=128 --cta_n=128 --cta_k=64 --cluster_m=1 --cluster_n=1  \
                  --cluster_k=1 --cluster_m_fallback=0 --cluster_n_fallback=0 --cluster_k_fallback=0 --stages=7 --warps_m=4  \
                  --warps_n=2 --warps_k=1 --inst_m=64 --inst_n=128 --inst_k=16 --min_cc=90 --max_cc=90

           Bytes: 2230272  bytes
           FLOPs: 16908288  flops
           FLOPs/Byte: 7

         Runtime: 0.0130992  ms
          Memory: 158.567 GiB/s

            Math: 1290.79 GFLOP/s
@manishucsd manishucsd added ? - Needs Triage bug Something isn't working labels Feb 12, 2025
@manishucsd
Copy link
Contributor Author

  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_4x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma

          Status: Success
    Verification: OFF
     Disposition: Failed

Another one that failed . The common pattern between these are cluster=4x1x1, stream_k, ws

@manishucsd
Copy link
Contributor Author

manishucsd commented Feb 12, 2025

  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_4x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem

          Status: Success
    Verification: OFF
     Disposition: Failed
Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_4x4x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem

          Status: Error: internal
    Verification: OFF
     Disposition: Failed

@hwu36
Copy link
Collaborator

hwu36 commented Feb 12, 2025

@jackkosaian

@jackkosaian
Copy link
Contributor

Can you provide the full CMake config you used?

Also, we recently fixed a similar issue internally (also occurring with larger clusters only). It is planned to be pushed here soon.

Can you see if the following change (which is the one that will be upstreamed) fixes the issue for you?
Add the following line here:

      new_hw_info.max_active_clusters = hw_info.max_active_clusters;

@manishucsd
Copy link
Contributor Author

cmake -DCMAKE_BUILD_TYPE:STRING=Release -DCUTLASS_NVCC_ARCHS:STRING=90a -DCUTLASS_NVCC_KEEP:STRING=OFF -DCUTLASS_ENABLE_F16C:STRING=ON -DCUTLASS_LIBRARY_KERNELS:STRING=cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16*tnn*align8 -DCUTLASS_LIBRARY_INSTANTIATION_LEVEL:STRING=max -DCUTLASS_LIBRARY_IGNORE_KERNELS:STRING=gemm_grouped*,gemm_planar* -DCUTLASS_ENABLE_CUBLAS:STRING=ON -DCMAKE_EXPORT_COMPILE_COMMANDS:BOOL=TRUE -DCMAKE_C_COMPILER:FILEPATH=/usr/bin/gcc -DCMAKE_CXX_COMPILER:FILEPATH=/usr/bin/g++ --no-warn-unused-cli -S/home/manish_magic_dev/repos/cutlass/cutlass_tree_2/cutlass -B/home/manish_magic_dev/repos/cutlass/cutlass_tree_2/build

@manishucsd
Copy link
Contributor Author

manishucsd commented Feb 12, 2025

hw_info.sm_count if set will restrict the kernel to run on lower than max SMs? For e.g., if I set this to 128 than 4 SMs on H100 with 132 SMs will be left out. Is this understanding correct? Or there is more to it?

@jackkosaian , code in your comment suggest the change is in hw_info.max_active_clusters and the hyperlink takes me to hw_info.sm_count. I will wait for the full fix to merge in the main, hopefully we can get it merged soon.

@jackkosaian
Copy link
Contributor

Sorry, the suggestion I was trying to make was to add the code that I pasted in the comment (max_active_clusters) below the code linked.

Here's the diff:

diff --git a/include/cutlass/gemm/kernel/tile_scheduler_params.h b/include/cutlass/gemm/kernel/tile_scheduler_params.h
index aa599a35..a4467e8a 100644
--- a/include/cutlass/gemm/kernel/tile_scheduler_params.h
+++ b/include/cutlass/gemm/kernel/tile_scheduler_params.h
@@ -1204,6 +1204,7 @@ struct PersistentTileSchedulerSm90StreamKParams {
       KernelHardwareInfo new_hw_info;
       new_hw_info.device_id = hw_info.device_id;
       new_hw_info.sm_count = hw_info.sm_count;
+      new_hw_info.max_active_clusters = hw_info.max_active_clusters;
       if (new_hw_info.sm_count <= 0) {
         CUTLASS_TRACE_HOST("  WARNING: Arguments do not include a valid SM count.\n"
             "  For optimal performance, populate the arguments KernelHardwareInfo struct with the SM count.");

I was able to reproduce the issue you mentioned before making this diff, and the issue went away after the diff.

@manishucsd
Copy link
Contributor Author

Let us check that in into mainline.

@jackkosaian
Copy link
Contributor

It will be merged in when we tag 3.8 (soon).

@manishucsd
Copy link
Contributor Author

Please close it if this merged and you verified it on your end. We will enable stream_k again, and reopen, if we see an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants