[QST]Inquiry About the Computation Size in a Single cute::gemm Call in CUTLASS #2092

ziyuhuang123 · 2025-02-08T11:34:09Z

What is your question?
Could you please explain how large a single cute::gemm computation is in CUTLASS? Since multiple threads compute together, and it doesn’t explicitly state the number of iterations like CUDA cores do, I find it a bit confusing.

Junkai-Wu · 2025-02-13T07:42:24Z

A single cute::gemm computes a single MMA instruction. Therefore, how large a single cute::gemm computation is depends on the MMA instruction you use. For example, if you use Ampere MMA with input type is fp16 which computes 16x8x16 (mxnxk) in a single instruction, the corresponding cute::gemm computes matrix multiplication size of 16x8x16 in each call. And since this MMA is a warp level mma, all threads in a warp will compute this instruction together at each time.

ziyuhuang123 added ? - Needs Triage question Question labels Feb 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST]Inquiry About the Computation Size in a Single cute::gemm Call in CUTLASS #2092

[QST]Inquiry About the Computation Size in a Single cute::gemm Call in CUTLASS #2092

ziyuhuang123 commented Feb 8, 2025

Junkai-Wu commented Feb 13, 2025

[QST]Inquiry About the Computation Size in a Single cute::gemm Call in CUTLASS #2092

[QST]Inquiry About the Computation Size in a Single cute::gemm Call in CUTLASS #2092

Comments

ziyuhuang123 commented Feb 8, 2025

Junkai-Wu commented Feb 13, 2025