You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What is your question?
Could you please explain how large a single cute::gemm computation is in CUTLASS? Since multiple threads compute together, and it doesn’t explicitly state the number of iterations like CUDA cores do, I find it a bit confusing.
The text was updated successfully, but these errors were encountered:
A single cute::gemm computes a single MMA instruction. Therefore, how large a single cute::gemm computation is depends on the MMA instruction you use. For example, if you use Ampere MMA with input type is fp16 which computes 16x8x16 (mxnxk) in a single instruction, the corresponding cute::gemm computes matrix multiplication size of 16x8x16 in each call. And since this MMA is a warp level mma, all threads in a warp will compute this instruction together at each time.
What is your question?
Could you please explain how large a single cute::gemm computation is in CUTLASS? Since multiple threads compute together, and it doesn’t explicitly state the number of iterations like CUDA cores do, I find it a bit confusing.
The text was updated successfully, but these errors were encountered: