Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST]Inquiry About the Computation Size in a Single cute::gemm Call in CUTLASS #2092

Open
ziyuhuang123 opened this issue Feb 8, 2025 · 1 comment

Comments

@ziyuhuang123
Copy link

What is your question?
Could you please explain how large a single cute::gemm computation is in CUTLASS? Since multiple threads compute together, and it doesn’t explicitly state the number of iterations like CUDA cores do, I find it a bit confusing.

@Junkai-Wu
Copy link
Contributor

A single cute::gemm computes a single MMA instruction. Therefore, how large a single cute::gemm computation is depends on the MMA instruction you use. For example, if you use Ampere MMA with input type is fp16 which computes 16x8x16 (mxnxk) in a single instruction, the corresponding cute::gemm computes matrix multiplication size of 16x8x16 in each call. And since this MMA is a warp level mma, all threads in a warp will compute this instruction together at each time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants