-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] Update cutlass_scaled_mm
to support 2d group (blockwise) scaling
#11868
[Kernel] Update cutlass_scaled_mm
to support 2d group (blockwise) scaling
#11868
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
cutlass_scaled_mm
to support 2d group (blockwise) scalingcutlass_scaled_mm
to support 2d group (blockwise) scaling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice performance! When you are ready for e2e testing lmk and we can hook these up for a full dsv3 eval
|
||
// Check for strides and alignment | ||
TORCH_CHECK(a.stride(1) == 1 && c.stride(1) == 1); // Row-major | ||
TORCH_CHECK(b.stride(0) == 1); // Column-major | ||
TORCH_CHECK(c.stride(0) % 16 == 0 && | ||
b.stride(1) % 16 == 0); // 16 Byte Alignment | ||
TORCH_CHECK(a_scales.is_contiguous() && b_scales.is_contiguous()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would these not be contiguous coming in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently for the blockwise scaling a_scales
and b_scales
must be column-major, this is something we may need to relax for a_scales
(but likely won't for b_scales
since we can just transpose it offline). Figured id save that for a future PR though
...ss_extensions/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized_fp8_blockwise_scaling.hpp
Outdated
Show resolved
Hide resolved
static constexpr int AlignmentA = 128 / cutlass::sizeof_bits<ElementA>::value; | ||
|
||
using ElementB = ElementAB; | ||
using LayoutB = cutlass::layout::ColumnMajor; | ||
static constexpr int AlignmentB = 128 / cutlass::sizeof_bits<ElementB>::value; | ||
|
||
using ElementD = OutType; | ||
using StrideD = Stride<int64_t, Int<1>, Int<0>>; | ||
static constexpr int AlignmentD = 128 / cutlass::sizeof_bits<ElementD>::value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these alignments right? I think these are used widely in the CUTLASS examples but we use different/smaller alignments in the vLLM dense cutlass gemm kernels.
Also wondering about the difference between AlignmentC and AlignmentD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't actually use smaller alignments in the dense cutlass GEMM kernels, we just "hardcoded" them in terms of elements, for example here for fp8/int8 we "hardcoded" A and B alignment to 16 elements:
vllm/csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cuh
Lines 90 to 91 in 9666369
ElementAB, cutlass::layout::RowMajor, 16, | |
ElementAB, cutlass::layout::ColumnMajor, 16, |
which is the same as 128bits / 8bits thats used here, im not really opposed to either. You are correct I was working off an example so I went with this style, happy to change it to be "hardcoded" to be more terse
Also wondering about the difference between AlignmentC and AlignmentD
we dont use C (type is void) so I just set the alignment to a dummy value (float32 alignment 128bit / 32bit = 4) but updated to just be the same as AlignmentD to avoid confusion 👍 (probably a better dummy value anyways)
17990c0
to
211f663
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments, but looks good to me!
csrc/quantization/cutlass_w8a8/c3x/scaled_mm_blockwise_sm90_fp8_dispatch.cuh
Outdated
Show resolved
Hide resolved
csrc/quantization/cutlass_w8a8/c3x/scaled_mm_blockwise_sm90_fp8_dispatch.cuh
Outdated
Show resolved
Hide resolved
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
…DeepSeekV3 (vllm-project#12587) Integrates the block-quantized kernels introduced in vllm-project#11868 for use in linear layers. Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
…DeepSeekV3 (vllm-project#12587) Integrates the block-quantized kernels introduced in vllm-project#11868 for use in linear layers. Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
…DeepSeekV3 (vllm-project#12587) Integrates the block-quantized kernels introduced in vllm-project#11868 for use in linear layers. Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: Felix Marty <felmarty@amd.com>
…DeepSeekV3 (vllm-project#12587) Integrates the block-quantized kernels introduced in vllm-project#11868 for use in linear layers. Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
hi @LucasWilkinson , thanks for your awesome work, I am doing similar work but using latest cutlass version. But it can not work correctly. I find you make some other adaptions in |
@LucasWilkinson Just wondering which device you test for these results? Thank you! |
@yizhang2077 sorry haven't gotten around to playing with the latest CUTLASS yet but hoping to get to it soon, I do remember running into issues with the original kernels but I don't remember exactly what the issues were (was a little over a month ago since I last touched the kernel code). I think it was something along the lines of not all threads participating in the copy and not being predicated correctly. I think this commit: d963eb4 encompasses most of the changes with @soundOfDestiny 's original kernel, not sure about the the one that ultimately landed in upstream CUTLASS. Hope this helps! |
@LucasWilkinson Thanks for your reply, I also notice that d963eb4 has solved my problem, but I don't know why (I am not familiar with cutlass kernel). Anyway, thank you again! |
@Andy0422 This was an H100 |
@yizhang2077 No worries. After reviewing the commit it's coming back to me a bit. The original implementation was pretty heavily bugged so I ended-up heavily modifying it. The main issue was that the Ill open a PR to fix this on upstream CUTLASS soon. |
@LucasWilkinson Nice work! I also raise an issue NVIDIA/cutlass#2087, maybe can help your PR |
…DeepSeekV3 (vllm-project#12587) Integrates the block-quantized kernels introduced in vllm-project#11868 for use in linear layers. Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
…DeepSeekV3 (vllm-project#12587) Integrates the block-quantized kernels introduced in vllm-project#11868 for use in linear layers. Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
…DeepSeekV3 (vllm-project#12587) Integrates the block-quantized kernels introduced in vllm-project#11868 for use in linear layers. Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
…DeepSeekV3 (vllm-project#12587) Integrates the block-quantized kernels introduced in vllm-project#11868 for use in linear layers. Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
…DeepSeekV3 (vllm-project#12587) Integrates the block-quantized kernels introduced in vllm-project#11868 for use in linear layers. Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
…DeepSeekV3 (vllm-project#12587) Integrates the block-quantized kernels introduced in vllm-project#11868 for use in linear layers. Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Hi, is there any plan to support smaller tensor core instructions? @LucasWilkinson |
Contemplating it with the lastest CUTLASS updates and NVIDIA/cutlass#2095 , is this something you can help with? If so, if you can handle the vLLM side then I can update NVIDIA/cutlass#2095 to support partial N tiles |
Thanks for reply. We are going to use DeepGEMM and maybe support smaller tensor core instructions based on it |
Currently only supports scale_a block shapes of 1x128 and scale_b block shapes of 128x128 (for deepseek v3)
Shout-out to @manishucsd and @soundOfDestiny for the kernel, kernel adapted from: https://github.com/soundOfDestiny/cutlass/tree/f8_blockwise_scaling_pr_branch
This PR also splits up
scaled_mm_c3x.cu
to help parallelize the building of the kernelsTODO:
per_token_group_quant_fp8
to output M-majorBenchmarking (Scroll horizontally to see
cutlass_fp8_fp8_fp16_scaled_mm_blockwise
):