-
Notifications
You must be signed in to change notification settings - Fork 61
Set __launch_bounds__ in kernel whenever we are able #3794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Currently we set the number of threads per block via `__launch_bounds__` when register sharing is enabled. This PR just enables this whenever it is possible, i.e. whenever we know the CTA size at compile time. See https://docs.nvidia.com/cuda/cuda-c-programming-guide/#launch-bounds for more background.
!test |
PR Reviewer Guide 🔍(Review updated until commit b78ae42)Here are some key observations to aid the review process:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me, but will let Ryan to decide.
!test --pybench |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_threads_per_cta
should always have a value because it is the inferred launch bounds. https://github.com/NVIDIA/Fuser/blob/main/csrc/runtime/executor.cpp#L268-L286
This seems different from what you intended. I thought you wanted to set the launch bounds when TIDx, TIDy, and TIDz extents are constant.
Ah, you're right. I did intend it to not be derived from inputs since that would interfere with dynamic shapes. I'll give it another try. |
!test --diff |
!test --diff |
!test --diff |
Review updated until commit 0088161 Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
!test --diff |
!test --diff |
Codediff doesn't show anything concerning IMO. Register usage is usually reduced, though sometimes increased by a few registers. |
if (kernel_->hasManaged("enable_register_sharing") && | ||
kernel_->getManaged<bool>("enable_register_sharing")) { | ||
NVF_ERROR( | ||
num_threads_per_cta.has_value(), | ||
"__launch_bounds__ must be set for register sharing warp specialization"); | ||
code_ << "__launch_bounds__(/*MAX_THREADS_PER_BLOCK=*/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this still necessary because we can determine the number of threads at runtime?
How about changing the predicate to
// Always set __launch_bounds__ when register sharing is enabled.
if (kernel_->hasManaged("enable_register_sharing") &&
kernel_->getManaged<bool>("enable_register_sharing") {
...
} else {
// Enable __launch_bounds__ when number of threads is known at compile-time.
// NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)
FusionGuard fg(const_cast<kir::Kernel*>(kernel_));
Val* num_threads =
kernel_->summary().parallel_dimension_map.getNumThreadsEachBlock();
if (num_threads->isConstInt()) {
code_ << "__launch_bounds__(/*MAX_THREADS_PER_BLOCK=*/"
<< num_threads->evaluate().as<int64_t>() << ") ";
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. I now just try and set num_threads_per_cta
if it is unset, then I use that to set the launch bounds argument. So in the warp specialization case we will skip the check and just use the provided number of threads.
!test |
We hit an error because of this Fuser/csrc/runtime/executor.cpp Lines 283 to 286 in fc719a7
When we compile the kernel, we are launching it one way, but then when we reuse the kernel we can use different launch params. I was thinking that for warp specialization in matmul this is probably not needed because we will have fixed block size. @rdspring1 what is the case where we might need to have runtime determination of block size? If needed, we could consider caching and recompiling after lowering, as part of |
if (kernel_->hasManaged("enable_register_sharing") && | ||
kernel_->getManaged<bool>("enable_register_sharing")) { | ||
NVF_ERROR( | ||
num_threads_per_cta.has_value(), | ||
"__launch_bounds__ must be set for register sharing warp specialization"); | ||
} | ||
if (num_threads_per_cta.has_value()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_threads_per_cta
is always known at compile-time, so the setting launch bounds with a runtime value should be guarded by enable_register_sharing
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I get that now. But due to kernel re-use this is not safe. We might have different LaunchParams
during compilation than we do for a later launch.
The persistent schedulers can use dynamic block sizes. There was interest in exploring warp specialization with register sharing for those schedulers. https://github.com/NVIDIA/Fuser/blob/main/csrc/scheduler/normalization_inner_outer.cpp#L903-L908 |
Currently we set the number of threads per block via
__launch_bounds__
when register sharing is enabled. This PR just enables this whenever it is possible, i.e. whenever we know the CTA size at compile time.Adds the method
ParallelDimensionMap::getNumThreadsEachBlock()
which is similar toParallelDimensionMap::getNumComputeThreadsEachBlock()
but includes all threads and doesn't skip dma threads.See https://docs.nvidia.com/cuda/cuda-c-programming-guide/#launch-bounds for more background.