-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set __launch_bounds__ in kernel whenever we are able #3794
base: main
Are you sure you want to change the base?
Conversation
Currently we set the number of threads per block via `__launch_bounds__` when register sharing is enabled. This PR just enables this whenever it is possible, i.e. whenever we know the CTA size at compile time. See https://docs.nvidia.com/cuda/cuda-c-programming-guide/#launch-bounds for more background.
!test |
PR Reviewer Guide 🔍(Review updated until commit b78ae42)Here are some key observations to aid the review process:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me, but will let Ryan to decide.
!test --pybench |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_threads_per_cta
should always have a value because it is the inferred launch bounds. https://github.com/NVIDIA/Fuser/blob/main/csrc/runtime/executor.cpp#L268-L286
This seems different from what you intended. I thought you wanted to set the launch bounds when TIDx, TIDy, and TIDz extents are constant.
Ah, you're right. I did intend it to not be derived from inputs since that would interfere with dynamic shapes. I'll give it another try. |
!test --diff |
!test --diff |
!test --diff |
Review updated until commit 508d7ae Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
!test --diff |
Currently we set the number of threads per block via
__launch_bounds__
when register sharing is enabled. This PR just enables this whenever it is possible, i.e. whenever we know the CTA size at compile time.Adds the method
ParallelDimensionMap::getNumThreadsEachBlock()
which is similar toParallelDimensionMap::getNumComputeThreadsEachBlock()
but includes all threads and doesn't skip dma threads.See https://docs.nvidia.com/cuda/cuda-c-programming-guide/#launch-bounds for more background.