Set __launch_bounds__ in kernel whenever we are able #3794

jacobhinkle · 2025-01-29T18:18:00Z

Currently we set the number of threads per block via __launch_bounds__ when register sharing is enabled. This PR just enables this whenever it is possible, i.e. whenever we know the CTA size at compile time.

Adds the method ParallelDimensionMap::getNumThreadsEachBlock() which is similar to ParallelDimensionMap::getNumComputeThreadsEachBlock() but includes all threads and doesn't skip dma threads.

See https://docs.nvidia.com/cuda/cuda-c-programming-guide/#launch-bounds for more background.

Currently we set the number of threads per block via `__launch_bounds__` when register sharing is enabled. This PR just enables this whenever it is possible, i.e. whenever we know the CTA size at compile time. See https://docs.nvidia.com/cuda/cuda-c-programming-guide/#launch-bounds for more background.

jacobhinkle · 2025-01-29T18:18:06Z

!test

github-actions · 2025-01-29T18:19:36Z

PR Reviewer Guide 🔍

(Review updated until commit `b78ae42`)

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
⚡ Recommended focus areas for review Const Cast The code uses a const_cast to avoid a const correctness issue. Consider re-designing the code to avoid the need for const_cast. // Avoid a const_cast that would be required to use kernel_ by picking the // fusion of the first kernel output FusionGuard fg(kernel_->outputs().front()->fusion()); Thread Count Calculation The getNumThreadsEachBlock method calculates the total number of threads per block by multiplying the number of threads for each parallel type. Verify that this calculation is correct and accounts for all possible parallel types. Val* ParallelDimensionMap::getNumThreadsEachBlock() const { Val* num_threads = FusionGuard::getCurFusion()->oneVal(); for (auto pt : kParallelTypeTIDs) { num_threads = SimplifyingIrBuilder::mulExpr(num_threads, getRaw(pt)); } return num_threads; }

zasdfgbnm

Makes sense to me, but will let Ryan to decide.

rdspring1 · 2025-01-29T18:33:09Z

!test --pybench

rdspring1

num_threads_per_cta should always have a value because it is the inferred launch bounds. https://github.com/NVIDIA/Fuser/blob/main/csrc/runtime/executor.cpp#L268-L286

This seems different from what you intended. I thought you wanted to set the launch bounds when TIDx, TIDy, and TIDz extents are constant.

jacobhinkle · 2025-01-29T20:59:30Z

This seems different from what you intended. I thought you wanted to set the launch bounds when TIDx, TIDy, and TIDz extents are constant.

Ah, you're right. I did intend it to not be derived from inputs since that would interfere with dynamic shapes. I'll give it another try.

jacobhinkle · 2025-01-30T20:21:17Z

!test --diff

jacobhinkle · 2025-01-31T01:24:37Z

!test --diff

jacobhinkle · 2025-01-31T13:12:17Z

!test --diff

github-actions · 2025-01-31T13:12:53Z

Review updated until commit 508d7ae

Description

Added getNumThreadsEachBlock() to ParallelDimensionMap
Set __launch_bounds__ whenever CTA size is known at compile time
Updated tests to reflect new __launch_bounds__ values
Fixed errors related to const_cast usage

Changes walkthrough 📝

Relevant files

Enhancement

codegen.cpp `Enhance launch bounds handling` csrc/codegen.cpp Added logic to set `__launch_bounds__` using `getNumThreadsEachBlock()` Removed redundant `__launch_bounds__` setting for register sharing	+10/-2
parallel_dimension_map.cpp `Add method for total threads calculation` csrc/parallel_dimension_map.cpp Implemented `getNumThreadsEachBlock()` to calculate total threads per block	+8/-0
parallel_dimension_map.h `Add method declaration` csrc/parallel_dimension_map.h Added declaration for `getNumThreadsEachBlock()`	+3/-0

Tests

test_loop_rotation.cpp `Update test expectations` tests/cpp/test_loop_rotation.cpp Updated expected kernel strings to include `__launch_bounds__`	+6/-6
test_scalar_hoisting.cpp `Update test expectations` tests/cpp/test_scalar_hoisting.cpp Updated expected kernel strings to include `__launch_bounds__`	+2/-2

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Performance Impact

Ensure that setting __launch_bounds__ more frequently does not introduce performance regressions or unexpected behavior.

// NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)
FusionGuard fg(const_cast<kir::Kernel*>(kernel_));
Val* num_threads =
    kernel_->summary().parallel_dimension_map.getNumThreadsEachBlock();
if (num_threads->isConstInt()) {
  code_ << "__launch_bounds__(/*MAX_THREADS_PER_BLOCK=*/"
        << num_threads->evaluate().as<int64_t>() << ") ";
}

Correctness

Verify that getNumThreadsEachBlock() correctly calculates the total number of threads per block, including all types of threads.

Val* ParallelDimensionMap::getNumThreadsEachBlock() const {
  Val* num_threads = FusionGuard::getCurFusion()->oneVal();
  for (auto pt : kParallelTypeTIDs) {
    num_threads = SimplifyingIrBuilder::mulExpr(num_threads, getRaw(pt));
  }
  return num_threads;
}

Test Coverage

Ensure that the tests cover a variety of scenarios and that the expected kernel strings accurately reflect the changes made.

  KernelExecutor ke;
  ke.compile(fusion.get(), {t0});
  auto cg_outputs = ke.run({t0});

  const std::string expected_kernel = R"(
__global__ void __launch_bounds__(/*MAX_THREADS_PER_BLOCK=*/256) CUDAGeneratedKernel(Tensor<float, 2, 2> T0, Tensor<float, 2, 2> T2) {

jacobhinkle · 2025-01-31T13:45:08Z

!test --diff

jacobhinkle requested review from zasdfgbnm and rdspring1 January 29, 2025 18:18

zasdfgbnm reviewed Jan 29, 2025

View reviewed changes

rdspring1 reviewed Jan 29, 2025

View reviewed changes

Compute num threads per block properly

eb80538

Fix error caused by using evaluate() without checking if const

b78ae42

Fix failing tests with expected cuda code

254a8a9

jacobhinkle added 2 commits January 31, 2025 08:21

Just use a const_cast

4495654

Disable clang-tidy warning on const_cast

508d7ae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set __launch_bounds__ in kernel whenever we are able #3794

Set __launch_bounds__ in kernel whenever we are able #3794

jacobhinkle commented Jan 29, 2025 •

edited

Loading

jacobhinkle commented Jan 29, 2025

github-actions bot commented Jan 29, 2025 •

edited

Loading

zasdfgbnm left a comment

rdspring1 commented Jan 29, 2025

rdspring1 left a comment

jacobhinkle commented Jan 29, 2025

jacobhinkle commented Jan 30, 2025

jacobhinkle commented Jan 31, 2025

jacobhinkle commented Jan 31, 2025

github-actions bot commented Jan 31, 2025 •

edited

Loading

jacobhinkle commented Jan 31, 2025

Set __launch_bounds__ in kernel whenever we are able #3794

Are you sure you want to change the base?

Set __launch_bounds__ in kernel whenever we are able #3794

Conversation

jacobhinkle commented Jan 29, 2025 • edited Loading

jacobhinkle commented Jan 29, 2025

github-actions bot commented Jan 29, 2025 • edited Loading

PR Reviewer Guide 🔍

(Review updated until commit b78ae42)

zasdfgbnm left a comment

Choose a reason for hiding this comment

rdspring1 commented Jan 29, 2025

rdspring1 left a comment

Choose a reason for hiding this comment

jacobhinkle commented Jan 29, 2025

jacobhinkle commented Jan 30, 2025

jacobhinkle commented Jan 31, 2025

jacobhinkle commented Jan 31, 2025

github-actions bot commented Jan 31, 2025 • edited Loading

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

jacobhinkle commented Jan 31, 2025

jacobhinkle commented Jan 29, 2025 •

edited

Loading

github-actions bot commented Jan 29, 2025 •

edited

Loading

(Review updated until commit `b78ae42`)

github-actions bot commented Jan 31, 2025 •

edited

Loading