Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set __launch_bounds__ in kernel whenever we are able #3794

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

jacobhinkle
Copy link
Collaborator

@jacobhinkle jacobhinkle commented Jan 29, 2025

Currently we set the number of threads per block via __launch_bounds__ when register sharing is enabled. This PR just enables this whenever it is possible, i.e. whenever we know the CTA size at compile time.

Adds the method ParallelDimensionMap::getNumThreadsEachBlock() which is similar to ParallelDimensionMap::getNumComputeThreadsEachBlock() but includes all threads and doesn't skip dma threads.

See https://docs.nvidia.com/cuda/cuda-c-programming-guide/#launch-bounds for more background.

Currently we set the number of threads per block via `__launch_bounds__`
when register sharing is enabled. This PR just enables this whenever it
is possible, i.e. whenever we know the CTA size at compile time.

See https://docs.nvidia.com/cuda/cuda-c-programming-guide/#launch-bounds
for more background.
@jacobhinkle
Copy link
Collaborator Author

!test

Copy link

github-actions bot commented Jan 29, 2025

PR Reviewer Guide 🔍

(Review updated until commit b78ae42)

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
⚡ Recommended focus areas for review

Const Cast

The code uses a const_cast to avoid a const correctness issue. Consider re-designing the code to avoid the need for const_cast.

// Avoid a const_cast that would be required to use kernel_ by picking the
// fusion of the first kernel output
FusionGuard fg(kernel_->outputs().front()->fusion());
Thread Count Calculation

The getNumThreadsEachBlock method calculates the total number of threads per block by multiplying the number of threads for each parallel type. Verify that this calculation is correct and accounts for all possible parallel types.

Val* ParallelDimensionMap::getNumThreadsEachBlock() const {
  Val* num_threads = FusionGuard::getCurFusion()->oneVal();
  for (auto pt : kParallelTypeTIDs) {
    num_threads = SimplifyingIrBuilder::mulExpr(num_threads, getRaw(pt));
  }
  return num_threads;
}

Copy link
Collaborator

@zasdfgbnm zasdfgbnm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me, but will let Ryan to decide.

@rdspring1
Copy link
Collaborator

!test --pybench

Copy link
Collaborator

@rdspring1 rdspring1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_threads_per_cta should always have a value because it is the inferred launch bounds. https://github.com/NVIDIA/Fuser/blob/main/csrc/runtime/executor.cpp#L268-L286

This seems different from what you intended. I thought you wanted to set the launch bounds when TIDx, TIDy, and TIDz extents are constant.

@jacobhinkle
Copy link
Collaborator Author

This seems different from what you intended. I thought you wanted to set the launch bounds when TIDx, TIDy, and TIDz extents are constant.

Ah, you're right. I did intend it to not be derived from inputs since that would interfere with dynamic shapes. I'll give it another try.

@jacobhinkle
Copy link
Collaborator Author

!test --diff

@jacobhinkle
Copy link
Collaborator Author

!test --diff

@jacobhinkle
Copy link
Collaborator Author

!test --diff

Copy link

github-actions bot commented Jan 31, 2025

Review updated until commit 508d7ae

Description

  • Added getNumThreadsEachBlock() to ParallelDimensionMap

  • Set __launch_bounds__ whenever CTA size is known at compile time

  • Updated tests to reflect new __launch_bounds__ values

  • Fixed errors related to const_cast usage


Changes walkthrough 📝

Relevant files
Enhancement
codegen.cpp
Enhance launch bounds handling                                                     

csrc/codegen.cpp

  • Added logic to set __launch_bounds__ using getNumThreadsEachBlock()
  • Removed redundant __launch_bounds__ setting for register sharing
  • +10/-2   
    parallel_dimension_map.cpp
    Add method for total threads calculation                                 

    csrc/parallel_dimension_map.cpp

  • Implemented getNumThreadsEachBlock() to calculate total threads per
    block
  • +8/-0     
    parallel_dimension_map.h
    Add method declaration                                                                     

    csrc/parallel_dimension_map.h

    • Added declaration for getNumThreadsEachBlock()
    +3/-0     
    Tests
    test_loop_rotation.cpp
    Update test expectations                                                                 

    tests/cpp/test_loop_rotation.cpp

    • Updated expected kernel strings to include __launch_bounds__
    +6/-6     
    test_scalar_hoisting.cpp
    Update test expectations                                                                 

    tests/cpp/test_scalar_hoisting.cpp

    • Updated expected kernel strings to include __launch_bounds__
    +2/-2     

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review

    Performance Impact

    Ensure that setting __launch_bounds__ more frequently does not introduce performance regressions or unexpected behavior.

    // NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)
    FusionGuard fg(const_cast<kir::Kernel*>(kernel_));
    Val* num_threads =
        kernel_->summary().parallel_dimension_map.getNumThreadsEachBlock();
    if (num_threads->isConstInt()) {
      code_ << "__launch_bounds__(/*MAX_THREADS_PER_BLOCK=*/"
            << num_threads->evaluate().as<int64_t>() << ") ";
    }
    Correctness

    Verify that getNumThreadsEachBlock() correctly calculates the total number of threads per block, including all types of threads.

    Val* ParallelDimensionMap::getNumThreadsEachBlock() const {
      Val* num_threads = FusionGuard::getCurFusion()->oneVal();
      for (auto pt : kParallelTypeTIDs) {
        num_threads = SimplifyingIrBuilder::mulExpr(num_threads, getRaw(pt));
      }
      return num_threads;
    }
    Test Coverage

    Ensure that the tests cover a variety of scenarios and that the expected kernel strings accurately reflect the changes made.

      KernelExecutor ke;
      ke.compile(fusion.get(), {t0});
      auto cg_outputs = ke.run({t0});
    
      const std::string expected_kernel = R"(
    __global__ void __launch_bounds__(/*MAX_THREADS_PER_BLOCK=*/256) CUDAGeneratedKernel(Tensor<float, 2, 2> T0, Tensor<float, 2, 2> T2) {

    @jacobhinkle
    Copy link
    Collaborator Author

    !test --diff

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    None yet
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    3 participants