Fix stmatrix scheduling for persistent GEMM #3791

rdspring1 · 2025-01-29T17:16:50Z

Problem

The current stmatrix scheduling assumes all iterDomains are parallelized to the left of the mma output allocation domain.
Therefore, it moves the stmatrix serial iterDomain to the 0th position. This is incompatible with persistent gemm kernels, which have a grid strided serial iterDomain.

Solution

This PR fixes this moving the stmatrix serial iterDomain back one position and using the 3rd from the end for-loop during index generation. The 3rd from the end for-loop is the 0th position from the mma output allocation domain.

rdspring1 · 2025-01-29T17:16:56Z

!test

github-actions · 2025-01-29T17:17:30Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪

🧪 No relevant tests

⚡ Recommended focus areas for review

Index Generation

The PR changes the index generation for stmatrix to use the 3rd from the end for-loop. This change may have implications for the correctness and performance of the stmatrix scheduling.

      ldst, for_loops_[for_loops_.size() - 3], m_tile, n_tile, m, n);
  break;
case MmaInputSmemSwizzle::B128:
case MmaInputSmemSwizzle::B64:
case MmaInputSmemSwizzle::B32:
  out = hardCodedIndexGenerationForStMatrixSwizzle(
      ldst, for_loops_[for_loops_.size() - 3], m_tile, n_tile, m, n);

IterDomain Reordering

The PR reorders the iterDomains to accommodate the stmatrix scheduling. This change may have implications for the correctness and performance of the mma output allocation.

  tv->reorder({{-4, -5}});
  // [128(TIDx), 2(no), 2(ni), 2, 2] -> [2(no), 128(TIDx), 8 (vectorize)]
  tv->merge(-3);
  tv->merge(-2);
} else if (tile_m == 16 && tile_n == 8) {
  // Let [M, N] be [64, 16]
  // After scheduleMmaOutputAllocation: [128(TIDx), 2, 2, 2]
  // [128(TIDx), 2, 2, 2] -> [2, 128(TIDx), 2, 2]
  tv->reorder({{-3, -4}});

jacobhinkle

LGTM. Thanks!

cherry-pick: fix stmatrix indexing

Loading
Loading status checks…

1554e02

rdspring1 added the Matmuls label Jan 29, 2025

jacobhinkle approved these changes Jan 29, 2025

View reviewed changes

rdspring1 merged commit a3f6ba9 into main Jan 30, 2025
51 checks passed

rdspring1 deleted the fix_stmatrix_for_loop branch January 30, 2025 02:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix stmatrix scheduling for persistent GEMM #3791

Fix stmatrix scheduling for persistent GEMM #3791

rdspring1 commented Jan 29, 2025

rdspring1 commented Jan 29, 2025

github-actions bot commented Jan 29, 2025

jacobhinkle left a comment

Fix stmatrix scheduling for persistent GEMM #3791

Fix stmatrix scheduling for persistent GEMM #3791

Conversation

rdspring1 commented Jan 29, 2025

Problem

Solution

rdspring1 commented Jan 29, 2025

github-actions bot commented Jan 29, 2025

PR Reviewer Guide 🔍

jacobhinkle left a comment

Choose a reason for hiding this comment