Automate performant Hopper matmul #3819

jacobhinkle · 2025-02-04T16:19:36Z

This is a collection of issues that were surfaced in our recent perf sprint. These issues are not necessarily all required for decent perf.

K loop efficiency

Encourage uniform register use in K loop. See Opportunistically encourage uniform register usage for some integer scalars #2323
Simplify indexing in K loop. Specifically, the mma descriptors can be simplified, and subexpressions hoisted outside the loop.
Pipelining matmuls. We currently are able to set prefetch_gap > 1, but in some cases the PTX compiler serializes the mma's anyway, particularly for persistent kernels.

Persistent matmul

Implement persistent matmul using cooperative strategy (Implement persistent matmul scheduling #3812 )
Solve mapping problem with grid swizzling or persistent scheduling and translated MatmulOp or LinearOp nodes. Inlining error in Hopper matmul with AxisMapping and grid swizzling #3671 (comment)
(long term) Implement ping-pong persistent matmul

Syncing issues

Ensure that wgmma commit_group and wait_group happens only once, at the end of the K loop for the math warps. Add WAR sync for wgmma expressions in compute warp #3863 Use a single wgmma wait_group to flush async wgmma pipeline #3843
Ensure that the wgmma fence is inserted only once, just before the mma loop in each K loop iteration
Wait for smem used for tma store before the stmatrix loop, not immediately after the store. This lets us overlap the tma store with the next tile’s MMAs.

Syncing persistent matmuls

Initializing mbarriers outside persistent loop.
Compute number of stages processed so far in order to determine circular buffer stage. This prevents us resetting to stage 0 of the circular buffer when beginning a new tile.

Register usage/spilling

Automatically enable register sharing with warp specialized kernels and test that the ptx compiler does not override it.
Fix stack frame observed in persistent kernels due to volatile state in mbarrier arrive. Add checks that we do not have stack frame or spills in each of our hopper tests.

Operand load efficiency/L2 locality/grid swizzling

Support large swizzles (like grid_swizzle_factor = 16) without introducing more waves due to nondivisible splits.
ZSwizzle tile grid Y wrt X. See Remove z-shape block swizzle #92
ZSwizzle K loop wrt persistent loop. When the K loop is long enough that L2 is thrashed, turning around during each alternating loop lets us get some L2 hits instead of starting with the coldest region first.

Epilogue

Epilogue inputs

Schedule epilogue inputs with TMA avoiding excessive bank conflicts
Investigate overlapping TMA loads of those inputs with the K loop, or potentially circular buffering them.

TMA store and stmatrix

Predicate TMA store with electSync. Use PredicateType::ElectSync with TMA store expressions #3814
Reuse smem for tma stores when there are multiple outputs by waiting for earlier TMA stores.
Split up TMA into smaller chunks and inline the stmatrix/epilogue for each chunk. We currently split into 64x64 chunks but use separate stmatrix loops and tma loops. We can re-use memory if we inline these and do them serially.

Other

Support TMA for problem sizes requiring 64-bit indexing: Support TMA with 64-bit indexing #3599 Allow 64-bit indexing for TMA instructions, with validation #3850

The text was updated successfully, but these errors were encountered:

rdspring1 · 2025-02-11T17:34:54Z

Roadmap update on 2/11/2025

Solve mapping problem with grid swizzling or persistent scheduling and translated MatmulOp or LinearOp nodes. (Thunder Interface and Perfomance)
Optimize TMA Store and Epilogue (Performance - Unknown)
Implement TMA Multicast (Performance - 1 month)
Implement persistent matmul using cooperative strategy (1 month)
Implement ping-pong persistent matmul (1 month)

Misc Improvements:

Support TMA for problem sizes requiring 64-bit indexing
K loop efficiency
Z-Swizzle grid swizzling

jacobhinkle · 2025-02-11T19:00:12Z

I'll add one to the roadmap:
0. Enable TMA with Int64 problem sizes (last item in description) Estimate 1 week.

jacobhinkle · 2025-02-12T19:48:22Z

Here are a couple burn-down list versions of the highest priority items, ordered by priority decreasing

Functionality:

Support translation of MatmulOp and LinearOp with grid swizzling and persistent kernels.
Support TMA with Int64 problem sizes

Performance:

Optimize TMA store and epilogue inputs
Fix syncing in persistent kernels: use long-lived mbarriers, properly place syncs.

jacobhinkle added H100 Perf improve performance on H100 Matmuls labels Feb 4, 2025

rdspring1 assigned rdspring1 and jacobhinkle Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automate performant Hopper matmul #3819

Automate performant Hopper matmul #3819

jacobhinkle commented Feb 4, 2025 •

edited

Loading

rdspring1 commented Feb 11, 2025 •

edited

Loading

jacobhinkle commented Feb 11, 2025

jacobhinkle commented Feb 12, 2025

Automate performant Hopper matmul #3819

Automate performant Hopper matmul #3819

Comments

jacobhinkle commented Feb 4, 2025 • edited Loading

K loop efficiency

Persistent matmul

Syncing issues

Syncing persistent matmuls

Register usage/spilling

Operand load efficiency/L2 locality/grid swizzling

Epilogue

Epilogue inputs

TMA store and stmatrix

Other

rdspring1 commented Feb 11, 2025 • edited Loading

Roadmap update on 2/11/2025

jacobhinkle commented Feb 11, 2025

jacobhinkle commented Feb 12, 2025

jacobhinkle commented Feb 4, 2025 •

edited

Loading

rdspring1 commented Feb 11, 2025 •

edited

Loading