Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate performant Hopper matmul #3819

Open
2 of 22 tasks
jacobhinkle opened this issue Feb 4, 2025 · 3 comments
Open
2 of 22 tasks

Automate performant Hopper matmul #3819

jacobhinkle opened this issue Feb 4, 2025 · 3 comments
Assignees
Labels
H100 Perf improve performance on H100 Matmuls

Comments

@jacobhinkle
Copy link
Collaborator

jacobhinkle commented Feb 4, 2025

This is a collection of issues that were surfaced in our recent perf sprint. These issues are not necessarily all required for decent perf.

K loop efficiency

  • Encourage uniform register use in K loop. See Opportunistically encourage uniform register usage for some integer scalars #2323
  • Simplify indexing in K loop. Specifically, the mma descriptors can be simplified, and subexpressions hoisted outside the loop.
  • Pipelining matmuls. We currently are able to set prefetch_gap > 1, but in some cases the PTX compiler serializes the mma's anyway, particularly for persistent kernels.

Persistent matmul

Syncing issues

Syncing persistent matmuls

  • Initializing mbarriers outside persistent loop.
  • Compute number of stages processed so far in order to determine circular buffer stage. This prevents us resetting to stage 0 of the circular buffer when beginning a new tile.

Register usage/spilling

  • Automatically enable register sharing with warp specialized kernels and test that the ptx compiler does not override it.
  • Fix stack frame observed in persistent kernels due to volatile state in mbarrier arrive. Add checks that we do not have stack frame or spills in each of our hopper tests.

Operand load efficiency/L2 locality/grid swizzling

  • Support large swizzles (like grid_swizzle_factor = 16) without introducing more waves due to nondivisible splits.
  • ZSwizzle tile grid Y wrt X. See Remove z-shape block swizzle #92
  • ZSwizzle K loop wrt persistent loop. When the K loop is long enough that L2 is thrashed, turning around during each alternating loop lets us get some L2 hits instead of starting with the coldest region first.

Epilogue

Epilogue inputs

  • Schedule epilogue inputs with TMA avoiding excessive bank conflicts
  • Investigate overlapping TMA loads of those inputs with the K loop, or potentially circular buffering them.

TMA store and stmatrix

  • Predicate TMA store with electSync. Use PredicateType::ElectSync with TMA store expressions #3814
  • Reuse smem for tma stores when there are multiple outputs by waiting for earlier TMA stores.
  • Split up TMA into smaller chunks and inline the stmatrix/epilogue for each chunk. We currently split into 64x64 chunks but use separate stmatrix loops and tma loops. We can re-use memory if we inline these and do them serially.

Other

@jacobhinkle jacobhinkle added H100 Perf improve performance on H100 Matmuls labels Feb 4, 2025
@rdspring1
Copy link
Collaborator

rdspring1 commented Feb 11, 2025

Roadmap update on 2/11/2025

  1. Solve mapping problem with grid swizzling or persistent scheduling and translated MatmulOp or LinearOp nodes. (Thunder Interface and Perfomance)
  2. Optimize TMA Store and Epilogue (Performance - Unknown)
  3. Implement TMA Multicast (Performance - 1 month)
  4. Implement persistent matmul using cooperative strategy (1 month)
  5. Implement ping-pong persistent matmul (1 month)

Misc Improvements:

  • Support TMA for problem sizes requiring 64-bit indexing
  • K loop efficiency
  • Z-Swizzle grid swizzling

@jacobhinkle
Copy link
Collaborator Author

I'll add one to the roadmap:
0. Enable TMA with Int64 problem sizes (last item in description) Estimate 1 week.

@jacobhinkle
Copy link
Collaborator Author

Here are a couple burn-down list versions of the highest priority items, ordered by priority decreasing

Functionality:

  • Support translation of MatmulOp and LinearOp with grid swizzling and persistent kernels.
  • Support TMA with Int64 problem sizes

Performance:

  • Optimize TMA store and epilogue inputs
  • Fix syncing in persistent kernels: use long-lived mbarriers, properly place syncs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
H100 Perf improve performance on H100 Matmuls
Projects
None yet
Development

No branches or pull requests

2 participants