You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Simplify indexing in K loop. Specifically, the mma descriptors can be simplified, and subexpressions hoisted outside the loop.
Pipelining matmuls. We currently are able to set prefetch_gap > 1, but in some cases the PTX compiler serializes the mma's anyway, particularly for persistent kernels.
Ensure that the wgmma fence is inserted only once, just before the mma loop in each K loop iteration
Wait for smem used for tma store before the stmatrix loop, not immediately after the store. This lets us overlap the tma store with the next tile’s MMAs.
Syncing persistent matmuls
Initializing mbarriers outside persistent loop.
Compute number of stages processed so far in order to determine circular buffer stage. This prevents us resetting to stage 0 of the circular buffer when beginning a new tile.
Register usage/spilling
Automatically enable register sharing with warp specialized kernels and test that the ptx compiler does not override it.
Fix stack frame observed in persistent kernels due to volatile state in mbarrier arrive. Add checks that we do not have stack frame or spills in each of our hopper tests.
ZSwizzle K loop wrt persistent loop. When the K loop is long enough that L2 is thrashed, turning around during each alternating loop lets us get some L2 hits instead of starting with the coldest region first.
Epilogue
Epilogue inputs
Schedule epilogue inputs with TMA avoiding excessive bank conflicts
Investigate overlapping TMA loads of those inputs with the K loop, or potentially circular buffering them.
Reuse smem for tma stores when there are multiple outputs by waiting for earlier TMA stores.
Split up TMA into smaller chunks and inline the stmatrix/epilogue for each chunk. We currently split into 64x64 chunks but use separate stmatrix loops and tma loops. We can re-use memory if we inline these and do them serially.
This is a collection of issues that were surfaced in our recent perf sprint. These issues are not necessarily all required for decent perf.
K loop efficiency
prefetch_gap > 1
, but in some cases the PTX compiler serializes the mma's anyway, particularly for persistent kernels.Persistent matmul
MatmulOp
orLinearOp
nodes. Inlining error in Hopper matmul with AxisMapping and grid swizzling #3671 (comment)Syncing issues
Syncing persistent matmuls
Register usage/spilling
Operand load efficiency/L2 locality/grid swizzling
grid_swizzle_factor = 16
) without introducing more waves due to nondivisible splits.Epilogue
Epilogue inputs
TMA store and stmatrix
Other
The text was updated successfully, but these errors were encountered: