-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inlining error in Hopper matmul with AxisMapping and grid swizzling #3671
Comments
This updates the default (non-plugin) matmul heuristic to support Hopper matmuls. This change means that we can not run matmuls on Hopper similarly to how we do it on Ampere and Turing, including using the Python interface. I tried to make the default heuristic somewhat thoughtful and not just a placeholder. Here are some notes about the Hopper heuristic in its current form: - I set the macro to Hopper_64_64_16. I intended to always use the largest macro for which the N size divided the problem's N, but this led to lower perf on the handful of examples I looked at. We should benchmark more and find out why this is once we have warp specialization and register stealing fully plumbed in, but for the time being I simply left it at N=64. - Once the instruction tile is set we set the warp tile equal to the instruction tile (we can revisit this in the future). Then to find the CTA tile we double the instruction tile in the M or N dimension until we run out of registers. - We start with 8 circular buffering stages and decrease until the circular buffers fit into smem. - We use `use_smem_epilogue` when possible. Whenever that is possible we _always_ use `promote_prologue_smem_reuse` even if it's not needed. This is to try and avoid bugs like #3602. - I set the tile rasterization order so that the fast axis is the axis with the fewest tiles, which should encourage more L2 hits unless there are tons of tiles in each dimension. - I cannot yet set grid swizzling due to #3671, but I placed a TODO comment and some code to do the proper swizzling. --------- Co-authored-by: Ryan Spring <rspring@nvidia.com>
This error also appears whenever we try and do persistent kernel scheduling when there is a translated
I think the problem is in the I am considering now whether #3372 was a mistake and if we should revisit something like #3366 instead. What do you think @naoyam ? |
So, the issue is we can't inline |
Here is a diagram of grid swizzling for matmul: Here we need to inline Inlining is one thing: we can pretty much repeat the logic of permissive map to create an inlining check that allows this. However, when I bypass that check currently, I hit an error in Another option (besides #3366) would be to try and make the additional IDs broadcast mapped with the corresponding consumer dimension, which would make these IDs all permissive mapped. For example we could add that as an option to |
Thanks. I think I understand what's happening and what the problem is. I wonder if we could just get rid of the offending BroadcastOp during lowering. #3366 may work. I'd also try |
If we left the BroadcastOps in and scheduled it similar to how we do on Ampere, then we'd have something like this:
If I understand correctly, you're saying we could potentially remove those broadcasts at lowering (say in the indexing pass) and replace |
Some of the localized KIR changes are also done after indexing, e.g., https://github.com/NVIDIA/Fuser/blob/main/csrc/device_lower/lower2device.cpp#L283 Removing the ops in KIR may feel ad-hoc, but I'd suspect #3366 would be a significant change. |
The inlining logic for
MmaOp
withAxisMapping
checks that unmapped dimensions areBroadcast
. We expect to have something like thisIn this case, we are able to inline the mma operation that consumes these two tensors, but we check that the unmapped IDs 5, 6, 13, and 14 are
Broadcast
and that the operation is anMmaOp
.In the case of grid swizzling by a factor 4, we will do some further scheduling here. For example we will have
Now we have mixed the first two outer dimensions with this swizzle and what used to be a simple split of a loop broadcast (
bS5
) is now an iteration IDiS22{4}
resulting from the merge.I am not sure yet how to address this. I don't think we can just inline here without some other changes since when I disable this check I get errors in expression sorting.
The text was updated successfully, but these errors were encountered: