You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wrote out this permutation for a TN TF32 16x8x8. I was trying to get the threads to be contiguous when writing N major. I have the following TiledMMA with a cta size of (128,128,32):
Besides working out the swizzle for B (I can see visually what it should do), am I interpreting this diagram correctly that with the right warp shfl_sync I should be able to write 128B contiguous as 16B per thread stores (again, in N major)? Am I paying anything in a non-obvious way with this permutation that I might be missing?
Edit: For example, does this break any of the cp.async or ldmatrix assumptions?
Thanks!
The text was updated successfully, but these errors were encountered:
I wrote out this permutation for a TN TF32 16x8x8. I was trying to get the threads to be contiguous when writing N major. I have the following TiledMMA with a cta size of (128,128,32):
Besides working out the swizzle for B (I can see visually what it should do), am I interpreting this diagram correctly that with the right warp shfl_sync I should be able to write 128B contiguous as 16B per thread stores (again, in N major)? Am I paying anything in a non-obvious way with this permutation that I might be missing?
Edit: For example, does this break any of the cp.async or ldmatrix assumptions?
Thanks!
The text was updated successfully, but these errors were encountered: