Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Permutation layout for contiguous stores #2127

Open
capybara-club opened this issue Feb 22, 2025 · 0 comments
Open

[QST] Permutation layout for contiguous stores #2127

capybara-club opened this issue Feb 22, 2025 · 0 comments

Comments

@capybara-club
Copy link

capybara-club commented Feb 22, 2025

I wrote out this permutation for a TN TF32 16x8x8. I was trying to get the threads to be contiguous when writing N major. I have the following TiledMMA with a cta size of (128,128,32):

    TiledMMA mma = 
        make_tiled_mma(
            SM80_16x8x8_F32TF32TF32F32_TN{},
            Layout<
                Shape<_4,_1>,
                Stride<_1,_4>
            >{}
            , 
            Tile<
                Layout<
                    Shape<_16>,
                    Stride<_1>
                >,
                Layout<
                    Shape<_2,_16>,
                    Stride<_1,_8>
                >,
                _8
            >{}
        );

Image

Besides working out the swizzle for B (I can see visually what it should do), am I interpreting this diagram correctly that with the right warp shfl_sync I should be able to write 128B contiguous as 16B per thread stores (again, in N major)? Am I paying anything in a non-obvious way with this permutation that I might be missing?

Edit: For example, does this break any of the cp.async or ldmatrix assumptions?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant