[QST] Permutation layout for contiguous stores #2127

capybara-club · 2025-02-22T15:07:03Z

I wrote out this permutation for a TN TF32 16x8x8. I was trying to get the threads to be contiguous when writing N major. I have the following TiledMMA with a cta size of (128,128,32):

    TiledMMA mma = 
        make_tiled_mma(
            SM80_16x8x8_F32TF32TF32F32_TN{},
            Layout<
                Shape<_4,_1>,
                Stride<_1,_4>
            >{}
            , 
            Tile<
                Layout<
                    Shape<_16>,
                    Stride<_1>
                >,
                Layout<
                    Shape<_2,_16>,
                    Stride<_1,_8>
                >,
                _8
            >{}
        );

Besides working out the swizzle for B (I can see visually what it should do), am I interpreting this diagram correctly that with the right warp shfl_sync I should be able to write 128B contiguous as 16B per thread stores (again, in N major)? Am I paying anything in a non-obvious way with this permutation that I might be missing?

Edit: For example, does this break any of the cp.async or ldmatrix assumptions?

Thanks!

The text was updated successfully, but these errors were encountered:

capybara-club added ? - Needs Triage question Question labels Feb 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Permutation layout for contiguous stores #2127

[QST] Permutation layout for contiguous stores #2127

capybara-club commented Feb 22, 2025 •

edited

Loading

[QST] Permutation layout for contiguous stores #2127

[QST] Permutation layout for contiguous stores #2127

Comments

capybara-club commented Feb 22, 2025 • edited Loading

capybara-club commented Feb 22, 2025 •

edited

Loading