Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] Support shuffle algorithms for non-uniform groups #12705

Merged

Conversation

steffenlarsen
Copy link
Contributor

This commit makes the non-uniform group classes support the shift and select algorithms.

This commit makes the non-uniform group classes support the shift and
select algorithms.

Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
@steffenlarsen steffenlarsen requested a review from a team as a code owner February 13, 2024 15:57
Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
Copy link
Contributor

@Pennycook Pennycook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this, and for tidying up some of the SPIR-V wrappers!

sycl/include/sycl/detail/spirv.hpp Outdated Show resolved Hide resolved
sycl/include/sycl/detail/spirv.hpp Outdated Show resolved Hide resolved
sycl/include/sycl/detail/spirv.hpp Outdated Show resolved Hide resolved
sycl/test-e2e/NonUniformGroups/ballot_group_algorithms.cpp Outdated Show resolved Hide resolved
Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
Copy link
Contributor

@Pennycook Pennycook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me, modulo the test failures.

Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
if constexpr (ext::oneapi::experimental::is_user_constructed_group_v<
GroupT>) {
return __nvvm_shfl_sync_up_i32(detail::ExtractMask(detail::GetMask(g))[0],
x, delta, 0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JackAKirk - Can you think of a reason why this would fail for the new test cases for ballot_group only?

Copy link
Contributor

@JackAKirk JackAKirk Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is failing because the delta does not relate to the true delta only considering set bits in the mask which is what is desired. I think could also fail for opportunistic_group, but guess this depends on the test/execution.

I think for these cases (opportunistic_group and ballot_group, but you could also use this for all non-uniform groups) if you replace

    return __nvvm_shfl_sync_up_i32(detail::ExtractMask(detail::GetMask(g))[0],
                                   x, delta, 0);

with a call to

non_uniform_shfl

defined here:

https://github.com/intel/llvm/blob/sycl/sycl/include/sycl/ext/oneapi/experimental/cuda/non_uniform_algorithms.hpp#L151

as e.g.

unsigned localSetBit = g.get_local_id()[0] + 1;
int unfoldedSrcSetBit = localSetBit + delta;
auto MemberMask = detail::ExtractMask(detail::GetMask(g))[0];
return non_uniform_shfl(g, MemberMask, x,
                                __nvvm_fns(MemberMask, 0, unfoldedSrcSetBit));

Then it should work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually thinking about it more, in this case I think the sign of delta might have to be changed for ballot_group/opportunistic group, when using that non_uniform_shfl function.
e.g. in my above message you might have to replace

int unfoldedSrcSetBit = localSetBit + delta;

with:

int unfoldedSrcSetBit = localSetBit - delta;

In any case I think one of those should work. I can't remember the docs offhand for which version is correct (basically whether the semantic is send to idx or receive from). I think it is the - delta version!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a ton, Jack! Let's give it a try! 😄

Copy link
Contributor Author

@steffenlarsen steffenlarsen Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly it doesn't seem that it changed anything, but it also doesn't seem like we call non_uniform_shfl here. I wonder if it is immediately usable or if it has been specialized for the use-case with reduce and scan. Either way, I do not have an easy way of testing fixes, so would you be okay with me disabling the subset of checks for CUDA and opening an issue so you can have a look when possible? (Tag @npmiller )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, no worries. I think I know how to fix it.

Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
@steffenlarsen steffenlarsen requested a review from a team as a code owner February 15, 2024 18:34
Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
@steffenlarsen
Copy link
Contributor Author

@intel/llvm-reviewers-runtime @sergey-semenov @uditagarwal97 - Friendly ping.

Copy link
Contributor

@uditagarwal97 uditagarwal97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within the bounds of my limited knowledge in this part of the code base, the changes LGTM!

sycl/include/sycl/detail/spirv.hpp Show resolved Hide resolved
@steffenlarsen
Copy link
Contributor Author

@intel/llvm-reviewers-runtime | @sergey-semenov - Friendly ping.

@steffenlarsen steffenlarsen merged commit 398aa20 into intel:sycl Mar 15, 2024
12 checks passed
steffenlarsen pushed a commit that referenced this pull request Apr 9, 2024
This follows on from discussion of
#12705 (comment) to
impl/fix non-uniform group shuffles on cuda.

- Non-uniform group algorithm impls fixes for permute/left/right
- Generalize group shuffles to support double/half/long/short correctly
for both uniform and non-uniform groups
- Make fixed_size_group test fail if group member "local id" mapping not
correct or removed.
- Update ballot_group_algorithms.cpp to test previously failing cases on
cuda backend.

Shuffle impls in ::detail match those in syclomatic for masked shuffle
builtins (which don't exist in oneapi outside syclomatic).

---------

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants