Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][DOC] Add draft of sycl_ext_oneapi_async_memcpy #9439

Open
wants to merge 2 commits into
base: sycl
Choose a base branch
from

Conversation

dkhaldi
Copy link
Contributor

@dkhaldi dkhaldi commented May 12, 2023

No description provided.

@dkhaldi dkhaldi requested a review from a team as a code owner May 12, 2023 17:29
@dkhaldi dkhaldi requested review from Pennycook and jbrodman May 12, 2023 17:29
Comment on lines +87 to +89
`sycl::ext::oneapi::experimental::async_memcpy` is a free function
that asynchronously copies a number of elements specified by
`num_elements` of data of type `T` from the source pointer `src` to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a good idea to align with SYCL 2020's USM copy functions here. For USM, memcpy accepts void* and a numBytes, whereas copy accepts T* and a count. The function you've defined here is called memcpy but accepts T*.

My recommendation would be to define both async_memcpy and async_copy.

`syclex::barrier` as an argument that can be used to wait on the
completion of the memory copy.

Permitted types for `T` are all scalar and vector types.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that we may need some more restrictions here, or at least some non-normative notes to highlight potentially surprising behavior.

It's unsafe to use memcpy if T isn't trivially copyable. This should probably be listed as a precondition of memcpy_async as well.

If T has a non-trivial copy constructor, developers calling async_copy would expect the copy constructor to execute. That would (probably) prevent an implementation from using any hardware-acceleration for asynchronous copies. I don't know whether it's better to limit T to trivially copyable types, or to add a non-normative note that many implementations will probably not execute asynchronously if T is not trivially copyable.

I'm leaning towards accepting all T and adding a note, because asynchrony already isn't guaranteed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the term "scalar type" did not include classes. Therefore, aren't all scalar types trivially copyable? See cppreference definition of Scalar Type. Also see the SYCL definition of Scalar data types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should limit this to scalar and vector. One can create a class and allocate it in local memory/registers where the hardware can use such accelerated async_memcpy for such class.
I will add a precondition that T must be a trivially copyable type.

access::decorated SrcIsDecorated, sycl::memory_scope Scope>
void async_memcpy(multi_ptr<T, DestSpace, DestIsDecorated> dest,
multi_ptr<T, SrcSpace, SrcIsDecorated> src, size_t numElements,
syclex::barrier<Scope> bar);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syclex::barrier has two template arguments. Was it your intention to limit this to barriers with the default CompletionFunction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I will add CompletionFunction to the arguments

access::decorated SrcIsDecorated, sycl::memory_scope Scope>
void async_memcpy(multi_ptr<T, DestSpace, DestIsDecorated> dest,
multi_ptr<T, SrcSpace, SrcIsDecorated> src, size_t numElements,
syclex::barrier<Scope> bar);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think syclex is defined anywhere here. Since you're in the namespace already, you can probably just remove it.

template <typename T, access::address_space DestSpace,
access::decorated DestIsDecorated, access::address_space SrcSpace,
access::decorated SrcIsDecorated, sycl::memory_scope Scope>
void async_memcpy(multi_ptr<T, DestSpace, DestIsDecorated> dest,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why limit to multi_ptr here? Should we allow generic T* as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use multi_ptr to be able to carry around the address space info.
If USM is used, multi_ptr can be constructed using address_space_cast
Can the implementation of async_copy retrieve address space (local vs global) if we only pass generic T*?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends. T* is allowed to match both decorated pointers (which have the address space attached) and raw pointers (which don't have the address space attached).

If a function is passed a decorated pointer, then the implementation knows the address space (just as if it was passed a multi_ptr). If it's passed a raw pointer, then it can use address_space_cast to determine what the address space is using a runtime check (which may introduce some overhead).

I'm hopeful that in cases where the compiler actually does know the address space, but it's represented as a raw pointer, the runtime check can be optimized away. But I don't think that is implemented right now.

`syclex::barrier` as an argument that can be used to wait on the
completion of the memory copy.

Permitted types for `T` are all scalar and vector types.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the term "scalar type" did not include classes. Therefore, aren't all scalar types trivially copyable? See cppreference definition of Scalar Type. Also see the SYCL definition of Scalar data types.

access::decorated IsDecorated, sycl::memory_scope Scope>
void group_async_memcpy(Group g, multi_ptr<T, Space, IsDecorated>
dest, multi_ptr<T, Space, IsDecorated> src, size_t numElements,
syclex::barrier<Scope> bar);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The barrier class has a deleted copy constructor, so I think we need to pass it by reference.

// Allocate memory for and construct the barrier
auto* bar = sycl::ext::oneapi::group_local_memory<wg_barrier>(it.get_group(), nthreads);

async_memcpy(pdest, psrc, N, bar);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how the barrier works here. Does the implementation call arrive on the barrier when the copy is complete? If that is the case, it seems like the call to arrive_and_wait below should be just wait. If that is not how it works, then I'm confused about something here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using the barrier defined here: https://github.com/Pennycook/llvm/blob/cc7eaf559699a759c9cde1586e3113f9c1479bda/sycl/doc/extensions/proposed/sycl_ext_oneapi_barrier.asciidoc
The WIs here call arrive, not the implementation. Then, wait to block until the predetermined number of work items have called arrive. When same number of WIs call arrive and wait at the same point of code, token is not needed and arrive_and_wait can be used.

// Allocate memory for and construct the barrier
auto* bar = sycl::ext::oneapi::group_local_memory<wg_barrier>(it.get_group(), nthreads);

group_async_memcpy(it.get_group(), pdest, psrc, N, bar);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea that all work-items in the group jointly copy from psrc to pdest? If so, there should be a description someplace saying that all work-items must pass the same value for pdest, psrc, and N. I presume all work-items should also pass the same value for bar?

Copy link
Contributor

@JackAKirk JackAKirk Aug 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In real applications you often need an interface whereby each thread is passed a different pointer:
imagine you are dealing with some complicated strided/and or with offset data that you wish to copy. All of these existing interfaces in this spec doc assume each thread is passed the same start address, and then (although as you mention doc doesn't state yet) the implication is that the impl assigns group items different addresses wrt the start address to copy, up to N items.
This is fine for certain applications, but if you want to support all real applications you at least need an additional interface (perhaps by e.g. introducing a single item/thread "group"), where each thread can be passed a specific pointer by the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The group variants adds convergence constraints. I will add description that same value of pdest, psrc, N, bar must be passed.
If these are different as @JackAKirk is suggesting, I think the non-group variants should be used.

Copy link
Contributor

This pull request is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be automatically closed in 30 days.

@github-actions github-actions bot added the Stale label Sep 15, 2024
Copy link
Contributor

This pull request was closed because it has been stalled for 30 days with no activity.

@github-actions github-actions bot closed this Oct 16, 2024
@dkhaldi dkhaldi reopened this Dec 10, 2024
@github-actions github-actions bot removed the Stale label Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants