-
Notifications
You must be signed in to change notification settings - Fork 751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][DOC] Add draft of sycl_ext_oneapi_async_memcpy #9439
base: sycl
Are you sure you want to change the base?
Conversation
`sycl::ext::oneapi::experimental::async_memcpy` is a free function | ||
that asynchronously copies a number of elements specified by | ||
`num_elements` of data of type `T` from the source pointer `src` to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be a good idea to align with SYCL 2020's USM copy functions here. For USM, memcpy
accepts void*
and a numBytes
, whereas copy
accepts T*
and a count
. The function you've defined here is called memcpy
but accepts T*
.
My recommendation would be to define both async_memcpy
and async_copy
.
`syclex::barrier` as an argument that can be used to wait on the | ||
completion of the memory copy. | ||
|
||
Permitted types for `T` are all scalar and vector types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that we may need some more restrictions here, or at least some non-normative notes to highlight potentially surprising behavior.
It's unsafe to use memcpy
if T
isn't trivially copyable. This should probably be listed as a precondition of memcpy_async
as well.
If T
has a non-trivial copy constructor, developers calling async_copy
would expect the copy constructor to execute. That would (probably) prevent an implementation from using any hardware-acceleration for asynchronous copies. I don't know whether it's better to limit T
to trivially copyable types, or to add a non-normative note that many implementations will probably not execute asynchronously if T
is not trivially copyable.
I'm leaning towards accepting all T
and adding a note, because asynchrony already isn't guaranteed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the term "scalar type" did not include classes. Therefore, aren't all scalar types trivially copyable? See cppreference definition of Scalar Type. Also see the SYCL definition of Scalar data types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should limit this to scalar and vector. One can create a class and allocate it in local memory/registers where the hardware can use such accelerated async_memcpy for such class.
I will add a precondition that T
must be a trivially copyable type.
access::decorated SrcIsDecorated, sycl::memory_scope Scope> | ||
void async_memcpy(multi_ptr<T, DestSpace, DestIsDecorated> dest, | ||
multi_ptr<T, SrcSpace, SrcIsDecorated> src, size_t numElements, | ||
syclex::barrier<Scope> bar); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syclex::barrier
has two template arguments. Was it your intention to limit this to barriers with the default CompletionFunction
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I will add CompletionFunction
to the arguments
access::decorated SrcIsDecorated, sycl::memory_scope Scope> | ||
void async_memcpy(multi_ptr<T, DestSpace, DestIsDecorated> dest, | ||
multi_ptr<T, SrcSpace, SrcIsDecorated> src, size_t numElements, | ||
syclex::barrier<Scope> bar); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think syclex
is defined anywhere here. Since you're in the namespace already, you can probably just remove it.
template <typename T, access::address_space DestSpace, | ||
access::decorated DestIsDecorated, access::address_space SrcSpace, | ||
access::decorated SrcIsDecorated, sycl::memory_scope Scope> | ||
void async_memcpy(multi_ptr<T, DestSpace, DestIsDecorated> dest, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why limit to multi_ptr
here? Should we allow generic T*
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use multi_ptr
to be able to carry around the address space info.
If USM is used, multi_ptr
can be constructed using address_space_cast
Can the implementation of async_copy
retrieve address space (local vs global) if we only pass generic T*
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends. T*
is allowed to match both decorated pointers (which have the address space attached) and raw pointers (which don't have the address space attached).
If a function is passed a decorated pointer, then the implementation knows the address space (just as if it was passed a multi_ptr
). If it's passed a raw pointer, then it can use address_space_cast
to determine what the address space is using a runtime check (which may introduce some overhead).
I'm hopeful that in cases where the compiler actually does know the address space, but it's represented as a raw pointer, the runtime check can be optimized away. But I don't think that is implemented right now.
`syclex::barrier` as an argument that can be used to wait on the | ||
completion of the memory copy. | ||
|
||
Permitted types for `T` are all scalar and vector types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the term "scalar type" did not include classes. Therefore, aren't all scalar types trivially copyable? See cppreference definition of Scalar Type. Also see the SYCL definition of Scalar data types.
access::decorated IsDecorated, sycl::memory_scope Scope> | ||
void group_async_memcpy(Group g, multi_ptr<T, Space, IsDecorated> | ||
dest, multi_ptr<T, Space, IsDecorated> src, size_t numElements, | ||
syclex::barrier<Scope> bar); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The barrier
class has a deleted copy constructor, so I think we need to pass it by reference.
// Allocate memory for and construct the barrier | ||
auto* bar = sycl::ext::oneapi::group_local_memory<wg_barrier>(it.get_group(), nthreads); | ||
|
||
async_memcpy(pdest, psrc, N, bar); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how the barrier works here. Does the implementation call arrive
on the barrier when the copy is complete? If that is the case, it seems like the call to arrive_and_wait
below should be just wait
. If that is not how it works, then I'm confused about something here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am using the barrier defined here: https://github.com/Pennycook/llvm/blob/cc7eaf559699a759c9cde1586e3113f9c1479bda/sycl/doc/extensions/proposed/sycl_ext_oneapi_barrier.asciidoc
The WIs here call arrive
, not the implementation. Then, wait
to block until the predetermined number of work items have called arrive
. When same number of WIs call arrive
and wait
at the same point of code, token
is not needed and arrive_and_wait
can be used.
// Allocate memory for and construct the barrier | ||
auto* bar = sycl::ext::oneapi::group_local_memory<wg_barrier>(it.get_group(), nthreads); | ||
|
||
group_async_memcpy(it.get_group(), pdest, psrc, N, bar); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea that all work-items in the group jointly copy from psrc
to pdest
? If so, there should be a description someplace saying that all work-items must pass the same value for pdest
, psrc
, and N
. I presume all work-items should also pass the same value for bar
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In real applications you often need an interface whereby each thread is passed a different pointer:
imagine you are dealing with some complicated strided/and or with offset data that you wish to copy. All of these existing interfaces in this spec doc assume each thread is passed the same start address, and then (although as you mention doc doesn't state yet) the implication is that the impl assigns group items different addresses wrt the start address to copy, up to N items.
This is fine for certain applications, but if you want to support all real applications you at least need an additional interface (perhaps by e.g. introducing a single item/thread "group"), where each thread can be passed a specific pointer by the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The group variants adds convergence constraints. I will add description that same value of pdest, psrc, N, bar must be passed.
If these are different as @JackAKirk is suggesting, I think the non-group variants should be used.
This pull request is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be automatically closed in 30 days. |
This pull request was closed because it has been stalled for 30 days with no activity. |
No description provided.