-
Notifications
You must be signed in to change notification settings - Fork 751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][DOC] Add draft of sycl_ext_oneapi_async_memcpy #9439
base: sycl
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,161 @@ | ||
= sycl_ext_oneapi_async_memcpy | ||
|
||
:source-highlighter: coderay | ||
:coderay-linenums-mode: table | ||
|
||
// This section needs to be after the document title. | ||
:doctype: book | ||
:toc2: | ||
:toc: left | ||
:encoding: utf-8 | ||
:lang: en | ||
:dpcpp: pass:[DPC++] | ||
|
||
// Set the default source code type in this document to C++, | ||
// for syntax highlighting purposes. This is needed because | ||
// docbook uses c++ and html5 uses cpp. | ||
:language: {basebackend@docbook:c++:cpp} | ||
|
||
|
||
== Notice | ||
|
||
[%hardbreaks] | ||
Copyright (C) 2023-2023 Intel Corporation. All rights reserved. | ||
|
||
Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks | ||
of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. used by | ||
permission by Khronos. | ||
|
||
|
||
== Contact | ||
|
||
To report problems with this extension, please open a new issue at: | ||
|
||
https://github.com/intel/llvm/issues | ||
|
||
|
||
== Dependencies | ||
|
||
This extension is written against the SYCL 2020 revision 7 specification. All | ||
references below to the "core SYCL specification" or to section numbers in the | ||
SYCL specification refer to that revision. | ||
|
||
This extension also depends on the following SYCL extensions: | ||
|
||
* link:https://github.com/intel/llvm/pull/9186/[sycl_ext_oneapi_barrier] | ||
|
||
== Status | ||
|
||
This is a proposed extension specification, intended to gather community | ||
feedback. Interfaces defined in this specification may not be implemented yet | ||
or may be in a preliminary state. The specification itself may also change in | ||
incompatible ways before it is finalized. *Shipping software products should | ||
not rely on APIs defined in this specification.* | ||
|
||
|
||
== Overview | ||
|
||
This extension defines | ||
`sycl::ext::oneapi::experimental::async_memcpy` free function to | ||
generalize and replace the current `sycl::async_work_group_copy` | ||
function. | ||
|
||
== Specification | ||
|
||
=== Feature test macro | ||
|
||
This extension provides a feature-test macro as described in the core SYCL | ||
specification. An implementation supporting this extension must predefine the | ||
macro `SYCL_EXT_ONEAPI_ASYNC_MEMCPY` to one of the values defined in the table | ||
below. Applications can test for the existence of this macro to determine if | ||
the implementation supports this feature, or applications can test the macro's | ||
value to determine which of the extension's features the implementation | ||
supports. | ||
|
||
[%header,cols="1,5"] | ||
|=== | ||
|Value | ||
|Description | ||
|
||
|1 | ||
|The APIs of this experimental extension are not versioned, so the | ||
feature-test macro always has this value. | ||
|=== | ||
|
||
|
||
=== `async_memcpy` function | ||
`sycl::ext::oneapi::experimental::async_memcpy` is a free function | ||
that asynchronously copies a number of elements specified by | ||
`num_elements` of data of type `T` from the source pointer `src` to | ||
destination pointer `dest`. It also takes a barrier object of type | ||
`syclex::barrier` as an argument that can be used to wait on the | ||
completion of the memory copy. | ||
|
||
Permitted types for `T` are all scalar and vector types. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suspect that we may need some more restrictions here, or at least some non-normative notes to highlight potentially surprising behavior. It's unsafe to use If I'm leaning towards accepting all There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I thought the term "scalar type" did not include classes. Therefore, aren't all scalar types trivially copyable? See cppreference definition of Scalar Type. Also see the SYCL definition of Scalar data types. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we should limit this to scalar and vector. One can create a class and allocate it in local memory/registers where the hardware can use such accelerated async_memcpy for such class. |
||
|
||
This extension provides two versions of `async_memcpy`: with and | ||
without `Group` template parameter and argument. In the case of the | ||
group variant, `group_async_memcpy` is issued by all the threads in | ||
the group. This is a _group function_, as defined in Section 4.17.3 | ||
of the SYCL specification. In the case of the work-item variant, | ||
`async_memcpy` is issued by the current work-item. | ||
|
||
[source,c++] | ||
---- | ||
namespace sycl::ext::oneapi::experimental { | ||
|
||
template <typename T, access::address_space DestSpace, | ||
access::decorated DestIsDecorated, access::address_space SrcSpace, | ||
access::decorated SrcIsDecorated, sycl::memory_scope Scope> | ||
void async_memcpy(multi_ptr<T, DestSpace, DestIsDecorated> dest, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why limit to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It depends. If a function is passed a decorated pointer, then the implementation knows the address space (just as if it was passed a I'm hopeful that in cases where the compiler actually does know the address space, but it's represented as a raw pointer, the runtime check can be optimized away. But I don't think that is implemented right now. |
||
multi_ptr<T, SrcSpace, SrcIsDecorated> src, size_t numElements, | ||
syclex::barrier<Scope> bar); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, I will add There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think |
||
|
||
template <typename Group, typename T, access::address_space Space, | ||
access::decorated IsDecorated, sycl::memory_scope Scope> | ||
void group_async_memcpy(Group g, multi_ptr<T, Space, IsDecorated> | ||
dest, multi_ptr<T, Space, IsDecorated> src, size_t numElements, | ||
syclex::barrier<Scope> bar); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The |
||
|
||
} // namespace sycl::ext::oneapi::experimental | ||
---- | ||
|
||
=== `async_memcpy` Example | ||
|
||
[source,c++] | ||
---- | ||
using wg_barrier = syclex::barrier<sycl::memory_scope::work_group>; | ||
auto psrc = multi_ptr<T, sycl::access::address_space::global_space>(src); | ||
auto pdest = multi_ptr<T, sycl::access::address_space::local_space>(dest); | ||
|
||
q.parallel_for(..., [=](sycl::nd_item it) { | ||
|
||
// Allocate memory for and construct the barrier | ||
auto* bar = sycl::ext::oneapi::group_local_memory<wg_barrier>(it.get_group(), nthreads); | ||
|
||
async_memcpy(pdest, psrc, N, bar); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't understand how the barrier works here. Does the implementation call There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am using the barrier defined here: https://github.com/Pennycook/llvm/blob/cc7eaf559699a759c9cde1586e3113f9c1479bda/sycl/doc/extensions/proposed/sycl_ext_oneapi_barrier.asciidoc |
||
// Use the barrier | ||
bar->arrive_and_wait(); | ||
|
||
}).wait(); | ||
---- | ||
|
||
=== `group_async_memcpy` Example | ||
|
||
[source,c++] | ||
---- | ||
using wg_barrier = syclex::barrier<sycl::memory_scope::work_group>; | ||
auto psrc = multi_ptr<T, sycl::access::address_space::global_space>(src); | ||
auto pdest = multi_ptr<T, sycl::access::address_space::local_space>(dest); | ||
|
||
q.parallel_for(..., [=](sycl::nd_item it) { | ||
|
||
// Allocate memory for and construct the barrier | ||
auto* bar = sycl::ext::oneapi::group_local_memory<wg_barrier>(it.get_group(), nthreads); | ||
|
||
group_async_memcpy(it.get_group(), pdest, psrc, N, bar); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the idea that all work-items in the group jointly copy from There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In real applications you often need an interface whereby each thread is passed a different pointer: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The group variants adds convergence constraints. I will add description that same value of pdest, psrc, N, bar must be passed. |
||
// Use the group barrier wait | ||
group_arrive_and_wait(it.get_group(), bar); | ||
|
||
}).wait(); | ||
---- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be a good idea to align with SYCL 2020's USM copy functions here. For USM,
memcpy
acceptsvoid*
and anumBytes
, whereascopy
acceptsT*
and acount
. The function you've defined here is calledmemcpy
but acceptsT*
.My recommendation would be to define both
async_memcpy
andasync_copy
.