Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][DOC] Add draft of sycl_ext_oneapi_async_memcpy #9439

Open
wants to merge 2 commits into
base: sycl
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
161 changes: 161 additions & 0 deletions sycl/doc/extensions/proposed/sycl_ext_oneapi_async_memcpy.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
= sycl_ext_oneapi_async_memcpy

:source-highlighter: coderay
:coderay-linenums-mode: table

// This section needs to be after the document title.
:doctype: book
:toc2:
:toc: left
:encoding: utf-8
:lang: en
:dpcpp: pass:[DPC++]

// Set the default source code type in this document to C++,
// for syntax highlighting purposes. This is needed because
// docbook uses c++ and html5 uses cpp.
:language: {basebackend@docbook:c++:cpp}


== Notice

[%hardbreaks]
Copyright (C) 2023-2023 Intel Corporation. All rights reserved.

Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks
of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. used by
permission by Khronos.


== Contact

To report problems with this extension, please open a new issue at:

https://github.com/intel/llvm/issues


== Dependencies

This extension is written against the SYCL 2020 revision 7 specification. All
references below to the "core SYCL specification" or to section numbers in the
SYCL specification refer to that revision.

This extension also depends on the following SYCL extensions:

* link:https://github.com/intel/llvm/pull/9186/[sycl_ext_oneapi_barrier]

== Status

This is a proposed extension specification, intended to gather community
feedback. Interfaces defined in this specification may not be implemented yet
or may be in a preliminary state. The specification itself may also change in
incompatible ways before it is finalized. *Shipping software products should
not rely on APIs defined in this specification.*


== Overview

This extension defines
`sycl::ext::oneapi::experimental::async_memcpy` free function to
generalize and replace the current `sycl::async_work_group_copy`
function.

== Specification

=== Feature test macro

This extension provides a feature-test macro as described in the core SYCL
specification. An implementation supporting this extension must predefine the
macro `SYCL_EXT_ONEAPI_ASYNC_MEMCPY` to one of the values defined in the table
below. Applications can test for the existence of this macro to determine if
the implementation supports this feature, or applications can test the macro's
value to determine which of the extension's features the implementation
supports.

[%header,cols="1,5"]
|===
|Value
|Description

|1
|The APIs of this experimental extension are not versioned, so the
feature-test macro always has this value.
|===


=== `async_memcpy` function
`sycl::ext::oneapi::experimental::async_memcpy` is a free function
that asynchronously copies a number of elements specified by
`num_elements` of data of type `T` from the source pointer `src` to
Comment on lines +88 to +90
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a good idea to align with SYCL 2020's USM copy functions here. For USM, memcpy accepts void* and a numBytes, whereas copy accepts T* and a count. The function you've defined here is called memcpy but accepts T*.

My recommendation would be to define both async_memcpy and async_copy.

destination pointer `dest`. It also takes a barrier object of type
`syclex::barrier` as an argument that can be used to wait on the
completion of the memory copy.

Permitted types for `T` are all scalar and vector types.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that we may need some more restrictions here, or at least some non-normative notes to highlight potentially surprising behavior.

It's unsafe to use memcpy if T isn't trivially copyable. This should probably be listed as a precondition of memcpy_async as well.

If T has a non-trivial copy constructor, developers calling async_copy would expect the copy constructor to execute. That would (probably) prevent an implementation from using any hardware-acceleration for asynchronous copies. I don't know whether it's better to limit T to trivially copyable types, or to add a non-normative note that many implementations will probably not execute asynchronously if T is not trivially copyable.

I'm leaning towards accepting all T and adding a note, because asynchrony already isn't guaranteed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the term "scalar type" did not include classes. Therefore, aren't all scalar types trivially copyable? See cppreference definition of Scalar Type. Also see the SYCL definition of Scalar data types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should limit this to scalar and vector. One can create a class and allocate it in local memory/registers where the hardware can use such accelerated async_memcpy for such class.
I will add a precondition that T must be a trivially copyable type.


This extension provides two versions of `async_memcpy`: with and
without `Group` template parameter and argument. In the case of the
group variant, `group_async_memcpy` is issued by all the threads in
the group. This is a _group function_, as defined in Section 4.17.3
of the SYCL specification. In the case of the work-item variant,
`async_memcpy` is issued by the current work-item.

[source,c++]
----
namespace sycl::ext::oneapi::experimental {

template <typename T, access::address_space DestSpace,
access::decorated DestIsDecorated, access::address_space SrcSpace,
access::decorated SrcIsDecorated, sycl::memory_scope Scope>
void async_memcpy(multi_ptr<T, DestSpace, DestIsDecorated> dest,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why limit to multi_ptr here? Should we allow generic T* as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use multi_ptr to be able to carry around the address space info.
If USM is used, multi_ptr can be constructed using address_space_cast
Can the implementation of async_copy retrieve address space (local vs global) if we only pass generic T*?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends. T* is allowed to match both decorated pointers (which have the address space attached) and raw pointers (which don't have the address space attached).

If a function is passed a decorated pointer, then the implementation knows the address space (just as if it was passed a multi_ptr). If it's passed a raw pointer, then it can use address_space_cast to determine what the address space is using a runtime check (which may introduce some overhead).

I'm hopeful that in cases where the compiler actually does know the address space, but it's represented as a raw pointer, the runtime check can be optimized away. But I don't think that is implemented right now.

multi_ptr<T, SrcSpace, SrcIsDecorated> src, size_t numElements,
syclex::barrier<Scope> bar);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syclex::barrier has two template arguments. Was it your intention to limit this to barriers with the default CompletionFunction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I will add CompletionFunction to the arguments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think syclex is defined anywhere here. Since you're in the namespace already, you can probably just remove it.


template <typename Group, typename T, access::address_space Space,
access::decorated IsDecorated, sycl::memory_scope Scope>
void group_async_memcpy(Group g, multi_ptr<T, Space, IsDecorated>
dest, multi_ptr<T, Space, IsDecorated> src, size_t numElements,
syclex::barrier<Scope> bar);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The barrier class has a deleted copy constructor, so I think we need to pass it by reference.


} // namespace sycl::ext::oneapi::experimental
----

=== `async_memcpy` Example

[source,c++]
----
using wg_barrier = syclex::barrier<sycl::memory_scope::work_group>;
auto psrc = multi_ptr<T, sycl::access::address_space::global_space>(src);
auto pdest = multi_ptr<T, sycl::access::address_space::local_space>(dest);

q.parallel_for(..., [=](sycl::nd_item it) {

// Allocate memory for and construct the barrier
auto* bar = sycl::ext::oneapi::group_local_memory<wg_barrier>(it.get_group(), nthreads);

async_memcpy(pdest, psrc, N, bar);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how the barrier works here. Does the implementation call arrive on the barrier when the copy is complete? If that is the case, it seems like the call to arrive_and_wait below should be just wait. If that is not how it works, then I'm confused about something here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using the barrier defined here: https://github.com/Pennycook/llvm/blob/cc7eaf559699a759c9cde1586e3113f9c1479bda/sycl/doc/extensions/proposed/sycl_ext_oneapi_barrier.asciidoc
The WIs here call arrive, not the implementation. Then, wait to block until the predetermined number of work items have called arrive. When same number of WIs call arrive and wait at the same point of code, token is not needed and arrive_and_wait can be used.

// Use the barrier
bar->arrive_and_wait();

}).wait();
----

=== `group_async_memcpy` Example

[source,c++]
----
using wg_barrier = syclex::barrier<sycl::memory_scope::work_group>;
auto psrc = multi_ptr<T, sycl::access::address_space::global_space>(src);
auto pdest = multi_ptr<T, sycl::access::address_space::local_space>(dest);

q.parallel_for(..., [=](sycl::nd_item it) {

// Allocate memory for and construct the barrier
auto* bar = sycl::ext::oneapi::group_local_memory<wg_barrier>(it.get_group(), nthreads);

group_async_memcpy(it.get_group(), pdest, psrc, N, bar);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea that all work-items in the group jointly copy from psrc to pdest? If so, there should be a description someplace saying that all work-items must pass the same value for pdest, psrc, and N. I presume all work-items should also pass the same value for bar?

Copy link
Contributor

@JackAKirk JackAKirk Aug 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In real applications you often need an interface whereby each thread is passed a different pointer:
imagine you are dealing with some complicated strided/and or with offset data that you wish to copy. All of these existing interfaces in this spec doc assume each thread is passed the same start address, and then (although as you mention doc doesn't state yet) the implication is that the impl assigns group items different addresses wrt the start address to copy, up to N items.
This is fine for certain applications, but if you want to support all real applications you at least need an additional interface (perhaps by e.g. introducing a single item/thread "group"), where each thread can be passed a specific pointer by the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The group variants adds convergence constraints. I will add description that same value of pdest, psrc, N, bar must be passed.
If these are different as @JackAKirk is suggesting, I think the non-group variants should be used.

// Use the group barrier wait
group_arrive_and_wait(it.get_group(), bar);

}).wait();
----