-
Notifications
You must be signed in to change notification settings - Fork 747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][Graph] Support for Graph Node Profiling #12592
Changes from 29 commits
c032cf9
ae32544
f63acca
c189890
53b76bb
421f303
99c7bf9
a7b839b
ed0bc28
fbbc822
7ff2327
96812b9
f589d9b
262b44a
46bce9c
30ab2fe
df99396
7d86fb2
190136b
7d21aba
19102ef
81efcc4
1edcd01
f52745e
78d9d22
69d8dbb
74c5295
2af6f28
7b8f9bb
d58e1a5
f453a09
1c6aa95
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1200,6 +1200,44 @@ Exceptions: | |
|
||
|=== | ||
|
||
==== New Event Member Functions | ||
|
||
Table {counter: tableNumber}. Additional member functions of the `sycl::event` class. | ||
[cols="2a,a"] | ||
|=== | ||
|Member function|Description | ||
|
||
| | ||
[source,c++] | ||
---- | ||
template <typename Param> | ||
typename Param::return_type | ||
event::ext_oneapi_get_profiling_info(node Node) const; | ||
---- | ||
|
||
| Queries the profiling information of a SYCL Graph node for the graph | ||
execution associated with this SYCL event. If the requested info is | ||
not available when this member function is called due to incompletion of | ||
graph execution associated with the event, then the call to this member | ||
function will block until the requested info is available. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When I first read this, I was confused about which It took me a while to figure out that applications are expected to use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @EwanC asked in #12838 how I think the graph extension could add a new member function to
This function would fail unless the graph's device had the aspect |
||
|
||
Parameters: | ||
|
||
* `Node` - Node object for which the profiling information is being queried. | ||
|
||
Exceptions: | ||
|
||
* Throws synchronously with error code `invalid` if this SYCL event is not | ||
associated with a graph execution. | ||
* Throws synchronously with error code `invalid` if the queue on which | ||
the graph was submitted was not constructed with | ||
the `property::queue::enable_profiling` property. | ||
* Throws synchronously with error code `invalid` if `Node` is not associated | ||
with the graph exectution represented by this event. | ||
|
||
|=== | ||
|
||
|
||
=== Thread Safety | ||
|
||
The new functions in this extension are thread-safe, the same as member | ||
|
@@ -1899,8 +1937,10 @@ if used in application code. | |
|
||
. Using reductions in a graph node. | ||
. Using sycl streams in a graph node. | ||
. Profiling an event returned from graph submission with | ||
`event::get_profiling_info()`. | ||
. Profiling information is not available for graphs that contain host-task nodes. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This restriction seems like another good reason not to expose per-node profiling information. I don't understand why the presence of a host task would impact the runtime's ability to reason about a different device kernel, and I think many users would struggle to understand this too. It's also unclear to me whether the presence of a host task disables all profiling. What is the intent here? Is it still possible to use the event returned by the graph submission to reason about how long the entire graph took to execute? |
||
. Profiling a node from an event returned from graph submission with | ||
`event::ext_oneapi_get_profiling_info(ext::node)` is only available for | ||
the level-zero backend. | ||
Comment on lines
+1923
to
+1925
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again, I think this is surprising. Removing the per-node profiling completely would fix this. |
||
. Synchronization between multiple executions of the same command-buffer | ||
must be handled in the host for level-zero backend, which may involve | ||
extra latency for subsequent submissions. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
__SYCL_PARAM_TRAITS_SPEC_PARAMT(event_profiling, command_submit, ext::oneapi::experimental::node, uint64_t, PI_PROFILING_INFO_COMMAND_SUBMIT) | ||
__SYCL_PARAM_TRAITS_SPEC_PARAMT(event_profiling, command_start, ext::oneapi::experimental::node, uint64_t, PI_PROFILING_INFO_COMMAND_START) | ||
__SYCL_PARAM_TRAITS_SPEC_PARAMT(event_profiling, command_end, ext::oneapi::experimental::node, uint64_t, PI_PROFILING_INFO_COMMAND_END) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -56,15 +56,8 @@ endif() | |
if(SYCL_PI_UR_USE_FETCH_CONTENT) | ||
include(FetchContent) | ||
|
||
set(UNIFIED_RUNTIME_REPO "https://github.com/oneapi-src/unified-runtime.git") | ||
|
||
# commit a2757b2931daa2f8d7c9dd51b0fc846be1fd49a7 | ||
# Merge: 9b936b5 + f78d369 | ||
# Author: Kenneth Benzie (Benie) <k.benzie@codeplay.com> | ||
# Date: Tue Feb 27 11:34:58 2024 +0000 | ||
# Merge pull request #1254 from Bensuo/cmdbuf-support-hip | ||
# [EXP][CMDBUF] HIP adapter support for command buffers | ||
set(UNIFIED_RUNTIME_TAG a2757b2931daa2f8d7c9dd51b0fc846be1fd49a7 ) | ||
set(UNIFIED_RUNTIME_REPO "https://github.com/bensuo/unified-runtime.git") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Remember to set this back to the original before merging. |
||
set(UNIFIED_RUNTIME_TAG cmdbuf-profiling-sync-point) | ||
|
||
if(SYCL_PI_UR_OVERRIDE_FETCH_CONTENT_REPO) | ||
set(UNIFIED_RUNTIME_REPO "${SYCL_PI_UR_OVERRIDE_FETCH_CONTENT_REPO}") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping the following aspect as a different thread:
We also need to consider kernel/node fusion as an optimization strategy. In this case we might not have the exact profiling info.
@sommerlukas what are your thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are two possible ways to handle this:
I believe that profiling info for the fused kernel would be valuable for the user, so I would prefer the first option here.
On some backends, it might even be possible to get profiling information for individual parts of the fused kernel corresponding to the original nodes, but I don't think this would be portable and implementation could quickly get very tricky, so I would prefer to not include that in the extension proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your feedback.
I agree that it might be interesting for users to get node profiling information when using the kernel fusion feature.
But the proposed profiling API is based on the level-zero function
zeCommandListAppendQueryKernelTimestamps
(https://spec.oneapi.io/level-zero/latest/core/api.html#zecommandlistappendquerykerneltimestamps), which in turn relies on the ZeEvents that were linked to the kernels when they were enqueued to theircommand-list
. So I don't know what happens to this ZeEvents when the kernels are fused.Could you tell me where I can find more information on the implementation of the kernel fusion feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kernel fusion happens in the SYCL RT, even before commands are submitted to the underlying backend.
So, in this case, fusion would happen before anything is passed to the LevelZero backend. Instead, a single command for the execution of the fused kernel would be passed to the LevelZero backend and inserted into the command-list.
Obtaining profiling information for the execution of the fused kernel could use the same API and the
zeEvent
returned for the command corresponding to the fused kernel execution.There is a design document with more information here: https://github.com/intel/llvm/blob/sycl/sycl/doc/design/KernelFusionJIT.md
In the current implementation, is the backend (ZE) command list only created upon call to
command_graph<...>::finalize
or already before that?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The command-buffers that instantiate the command-list for LevelZero backend are only created during the
command_graph<...>::finalize
process.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in that case it would be a viable implementation option to first perform fusion and only then create the command buffer inside
command_graph<...>::finalize
. The command-buffer would only contain the fused kernel command (and memory commands if necessary) and it would be possible to return azeEvent
that could be used for profiling of the fused kernel execution.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that profiling information is useful for a fused graph as a whole, but I don't see how it's useful to attempt to provide this profiling information on a per-node basis. How can the application know what the profiling information means? Won't it depend on the implementation's ability to fuse each node?
Let's consider that an application calls
ext_oneapi_get_profiling_info
on a node in a fused graph. The returned information might tell the time it took to execute that one node (if the node could not be fused), it might tell the time it took to execute several nodes in the graph (if the node was fused with some of its neighbors), or it might tell the time it took to execute the entire graph (if all the nodes in the graph were fused). With so much uncertainty, I don't see why this is a useful feature.I think it does make sense to let the application get profiling information for the entire graph, but the application can do this already using the
event
that is returned fromqueue::ext_oneapi_graph
.I think
ext_oneapi_get_profiling_info
should simply fail if the graph was created with fusion enabled. Per-node profiling information is not available for a fused graph.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may also be worth noting that the SYCL 2020 wording is phrased in terms of "command groups".
Getting the profiling information for the command group in which the graph was submitted, where graph execution is the "action", is consistent with the specification wording, and is always well defined. Getting the profiling information for nodes in the graph seems less consistent to me: the mapping of nodes to command groups and "actions" is unclear.