-
Notifications
You must be signed in to change notification settings - Fork 747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][Graph] Support for Graph Node Profiling #12592
Conversation
Implements node profiling support. Adds new pi entry points for node profiling. Updates docs. adds tests. Updates symbols.
Fixes declaration issues Changes raw pointer to weak_ptr Adds exception is node not found. Updates specs.
sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc
Outdated
Show resolved
Hide resolved
sycl/include/sycl/info/ext_oneapi_graph_node_profiling_traits.def
Outdated
Show resolved
Hide resolved
execution associated with this SYCL event. If the requested info is | ||
not available when this member function is called due to incompletion of | ||
command groups associated with the event, then the call to this member | ||
function will block until the requested info is available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping the following aspect as a different thread:
We also need to consider kernel/node fusion as an optimization strategy. In this case we might not have the exact profiling info.
@sommerlukas what are your thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are two possible ways to handle this:
- Return the same profiling info for all nodes that were fused. So if you use this new API with two different nodes, you would get the same information and that information would correspond to the profiling information of the fused kernel.
- Disallow to query for profiling information if fusion was applied.
I believe that profiling info for the fused kernel would be valuable for the user, so I would prefer the first option here.
On some backends, it might even be possible to get profiling information for individual parts of the fused kernel corresponding to the original nodes, but I don't think this would be portable and implementation could quickly get very tricky, so I would prefer to not include that in the extension proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your feedback.
I agree that it might be interesting for users to get node profiling information when using the kernel fusion feature.
But the proposed profiling API is based on the level-zero function zeCommandListAppendQueryKernelTimestamps
(https://spec.oneapi.io/level-zero/latest/core/api.html#zecommandlistappendquerykerneltimestamps), which in turn relies on the ZeEvents that were linked to the kernels when they were enqueued to their command-list
. So I don't know what happens to this ZeEvents when the kernels are fused.
Could you tell me where I can find more information on the implementation of the kernel fusion feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kernel fusion happens in the SYCL RT, even before commands are submitted to the underlying backend.
So, in this case, fusion would happen before anything is passed to the LevelZero backend. Instead, a single command for the execution of the fused kernel would be passed to the LevelZero backend and inserted into the command-list.
Obtaining profiling information for the execution of the fused kernel could use the same API and the zeEvent
returned for the command corresponding to the fused kernel execution.
There is a design document with more information here: https://github.com/intel/llvm/blob/sycl/sycl/doc/design/KernelFusionJIT.md
In the current implementation, is the backend (ZE) command list only created upon call to command_graph<...>::finalize
or already before that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The command-buffers that instantiate the command-list for LevelZero backend are only created during the command_graph<...>::finalize
process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in that case it would be a viable implementation option to first perform fusion and only then create the command buffer inside command_graph<...>::finalize
. The command-buffer would only contain the fused kernel command (and memory commands if necessary) and it would be possible to return a zeEvent
that could be used for profiling of the fused kernel execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that profiling info for the fused kernel would be valuable for the user, so I would prefer the first option here.
I agree that profiling information is useful for a fused graph as a whole, but I don't see how it's useful to attempt to provide this profiling information on a per-node basis. How can the application know what the profiling information means? Won't it depend on the implementation's ability to fuse each node?
Let's consider that an application calls ext_oneapi_get_profiling_info
on a node in a fused graph. The returned information might tell the time it took to execute that one node (if the node could not be fused), it might tell the time it took to execute several nodes in the graph (if the node was fused with some of its neighbors), or it might tell the time it took to execute the entire graph (if all the nodes in the graph were fused). With so much uncertainty, I don't see why this is a useful feature.
I think it does make sense to let the application get profiling information for the entire graph, but the application can do this already using the event
that is returned from queue::ext_oneapi_graph
.
I think ext_oneapi_get_profiling_info
should simply fail if the graph was created with fusion enabled. Per-node profiling information is not available for a fused graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may also be worth noting that the SYCL 2020 wording is phrased in terms of "command groups".
Getting the profiling information for the command group in which the graph was submitted, where graph execution is the "action", is consistent with the specification wording, and is always well defined. Getting the profiling information for nodes in the graph seems less consistent to me: the mapping of nodes to command groups and "actions" is unclear.
sycl/test-e2e/Graph/Explicit/nodes_profiling_info_invalid_backend.cpp
Outdated
Show resolved
Hide resolved
# Merge pull request #1254 from Bensuo/cmdbuf-support-hip | ||
# [EXP][CMDBUF] HIP adapter support for command buffers | ||
set(UNIFIED_RUNTIME_TAG a2757b2931daa2f8d7c9dd51b0fc846be1fd49a7 ) | ||
set(UNIFIED_RUNTIME_REPO "https://github.com/bensuo/unified-runtime.git") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember to set this back to the original before merging.
sycl/test-e2e/Graph/Explicit/nodes_profiling_info_invalid_backend.cpp
Outdated
Show resolved
Hide resolved
sycl/test-e2e/Graph/Explicit/nodes_profiling_info_invalid_backend.cpp
Outdated
Show resolved
Hide resolved
sycl/test-e2e/Graph/Explicit/nodes_profiling_info_invalid_backend.cpp
Outdated
Show resolved
Hide resolved
sycl/test-e2e/Graph/Explicit/nodes_profiling_info_invalid_backend.cpp
Outdated
Show resolved
Hide resolved
sycl/test-e2e/Graph/Explicit/nodes_profiling_info_invalid_backend.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just remember to revert the changes related to UR repo/tag.
execution associated with this SYCL event. If the requested info is | ||
not available when this member function is called due to incompletion of | ||
graph execution associated with the event, then the call to this member | ||
function will block until the requested info is available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I first read this, I was confused about which event
the application would use when calling this API because there are two relevant events. If the graph is created by recording a queue, each recorded node has an event. You also get an event when you submit the graph to a queue via queue::ext_oneapi_graph
.
It took me a while to figure out that applications are expected to use ext_oneapi_get_profiling_info
on the second event (the one returned from queue::ext_oneapi_graph
). I think it would be helpful to make this more clear somehow in the description.
execution associated with this SYCL event. If the requested info is | ||
not available when this member function is called due to incompletion of | ||
graph execution associated with the event, then the call to this member | ||
function will block until the requested info is available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EwanC asked in #12838 how ext_oneapi_get_profiling_info
should relate to the new submit_profiling_tag
API.
I think the graph extension could add a new member function to command_graph
:
node add_profiling_tag();
This function would fail unless the graph's device had the aspect ext_oneapi_queue_profiling_tag
. Once the application uses this API to add a profiling tag to the graph, it can get the profiling information later via ext_oneapi_get_profiling_info
.
execution associated with this SYCL event. If the requested info is | ||
not available when this member function is called due to incompletion of | ||
command groups associated with the event, then the call to this member | ||
function will block until the requested info is available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that profiling info for the fused kernel would be valuable for the user, so I would prefer the first option here.
I agree that profiling information is useful for a fused graph as a whole, but I don't see how it's useful to attempt to provide this profiling information on a per-node basis. How can the application know what the profiling information means? Won't it depend on the implementation's ability to fuse each node?
Let's consider that an application calls ext_oneapi_get_profiling_info
on a node in a fused graph. The returned information might tell the time it took to execute that one node (if the node could not be fused), it might tell the time it took to execute several nodes in the graph (if the node was fused with some of its neighbors), or it might tell the time it took to execute the entire graph (if all the nodes in the graph were fused). With so much uncertainty, I don't see why this is a useful feature.
I think it does make sense to let the application get profiling information for the entire graph, but the application can do this already using the event
that is returned from queue::ext_oneapi_graph
.
I think ext_oneapi_get_profiling_info
should simply fail if the graph was created with fusion enabled. Per-node profiling information is not available for a fused graph.
@@ -1881,8 +1919,10 @@ if used in application code. | |||
|
|||
. Using reductions in a graph node. | |||
. Using sycl streams in a graph node. | |||
. Profiling an event returned from graph submission with | |||
`event::get_profiling_info()`. | |||
. Profiling information is not available for graphs that contain host-task nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This restriction seems like another good reason not to expose per-node profiling information. I don't understand why the presence of a host task would impact the runtime's ability to reason about a different device kernel, and I think many users would struggle to understand this too.
It's also unclear to me whether the presence of a host task disables all profiling. What is the intent here? Is it still possible to use the event returned by the graph submission to reason about how long the entire graph took to execute?
. Profiling a node from an event returned from graph submission with | ||
`event::ext_oneapi_get_profiling_info(ext::node)` is only available for | ||
the level-zero backend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I think this is surprising. Removing the per-node profiling completely would fix this.
This pull request is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be automatically closed in 30 days. |
This pull request was closed because it has been stalled for 30 days with no activity. |
This PR: