Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][Graph] Support for Graph Node Profiling #12592

Closed
wants to merge 32 commits into from

Conversation

mfrancepillois
Copy link
Contributor

This PR:

  • adds a new Sycl API to query the profiling information of a node ext_oneapi_get_profiling_info(ext::node Node).
  • Implements node profiling support.
  • Throws an exception if the profiling is queried for partitioned graphs (i.e. graph with an host-task)
  • Adds new pi entry points for node profiling.
  • Updates specs.
  • Adds tests.
  • Updates symbols.

execution associated with this SYCL event. If the requested info is
not available when this member function is called due to incompletion of
command groups associated with the event, then the call to this member
function will block until the requested info is available.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping the following aspect as a different thread:
We also need to consider kernel/node fusion as an optimization strategy. In this case we might not have the exact profiling info.
@sommerlukas what are your thoughts on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are two possible ways to handle this:

  • Return the same profiling info for all nodes that were fused. So if you use this new API with two different nodes, you would get the same information and that information would correspond to the profiling information of the fused kernel.
  • Disallow to query for profiling information if fusion was applied.

I believe that profiling info for the fused kernel would be valuable for the user, so I would prefer the first option here.

On some backends, it might even be possible to get profiling information for individual parts of the fused kernel corresponding to the original nodes, but I don't think this would be portable and implementation could quickly get very tricky, so I would prefer to not include that in the extension proposal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your feedback.
I agree that it might be interesting for users to get node profiling information when using the kernel fusion feature.
But the proposed profiling API is based on the level-zero function zeCommandListAppendQueryKernelTimestamps (https://spec.oneapi.io/level-zero/latest/core/api.html#zecommandlistappendquerykerneltimestamps), which in turn relies on the ZeEvents that were linked to the kernels when they were enqueued to their command-list. So I don't know what happens to this ZeEvents when the kernels are fused.
Could you tell me where I can find more information on the implementation of the kernel fusion feature?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kernel fusion happens in the SYCL RT, even before commands are submitted to the underlying backend.

So, in this case, fusion would happen before anything is passed to the LevelZero backend. Instead, a single command for the execution of the fused kernel would be passed to the LevelZero backend and inserted into the command-list.

Obtaining profiling information for the execution of the fused kernel could use the same API and the zeEvent returned for the command corresponding to the fused kernel execution.

There is a design document with more information here: https://github.com/intel/llvm/blob/sycl/sycl/doc/design/KernelFusionJIT.md

In the current implementation, is the backend (ZE) command list only created upon call to command_graph<...>::finalize or already before that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command-buffers that instantiate the command-list for LevelZero backend are only created during the command_graph<...>::finalize process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in that case it would be a viable implementation option to first perform fusion and only then create the command buffer inside command_graph<...>::finalize. The command-buffer would only contain the fused kernel command (and memory commands if necessary) and it would be possible to return a zeEvent that could be used for profiling of the fused kernel execution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that profiling info for the fused kernel would be valuable for the user, so I would prefer the first option here.

I agree that profiling information is useful for a fused graph as a whole, but I don't see how it's useful to attempt to provide this profiling information on a per-node basis. How can the application know what the profiling information means? Won't it depend on the implementation's ability to fuse each node?

Let's consider that an application calls ext_oneapi_get_profiling_info on a node in a fused graph. The returned information might tell the time it took to execute that one node (if the node could not be fused), it might tell the time it took to execute several nodes in the graph (if the node was fused with some of its neighbors), or it might tell the time it took to execute the entire graph (if all the nodes in the graph were fused). With so much uncertainty, I don't see why this is a useful feature.

I think it does make sense to let the application get profiling information for the entire graph, but the application can do this already using the event that is returned from queue::ext_oneapi_graph.

I think ext_oneapi_get_profiling_info should simply fail if the graph was created with fusion enabled. Per-node profiling information is not available for a fused graph.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may also be worth noting that the SYCL 2020 wording is phrased in terms of "command groups".

Getting the profiling information for the command group in which the graph was submitted, where graph execution is the "action", is consistent with the specification wording, and is always well defined. Getting the profiling information for nodes in the graph seems less consistent to me: the mapping of nodes to command groups and "actions" is unclear.

sycl/test-e2e/Graph/Explicit/invalid_profiling.cpp Outdated Show resolved Hide resolved
sycl/test-e2e/Graph/Explicit/nodes_profiling_info.cpp Outdated Show resolved Hide resolved
sycl/include/sycl/detail/pi.h Outdated Show resolved Hide resolved
sycl/include/sycl/detail/pi.h Outdated Show resolved Hide resolved
# Merge pull request #1254 from Bensuo/cmdbuf-support-hip
# [EXP][CMDBUF] HIP adapter support for command buffers
set(UNIFIED_RUNTIME_TAG a2757b2931daa2f8d7c9dd51b0fc846be1fd49a7 )
set(UNIFIED_RUNTIME_REPO "https://github.com/bensuo/unified-runtime.git")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember to set this back to the original before merging.

sycl/source/detail/event_info.hpp Outdated Show resolved Hide resolved
sycl/test-e2e/Graph/Explicit/invalid_profiling.cpp Outdated Show resolved Hide resolved
sycl/test-e2e/Graph/Explicit/invalid_profiling.cpp Outdated Show resolved Hide resolved
sycl/source/detail/event_impl.cpp Outdated Show resolved Hide resolved
sycl/source/detail/event_info.hpp Outdated Show resolved Hide resolved
sycl/source/detail/graph_impl.hpp Outdated Show resolved Hide resolved
Copy link
Contributor

@maarquitos14 maarquitos14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just remember to revert the changes related to UR repo/tag.

execution associated with this SYCL event. If the requested info is
not available when this member function is called due to incompletion of
graph execution associated with the event, then the call to this member
function will block until the requested info is available.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I first read this, I was confused about which event the application would use when calling this API because there are two relevant events. If the graph is created by recording a queue, each recorded node has an event. You also get an event when you submit the graph to a queue via queue::ext_oneapi_graph.

It took me a while to figure out that applications are expected to use ext_oneapi_get_profiling_info on the second event (the one returned from queue::ext_oneapi_graph). I think it would be helpful to make this more clear somehow in the description.

execution associated with this SYCL event. If the requested info is
not available when this member function is called due to incompletion of
graph execution associated with the event, then the call to this member
function will block until the requested info is available.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EwanC asked in #12838 how ext_oneapi_get_profiling_info should relate to the new submit_profiling_tag API.

I think the graph extension could add a new member function to command_graph:

node add_profiling_tag();

This function would fail unless the graph's device had the aspect ext_oneapi_queue_profiling_tag. Once the application uses this API to add a profiling tag to the graph, it can get the profiling information later via ext_oneapi_get_profiling_info.

execution associated with this SYCL event. If the requested info is
not available when this member function is called due to incompletion of
command groups associated with the event, then the call to this member
function will block until the requested info is available.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that profiling info for the fused kernel would be valuable for the user, so I would prefer the first option here.

I agree that profiling information is useful for a fused graph as a whole, but I don't see how it's useful to attempt to provide this profiling information on a per-node basis. How can the application know what the profiling information means? Won't it depend on the implementation's ability to fuse each node?

Let's consider that an application calls ext_oneapi_get_profiling_info on a node in a fused graph. The returned information might tell the time it took to execute that one node (if the node could not be fused), it might tell the time it took to execute several nodes in the graph (if the node was fused with some of its neighbors), or it might tell the time it took to execute the entire graph (if all the nodes in the graph were fused). With so much uncertainty, I don't see why this is a useful feature.

I think it does make sense to let the application get profiling information for the entire graph, but the application can do this already using the event that is returned from queue::ext_oneapi_graph.

I think ext_oneapi_get_profiling_info should simply fail if the graph was created with fusion enabled. Per-node profiling information is not available for a fused graph.

@@ -1881,8 +1919,10 @@ if used in application code.

. Using reductions in a graph node.
. Using sycl streams in a graph node.
. Profiling an event returned from graph submission with
`event::get_profiling_info()`.
. Profiling information is not available for graphs that contain host-task nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This restriction seems like another good reason not to expose per-node profiling information. I don't understand why the presence of a host task would impact the runtime's ability to reason about a different device kernel, and I think many users would struggle to understand this too.

It's also unclear to me whether the presence of a host task disables all profiling. What is the intent here? Is it still possible to use the event returned by the graph submission to reason about how long the entire graph took to execute?

Comment on lines +1923 to +1925
. Profiling a node from an event returned from graph submission with
`event::ext_oneapi_get_profiling_info(ext::node)` is only available for
the level-zero backend.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I think this is surprising. Removing the per-node profiling completely would fix this.

Copy link
Contributor

This pull request is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be automatically closed in 30 days.

@github-actions github-actions bot added the Stale label Sep 13, 2024
Copy link
Contributor

This pull request was closed because it has been stalled for 30 days with no activity.

@github-actions github-actions bot closed this Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants