[SYCL][Graph] Support for Graph Node Profiling #12592

mfrancepillois · 2024-02-02T10:55:21Z

This PR:

adds a new Sycl API to query the profiling information of a node ext_oneapi_get_profiling_info(ext::node Node).
Implements node profiling support.
Throws an exception if the profiling is queried for partitioned graphs (i.e. graph with an host-task)
Adds new pi entry points for node profiling.
Updates specs.
Adds tests.
Updates symbols.

Implements node profiling support. Adds new pi entry points for node profiling. Updates docs. adds tests. Updates symbols.

Fixes declaration issues Changes raw pointer to weak_ptr Adds exception is node not found. Updates specs.

sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc

sycl/include/sycl/info/ext_oneapi_graph_node_profiling_traits.def

reble · 2024-02-02T19:31:33Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc

+  execution associated with this SYCL event. If the requested info is
+  not available when this member function is called due to incompletion of
+  command groups associated with the event, then the call to this member
+  function will block until the requested info is available.


Keeping the following aspect as a different thread:
We also need to consider kernel/node fusion as an optimization strategy. In this case we might not have the exact profiling info.
@sommerlukas what are your thoughts on this?

I think there are two possible ways to handle this:

Return the same profiling info for all nodes that were fused. So if you use this new API with two different nodes, you would get the same information and that information would correspond to the profiling information of the fused kernel.

Disallow to query for profiling information if fusion was applied.

I believe that profiling info for the fused kernel would be valuable for the user, so I would prefer the first option here.

On some backends, it might even be possible to get profiling information for individual parts of the fused kernel corresponding to the original nodes, but I don't think this would be portable and implementation could quickly get very tricky, so I would prefer to not include that in the extension proposal.

Thanks for your feedback.
I agree that it might be interesting for users to get node profiling information when using the kernel fusion feature.
But the proposed profiling API is based on the level-zero function zeCommandListAppendQueryKernelTimestamps (https://spec.oneapi.io/level-zero/latest/core/api.html#zecommandlistappendquerykerneltimestamps), which in turn relies on the ZeEvents that were linked to the kernels when they were enqueued to their command-list. So I don't know what happens to this ZeEvents when the kernels are fused.
Could you tell me where I can find more information on the implementation of the kernel fusion feature?

Kernel fusion happens in the SYCL RT, even before commands are submitted to the underlying backend.

So, in this case, fusion would happen before anything is passed to the LevelZero backend. Instead, a single command for the execution of the fused kernel would be passed to the LevelZero backend and inserted into the command-list.

Obtaining profiling information for the execution of the fused kernel could use the same API and the zeEvent returned for the command corresponding to the fused kernel execution.

There is a design document with more information here: https://github.com/intel/llvm/blob/sycl/sycl/doc/design/KernelFusionJIT.md

In the current implementation, is the backend (ZE) command list only created upon call to command_graph<...>::finalize or already before that?

The command-buffers that instantiate the command-list for LevelZero backend are only created during the command_graph<...>::finalize process.

I think in that case it would be a viable implementation option to first perform fusion and only then create the command buffer inside command_graph<...>::finalize. The command-buffer would only contain the fused kernel command (and memory commands if necessary) and it would be possible to return a zeEvent that could be used for profiling of the fused kernel execution.

I believe that profiling info for the fused kernel would be valuable for the user, so I would prefer the first option here.

I agree that profiling information is useful for a fused graph as a whole, but I don't see how it's useful to attempt to provide this profiling information on a per-node basis. How can the application know what the profiling information means? Won't it depend on the implementation's ability to fuse each node?

Let's consider that an application calls ext_oneapi_get_profiling_info on a node in a fused graph. The returned information might tell the time it took to execute that one node (if the node could not be fused), it might tell the time it took to execute several nodes in the graph (if the node was fused with some of its neighbors), or it might tell the time it took to execute the entire graph (if all the nodes in the graph were fused). With so much uncertainty, I don't see why this is a useful feature.

I think it does make sense to let the application get profiling information for the entire graph, but the application can do this already using the event that is returned from queue::ext_oneapi_graph.

I think ext_oneapi_get_profiling_info should simply fail if the graph was created with fusion enabled. Per-node profiling information is not available for a fused graph.

It may also be worth noting that the SYCL 2020 wording is phrased in terms of "command groups".

Getting the profiling information for the command group in which the graph was submitted, where graph execution is the "action", is consistent with the specification wording, and is always well defined. Getting the profiling information for nodes in the graph seems less consistent to me: the mapping of nodes to command groups and "actions" is unclear.

sycl/test-e2e/Graph/Explicit/invalid_profiling.cpp

sycl/test-e2e/Graph/Explicit/nodes_profiling_info.cpp

sycl/test-e2e/Graph/Explicit/nodes_profiling_info_invalid_backend.cpp

sycl/include/sycl/detail/pi.h

maarquitos14 · 2024-03-08T17:05:16Z

sycl/plugins/unified_runtime/CMakeLists.txt

-  #     Merge pull request #1254 from Bensuo/cmdbuf-support-hip
-  #     [EXP][CMDBUF] HIP adapter support for command buffers
-  set(UNIFIED_RUNTIME_TAG a2757b2931daa2f8d7c9dd51b0fc846be1fd49a7 )
+  set(UNIFIED_RUNTIME_REPO "https://github.com/bensuo/unified-runtime.git")


Remember to set this back to the original before merging.

sycl/source/detail/event_info.hpp

sycl/test-e2e/Graph/Explicit/invalid_profiling.cpp

sycl/test-e2e/Graph/Explicit/nodes_profiling_info_invalid_backend.cpp

sycl/source/detail/event_impl.cpp

sycl/source/detail/event_info.hpp

sycl/source/detail/graph_impl.hpp

maarquitos14

LGTM. Just remember to revert the changes related to UR repo/tag.

gmlueck · 2024-03-12T12:54:29Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc

+  execution associated with this SYCL event. If the requested info is
+  not available when this member function is called due to incompletion of
+  graph execution associated with the event, then the call to this member
+  function will block until the requested info is available.


When I first read this, I was confused about which event the application would use when calling this API because there are two relevant events. If the graph is created by recording a queue, each recorded node has an event. You also get an event when you submit the graph to a queue via queue::ext_oneapi_graph.

It took me a while to figure out that applications are expected to use ext_oneapi_get_profiling_info on the second event (the one returned from queue::ext_oneapi_graph). I think it would be helpful to make this more clear somehow in the description.

gmlueck · 2024-03-12T13:08:41Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc

+  execution associated with this SYCL event. If the requested info is
+  not available when this member function is called due to incompletion of
+  graph execution associated with the event, then the call to this member
+  function will block until the requested info is available.


@EwanC asked in #12838 how ext_oneapi_get_profiling_info should relate to the new submit_profiling_tag API.

I think the graph extension could add a new member function to command_graph:

node add_profiling_tag();

This function would fail unless the graph's device had the aspect ext_oneapi_queue_profiling_tag. Once the application uses this API to add a profiling tag to the graph, it can get the profiling information later via ext_oneapi_get_profiling_info.

gmlueck · 2024-03-12T13:17:55Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc

+  execution associated with this SYCL event. If the requested info is
+  not available when this member function is called due to incompletion of
+  command groups associated with the event, then the call to this member
+  function will block until the requested info is available.


I believe that profiling info for the fused kernel would be valuable for the user, so I would prefer the first option here.

I agree that profiling information is useful for a fused graph as a whole, but I don't see how it's useful to attempt to provide this profiling information on a per-node basis. How can the application know what the profiling information means? Won't it depend on the implementation's ability to fuse each node?

Let's consider that an application calls ext_oneapi_get_profiling_info on a node in a fused graph. The returned information might tell the time it took to execute that one node (if the node could not be fused), it might tell the time it took to execute several nodes in the graph (if the node was fused with some of its neighbors), or it might tell the time it took to execute the entire graph (if all the nodes in the graph were fused). With so much uncertainty, I don't see why this is a useful feature.

I think it does make sense to let the application get profiling information for the entire graph, but the application can do this already using the event that is returned from queue::ext_oneapi_graph.

I think ext_oneapi_get_profiling_info should simply fail if the graph was created with fusion enabled. Per-node profiling information is not available for a fused graph.

Pennycook · 2024-03-12T15:48:33Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc

@@ -1881,8 +1919,10 @@ if used in application code.

 . Using reductions in a graph node.
 . Using sycl streams in a graph node.
-. Profiling an event returned from graph submission with
-  `event::get_profiling_info()`.
+. Profiling information is not available for graphs that contain host-task nodes.


This restriction seems like another good reason not to expose per-node profiling information. I don't understand why the presence of a host task would impact the runtime's ability to reason about a different device kernel, and I think many users would struggle to understand this too.

It's also unclear to me whether the presence of a host task disables all profiling. What is the intent here? Is it still possible to use the event returned by the graph submission to reason about how long the entire graph took to execute?

Pennycook · 2024-03-12T15:49:58Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc

+. Profiling a node from an event returned from graph submission with
+  `event::ext_oneapi_get_profiling_info(ext::node)` is only available for
+  the level-zero backend.


Again, I think this is surprising. Removing the per-node profiling completely would fix this.

github-actions · 2024-09-13T01:56:46Z

This pull request is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be automatically closed in 30 days.

github-actions · 2024-10-15T02:00:22Z

This pull request was closed because it has been stalled for 30 days with no activity.

mfrancepillois added 16 commits January 22, 2024 14:00

initial commit (does not work)

c032cf9

Merge branch 'sycl' into maxime/profiling_v2

ae32544

Add sycl event entry point for node profiling.

f63acca

Implements node profiling support. Adds new pi entry points for node profiling. Updates docs. adds tests. Updates symbols.

Merge branch 'sycl' into maxime/profiling_v2

c189890

fixes format

53b76bb

Renames query function to ext_oneapi_get_profiling_info()

421f303

Fixes declaration issues Changes raw pointer to weak_ptr Adds exception is node not found. Updates specs.

Code format improvement

99c7bf9

Updates comments

a7b839b

Updates comments + corrects typo

ed0bc28

Moves tests to Explicit directory

fbbc822

Merge branch 'sycl' into maxime/profiling_v2

7ff2327

Updates UR branch

96812b9

Merge branch 'sycl-upstream' into maxime/profiling_v2

f589d9b

Merge branch 'sycl-upstream' into maxime/profiling_v2

262b44a

Updates function name

46bce9c

Merge branch 'sycl-upstream' into maxime/profiling_v2

30ab2fe

This was referenced Feb 2, 2024

[EXP][CMDBUF] Support for Node Profiling oneapi-src/unified-runtime#1309

Open

[SYCL][Graph] Node Profiling reble/llvm#353

Closed

mfrancepillois had a problem deploying to WindowsCILock February 2, 2024 11:16 — with GitHub Actions Failure

Corrects pi function name

df99396

mfrancepillois had a problem deploying to WindowsCILock February 2, 2024 12:00 — with GitHub Actions Failure

Adds missing symbols

7d86fb2

mfrancepillois had a problem deploying to WindowsCILock February 2, 2024 14:40 — with GitHub Actions Failure

mfrancepillois temporarily deployed to WindowsCILock February 2, 2024 15:08 — with GitHub Actions Inactive

reble reviewed Feb 2, 2024

View reviewed changes

sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc Outdated Show resolved Hide resolved

Adds windows symbols + corrects a typo

190136b

mfrancepillois temporarily deployed to WindowsCILock February 2, 2024 16:05 — with GitHub Actions Inactive

mfrancepillois temporarily deployed to WindowsCILock February 2, 2024 16:27 — with GitHub Actions Inactive

reble reviewed Feb 2, 2024

View reviewed changes

sycl/include/sycl/info/ext_oneapi_graph_node_profiling_traits.def Outdated Show resolved Hide resolved

reble reviewed Feb 2, 2024

View reviewed changes

mfrancepillois requested review from a team as code owners February 21, 2024 14:04

mfrancepillois had a problem deploying to WindowsCILock February 21, 2024 14:04 — with GitHub Actions Failure

mfrancepillois requested review from jchlanda and maarquitos14 February 21, 2024 14:04

EwanC approved these changes Mar 6, 2024

View reviewed changes

mfrancepillois added 2 commits March 6, 2024 15:29

Merge branch 'sycl-upstream' into maxime/profiling_v2

74c5295

Update new Tests

2af6f28

mfrancepillois had a problem deploying to WindowsCILock March 6, 2024 16:34 — with GitHub Actions Failure

maarquitos14 reviewed Mar 8, 2024

View reviewed changes

Typos

7b8f9bb

maarquitos14 reviewed Mar 11, 2024

View reviewed changes

sycl/source/detail/event_impl.cpp Outdated Show resolved Hide resolved

sycl/source/detail/event_info.hpp Outdated Show resolved Hide resolved

sycl/source/detail/graph_impl.hpp Outdated Show resolved Hide resolved

mfrancepillois added 2 commits March 11, 2024 11:53

more typos

d58e1a5

Merge branch 'sycl' into maxime/profiling_v2

f453a09

mfrancepillois had a problem deploying to WindowsCILock March 11, 2024 12:02 — with GitHub Actions Failure

maarquitos14 approved these changes Mar 11, 2024

View reviewed changes

move tests to Profiling directory

1c6aa95

mfrancepillois had a problem deploying to WindowsCILock March 11, 2024 12:46 — with GitHub Actions Failure

EwanC mentioned this pull request Mar 11, 2024

[SYCL] Implement sycl_ext_oneapi_profiling_tag extension #12838

Merged

EwanC requested a review from a team March 12, 2024 09:48

gmlueck reviewed Mar 12, 2024

View reviewed changes

Pennycook reviewed Mar 12, 2024

View reviewed changes

EwanC mentioned this pull request Apr 24, 2024

[SYCL][Graph] Enable in-order cmd-list #13088

Merged

github-actions bot added the Stale label Sep 13, 2024

github-actions bot closed this Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL][Graph] Support for Graph Node Profiling #12592

[SYCL][Graph] Support for Graph Node Profiling #12592

mfrancepillois commented Feb 2, 2024

reble Feb 2, 2024

sommerlukas Feb 5, 2024

mfrancepillois Feb 6, 2024

sommerlukas Feb 6, 2024

mfrancepillois Feb 6, 2024

sommerlukas Feb 6, 2024

gmlueck Mar 12, 2024

Pennycook Mar 12, 2024

maarquitos14 Mar 8, 2024

maarquitos14 left a comment

gmlueck Mar 12, 2024

gmlueck Mar 12, 2024

gmlueck Mar 12, 2024

Pennycook Mar 12, 2024

Pennycook Mar 12, 2024

github-actions bot commented Sep 13, 2024

github-actions bot commented Oct 15, 2024

[SYCL][Graph] Support for Graph Node Profiling #12592

[SYCL][Graph] Support for Graph Node Profiling #12592

Conversation

mfrancepillois commented Feb 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maarquitos14 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Sep 13, 2024

github-actions bot commented Oct 15, 2024