FrameGraph MultiGPU Support #165

martinwinter-huawei · 2024-07-04T13:06:00Z

martinwinter-huawei
Jul 4, 2024
Collaborator

Goal

To allow for proper multi-GPU scheduling, the FrameGraph needs to be aware of the locality of Scopes and ScopeAttachments to properly link them, potentially transfer data between devices and insert appropriate Barriers etc.

Problem Description

In its current form, FrameAttachments and ScopeAttachments both hold multi-device resources (Buffer or Image).
During FrameGraph::Begin(), such buffers and images can be imported into the FrameGraph, which creates FrameAttachments from them.
These can then be used during the FrameScheduler::Compile step, which calls the first of the three methods of a ScopeProducer, which is Prepare/SetupFrameGraphDependencies, during which FrameGraphAttachmentDatabase::EmplaceScopeAttachment() is called.
This creates the ScopeAttachments and also updates the FrameAttachments and links the ScopeAttachments together, i.e. this creates a chain of Scopes using a FrameAttachment in a certain order.
This allows subsequent calls to iterate over all ScopeAttachments of a FrameAttachment, for example to insert barriers during the CompileResourceBarriers step later.

This linkage and traversal works "fine" with a single device (not considering that subsequent reads are also linked in the FrameGraph), but becomes problematic in case of multiple devices.
Consider a multi-device resource, attached to the FrameGraph as a FrameAttachment that is used in Scopes on multiple devices.
Currently, this would lead to these Scopes and ScopeAttachments being linked, even over devices.
These links suggest the need for synchronization or ordering (even if not necessary) and also create barriers and/or state transitions, which are incorrect (as the Scope in the chain before does not have to be the one on the same device, but can be from another device).

Solution

In the current system, a connection between two passes (by adding a connection to an attachment of a pass) results in an edge being added in the FrameGraph between the connected Scopes, providing information about synchronization and, in the case of multi-GPU, data movement.
This form of implicit scheduling information tasks the FrameGraph with the scheduling decisions and makes it clear to the user (and less error-prone to use) how Scopes and corresponding Attachments are connected, but it doesn't allow for parallelism between Scopes which are not actually dependent on another.
We envision implicit dependencies between Scopes via the connections (as is the current state) but allow for the augmentation of these connections (via a flag) to not introduce synchronization and data dependencies.
All this information, namely:

which Scopes are connected by which attachments,
whether these connections should enforce synchronization and/or data migration between devices (if not marked with a flag), and
whether the producer and consumer of the connection Read from or Write to the attachment

should give the FrameGraph enough information to schedule resources with both a fixed scheduling pattern or even dynamic scheduling at some point while still allowing for maximum parallel execution with the cleanest and least error-prone interface for the user.
It also allows for the automatic insertion of copies between multiple devices where necessary based on the data-dependencies, i.e., data migrations, or put differently, it allows us to avoid such data migration, were unnecessary.

Current Problems

As far as we can tell, currently all connections result in dependencies between Scopes (i.e. even two Read dependencies will result in an unnecessary edge in the FrameGraph)
Currently, ScopeAttachments are just linked in-order, i.e. to subsequent calls to frameGraph.Use*Attachment() will link the Scopes in question in that order in which this method is called, not using the information from the json
- Example: Two Scopes consume an attachment from a previous Scope, both reading from the Attachment, this will still introduce an edge between the two consumers instead of an edge between each consumer and the producer
- While this information is currently present in the json, it currently is not even possible to express this in code (as one can only call frameGraph.Use*Attachment() in a certain order)

akioCL · 2024-07-10T16:59:35Z

akioCL
Jul 10, 2024
Maintainer

Could you add an example on how adding connections without dependencies would work? Maybe with a diagram between the framegraphs, because I'm having a hard time visualizing it.

If I remember correctly, the current implementation of the framegraph was done for simplicity (like the case of read after read case adding an edge) in order to produce a simple order list of scopes to execute instead of a graph of execution.

0 replies

jhmueller-huawei · 2024-07-11T14:37:19Z

jhmueller-huawei
Jul 11, 2024
Maintainer

Since we originally posted this discussion, we spent more time investigating and understanding the FrameGraph and relations in Atom. As we understand it now, the current process is basically as follows:

Slots and Connections are only used so that Passes and later Scopes use the same Attachments. The Connections do not state that there is a memory or execution dependency between these Passes, those are implicit/derived by the FrameGraph.
For each Attachment (Buffer or Image), a FrameGraphAttachment is generated and then Passes create ScopeAttachments which are linked together in the order that the Passes are ordered in their ParentPasses (i.e. in the order that UseAttachment is called during the FrameBegin).
This results in linear order of Scopes per FrameGraphAttachment. Since Scopes can have multiple FrameGraphAttachments, they can be in multiple such ordered lists.
The FrameGraph::TopologicalSort method takes these multiple ordered list, creating a single one considering all the dependencies given by these linear ordered lists. Side notes: ExecuteAfter and ExecuteBefore can add additional dependencies, but this basically only used to execute the RootScope before all others. This linear order of Scopes is also split into Groups for parallel CommandBuffer recording (in threads on the CPU) later.

What we end up with is one linear execution order of Scopes with no real information about which Scopes could run in parallel or other dependency information such as whether Attachments are Read, Write or ReadWrite.

What does it mean to have execution and memory dependencies. For simplicity, let's just consider them together. Basically, if one Scope writes or reads some kind of memory it needs wait before the previous Scope doing so is finished. That's what a dependency tells us, but it could be that no dependency is necessary, some examples would be: Two Scopes are only reading the same memory. This can happen in parallel, so one does not need to wait for the other. Two Scopes using the same buffer or image reading and/or writing, but in different non-overlapping memory areas. This can also happen in parallel. So just because two Scopes are using the same Attachment, it doesn't mean they have a dependency automatically.

To synchronize there are a few different techniques based on the circumstances of the Scopes, being most familiar with it, I'm using Vulkan terminology here:

Within a command buffer the synchronization happens via barriers, that's the typical case.
If Scopes run on different queues of the same device, the synchronization needs to be done via semaphores synchronizing the queues (and barriers).
(New:) If Scopes run on different devices, synchronization needs to happen with (timeline) semaphores synchronizing with the CPU. In addition to this execution synchronization, we also need to copy data between the devices, since they (unfortunately) cannot simply access the data on the other devices, so we not only need to synchronize with semaphores, we also need to insert copy commands with corresponding barriers on both devices.

With these bases covered, let's have a look at two examples.

The first one, I already mentioned above: two ComputePasses write to the same BufferAttachment (in different areas of memory) and a following third ComputePass reads from the Buffer, depending on them. The top half of the figure shows how we would like to connect the Passes/the Buffer ideally. There are two connections that actually represent a dependency (both are write -> read dependencies). This is currently not possible actually, since you cannot connect to the same Input Slot twice. There is a third dashed connection from the first to the second ComputePass which does not represent a dependency, but does exactly what current Connections do: make sure that both ComputePasses use the same Buffer, the first Pass being the one that creates the Attachment and the second one just using it. The lower half shows how the FrameGraph is build with FrameGraphAttachment and ScopeAttachments. Currently (left side), the ScopeAttachments would be linked in a linear chain, resulting the first two ComputePasses to run in sequence. The ideal version is on the right side, where the ScopeAttachments are correctly linked together based on the Connections in the upper half of the image. This would allow the first two ComputePasses to run in parallel. We also show with the dash-dotted arrow again, that the second CopyPass gets the same FrameGraphAttachment from the first CopyPass, but there is no execution/memory dependency between them.

The second sample turns things around, the first ComputePass writes to the buffer and the following two read. Since they are both reading, again they could run in parallel. Now, to illustrate another issue, we would like to add one more Pass that is injected into this pipeline. The upper half again shows what would need to happen: The two Connections from the first to the other two ComputePasses need to be removed and the injected Pass connects it's Input to the first and it's Output to the other two CopyPasses in order to correctly update Connections that represent dependencies. Currently, what happens though is, that the injected Pass simply inserts after the first ComputePass and a Connection is established between the first ComputePass Output and the injected Pass Input. The other two Connections are still there. This only works, because currently the ScopeAttachments are linked in the order that the FrameGraphAttachment is used, showing that even though there are two missing Connections, the dependencies are inserted. I.e., Connections are neither required, nor sufficient for dependencies at the moment. We can see that again in the lower half. All ScopeAttachments are put into a linear chain again. It works, but it's not optimal. Probably, due to optimization of the Barriers later, based on the fact that both Scopes are only reading, the two later ComputePasses could run in parallel nevertheless, but it's at least not reflected in the graph. On the right side, we see, how the ScopeAttachments should actually be linked, where it's easy to see that after the injected Pass, the other two can run in parallel.

We are slowly realising that changing the current system to the one we are envisioning would again be a substantial project, where:

Basically, a new FrameGraph has to be implemented that uses a directed acyclic graph (DAG) rather than the current linear chain of Scopes.
A new or at least modified way to make Connections that actually define dependencies (in JSON and the C++ side) is required.
All backends would need to be remodeled to support the new FrameGraph.
It is probably more difficult to properly group Scopes for parallel CommandBuffer recording on the CPU.
All existing pass files need to be adapted to the new Connections and made sure to work correctly.
All Pass injections need to be made more explicit in terms of their connections and dependencies.
Probably more points that we are forgetting at the moment.

Why do we need that in the first place? It's because with multi-device scheduling we would like to have much more control and we need much more information on the dependencies to do this efficiently and potentially automatically be able to migrate Scopes between devices.

Now, it's possible we are completely wrong, or there is a better/different way to accomplish what we would like to do. So we are really curious about your thoughts.

0 replies

moudgils · 2024-07-16T03:03:45Z

moudgils
Jul 16, 2024
Maintainer

Even though the frame graph sorts and flattens the scopes it doesnt mean that the GPU execution is linear as well. Based on the scope's input and output resource we tell the drivers how these resources need to be transitioned. In the first example if we have CP1->CP2->CP3 and when we encode the gpu commands we specify that for CP1 Buffer 1 is in Write state. Next for CP2 and CP3 we specify that Buffer 1 will be in Read state. Hence the drivers should ensure that CP1 is executed first and then CP2 and CP3 can run in parallel. @akioCL , can you confirm this for Vulkan.

The places we add manual synchronization constructs is between scopes for aliased memory or if the synchronization needs to happen between queues.

For aliased memory -> So if CP1 writes to Buffer 1, CP2 reads from Buffer 1 and CP3 writes to the Buffer 1 (i.e Buffer 1 is aliased) then RHI will add a fence/semaphore between CP2 and CP3. This will ensure that drive will not execute CP2 and CP3 in parallel.
Cross-queue synchronization = If CP1 (running on Queue1) writes to Buffer 1 and then CP2 (running on Queue2) reads from Buffer 1 then RHI should add a fence between the two scopes ensuring they dont run in parallel.

For multi-device resources I think the synchronization needs to be explicitly made by the render pipeline and not implicitly done by RHI. This is essentially moving the complexity to the render pipeline developers instead of RHI. So the render pipeline needs to embed copy passes between Write passes (on device 1) and Read passes (on device2). Are you able to give an example where this does not work?

0 replies

jhmueller-huawei · 2024-07-16T13:50:10Z

jhmueller-huawei
Jul 16, 2024
Maintainer

Thanks for your response! It's good to hear that multiple reads can still run in parallel, by properly inserting PipelineBarriers (Vulkan terminology), right? I guess that's what you mean by "tell the drivers"? But as far as we understand, it's still not possible to have two Passes write to the same buffer simultaneously at the moment, since we would tell the driver that both Passes write and that means, we would insert a PipelineBarrier?!

We decided for now to go the manual route as you suggested, where CopyPasses have to be inserted into the pipeline manually, whenever data needs to be transfered from one device to another and not let the RHI handle that. Unfortunately, that prevents automatic scheduling of Passes onto devices, but that is anyway not something we could tackle yet.

2 replies

akioCL Jul 17, 2024
Maintainer

I think there's a little confusion between submission and execution order. In your example, CP1 is connected to CP2 because both are writing to the same buffer (but on different areas). But this doesn't mean that CP2 has to wait for CP1 to finish. It only means that CP2 is submitted after CP1, not that it executes after. This is defined by the barriers that we insert. Since CP1 and CP2 are writing to different areas of the buffer, there shouldn't be any barrier between them, hence they are free to execute in "parallel" if the driver decides. The only barrier should be the one before CP3, who needs to wait until CP1 and CP2 have finished. The only case that we do add an unnecessary barrier is when we have read after read over the same resource in the same area. That is a known issue that we need to address.

So to sum up, an edge between scopes doesn't necessary mean that they can't run in parallel.

moudgils Jul 17, 2024
Maintainer

@akioCL Can you create a GHI for the unnecessary barrier case? I want to make sure the issue is documented somewhere external. Thanks.

jhmueller-huawei · 2024-07-18T13:34:20Z

jhmueller-huawei
Jul 18, 2024
Maintainer

Ok, so we implemented our fixes in the FrameGraph that are necessary for the manual route, please have a look here: o3de/o3de#18140

We will investigate the actual Barriers that are set further and come back with more detailed information after that. However what we can already mention is this: Even when two Passes are writing to the same Buffer/Image/Memory you might not actually have a write-write-dependency. The Passes could use atomics to access the same memory location or they could write to memory in the same area, but not actually the same one (e.g. there's a 4 member struct in the Buffer and P1 writes members 1 and 3 and P2 writes members 2 and 4).

0 replies

moudgils · 2024-07-18T19:51:29Z

moudgils
Jul 18, 2024
Maintainer

We do have support for sub-resource dependency for textures whereby if a pass1 is writing out to mip0 and pass2 is writing out to mip1 then we only transition those resources accordingly for those passes and in theory the drivers should execute them in parallel. This applies to writing out to depth or stencil aspect plane. Although I dont think we have anything if two passes are writing out to different parts of a buffer. Tracking this information and passing it to RHI will be complicated to add as we will need to consider different types of buffers and how they are updated. @akio to confirm in case I am wrong.

2 replies

akioCL Jul 18, 2024
Maintainer

Yes, we do track subresources but I think he is talking about writing/reading to different members of a buffer, which is not possible to synchronize using barriers or similar. The smallest granularity the RHI tracks is the same one exposed by the rendering APIs, at subresource level. Also using atomics would be something that the RHI can't track. At that point it would be better to not synchronize the resources and just use the atomics. Both of the cases mentioned are outside of framegraph synchronization.

moudgils Jul 18, 2024
Maintainer

Yep. With buffers there is no way for RHI to know which pass is updating what part of the buffers without adding significant amount of plumbing to automatically detect this and pass this information to RHI. It probably makes more sense to add better async compute synchronization support as that way you can push the second pass to a different queue explicitly via render pipeline.

jhmueller-huawei · 2024-07-24T12:23:07Z

jhmueller-huawei
Jul 24, 2024
Maintainer

We continued to investigate synchronization in O3DE to understand how it works. To do so, we setup a simple pipeline that just runs 3 compute passes followed by the imgui and a copy to swapchain pass. The compute passes get a transient buffer and a transient image attached as InputOutput, the image is further passed on to the two final passes. We first found that no barriers for the buffer were inserted and fixed that (o3de/o3de#18148). What we then get are the following barriers (src and dest show pipeline stage and access mask):

Buffer Memory Barrier: offset: 0, size: 256, src: COMPUTE_SHADER, SHADER_READ | SHADER_WRITE, dest: COMPUTE_SHADER, SHADER_READ | SHADER_WRITE
Image Memory Barrier: layout: COLOR_ATTACHMENT_OPTIMAL -> GENERAL, src: COLOR_ATTACHMENT_OUTPUT, COLOR_ATTACHMENT_WRITE, dest: COMPUTE_SHADER, SHADER_READ | SHADER_WRITE
Dispatch CP1
Buffer Memory Barrier: offset: 0, size: 256, src: COMPUTE_SHADER, SHADER_READ | SHADER_WRITE, dest: COMPUTE_SHADER, SHADER_READ | SHADER_WRITE
Image Memory Barrier: layout: GENERAL -> GENERAL, src: COMPUTE_SHADER, SHADER_READ | SHADER_WRITE, dest: COMPUTE_SHADER, SHADER_READ | SHADER_WRITE
Dispatch CP2
Buffer Memory Barrier: offset: 0, size: 256, src: COMPUTE_SHADER, SHADER_READ | SHADER_WRITE, dest: COMPUTE_SHADER, SHADER_READ | SHADER_WRITE
Image Memory Barrier: layout: GENERAL -> GENERAL, src: COMPUTE_SHADER, SHADER_READ | SHADER_WRITE, dest: COMPUTE_SHADER, SHADER_READ | SHADER_WRITE
Dispatch CP3

It's extremely hard to get barriers exactly right through automated means. For example, you would have to know exactly what happens in a shader to know which memory regions are actually accessed. That's of course impractical. Actually, buffer/image memory barriers should not even be used when not doing a queue family ownership transfer or image layout transition. Memory barriers should be used instead [1-4]. Consequently, there would for example only be a single memory barrier between the compute passes in the above example. That's an improvement that could be made to the engine. You are already aware of the improvement possibility to correct the access masks in terms of read/write.

Another improvement is to give pipeline designers the option to declare independence. That would allow them for example to resolve current read-read dependencies, but as we've discussed before, that's not the only case, it could also be other use cases (writes in different parts of a buffer or using atomic instructions, for example). It's good that the engine is currently conservative in terms of synchronization, since that only impacts performance, rather than correctness. However, it would be nice to be able to also improve performance with this feature. We don't know how you would implement this in the current system.

[1] https://github.com/KhronosGroup/Vulkan-Docs/wiki/Synchronization-Examples
[2] https://developer.nvidia.com/blog/vulkan-dos-donts/#h.mlft4y40j8qb
[3] https://themaister.net/blog/2019/08/14/yet-another-blog-explaining-vulkan-synchronization/
[4] https://www.lunarg.com/wp-content/uploads/2021/08/Vulkan-Synchronization-SIGGRAPH-2021.pdf

5 replies

akioCL Jul 25, 2024
Maintainer

There's definitely room for improvements in respect to barriers, but having the perfect barrier that only contain the exact region of a resource is not only impractical, but most likely useless. Drivers do a lot of optimizations/simplifications, and sometimes block complete regions of a resource even if the barrier specifies a smaller area. We should identify what type of optimizations we can do for barriers, and tackle them one by one. I know for example that we could remove some barriers and use the renderpass implicit layout transitions, remove some read after read barriers, use memory barriers like you mentioned, merge barriers together, among some of the optimizations.

About declaring independence I don't think it's a good idea. That is going to open a can of worms and most likely expose synchronization issues that we will not be able to control. I think focusing in improving the current barriers on Vulkan is a better place to spend the resources.

jhmueller-huawei Jul 25, 2024
Maintainer

There's definitely room for improvements in respect to barriers, but having the perfect barrier that only contain the exact region of a resource is not only impractical, but most likely useless. Drivers do a lot of optimizations/simplifications, and sometimes block complete regions of a resource even if the barrier specifies a smaller area.

That's what I meant refering to the linked references - you don't even need Buffer-/ImageMemoryBarriers, MemoryBarriers suffice.

We should identify what type of optimizations we can do for barriers, and tackle them one by one. I know for example that we could remove some barriers and use the renderpass implicit layout transitions, remove some read after read barriers, use memory barriers like you mentioned, merge barriers together, among some of the optimizations.

You're right, the optimization we can do is the decision whether to have a barrier or not, i.e., not have as many as possible.

What I meant to convey exactly is that in order to automatically figure that out you would need to know exactly which memory locations are accessed and how. And I wanted to convey that this is hardly possible, so the completely optimized automatic placement of barriers is impractical.

You could of course do some optimizations of the automatic ones as you mentioned and this will improve the baseline.

About declaring independence I don't think it's a good idea. That is going to open a can of worms and most likely expose synchronization issues that we will not be able to control. I think focusing in improving the current barriers on Vulkan is a better place to spend the resources.

What I proposed is to give the users a chance to manually specify which barriers should be skipped to optimize the pipeline, since an automatic approach is infeasible. Of course that also means that if they f*** up, it's their fault. If the user doesn't use it then the pipeline might not be optimal, but with the conservative placement of barriers it would work properly.

moudgils Jul 25, 2024
Maintainer

My recommendation is to go ahead, investigate the use cases identified above and implement the improvements to barrier usage. The hope here is to make small improvements (ideally a PR per use case) as oppose to an overhaul or a significant change to the barrier compilation.

As for declaring independence approach, it has its advantages and dis-advantages. On one side you are allowing the pipeline developer to take all the risk but on the other side maintaining such a pipeline will be very hard across various platforms. For example, would you want to implement such an approach properly to all RHI backends? With rendering artifacts it will become much harder to fix issues as debugging barriers at RHI level is non-trivial. Maybe if we do implement it we should allow developers to enable/disable it for debugging purposes. I would be interested in knowing if there is a use case where adding such a feature allowed you to gain XX ms in gpu time. Maybe we can discuss this option at a sig-graphics weekly meeting. We can move it earlier (i.e 9:00am - 10:00am) PST if that helps.

jhmueller-huawei Jul 26, 2024
Maintainer

All clear. You're right that we don't have an immediate application for this in a pipeline. This barrier part of the discussion is more coming from our general investigation in order to figure out how to modify the frame graph to better support multi-device usage. Currently, the important step for us there is #18140.

So for now we don't plan to work on the barriers specifically, until we will encounter a case where we can actually benefit in a significant way as you wrote. We will then come back to this. I think it's good, we discussed this so far, also for future reference.

akioCL Aug 6, 2024
Maintainer

FYI, this PR implements some of the Vulkan barrier optimizations that we discussed here (using memory barriers instead of buffer/image barriers, removing read after read barriers, using automatic layout transitions of the renderpass among others).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FrameGraph MultiGPU Support #165

{{title}}

Replies: 7 comments 9 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

FrameGraph MultiGPU Support #165

martinwinter-huawei Jul 4, 2024 Collaborator

Goal

Problem Description

Solution

Current Problems

Replies: 7 comments · 9 replies

akioCL Jul 10, 2024 Maintainer

jhmueller-huawei Jul 11, 2024 Maintainer

moudgils Jul 16, 2024 Maintainer

jhmueller-huawei Jul 16, 2024 Maintainer

akioCL Jul 17, 2024 Maintainer

moudgils Jul 17, 2024 Maintainer

jhmueller-huawei Jul 18, 2024 Maintainer

moudgils Jul 18, 2024 Maintainer

akioCL Jul 18, 2024 Maintainer

moudgils Jul 18, 2024 Maintainer

jhmueller-huawei Jul 24, 2024 Maintainer

akioCL Jul 25, 2024 Maintainer

jhmueller-huawei Jul 25, 2024 Maintainer

moudgils Jul 25, 2024 Maintainer

jhmueller-huawei Jul 26, 2024 Maintainer

akioCL Aug 6, 2024 Maintainer

martinwinter-huawei
Jul 4, 2024
Collaborator

Replies: 7 comments 9 replies

akioCL
Jul 10, 2024
Maintainer

jhmueller-huawei
Jul 11, 2024
Maintainer

moudgils
Jul 16, 2024
Maintainer

jhmueller-huawei
Jul 16, 2024
Maintainer

akioCL Jul 17, 2024
Maintainer

moudgils Jul 17, 2024
Maintainer

jhmueller-huawei
Jul 18, 2024
Maintainer

moudgils
Jul 18, 2024
Maintainer

akioCL Jul 18, 2024
Maintainer

moudgils Jul 18, 2024
Maintainer

jhmueller-huawei
Jul 24, 2024
Maintainer

akioCL Jul 25, 2024
Maintainer

jhmueller-huawei Jul 25, 2024
Maintainer

moudgils Jul 25, 2024
Maintainer

jhmueller-huawei Jul 26, 2024
Maintainer

akioCL Aug 6, 2024
Maintainer