FrameGraph MultiGPU Support #165
Replies: 7 comments 9 replies
-
Could you add an example on how adding connections without dependencies would work? Maybe with a diagram between the framegraphs, because I'm having a hard time visualizing it. If I remember correctly, the current implementation of the framegraph was done for simplicity (like the case of read after read case adding an edge) in order to produce a simple order list of scopes to execute instead of a graph of execution. |
Beta Was this translation helpful? Give feedback.
-
Since we originally posted this discussion, we spent more time investigating and understanding the FrameGraph and relations in Atom. As we understand it now, the current process is basically as follows:
What we end up with is one linear execution order of Scopes with no real information about which Scopes could run in parallel or other dependency information such as whether Attachments are Read, Write or ReadWrite. What does it mean to have execution and memory dependencies. For simplicity, let's just consider them together. Basically, if one Scope writes or reads some kind of memory it needs wait before the previous Scope doing so is finished. That's what a dependency tells us, but it could be that no dependency is necessary, some examples would be: Two Scopes are only reading the same memory. This can happen in parallel, so one does not need to wait for the other. Two Scopes using the same buffer or image reading and/or writing, but in different non-overlapping memory areas. This can also happen in parallel. So just because two Scopes are using the same Attachment, it doesn't mean they have a dependency automatically. To synchronize there are a few different techniques based on the circumstances of the Scopes, being most familiar with it, I'm using Vulkan terminology here:
With these bases covered, let's have a look at two examples. The first one, I already mentioned above: two ComputePasses write to the same BufferAttachment (in different areas of memory) and a following third ComputePass reads from the Buffer, depending on them. The top half of the figure shows how we would like to connect the Passes/the Buffer ideally. There are two connections that actually represent a dependency (both are write -> read dependencies). This is currently not possible actually, since you cannot connect to the same Input Slot twice. There is a third dashed connection from the first to the second ComputePass which does not represent a dependency, but does exactly what current Connections do: make sure that both ComputePasses use the same Buffer, the first Pass being the one that creates the Attachment and the second one just using it. The lower half shows how the FrameGraph is build with FrameGraphAttachment and ScopeAttachments. Currently (left side), the ScopeAttachments would be linked in a linear chain, resulting the first two ComputePasses to run in sequence. The ideal version is on the right side, where the ScopeAttachments are correctly linked together based on the Connections in the upper half of the image. This would allow the first two ComputePasses to run in parallel. We also show with the dash-dotted arrow again, that the second CopyPass gets the same FrameGraphAttachment from the first CopyPass, but there is no execution/memory dependency between them. The second sample turns things around, the first ComputePass writes to the buffer and the following two read. Since they are both reading, again they could run in parallel. Now, to illustrate another issue, we would like to add one more Pass that is injected into this pipeline. The upper half again shows what would need to happen: The two Connections from the first to the other two ComputePasses need to be removed and the injected Pass connects it's Input to the first and it's Output to the other two CopyPasses in order to correctly update Connections that represent dependencies. Currently, what happens though is, that the injected Pass simply inserts after the first ComputePass and a Connection is established between the first ComputePass Output and the injected Pass Input. The other two Connections are still there. This only works, because currently the ScopeAttachments are linked in the order that the FrameGraphAttachment is used, showing that even though there are two missing Connections, the dependencies are inserted. I.e., Connections are neither required, nor sufficient for dependencies at the moment. We can see that again in the lower half. All ScopeAttachments are put into a linear chain again. It works, but it's not optimal. Probably, due to optimization of the Barriers later, based on the fact that both Scopes are only reading, the two later ComputePasses could run in parallel nevertheless, but it's at least not reflected in the graph. On the right side, we see, how the ScopeAttachments should actually be linked, where it's easy to see that after the injected Pass, the other two can run in parallel. We are slowly realising that changing the current system to the one we are envisioning would again be a substantial project, where:
Why do we need that in the first place? It's because with multi-device scheduling we would like to have much more control and we need much more information on the dependencies to do this efficiently and potentially automatically be able to migrate Scopes between devices. Now, it's possible we are completely wrong, or there is a better/different way to accomplish what we would like to do. So we are really curious about your thoughts. |
Beta Was this translation helpful? Give feedback.
-
Even though the frame graph sorts and flattens the scopes it doesnt mean that the GPU execution is linear as well. Based on the scope's input and output resource we tell the drivers how these resources need to be transitioned. In the first example if we have CP1->CP2->CP3 and when we encode the gpu commands we specify that for CP1 Buffer 1 is in Write state. Next for CP2 and CP3 we specify that Buffer 1 will be in Read state. Hence the drivers should ensure that CP1 is executed first and then CP2 and CP3 can run in parallel. @akioCL , can you confirm this for Vulkan. The places we add manual synchronization constructs is between scopes for aliased memory or if the synchronization needs to happen between queues.
For multi-device resources I think the synchronization needs to be explicitly made by the render pipeline and not implicitly done by RHI. This is essentially moving the complexity to the render pipeline developers instead of RHI. So the render pipeline needs to embed copy passes between Write passes (on device 1) and Read passes (on device2). Are you able to give an example where this does not work? |
Beta Was this translation helpful? Give feedback.
-
Thanks for your response! It's good to hear that multiple reads can still run in parallel, by properly inserting PipelineBarriers (Vulkan terminology), right? I guess that's what you mean by "tell the drivers"? But as far as we understand, it's still not possible to have two Passes write to the same buffer simultaneously at the moment, since we would tell the driver that both Passes write and that means, we would insert a PipelineBarrier?! We decided for now to go the manual route as you suggested, where CopyPasses have to be inserted into the pipeline manually, whenever data needs to be transfered from one device to another and not let the RHI handle that. Unfortunately, that prevents automatic scheduling of Passes onto devices, but that is anyway not something we could tackle yet. |
Beta Was this translation helpful? Give feedback.
-
Ok, so we implemented our fixes in the FrameGraph that are necessary for the manual route, please have a look here: o3de/o3de#18140 We will investigate the actual Barriers that are set further and come back with more detailed information after that. However what we can already mention is this: Even when two Passes are writing to the same Buffer/Image/Memory you might not actually have a write-write-dependency. The Passes could use atomics to access the same memory location or they could write to memory in the same area, but not actually the same one (e.g. there's a 4 member struct in the Buffer and P1 writes members 1 and 3 and P2 writes members 2 and 4). |
Beta Was this translation helpful? Give feedback.
-
We do have support for sub-resource dependency for textures whereby if a pass1 is writing out to mip0 and pass2 is writing out to mip1 then we only transition those resources accordingly for those passes and in theory the drivers should execute them in parallel. This applies to writing out to depth or stencil aspect plane. Although I dont think we have anything if two passes are writing out to different parts of a buffer. Tracking this information and passing it to RHI will be complicated to add as we will need to consider different types of buffers and how they are updated. @akio to confirm in case I am wrong. |
Beta Was this translation helpful? Give feedback.
-
We continued to investigate synchronization in O3DE to understand how it works. To do so, we setup a simple pipeline that just runs 3 compute passes followed by the imgui and a copy to swapchain pass. The compute passes get a transient buffer and a transient image attached as
It's extremely hard to get barriers exactly right through automated means. For example, you would have to know exactly what happens in a shader to know which memory regions are actually accessed. That's of course impractical. Actually, buffer/image memory barriers should not even be used when not doing a queue family ownership transfer or image layout transition. Memory barriers should be used instead [1-4]. Consequently, there would for example only be a single memory barrier between the compute passes in the above example. That's an improvement that could be made to the engine. You are already aware of the improvement possibility to correct the access masks in terms of read/write. Another improvement is to give pipeline designers the option to declare independence. That would allow them for example to resolve current read-read dependencies, but as we've discussed before, that's not the only case, it could also be other use cases (writes in different parts of a buffer or using atomic instructions, for example). It's good that the engine is currently conservative in terms of synchronization, since that only impacts performance, rather than correctness. However, it would be nice to be able to also improve performance with this feature. We don't know how you would implement this in the current system. [1] https://github.com/KhronosGroup/Vulkan-Docs/wiki/Synchronization-Examples |
Beta Was this translation helpful? Give feedback.
-
Goal
To allow for proper multi-GPU scheduling, the
FrameGraph
needs to be aware of the locality ofScopes
andScopeAttachments
to properly link them, potentially transfer data between devices and insert appropriateBarrier
s etc.Problem Description
In its current form,
FrameAttachment
s andScopeAttachment
s both hold multi-device resources (Buffer
orImage
).During
FrameGraph::Begin()
, such buffers and images can be imported into theFrameGraph
, which createsFrameAttachment
s from them.These can then be used during the
FrameScheduler::Compile
step, which calls the first of the three methods of aScopeProducer
, which isPrepare/SetupFrameGraphDependencies
, during whichFrameGraphAttachmentDatabase::EmplaceScopeAttachment()
is called.This creates the
ScopeAttachment
s and also updates theFrameAttachment
s and links theScopeAttachment
s together, i.e. this creates a chain ofScope
s using aFrameAttachment
in a certain order.This allows subsequent calls to iterate over all
ScopeAttachment
s of aFrameAttachment
, for example to insert barriers during theCompileResourceBarriers
step later.This linkage and traversal works "fine" with a single device (not considering that subsequent reads are also linked in the
FrameGraph
), but becomes problematic in case of multiple devices.Consider a multi-device resource, attached to the
FrameGraph
as aFrameAttachment
that is used inScope
s on multiple devices.Currently, this would lead to these
Scope
s andScopeAttachment
s being linked, even over devices.These links suggest the need for synchronization or ordering (even if not necessary) and also create barriers and/or state transitions, which are incorrect (as the
Scope
in the chain before does not have to be the one on the same device, but can be from another device).Solution
In the current system, a connection between two passes (by adding a connection to an attachment of a pass) results in an edge being added in the
FrameGraph
between the connectedScope
s, providing information about synchronization and, in the case of multi-GPU, data movement.This form of implicit scheduling information tasks the
FrameGraph
with the scheduling decisions and makes it clear to the user (and less error-prone to use) howScope
s and correspondingAttachment
s are connected, but it doesn't allow for parallelism betweenScope
s which are not actually dependent on another.We envision implicit dependencies between
Scope
s via the connections (as is the current state) but allow for the augmentation of these connections (via a flag) to not introduce synchronization and data dependencies.All this information, namely:
Scope
s are connected by which attachments,should give the
FrameGraph
enough information to schedule resources with both a fixed scheduling pattern or even dynamic scheduling at some point while still allowing for maximum parallel execution with the cleanest and least error-prone interface for the user.It also allows for the automatic insertion of copies between multiple devices where necessary based on the data-dependencies, i.e., data migrations, or put differently, it allows us to avoid such data migration, were unnecessary.
Current Problems
Scope
s (i.e. even two Read dependencies will result in an unnecessary edge in theFrameGraph
)ScopeAttachment
s are just linked in-order, i.e. to subsequent calls toframeGraph.Use*Attachment()
will link theScope
s in question in that order in which this method is called, not using the information from thejson
Scope
s consume an attachment from a previousScope
, both reading from theAttachment
, this will still introduce an edge between the two consumers instead of an edge between each consumer and the producerjson
, it currently is not even possible to express this in code (as one can only callframeGraph.Use*Attachment()
in a certain order)Beta Was this translation helpful? Give feedback.
All reactions