Atom Direction 2022-2023 #48

jeremyong-az · 2022-06-15T16:08:45Z

jeremyong-az
Jun 15, 2022
Collaborator

This post is meant to help steer discussion, planning, and architecture for the Atom renderer in the coming months. By "help steer," I mean that this is a set of recommendations that we can iterate on, and it would be wonderful to have partner and community feedback. For each item herein, I will explain the current state of affairs, and offer some rationale for why I think a change will be beneficial. This set of topics is by no means exhaustive, and skips over a lot of smaller scale spot optimizations we should consider.

General themes:

Improved renderer flexibility
Improved renderer scalability

Global GPU-resident Scene Representation

Currently, state is "smeared" out in a number of places. View, scene, pass, and object data is replicated into draw packet state at a per-draw frequency. We have GPU-resident state for object transforms and skinned geometry, but in general, the state currently resides in command buffers.

This poses a few issues for scalability. First, draw packets that reference such state must have state changes written to uniform buffers, resulting in a memory write-amplification cost that scales with scene size instead of just the frequency of the state changes themselves. Another scalability benefit is that having a unified scene representation reduces state duplication needed for systems that rely on indirection to gather state and render (e.g. ray tracing, terrain, deferred materials, deferred lighting).

Having a global GPU-resident scene representation unlocks several new opportunities, including but not restricted to:

Aggressive draw merging and hardware instancing (not yet present in the renderer)
GPU-driven scene culling
DXR-driven materials (i.e. needed for a path-tracing reference render pathway)

Fewer draws, not faster draws

Atom's draw submission architecture is predicated on leveraging CPU-side parallelism to do DX12-draw bundle or VK secondary command buffer style wide submission. While this architecture properly leverages available concurrency well (CPU occupancy high), much of the work is either unnecessary (excessive SRG/descriptor versioning), or could be done more efficiently on the GPU.

An alternative frame of reference for us as Atom contributors should be to minimize draws altogether. There are many downstream ramifications of this:

Restricting VkDescriptorSetLayout and RootSignature counts to more effectively coalesce draws
Leveraging indirection more aggressively to access unique per-draw data
Version data via indirection, not upfront descriptor allocation

This mindset will get us closer to a more modern approach of GPU-driven scene submission, unlock GPU culling, and more importantly, free up CPU time to do useful work in other systems (physics, animation, etc.).

Decouple materials, lighting, and geometry

The "material type" is the location where everything comes together, describing how draws should be constructed for every render pass in the main render pipeline. This includes the depth prepass, shadow passes, motion vector pass, and main pass. Critically, this means that changes to the main render pipeline affect all material types, immediately restricting the ease with which alternative pipelines may be implemented.

There is a natural division that may be possible by decoupling the material type. Namely, separation of primitive state and surface state promotes experimentation with geometry-only passes and surface and lighting passes. An ideal system in my mind has the following properties:

Material type definition has no knowledge of Atom's render passes or scene pipeline
Vertex transformation (static, skinned, parallax, heightmap, alpha clip/discard/coverage, etc.) is cleanly separated
Surface evaluation (BRDF parameter evaluation for roughness, metalness, etc.) is cleanly separated
Lighting is cleanly separated

The first point is the most critical one. As Atom's usage expands to different domains (mobile, console, VR, virtual production, simulation, etc.), it will be necessary to author new pipelines specialized to the platform's needs. Currently, this is manifested as a "low end" pipeline and a main pipeline, but the existing data abstraction will not scale if every material type needs to integrate with every future pipeline and their respective render passes.

This task naturally has ramifications to the "Material canvas" work, which will enable content creators to create procedural effects in shaders. We must work to ensure that materials authored via material canvas are content-stable across future renderer upgrades and architectural changes.

Device memory management

Currently, we don't have suitable mechanisms for backpressure when VRAM is oversubscribed. Entire mesh LOD chains and texture MIP chains are resident on the GPU, and if existing budgets are exhausted, draws are simply dropped from submission. We also don't respond to actual physical hardware conditions - our memory pools don't grow and shrink as available device-local memory changes.

A system is needed to resolve how to efficiently distribute memory across objects in the scene, balancing primitive and image resolution based on draw distance, estimated texel density, and physical hardware constraints. Some engines opt to do this computation on a world-cell basis, precomputing MIPs and LODs needed based on camera world-space position and object world-cell assignment. Other engines do this using a GPU-feedback mechanism. Still more engines perform a heuristic based on projected bounding box areas and information computed offline. The approach we take is not yet determined, but an RFC will likely be needed here to guide further discussion (my preference is the third approach mentioned above).

Deferred lighting, visibility buffer pipelines

A successful Material Canvas unlocks tremendous creative expression capabilities... and a lot of shaders. Segmenting opaque and transparent geometry is a nearly critical optimization for many engines to keep shader counts low, and a Forward+-only renderer only realistically scales for projects that can constrain artists to having a fixed palette of curated uber-shaders. While it is debatable what the correct approach is for a given project, I believe Atom should not dictate this for the user. The more important opportunity is that implementing, say, a deferred lighting pipeline is a forcing function to ensure that our abstractions provide suitable degrees of flexibility and modularity. If changing an existing pipeline or adding a new one requires content changes, we know we have more work to do.

General Tenets

A general guiding recommendation I would offer is for contributors to consider, "For each feature, how would such a feature be customized or overridden, and how can I ensure that content is not affected?"

Atom has a tall task of being a renderer used across a broad gamut of domains and use cases, and compared to other renderers, I believe Atom should lean far more heavily into modularity and extensibility. The primary focus at the moment should be to welcome new contributors and tackle new workloads. The "break even point" for such investments will naturally be further out in the future, but we should consider this to be an investment not just in the current working group, but the future working group (which will hopefully be considerably larger).

santorac · 2022-06-15T16:44:34Z

santorac
Jun 15, 2022
Collaborator

"View, scene, pass, and object data is replicated into draw packet state at a per-draw frequency" and "draw packets that reference such state must be re-recorded to uniform buffers every time the state changes".
This might be true for material, object, and draw SRGs but I don't think scene, view, and pass SRGs require the draw packets to be recorded again. I could be wrong. Anyway, the point is taken that at least some SRGs do require draw packet updates, which would not be the case if the scene were fully represented on the GPU.

2 replies

jeremyong-az Jun 15, 2022
Collaborator Author

The draw packets may not be recorded again but the uniform buffers need to be updated I believe (scales with uniform buffer == scene object count).

jeremyong-az Jun 15, 2022
Collaborator Author

Oh actually, looks like I have a typo in that sentence, will edit thanks for the catch.

santorac · 2022-06-15T20:15:57Z

santorac
Jun 15, 2022
Collaborator

I tend to agree that the material type system needs to be more modular. I'm curious how other engines support disparate pipelines, if they rely on separation of content (different materials for different platforms), or if there is some system to swap out the shaders that back their equivalent of material types. To a certain extent, we could use our existing system to swap out shaders for different pipelines while using the same materials. We already do this for the low-end pipeline vs the main pipeline. We could expand this a bit more to support a deferred pipeline for example. But I agree with you that it won't scale well. If you have deferred vs forward, and low end vs high end, and VR, plus game-specific pipelines, etc, it's going to be difficult to maintain a common material library.

There's also the question of timing for working on this project. We could delay this work in favor of other projects for now, use a more bespoke solution for the few use cases we are facing at this time, and ramp up modularity as needed. But depending on drastically we would want to modify the system, it could be better to tackle this project sooner before the user base grows. (I have yet to form an opinion about which way we should go).

1 reply

jeremyong-az Jun 15, 2022
Collaborator Author

Generally, "breaking systems apart" is the name of the game. To define a material type for example, you could define specifically how vertices are transformed/displaced given a set of known inputs (camera pose, scene info, object transform) and UVs are determined, and you could define how BRDF params are evaluated (e.g. sample this texture channel to retrieve the surface roughness). Engines that support a hybrid deferred/forward system basically forward that information in the shader compilation stack in order to defer the decision about how the shader ultimately needs to be synthesized. Note that the material editor/data in this context has no knowledge of the actual pipeline that will consume its published shader routines.

As far as priority and timing, I also don't want to be prescriptive here. I am extremely sensitive to the shader permutation problem which I believe we will encounter sooner rather than later, for what it's worth. One argument to support the decoupling earlier is that we limit our exposure to future breaking changes with existing content formats.

siliconvoodoo · 2022-07-04T03:58:21Z

siliconvoodoo
Jul 4, 2022
Maintainer

Just throwing this out there randomly:

I am working on some tech demo and I noted that for each objects, you need a distinct constant buffer to pass its specific transformation matrix (and other material setup flags for instance).

Unfortunately from the CPU point of view, until the command list execution (and wait), the constant buffer cannot be reused, because each Map() event does not record on the command list.
The eventual result of this, is that each object must have its own tiny constant buffer with such draw-frequency level data.

And a constant buffer is a minimum of 256 KiB IIRC because of stringent alignment constraints, we are losing tons of space and fragmentation because of that, also it's inconvenient to manage on the CPU side since we need a collection of CBs.

So I said screw it, and I kept my unique CB object, turned it into a Structured Buffer so that it can be an array. And instead, I passed a Root Constant 32 bit variable on command-list that indexes into that SB. So the fetch of draw-call-frequency data is indirected by the root-constant, into the SB, and that lifts a headache.

3 replies

jeremyong-az Jul 6, 2022
Collaborator Author

The engine actually already has all scene object transform state already persisted in GPU-resident memory, but the point applies to other per draw/object/pass/view/scene data as well.

Note, the minimum alignment requirement is 256B (256KiB I suspect was a typo).

One change we intend on doing you may be interested is changing the way buffers are allocated from pages. Currently, pages are mapped to a single buffer, and allocations are managed manually as offsets within that buffer. While this has the advantage of reducing memory waste due to alignment requirements (and is a useful strategy for certain types of data, like TLAS buffers), in general it has a number of weaknesses - in particular, the inability to transition individual ranges within the page (until we get the upcoming enhanced barriers feature with DX12 Agility 700 which is currently in preview).

Root constant usage in conjunction with bindless is another initiative, although this will need to be abstracted so as to continue to support platforms without descriptor indexing support.

siliconvoodoo Jul 7, 2022
Maintainer

Thanks. It wasn't a typo, it was a brain glitch. I had a doubt after I sent the message though, but just shrugged it.
So you intend to change the allocation strategy, but toward what? multi underlying buffers so that you can have more fine grained transitions?

jeremyong-az Jul 7, 2022
Collaborator Author

As far as changing the allocation strategy, no formal document or proposal has been written (that's why the discussion is useful). General requirements are that:

Data that changes frequently is copied infrequently (e.g. camera matrices)
Data is globally addressable on the GPU for ray tracing, terrain, materials, and other work (again without needing to write amplify the data)
Data can transfer roles intra frame easily (e.g. write to a UAV from compute, use it as an index buffer in a subsequent pass)
Data allocation suballocates from resources when needed because data size is much smaller than the alignment penalty. In other cases, placed resources are used instead
Data not needed (e.g. finer mip levels, unused mesh LODs) are sensibly evicted

The finer grained transitions are needed for workloads mentioned in item 3. Currently, the system is very restrictive and doesn't make it easy to have a resource with an "odd" lifecycle of resource states because you necessarily need an entire page to have the same resource state at all times if you use the buffer pool abstraction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Atom Direction 2022-2023 #48

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Atom Direction 2022-2023 #48

jeremyong-az Jun 15, 2022 Collaborator

Global GPU-resident Scene Representation

Fewer draws, not faster draws

Decouple materials, lighting, and geometry

Device memory management

Deferred lighting, visibility buffer pipelines

General Tenets

Replies: 3 comments · 6 replies

santorac Jun 15, 2022 Collaborator

jeremyong-az Jun 15, 2022 Collaborator Author

jeremyong-az Jun 15, 2022 Collaborator Author

santorac Jun 15, 2022 Collaborator

jeremyong-az Jun 15, 2022 Collaborator Author

siliconvoodoo Jul 4, 2022 Maintainer

jeremyong-az Jul 6, 2022 Collaborator Author

siliconvoodoo Jul 7, 2022 Maintainer

jeremyong-az Jul 7, 2022 Collaborator Author

jeremyong-az
Jun 15, 2022
Collaborator

Replies: 3 comments 6 replies

santorac
Jun 15, 2022
Collaborator

jeremyong-az Jun 15, 2022
Collaborator Author

jeremyong-az Jun 15, 2022
Collaborator Author

santorac
Jun 15, 2022
Collaborator

jeremyong-az Jun 15, 2022
Collaborator Author

siliconvoodoo
Jul 4, 2022
Maintainer

jeremyong-az Jul 6, 2022
Collaborator Author

siliconvoodoo Jul 7, 2022
Maintainer

jeremyong-az Jul 7, 2022
Collaborator Author