Major performance problems with multithreading. #5525
Replies: 14 comments 12 replies
-
I've just been reading through Rend3 and WGPU code. I'm starting to understand what's slowing down the render thread. The single command queue in WGPU seems to be the bottleneck. I think this is why arcanization didn't deliver the expected performance improvement for multithreaded users. I have many (usually 16) content loading threads which fetch content from the content delivery network. They decompress, cache, etc. and then call Rend3's add_2d_texture. I'd thought that would load the asset into the GPU immediately, then return. That's not how it works. That call generates a command item to load the texture content, and puts that on Rend3's single wgpu::Queue. Nothing processes that queue at that time. There's a callback mechanism that informs Rend3 when those commands have been processed. It's an "async" kind of system, although it doesn't use Rust's "async" syntax or machinery. At the beginning of the next frame, as Rend3's render graph is processed, calls are made in the render thread to WGPU's "submit". This causes the queued commands to be processed and the asset loaded into the GPU, using time from the render thread. So that's why loading content impacts the frame rate so badly. The render thread is doing all the work. So, how to fix this? It might not be that hard. If add_2d_texture at the Rend3 level, and related calls, simply blocked until the texture was in the GPU, that would work fine. You can't do anything with a new texture or mesh until the "add" call has returned and given you a handle, which eliminates any possibility of trying to use the texture before it is loaded. Rend3 handles are all refcounted (Rust "Arc"), so they can't disappear until they are no longer needed. So the necessary interlocking can be obtained for free from standard Rust. So, at the Rend3 level, make a distinction between commands which add content that can't be used until you have a handle, and other commands. "add_2d_texture", "add_mesh", and "add_skeleton" are such commands. Those commands involve transferring large blocks of asset data, so those are the ones where this matters. So, what's needed is a separate queue for only those safe commands, processed in parallel with rendering. Everything else retains the present ordering constraints. Blocking longer on "add_2d_texture" is not a problem for my application, because it's just blocking one of N loading threads. They're fed from a single prioritized work queue, using Rust crate priority-queue. Priorities change as the viewpoint moves. It's much better to have the backlog in the priority queue system than in a FIFO portion of the system. So I see this as a win. Are there cases where it is not? Now, would making that operation block longer degrade single-thread graphics applications? Unclear, but probably not. For a single-thread application, the work eventually gets done in the main thread either way. Currently, it looks like you can't get WGPU to give you two queues for a device, because WGPU's request_device only returns one queue. The Vulkan level seems to support multiple command buffers, but the WGPU level does not export that functionality. Some backends may not support it, so WGPU has to be able to refuse a request for multiple command buffers. But it should support them for back ends which can handle them. Now, the question is whether this can be made safe at the WPGU level. Attempts to use something before it's loaded have to be detected, but that can be considered an error, not something WGPU has to synchronize. If you only have one queue, you don't see this potential problem. If you have two queues, and something like Rend3 above WGPU, it's handled for you. So this looks like it can be made sound. I haven't been down in those internals before, so I'd appreciate comments from those who have. Thanks. |
Beta Was this translation helpful? Give feedback.
-
See this NVidia example of how to use multiple command queues with Vulkan to speed up asset loading. This describes best practices for concurrent content loading. |
Beta Was this translation helpful? Give feedback.
-
Looking around for who's using transfer queues and what's known about them. What queues the Vulkan level tells you are apparently only loosely connected to what the hardware offers. Some platforms tell you there are a large number of queues available, but really they're being multiplexed at some lower level. (NVidia?) Others really have dedicated DMA controllers for each queue. (AMD?) Some (Android?) don't offer a transfer-only queue at all. Most discussions about performance are speculative, not operational results. I get the feeling that Unreal Engine users care strongly about this, and their engine does most of the work for them. Probably because the UE AAA title developers with their 100GB asset games hit this problem in normal operation. Nobody else seems to really understand the issue very well. A good starting point is probably for WGPU to ask Vulkan for one graphics queue and one transfer queue. If Vulkan won't offer a second queue, put transfer work on the main graphics queue, as at present. You can get more queues, but being able to overlap drawing and transfer is a big win. Overlapping multiple transfers, not so much. There's only so much PCI bandwidth. Vulkan supports separate graphics and transfer queues, but some devices won't have the hardware. DX12 apparently works like Vulkan in this area. Apple's Metal supports multiple queues, but there is only one queue type, I think. The obvious way to implement this is, user allocates buffer, fills buffer, creates command, command goes on queue, queue is submitted, completion callback is returned, user gets handle to texture or mesh back. But "submit" is said to be an expensive operation. Batching up commands is suggested. Does that really matter on a transfer queue, where there's not much setup? Batching complicates things. If you do one at a time, the caller is returned a handle. If not, you have to return something like a future and wait for completion when the asset is needed. WGPU seems to have some internal machinery for that already, although it doesn't use async/future terminology or Rust features. All this is from a cursory reading of the code and what's out there. Comments from people into the innards of this would help. |
Beta Was this translation helpful? Give feedback.
-
To illustrate the use case for this, here's Second Life's Fantasy Faire 2024, being rendered with the Sharpview/Rend3/WGPU/Vulkan/Linux/NVidia 3070 stack. Sharpview can get all that content from the network fast enough. Due to the bottleneck discussed above, it can't push it into the GPU fast enough. So the frame rate takes a hit. If you stay in one place and let the content loader threads catch up, the frame rate goes up to 60FPS. But it can drop to 10FPS due to content loading interference with the render thread. This is not a theoretical problem. |
Beta Was this translation helpful? Give feedback.
-
So, your uploading threads are making calls to things like You should be able to kick off those transfers immediately by calling |
Beta Was this translation helpful? Give feedback.
-
I'm confident enough that the locks post-arcanization are finer-grained and less contentious to say that, if switching between pre-arcanization and post-arcanization wgpu doesn't improve your throughput, then your throughput is probably not limited by wgpu-core lock contention. Famous last words, but that's my bet. But to answer your overall question: Yes, wgpu is very concerned with performance, and although right now correctness and security are our priorities, to prepare for shipping WebGPU in Firefox, we will definitely be looking at performance as soon as those are under control. |
Beta Was this translation helpful? Give feedback.
-
Just in case the "empty submit" trick helps, I filed #5636 to make the documentation for this more explicit. |
Beta Was this translation helpful? Give feedback.
-
It's hard. I'm using Rend3, which is calling WGPU, so I'm one level removed from the problem and somewhat fumbling around here. The update threads are interfering badly with render performance, but neither I nor Connor Fitzgerald are sure why. Here are some Tracy traces showing what looks like a lock conflict at one of the locks in Texture::create_view in WGPU. There's enough indirection in that code that I can't figure out where the bottleneck is. Here's the raw Tracy file, which can be viewed with Tracy profiler 0.10.0. Calling "submit" on the main queue from an update thread while the render thread is in the middle of rendering seems questionable. Will that work? Is that queue locked? The Vulkan spec says "Host access to queue must be externally synchronized". |
Beta Was this translation helpful? Give feedback.
-
The |
Beta Was this translation helpful? Give feedback.
-
The submit queue problem may not be dominating the performance problem. Above, I wrote: Here are BVE-Reborn/rend3#579 showing what looks like a lock conflict at one of the locks in Texture::create_view in WGPU. There's enough indirection in that code that I can't figure out where the bottleneck is. Here's the raw Tracy file, which can be viewed with Tracy profiler 0.10.0. This, in texture_create_view, seems to be where Tracy shows a stall. Multiple threads are stuck in texture_create_view. Somebody is waiting too long for something. But I can't tell what. Locks requested include:
One of those is probably stalling things for several milliseconds. It's not a full deadlock, just a stall. I see that work is going on at #5586 to add a lock observer. That might help here. How's that coming along? |
Beta Was this translation helpful? Give feedback.
-
Here are screenshots of the key periods in Tracy. Three threads are stuck in texture_create_view for far longer than that usually takes. |
Beta Was this translation helpful? Give feedback.
-
I have a stack backtrace from interrupting the program under GDB. I see multiple asset-loading threads waiting in assign in registry.rs to get the lock on data.
This seems to be the same bottleneck previously reported with profiling data. In the debugger, it can be seen exactly which lock is holding things up. The assign function seems to write lock a Registry's entire Storage. This looks like "insert" and "get" were supposed to be fast operations. I'd be tempted to instrument this as
to find out where the time is going. Backtrace attached: I'm seeing multiple threads waiting at that lock. Don't know why yet. This seems to be at least part of why asset loading hits rendering performance so hard. |
Beta Was this translation helpful? Give feedback.
-
I've upgraded Rend3 from WGPU 0.19 to WGPU 0.20. No major changes to Rend3; just enough to match breaking changes in other crates. Performance is somewhat better and substantially more consistent. Asset fetch and create activity in other threads doesn't seem to be slowing the rendering thread as severely. When a new region connects and a large number of assets are added, the transient impact is less. I used to have transients where FPS dropped below 10; now, the worst I'm seeing is 20, and most of the time it's 60FPS. This is encouraging. I can work with this. New Rend3: https://github.com/John-Nagle/rend3-hp |
Beta Was this translation helpful? Give feedback.
-
I'm looking at Tracy traces for performance problems, and finding some.
Interestingly, both of these are simple bookkeeping. They're not doing much computation. I'm seeing some other spots where that might be the case. It looks like the way to approach this for now is to spend time looking at Tracy traces and noticing where things that should be trivially fast are slow. |
Beta Was this translation helpful? Give feedback.
-
I've been using the Rend3/WGPU/Egui/Winit/Vulkan stack for over three years now, to develop a metaverse viewer. Works for Second Life and Open Simulator. It's a big job, and it's coming along well. Here are some videos.. The older viewers, in C++, are mostly single-threaded, and they keep running out of CPU time in the main thread. The frame rate then drops, sometimes to single digits. My goal was to use all available CPUs and the GPU effectively to get past that.
The basic metaverse rendering problem for high-detail metaverses is that there's a huge amount of unique content, far more than in commercial games. Second Life has a world the size of Los Angeles and three petabytes of assets stored on AWS servers. You can drive and fly around that world. When you do that, up to 400mb/s of content is coming in from the network. Content can be highly detailed. Entire cities as detailed as the glTF "bistro" demo exist. Getting data into the GPU fast while rendering is in progress dominates the problem.
So the basic architecture in Sharpview, my client, is to have the render thread just render, and do all updating from other threads. Other threads load assets, move and modify objects, and deal with the network traffic. The rendering thread has higher priority than most of the asset wrangling. That's how people get things done in UE5.
Now, in theory, the Rend3/WGPU/Vulkan stack should handle this well. All levels claim to support concurrent updating. In practice, not so much. There are a lot of lock conflicts. (Tracy profiling log; use Tracy 0.10.0 to view.)
Rend3's occlusion culling and the updating threads are contending for some of the same locks. (Rend3 may be removing occlusion culling, which is a huge win when you're in a windowless room and not too useful outdoors, especially when most buildings have windows into which you can look.) Shadow rendering seems to be a big bottleneck. Depth sorting for translucent objects is a global delay. Lighting only supports one light without a slowdown. Frame rate is OK (about 60FPS) when no content loading is in progress, but can drop to single digits during heavy content loading. That's where things are today.
"Arcanization" was supposed to help. It didn't. I have a benchmark, "render-bench". , to test this. Branch "arc2" uses arcanization, the main branch does not. In both cases, when content is being added from another thread, frame rate drops substantially.
Can these bottlenecks be fixed? Will they be fixed?
Beta Was this translation helpful? Give feedback.
All reactions