Major performance problems with multithreading. #5525

John-Nagle · 2024-04-12T18:58:12Z

John-Nagle
Apr 12, 2024

I've been using the Rend3/WGPU/Egui/Winit/Vulkan stack for over three years now, to develop a metaverse viewer. Works for Second Life and Open Simulator. It's a big job, and it's coming along well. Here are some videos.. The older viewers, in C++, are mostly single-threaded, and they keep running out of CPU time in the main thread. The frame rate then drops, sometimes to single digits. My goal was to use all available CPUs and the GPU effectively to get past that.

The basic metaverse rendering problem for high-detail metaverses is that there's a huge amount of unique content, far more than in commercial games. Second Life has a world the size of Los Angeles and three petabytes of assets stored on AWS servers. You can drive and fly around that world. When you do that, up to 400mb/s of content is coming in from the network. Content can be highly detailed. Entire cities as detailed as the glTF "bistro" demo exist. Getting data into the GPU fast while rendering is in progress dominates the problem.

So the basic architecture in Sharpview, my client, is to have the render thread just render, and do all updating from other threads. Other threads load assets, move and modify objects, and deal with the network traffic. The rendering thread has higher priority than most of the asset wrangling. That's how people get things done in UE5.

Now, in theory, the Rend3/WGPU/Vulkan stack should handle this well. All levels claim to support concurrent updating. In practice, not so much. There are a lot of lock conflicts. (Tracy profiling log; use Tracy 0.10.0 to view.)

Rend3's occlusion culling and the updating threads are contending for some of the same locks. (Rend3 may be removing occlusion culling, which is a huge win when you're in a windowless room and not too useful outdoors, especially when most buildings have windows into which you can look.) Shadow rendering seems to be a big bottleneck. Depth sorting for translucent objects is a global delay. Lighting only supports one light without a slowdown. Frame rate is OK (about 60FPS) when no content loading is in progress, but can drop to single digits during heavy content loading. That's where things are today.

"Arcanization" was supposed to help. It didn't. I have a benchmark, "render-bench". , to test this. Branch "arc2" uses arcanization, the main branch does not. In both cases, when content is being added from another thread, frame rate drops substantially.

Can these bottlenecks be fixed? Will they be fixed?

John-Nagle · 2024-04-14T05:40:56Z

John-Nagle
Apr 14, 2024
Author

I've just been reading through Rend3 and WGPU code. I'm starting to understand what's slowing down the render thread. The single command queue in WGPU seems to be the bottleneck. I think this is why arcanization didn't deliver the expected performance improvement for multithreaded users.

I have many (usually 16) content loading threads which fetch content from the content delivery network. They decompress, cache, etc. and then call Rend3's add_2d_texture. I'd thought that would load the asset into the GPU immediately, then return.

That's not how it works. That call generates a command item to load the texture content, and puts that on Rend3's single wgpu::Queue. Nothing processes that queue at that time. There's a callback mechanism that informs Rend3 when those commands have been processed. It's an "async" kind of system, although it doesn't use Rust's "async" syntax or machinery.

At the beginning of the next frame, as Rend3's render graph is processed, calls are made in the render thread to WGPU's "submit". This causes the queued commands to be processed and the asset loaded into the GPU, using time from the render thread.

So that's why loading content impacts the frame rate so badly. The render thread is doing all the work.

So, how to fix this? It might not be that hard. If add_2d_texture at the Rend3 level, and related calls, simply blocked until the texture was in the GPU, that would work fine. You can't do anything with a new texture or mesh until the "add" call has returned and given you a handle, which eliminates any possibility of trying to use the texture before it is loaded. Rend3 handles are all refcounted (Rust "Arc"), so they can't disappear until they are no longer needed. So the necessary interlocking can be obtained for free from standard Rust.

So, at the Rend3 level, make a distinction between commands which add content that can't be used until you have a handle, and other commands. "add_2d_texture", "add_mesh", and "add_skeleton" are such commands. Those commands involve transferring large blocks of asset data, so those are the ones where this matters. So, what's needed is a separate queue for only those safe commands, processed in parallel with rendering. Everything else retains the present ordering constraints.

Blocking longer on "add_2d_texture" is not a problem for my application, because it's just blocking one of N loading threads. They're fed from a single prioritized work queue, using Rust crate priority-queue. Priorities change as the viewpoint moves. It's much better to have the backlog in the priority queue system than in a FIFO portion of the system. So I see this as a win. Are there cases where it is not?

Now, would making that operation block longer degrade single-thread graphics applications? Unclear, but probably not. For a single-thread application, the work eventually gets done in the main thread either way.

Currently, it looks like you can't get WGPU to give you two queues for a device, because WGPU's request_device only returns one queue. The Vulkan level seems to support multiple command buffers, but the WGPU level does not export that functionality. Some backends may not support it, so WGPU has to be able to refuse a request for multiple command buffers. But it should support them for back ends which can handle them.

Now, the question is whether this can be made safe at the WPGU level. Attempts to use something before it's loaded have to be detected, but that can be considered an error, not something WGPU has to synchronize. If you only have one queue, you don't see this potential problem. If you have two queues, and something like Rend3 above WGPU, it's handled for you. So this looks like it can be made sound.

I haven't been down in those internals before, so I'd appreciate comments from those who have. Thanks.

2 replies

kpreid Apr 14, 2024

I think this is why arcanization didn't deliver the expected performance improvement for multithreaded users.

If I recall correctly, arcanization was not expected to improve performance itself, but rather to refactor so it is possible to add more multithreaded throughput without a tricky locking nightmare.

John-Nagle Apr 15, 2024
Author

If I recall correctly, arcanization was not expected to improve performance itself, but rather to refactor so it is possible to add more multithreaded throughput without a tricky locking nightmare.

That may have been the case, but it wasn't well documented.

You don't get any benefit from multithreading until all the levels of the stack support it. Otherwise you're just moving the bottleneck.

Here's what the Vulkan documentation says about how to do concurrent asset loading.

Data upload is another section that is very often multithreaded. In here, you have a dedicated IO thread that will load assets to disk, and said IO thread will have its own queue and command allocators, hopefully a transfer queue. This way it is possible to upload assets at a speed completely separated from the main frame loop, so if it takes half a second to upload a set of big textures, you don’t have a hitch. To do that, you need to create a transfer or async-compute queue (if available), and dedicate that one to the loader thread. Once you have that, it’s similar to what was commented on the pipeline compiler thread, and you have an IO thread that communicates through a parallel queue with the main simulation loop to upload data in an asynchronous way. Once a transfer has been uploaded, and checked that it has finished with a Fence, then the IO thread can send the info to the main loop, and then the engine can connect the new textures or models into the renderer.

I think they meant "load assets from disk", but you get the idea. Vulkan is designed to support multiple command queues, although the interlocking is somewhat complicated.

One possibility is to have an "aux command queue", (or queues?) at the WGPU level, limited to operations that don't have ordering constraints. This includes adding meshes and textures. You can't use the new content until the operation is done, but other than that, there are no constraints on what has to be done first. It's not fully general, but the interlocking problems are not too bad. I'd suggest this as a near term first step. That would solve my current performance problem.

This approach should work for Vulkan, Metal, and DX12, all of which support multiple command queues. I think. Those are the targets where you see multi-threaded game titles. Won't help the OpenGL back end, which is single-thread, but that's becoming obsolete. Browser and Android targets don't have much multi-threading, so it's not a major issue there.

Comments?

John-Nagle · 2024-04-16T04:35:26Z

John-Nagle
Apr 16, 2024
Author

See this NVidia example of how to use multiple command queues with Vulkan to speed up asset loading. This describes best practices for concurrent content loading.

0 replies

John-Nagle · 2024-04-17T06:29:06Z

John-Nagle
Apr 17, 2024
Author

Looking around for who's using transfer queues and what's known about them. What queues the Vulkan level tells you are apparently only loosely connected to what the hardware offers. Some platforms tell you there are a large number of queues available, but really they're being multiplexed at some lower level. (NVidia?) Others really have dedicated DMA controllers for each queue. (AMD?) Some (Android?) don't offer a transfer-only queue at all. Most discussions about performance are speculative, not operational results. I get the feeling that Unreal Engine users care strongly about this, and their engine does most of the work for them. Probably because the UE AAA title developers with their 100GB asset games hit this problem in normal operation. Nobody else seems to really understand the issue very well.

A good starting point is probably for WGPU to ask Vulkan for one graphics queue and one transfer queue. If Vulkan won't offer a second queue, put transfer work on the main graphics queue, as at present. You can get more queues, but being able to overlap drawing and transfer is a big win. Overlapping multiple transfers, not so much. There's only so much PCI bandwidth.

Vulkan supports separate graphics and transfer queues, but some devices won't have the hardware.

DX12 apparently works like Vulkan in this area.

Apple's Metal supports multiple queues, but there is only one queue type, I think.

The obvious way to implement this is, user allocates buffer, fills buffer, creates command, command goes on queue, queue is submitted, completion callback is returned, user gets handle to texture or mesh back. But "submit" is said to be an expensive operation. Batching up commands is suggested. Does that really matter on a transfer queue, where there's not much setup? Batching complicates things. If you do one at a time, the caller is returned a handle. If not, you have to return something like a future and wait for completion when the asset is needed. WGPU seems to have some internal machinery for that already, although it doesn't use async/future terminology or Rust features.

All this is from a cursory reading of the code and what's out there. Comments from people into the innards of this would help.

0 replies

John-Nagle · 2024-04-20T05:23:37Z

John-Nagle
Apr 20, 2024
Author

To illustrate the use case for this, here's Second Life's Fantasy Faire 2024, being rendered with the Sharpview/Rend3/WGPU/Vulkan/Linux/NVidia 3070 stack. Sharpview can get all that content from the network fast enough. Due to the bottleneck discussed above, it can't push it into the GPU fast enough. So the frame rate takes a hit. If you stay in one place and let the content loader threads catch up, the frame rate goes up to 60FPS. But it can drop to 10FPS due to content loading interference with the render thread.

This is not a theoretical problem.

https://video.hardlimit.com/w/7usCE3v2RrWK6nuoSr4NHJ

0 replies

jimblandy · 2024-04-29T21:09:03Z

jimblandy
Apr 29, 2024
Maintainer

So, your uploading threads are making calls to things like write_buffer and write_texture, correct? As you observed, those functions don't actually submit anything to the GPU themselves. Rather, they buffer the commands to do the requested transfers, and submit them on the next call to Queue::submit, in front of the caller's commands.

You should be able to kick off those transfers immediately by calling Queue::submit with an empty iterator from the uploading threads. This flushes the buffered writes. This is mentioned in the docs for Queue::write_buffer and Queue::write_texture - did you come across that?

0 replies

jimblandy · 2024-04-29T21:23:54Z

jimblandy
Apr 29, 2024
Maintainer

wgpu-core's locking discipline changed a lot with arcanization, with a stack of very long locks being replaced with various finer-grained locks. I don't think it's unfair to say that, pre-arcanization, wgpu effectively had a global lock. We know there are some problems with its successor; see #5572 for one aspect.

I'm confident enough that the locks post-arcanization are finer-grained and less contentious to say that, if switching between pre-arcanization and post-arcanization wgpu doesn't improve your throughput, then your throughput is probably not limited by wgpu-core lock contention. Famous last words, but that's my bet.

But to answer your overall question: Yes, wgpu is very concerned with performance, and although right now correctness and security are our priorities, to prepare for shipping WebGPU in Firefox, we will definitely be looking at performance as soon as those are under control.

0 replies

jimblandy · 2024-04-30T00:06:02Z

jimblandy
Apr 30, 2024
Maintainer

Just in case the "empty submit" trick helps, I filed #5636 to make the documentation for this more explicit.

0 replies

John-Nagle · 2024-04-30T02:44:19Z

John-Nagle
Apr 30, 2024
Author

It's hard. I'm using Rend3, which is calling WGPU, so I'm one level removed from the problem and somewhat fumbling around here. The update threads are interfering badly with render performance, but neither I nor Connor Fitzgerald are sure why.

Here are some Tracy traces showing what looks like a lock conflict at one of the locks in Texture::create_view in WGPU. There's enough indirection in that code that I can't figure out where the bottleneck is. Here's the raw Tracy file, which can be viewed with Tracy profiler 0.10.0.

Calling "submit" on the main queue from an update thread while the render thread is in the middle of rendering seems questionable. Will that work? Is that queue locked? The Vulkan spec says "Host access to queue must be externally synchronized".

0 replies

jimblandy · 2024-04-30T14:00:17Z

jimblandy
Apr 30, 2024
Maintainer

The wgpu queue is internally synchronized. wgpu is a safe API, and Queue::submit takes &self, so it has to be internally synchronized.

0 replies

John-Nagle · 2024-05-05T02:44:54Z

John-Nagle
May 5, 2024
Author

The submit queue problem may not be dominating the performance problem. Above, I wrote:

Here are BVE-Reborn/rend3#579 showing what looks like a lock conflict at one of the locks in Texture::create_view in WGPU. There's enough indirection in that code that I can't figure out where the bottleneck is. Here's the raw Tracy file, which can be viewed with Tracy profiler 0.10.0.

This, in texture_create_view, seems to be where Tracy shows a stall. Multiple threads are stuck in texture_create_view. Somebody is waiting too long for something. But I can't tell what. Locks requested include:

device.snatchable_lock.read()
texture.views.lock();
device.trackers.lock()

One of those is probably stalling things for several milliseconds. It's not a full deadlock, just a stall.

I see that work is going on at #5586 to add a lock observer. That might help here. How's that coming along?

6 replies

jimblandy May 6, 2024
Maintainer

Mostly I've been concentrating on deadlocks and not performance. But it seems like there are many opportunities for simplification. I am not expecting these to improve throughput, but I am definitely hoping they'll simplify the picture enough that we can see opportunities for improving performance.

John-Nagle May 6, 2024
Author

That's kind of vague. I already waited 9 months for "arcanization", which was supposed to fix the concurrency stalls.
If WGPU can't outperform OpenGL, what's the point?

It's bad. I get 60FPS when not loading content from other threads, and 10FPS when other threads are loading content. This is just broken.

kpreid May 6, 2024

If WGPU can't outperform OpenGL, what's the point?

Some things wgpu does:

Implements WebGPU
Provides Rust bindings to WebGPU
Better API than OpenGL (subjective)

That's kind of vague. I already waited … This is just broken.

I am not an official project member of any sort, but I think this needs to be said:

I wish you would moderate your tone. This is a community open source project with no single sponsor. I believe most of the high-priority and sponsored goals are pertaining to being a conformant WebGPU implementation, not to being the highest-performance possible Rust GPU API. Of course performance improvements are desirable, but complaining about how fast they're being worked on isn't particularly constructive. (But providing data about what usage patterns are slow and why is constructive.)

John-Nagle May 6, 2024
Author

"I believe most of the high-priority and sponsored goals are pertaining to being a conformant WebGPU implementation, not to being the highest-performance possible Rust GPU API."

You may be right.

The technical question is, are the lock delays just a fixable bug, or is the design of WGPU fundamentally incapable of achieving Vulkan's speed and concurrency? If it's the former, it's worth pushing through to a fix in WGPU. If it's the latter, it may be necessary to do something else entirely.

kpreid May 6, 2024

If you want to participate in technical discussions of how to improve the performance of an open-source program, it is wise to refrain from using language like “broken”, “I waited”, and “is [this entire project] fundamentally incapable of”. Even if these words refer to literally true facts, it's difficult to participate in conversations with that much negativity in tone. You are putting off the maintainers and contributors from collaborating with you on making wgpu better.

John-Nagle · 2024-05-05T02:50:56Z

John-Nagle
May 5, 2024
Author

Here are screenshots of the key periods in Tracy.

Three threads are stuck in texture_create_view for far longer than that usually takes.

0 replies

John-Nagle · 2024-06-30T03:24:47Z

John-Nagle
Jun 30, 2024
Author

I have a stack backtrace from interrupting the program under GDB. I see multiple asset-loading threads waiting in assign in registry.rs to get the lock on data.

pub fn assign(self, value: T) -> (I, Arc<T>) {
    let mut data = self.data.write(); // (This is the stall point.)
    data.insert(self.id, self.init(value));
    (self.id, data.get(self.id).unwrap().clone())
}

This seems to be the same bottleneck previously reported with profiling data. In the debugger, it can be seen exactly which lock is holding things up.

The assign function seems to write lock a Registry's entire Storage. This looks like "insert" and "get" were supposed to be fast operations.

I'd be tempted to instrument this as

pub fn assign(self, value: T) -> (I, Arc<T>) {
    profiling::scope!("Reg assign");
    let mut data = self.data.write(); // (This is the stall point.)
    profiling::scope!("Reg assign init");
    data.insert(self.id, self.init(value));
    profiling::scope!("Reg assign get");
    (self.id, data.get(self.id).unwrap().clone())
}

to find out where the time is going.

Backtrace attached:

wgpustuck.txt

I'm seeing multiple threads waiting at that lock. Don't know why yet.

This seems to be at least part of why asset loading hits rendering performance so hard.

4 replies

teoxoy Jul 1, 2024
Maintainer

assign() is only used by resource creation functions, are you creating resources on multiple threads?

Btw, we a moving towards removing the storages/registries for wgpu users #5121.

John-Nagle Jul 1, 2024
Author

are you creating resources on multiple threads?

Yes. Multiple threads are loading resources. They interfere with each other a bit, which is OK.
One thread is rendering. It does some resource creation too, and gets slowed down considerably by lock conflicts with the resource creation threads. Frame rate drops from 60FPS to < 20 FPS. That's the performance problem. There's a Tracy profile linked in this discussion which shows the lock conflicts.

John-Nagle Jul 2, 2024
Author

I looked at #5121. Will that clearly improve performance? Is there profiling data which indicates that's a win? For my work, "arcanization" did not have measureable changes to performance.

I probably need to instrument a fork of Wgpu and chase this performance bug down. I'm currently waiting for egui-wgpu to get a versioning issue sorted out so I can update a fork of Rend3. What I'm seeing is something around assign, something that should not be much worse than O(1), taking much longer on large scenes.. I'm not sure about that yet.

teoxoy Jul 2, 2024
Maintainer

I looked at #5121. Will that clearly improve performance? Is there profiling data which indicates that's a win?

There isn't concrete evidence that it will improve perf in your case but if the bottleneck is assign(), then it will go away.

John-Nagle · 2024-07-18T04:35:42Z

John-Nagle
Jul 18, 2024
Author

I've upgraded Rend3 from WGPU 0.19 to WGPU 0.20. No major changes to Rend3; just enough to match breaking changes in other crates. Performance is somewhat better and substantially more consistent. Asset fetch and create activity in other threads doesn't seem to be slowing the rendering thread as severely. When a new region connects and a large number of assets are added, the transient impact is less. I used to have transients where FPS dropped below 10; now, the worst I'm seeing is 20, and most of the time it's 60FPS. This is encouraging. I can work with this.

New Rend3: https://github.com/John-Nagle/rend3-hp

0 replies

John-Nagle · 2024-07-18T21:28:18Z

John-Nagle
Jul 18, 2024
Author

I'm looking at Tracy traces for performance problems, and finding some.

"Device::create_buffer" taking 4ms: Device::create_buffer is sometimes slow (4ms) and slows down rendering. #5984
At the Rend3 level, there's a scan of all objects with lots of indirection that wipes out the CPU caches. It's building a vector of objects to depth sort for shadows, and the key building is slower than the sort.

Interestingly, both of these are simple bookkeeping. They're not doing much computation. I'm seeing some other spots where that might be the case. It looks like the way to approach this for now is to spend time looking at Tracy traces and noticing where things that should be trivially fast are slow.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major performance problems with multithreading. #5525

{{title}}

Replies: 14 comments 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Major performance problems with multithreading. #5525

John-Nagle Apr 12, 2024

Replies: 14 comments · 12 replies

John-Nagle Apr 14, 2024 Author

kpreid Apr 14, 2024

John-Nagle Apr 15, 2024 Author

John-Nagle Apr 16, 2024 Author

John-Nagle Apr 17, 2024 Author

John-Nagle Apr 20, 2024 Author

jimblandy Apr 29, 2024 Maintainer

jimblandy Apr 29, 2024 Maintainer

jimblandy Apr 30, 2024 Maintainer

John-Nagle Apr 30, 2024 Author

jimblandy Apr 30, 2024 Maintainer

John-Nagle May 5, 2024 Author

jimblandy May 6, 2024 Maintainer

John-Nagle May 6, 2024 Author

kpreid May 6, 2024

John-Nagle May 6, 2024 Author

kpreid May 6, 2024

John-Nagle May 5, 2024 Author

John-Nagle Jun 30, 2024 Author

teoxoy Jul 1, 2024 Maintainer

John-Nagle Jul 1, 2024 Author

John-Nagle Jul 2, 2024 Author

teoxoy Jul 2, 2024 Maintainer

John-Nagle Jul 18, 2024 Author

John-Nagle Jul 18, 2024 Author

John-Nagle
Apr 12, 2024

Replies: 14 comments 12 replies

John-Nagle
Apr 14, 2024
Author

John-Nagle Apr 15, 2024
Author

John-Nagle
Apr 16, 2024
Author

John-Nagle
Apr 17, 2024
Author

John-Nagle
Apr 20, 2024
Author

jimblandy
Apr 29, 2024
Maintainer

jimblandy
Apr 29, 2024
Maintainer

jimblandy
Apr 30, 2024
Maintainer

John-Nagle
Apr 30, 2024
Author

jimblandy
Apr 30, 2024
Maintainer

John-Nagle
May 5, 2024
Author

jimblandy May 6, 2024
Maintainer

John-Nagle May 6, 2024
Author

John-Nagle May 6, 2024
Author

John-Nagle
May 5, 2024
Author

John-Nagle
Jun 30, 2024
Author

teoxoy Jul 1, 2024
Maintainer

John-Nagle Jul 1, 2024
Author

John-Nagle Jul 2, 2024
Author

teoxoy Jul 2, 2024
Maintainer

John-Nagle
Jul 18, 2024
Author

John-Nagle
Jul 18, 2024
Author