-
Hello @mikke89, 👋 Once again, thank you for your continuous work on the library. The upcoming 6.0 release looks promising, and I cannot wait for it! Our TruckersMP team has been progressively working on the port from our previous GUI library to RmlUi. One of the current components in work is the player list, which can show up to 300 players, if not more. This is very important for community interaction or a case when players are misbehaving; you must be able to report such behavior. :) Unfortunately, one of RmlUi's most significant downsides is performance. For our complex UI document, the context update can take up to 15 ms (for around 120 players) on a fairly powerful CPU, which is unacceptable; the player list will be updated often, as it must show the most recent information and new players. After modifying compilation settings and playing around with CSS (see below why), I can get down to around 5 ms, which is still a lot, as this does not include rendering. It would be great to get under 2 ms, in the best-case scenario, to get under 1 ms. What is such a player list, anyway? For us, a real-time MMO modification for a truck simulation game, such a list is a mix of an end screen with a result and a scoreboard in a real-time FPS game. Basically, it must show all players in the area while accurately displaying the distance to the player, their latency, their tag, and their name. The user can sort this list or query only some players. When it comes to styling, the user can resize the window. The bottom controls, such as the query input, must always remain at the bottom of the window. I created an isolated sample for this case to showcase, test, and profile the problem. This is stripped of any more advanced processing or complex styling: https://github.com/ShawnCZek/RmlUi/tree/sample_player_list I enabled Tracy and Lua, used the Win32_VK backend, and modified the following variables:
My PC configuration is the following: The first thing you may notice is that the styling is off. I am attaching a comparison between RmlUi and Firefox. The styling is not 1:1 due to differences with browsers, but they should both show a similar result: The input is overflowing, and the margin between the top text and the player list table is off. You can notice this even better when you try to resize the player list in RmlUi. Even more worrying, though, is that this table causes the layout to reformat four times! Anyway, if we forget this issue, formatting the player list still takes 2.63 ms! In the provided sample, there are only 100 entries in a simple table, as opposed to how complex our solution will be (e.g., an addition of controls and images). One of the problems, I believe, is the high number of calls and nesting. Playing around with the compilation settings to enforce as much inlining as possible and enabling link-time optimization bring slight improvement. This does not fix the root problem, though. Another area for improvement apparent from sampling is the number of allocations. This does not have to be clear when reading the code, but using strings and vectors with a dynamic allocator causes a sheer amount of allocations. Most of these allocations are very small, too. Using a page frame/buddy allocator would probably improve the performance. Or the burden can be redirected to the library user by providing an option to set a custom allocator. The Sure, profiling causes an overhead, which exaggerates all these issues. Nonetheless, some of this is a price we pay on every frame and a huge price for multiple (or) more complex documents, not just for this scenario. After reading it all, I realized I diverged slightly from my original point. Nevertheless, to summarize this entire discussion post, there are three fundamental discoveries from this case/sample:
I wonder what your recommendation is to work around these issues or if you plan to improve these aspects of the library. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 14 replies
-
I think we actually ran into point 1 accidentally when making a custom element that uses flex; we clear the children and added new ones each frame (which obviously was not great and we've since fixed), and it caused the UI to get progressively slower as time went on, which is strange since you'd expect it to be a consistent slowdown since we're always removing & adding the same number of elements. I hadn't had time to pull it out into a test project, though. |
Beta Was this translation helpful? Give feedback.
-
Hey, I appreciate the thorough post! It's quite a bit to chew over, but it's always very interesting with real-world use cases. For your first point, I would suggest that you make a separate issue for that. I will instead focus on the performance side here. So to begin, I absolutely agree that performance is important. That is one of the first things I dived into after forking the library from libRocket. There was room there for orders-of-magnitude improvements, so at least I feel we are in a much more workable state. In any case, there is still a lot of room for improvements, there are many really dumb things we do. So, then, let's get to your example. It's neat, and I definitely see the use for it, it's a great example where performance becomes critical. I might actually steal it for our benchmarks, if you don't mind. The library just isn't really optimized the same way like web browsers, with the possibility of huge pages. In game dev, and most applications, pages are usually relatively small, and the complexity isn't that high. Huge tables are one of those complicated things that are very tricky in the general case, but with some assumptions could in principle be made much faster (like fixed row heights). In fact, there has been some discussion on large tables before, see this gitter link and nearby discussion. Here, I go through some of the aspects of possibly optimizing this. I think a lot of this still applies, so I'll quote it here:
Just to be clear, I think your list with 300 players is fine. Maybe I would say today that it's okay with 5000 too as long as there are some great filtering abilities and such. But obviously, with that many rows, I also think it's reasonable to expect more from the user in terms of accommodating that, such as making some additional assumptions or specialized containers. Here, I am thinking of something like a "big table" layout mode, where we know most rows are out of view, and the user can for example tell exactly the height of each row up-front. That would be massively helpful to make something fast. Perhaps it might not be so hard to add to our existing table implementation? I think that's a more fruitful approach than the layout proposal I mention in the quote.
Right, there are some cases where we need to do layouting multiple times. Taking a closer look at the example, let me quickly go through what's going on here. First of all, one culprit is the auto overflow. It's actually really bad when combined with a huge table, because with auto overflow, we always do layout without scrollbars first. Then, if we detect that scrollbars are necessary due to overflow, we have to start over again with the scrollbars present and the new layout width. Of course, this is another one of those "dumb" things, but is great for ensuring correctness. I have been thinking of some ways to possibly reduce this, by checking regularly for overflow, but it would have some overhead for cases where it doesn't apply. By replacing The next ones are all about flex layout with content-based sizing. Flex layout is just not great for performance when used naively, and I would generally avoid it for large layout structures if you can. The issue is that a lot of the calculations requires the size of its contents to be known, as part of its algorithm. Of course, this is also one of its strengths, but should be used with care when performance is critical. In this case, it's all documented behavior, see the performance section in the flexbox documentation. The two points here directly relate to your observations. By updating the example with this advice, I get the following:
We're down to 1 iteration where we want to be for such a large table, much better. It wouldn't surprise me if there are situations where we could skip some of these iterations automatically, or early out somehow, it hasn't really been looked at thoroughly. With these optimizations, most of the layout time is spent actually doing layout of each individual table cell once (something like 2.8ms / 3ms). The context render update is in total 2 ms. If we could skip layout of all the ones we cannot see, and also skip the render update on all of them, then we might be left with only 20/100 (visible rows) parts of that, so 1ms. And that would be the same for any number of rows. The element update step is currently 0.4ms and that would scale linearly with new elements. I see that for some reason, all the elements need to update their definitions, not exactly sure why this happens here, so perhaps something to look at. Otherwise, during the "no-op" frames the element updates are 0.1ms, not the most critical to optimize in my view. These repeated layout iterations are quite "dumb" in many ways. But there are reasons for all of them being there, they are easy ways to ensure correctness. Working around them in general can be a great effort, but I'm sure there is a lot of potential here for better approaches. And probably some trivial cases where we can skip them, or at least some early outs. In particular, I'm not sure if the layout update due to the lack of I generally agree that memory allocation is one area where we are taking a pretty naive approach. From my measurements there is something like 15% CPU time being spent here during some intensive operations (off the top of my head). It's one thing to study more, I actually have some local branches working with reducing a lot of these allocations. However, what we're really looking for are those order-of-magnitude improvements, so it might not be the first place we should put in our efforts. By the way, using the built-in thirdparty containers has a pretty big impact here in my measurements, I would really encourage you to measure and consider those. And all our types are using aliases that you can override, so you can set a (global) custom allocator that way. Some workarounds should be considered. One approach is to sidestep the library altogether, make your own element, and do all the layouting manually. You could still take advantage of our text layouting and such. Another approach is to only add the rows that are visible to the table. Above the visible rows, you could add a fake row which takes the height of all the omitted rows, and the same below. This way, the scroll height is correct. This would essentially be the same as some of the culling discussed earlier, just a bit more manual work. And you would automatically get the culling for all parts of the update and render steps. So this post became longer than I expected. I'm not sure how useful it is really? While there are a lot of approaches we can continue on, the main thing is finding the time and effort to sit down and implement these things. Ideas are quite plentiful. But it's nice that you bring it to the forefront, I certainly think large tables in particular are valuable. A sample with one of the workarounds could be nice in the meantime. Also, I would encourage anyone to really take note when coming over performance notes in the documentation, there are always good reasons for writing them, perhaps they should be elaborated upon. |
Beta Was this translation helpful? Give feedback.
-
While going through your first comment here and trying out the suggested optimizations, I noticed one pain point that I completely missed and is significantly related to this sample: updating elements with data from the model. We are looking at 0.6 ms (using the same build configuration as in the initial post) for a data model update when dirtying the collection of players. Again, it looks like there are a lot of small allocations and deallocations, but I believe this is not as important to optimize here. Right now, there is no way to dirty only one structure member in a collection, which would be very beneficial in cases like this. For instance, every 500 ms, one could only update the score ( What do you think of this? |
Beta Was this translation helpful? Give feedback.
Hey, I appreciate the thorough post! It's quite a bit to chew over, but it's always very interesting with real-world use cases.
For your first point, I would suggest that you make a separate issue for that. I will instead focus on the performance side here.
So to begin, I absolutely agree that performance is important. That is one of the first things I dived into after forking the library from libRocket. There was room there for orders-of-magnitude improvements, so at least I feel we are in a much more workable state. In any case, there is still a lot of room for improvements, there are many really dumb things we do.
So, then, let's get to your example. It's neat, and I definitely see the…