-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OTP27 performance regression #8322
Comments
Hi! We haven't forgotten about this. The problem is that we had a bug in how virtual heap sizes were calculated, leading to full-sweep garbage collections being less frequent than intended, improving performance in this case by chance. This was made worse by Fixing this in a way that doesn't make anything else worse will take some time. I'm hoping to have something before RC3. |
Not sure if it helps to have further evidence, but we have a performance test on leveled (a KV store used as a backend in Riak). Using our OTP 27 branch, and running From commit 49024e8:
From commit 24ef4cb:
The test has some natural variation between runs, but we consistently see a difference in the Running the same test on
There appears to be some small improvement, but still a significant slow down when compared to OTP 26. The test will output eprof outputs for TotalLoadTime and TotalFullFoldTime when run as From commit 49024e8:
From commit 24ef4cb:
From
It looks as if there was an issue with However, there remains some key deltas around Issues with |
I've made some tweaks in #8347 that brings most performance back on small sizes, and makes larger ones faster than before. Can you try it out? |
As far as RabbitMQ tests go, there's definitely an improvement with #8347, but still a regression compared to 26.2:
The biggest difference is with the test in the middle, which actually doesn't write to disk - in this test with have 2 publishers and 2 consumers ( |
From leveled tests, initial test runs of #8347 look positive, but I'm getting some variation between runs. The eprof profiles look more aligned with OTP 26. I will run a series of longer tests over the weekend which should give more reliable results and report back on Monday. |
Thanks, I've pushed another tweak of the virtual heap growth curve that brings performance back to where it was (or better) on my machine (edit: it varies a lot though, both before and after). This is not without downsides however, so I might have to make changes case later on in case this reduces performance or increases garbage retention too much for other uses. I'll try to strike the best balance I can, but chances are that it'll still be slower than before for your specific code, so you may want to look into tweaking the minimum virtual heap sizes for the processes involved. |
I just have to ask you folks (cc @ThomasArts), have these benchmarks been used as target for performance improvements for a long time? If so I fear there's not much more I can do. I've managed to reproduce this slowdown and #8329 in earlier versions by fixing the off-heap accounting bug alone, which strongly suggests that the code has been optimized to minimize inopportune GCs for these specific benchmarks with these specific parameters (probably inadvertently). |
If I understand correctly we should be able to get similar performance as before by tweaking virtual heap configuration. Could you list the process flags one would need to tinker with to achieve that? Just the |
Currently just |
The leveled perf_SUITE test we're using was only recently created, we've been using this test just for the past few months. Most of the resultant optimisations have been easy to justify (e.g. they align with the efficiency guide, or update to a more modern feature like maps), but I can't entirely rule out these sort of inadvertent improvements that were dependent on bugs. We have no production use of OTP 26 at present, nothing higher than OTP 24. So a small degradation from 26 to 27 isn't a big issue for us as long as the test is improving from OTP 24. I will compare over the weekend OTP24, OTP26, OTP27-RC2 and OTP27-PR8347. Are you interested in comparison between different commits on PR8347 - or should I just be testing 436568a? |
Thanks, that's really good to know :)
If you could test both, that'd be great. |
With the most recent changes I get these results (same machines as above): So indeed some tests are even a bit better than with 26.2 while some (the middle one, which doesn't write to disk) is still quite a bit worse. We will play around with As for how we are using these tests - the specific tests used in this issue and related ones were picked out of a much larger suite of tests we run regularly, mostly to check for improvements/regressions in RabbitMQ itself. We've been using these tests for a few years now (with some adjustments and additions). We have a dozen or so scenarios that test different code paths (that also may trigger very different behaviours - some will use more and more memory, some will write to disk, some will only read from disk, etc) and we test with at least 3 different message sizes. Based on that we produce internal results to validate PRs etc, but also occasionally share results in this example: https://www.rabbitmq.com/blog/2023/05/17/rabbitmq-3.12-performance-improvements#312-cqv1-vs-cqv2 Comparing OTP changes is more of a side effect of these tests, but since we have a branch dedicated to prepare the codebase for OTP27, I ran the tests as soon as we had a correct build and many tests were significantly worse across the board (although the 100B message size seems to be the most impacted). |
Only a quick and dirty test for now but: setting |
Test results from experiments with PR8347 on leveled .... The test starts-up six different databases (in series) with different key counts, and then runs phases of different types of intense activity - i.e. data load, head requests get requests, folds etc. We then total the times across all six tests, and then take the average across 3 separate runs. After each phase of intense activity, a check of the memory used is done by calling garbage_collect/0 on the test process and then checking erlang:memory/0. So the memory measurement is a snapshot once each phase of activity is over - there are six measurements, on each of the six databases on each of the 3 runs (108 measurements over all). The control for this was running the test on OTP 24.3.4.13 - and the results on this run is the 100% line on the chart. The results of the other runs are the columns - with timings on the left, and memory used on the right. There were six comparisons taken:
Couple of observations:
So overall the vheap resizing algorithm changes are probably not a great trade-off for our test case. It looks positive for our case that we can improve performance through an established configurable. I will work further on this next week, and try and enhance the memory monitoring in the test - in particular to try and capture memory usage during the phases, not just after. Also, based on the understanding of what appears to be driving performance, I’m going to look at what re-coding/re-design we might be able to do to achieve equivalent benefits without tweaking default emulator settings. Thanks for all the help so far. |
Huge thanks for putting that together, I’ll remove the vheap curve stuff from #8347 and focus more on adding flags to tweak it in the near term. 🙂 |
This issue came up, maybe inadvertently, in the Erlang forums. I was wondering if this is an ongoing problem still? Most of the related PRs, issues, and branches seem to be getting merged. |
Yes, we still haven't merged any fixes for this so RabbitMQ on OTP27 works, but is significantly slower for some message sizes (from 65 bytes up to 1-2kb - above that it usually doesn't matter that much). Setting For example we had a situation where:
Here's an example of a PR where we are playing with different values and you can see the results: The graphs might be hard to decipher without additional context but there run different workloads for 5 minutes each with different message sizes and for each there are three environments - Feel free to ping me on Erlanger slack if you want to chat about any of that. |
This is a follow-up to #8229. The PR that came out of that discussion improved the situation but hasn't resolved the problem. We still see up to 30% slow down in many RabbitMQ tests when switching from OTP26 to today's
master
branch, including thecopy_binary_to_buffer
change).I don't want to put pressure on this but it's not completely clear whether further improvements are in progress or planned soon. I just want to make it explicit that the problem is still there. I'm happy to share additional details but following the steps from #8229 is one of the ways to reproduce the problem.
Thank you
The text was updated successfully, but these errors were encountered: