Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTP27 performance regression #8322

Open
mkuratczyk opened this issue Mar 28, 2024 · 17 comments
Open

OTP27 performance regression #8322

mkuratczyk opened this issue Mar 28, 2024 · 17 comments
Assignees
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM

Comments

@mkuratczyk
Copy link
Contributor

This is a follow-up to #8229. The PR that came out of that discussion improved the situation but hasn't resolved the problem. We still see up to 30% slow down in many RabbitMQ tests when switching from OTP26 to today's master branch, including the copy_binary_to_buffer change).

I don't want to put pressure on this but it's not completely clear whether further improvements are in progress or planned soon. I just want to make it explicit that the problem is still there. I'm happy to share additional details but following the steps from #8229 is one of the ways to reproduce the problem.

Thank you

@mkuratczyk mkuratczyk added the bug Issue is reported as a bug label Mar 28, 2024
@jhogberg
Copy link
Contributor

jhogberg commented Apr 2, 2024

Hi! We haven't forgotten about this. The problem is that we had a bug in how virtual heap sizes were calculated, leading to full-sweep garbage collections being less frequent than intended, improving performance in this case by chance. This was made worse by inet_drv creating shady psuedo-sub-binaries that hit the bug really hard.

Fixing this in a way that doesn't make anything else worse will take some time. I'm hoping to have something before RC3.

@martinsumner
Copy link

martinsumner commented Apr 2, 2024

Not sure if it helps to have further evidence, but we have a performance test on leveled (a KV store used as a backend in Riak).

Using our OTP 27 branch, and running ./rebar3 as perf_mini do ct --suite=test/end_to_end/perf_SUITE ...

From commit 49024e8:

%%% perf_SUITE: 
Outputs from profiling with KeyCount 2000000 ObjSize 2048 Compression {zstd,as_store}:
TotalLoadTime - 219100 ms
TotalHeadTime - 38673 ms
TotalGetTime - 60652 ms
TotalQueryTime - 4841 ms
TotalMiniQueryTime - 7733 ms
TotalFullFoldTime - 38398 ms
TotalAAEFoldTime - 3597 ms
TotalUpdateTime - 10109 ms
Disk space required for test - 5.5G
Average Memory usage for test - Total 200 Proc 141 Bin 18 MB
Closing count of SST Files - 118
Closing count of CDB Files - 14

From commit 24ef4cb:

%%% perf_SUITE: 
Outputs from profiling with KeyCount 2000000 ObjSize 2048 Compression {zstd,as_store}:
TotalLoadTime - 257687 ms
TotalHeadTime - 40266 ms
TotalGetTime - 59248 ms
TotalQueryTime - 6429 ms
TotalMiniQueryTime - 12075 ms
TotalFullFoldTime - 52816 ms
TotalAAEFoldTime - 3888 ms
TotalUpdateTime - 10753 ms
Disk space required for test - 5.6G
Average Memory usage for test - Total 147 Proc 90 Bin 19 MB
Closing count of SST Files - 120
Closing count of CDB Files - 12

The test has some natural variation between runs, but we consistently see a difference in the TotalLoadTime and TotalFullFoldTime.

Running the same test on master:

%%% perf_SUITE: 
Outputs from profiling with KeyCount 2000000 ObjSize 2048 Compression {zstd,as_store}:
TotalLoadTime - 239077 ms
TotalHeadTime - 40771 ms
TotalGetTime - 45050 ms
TotalQueryTime - 6285 ms
TotalMiniQueryTime - 10388 ms
TotalFullFoldTime - 48978 ms
TotalAAEFoldTime - 3870 ms
TotalUpdateTime - 10816 ms
Disk space required for test - 5.6G
Average Memory usage for test - Total 153 Proc 91 Bin 22 MB
Closing count of SST Files - 120
Closing count of CDB Files - 12

There appears to be some small improvement, but still a significant slow down when compared to OTP 26.

The test will output eprof outputs for TotalLoadTime and TotalFullFoldTime when run as ./rebar3 as perf_prof do ct --suite=test/end_to_end/perf_SUITE. Focusing on the FullFold profile ...

From commit 49024e8:

Profile full:

FUNCTION                                             CALLS        %      TIME  [uS / CALLS]
--------                                             -----  -------      ----  [----------]
prim_file:pread_nif/3                                43999     0.33    231892  [      5.27]
erlang:'++'/2                                      1204983     0.37    257207  [      0.21]
lists:reverse/2                                    1078968     0.42    293538  [      0.27]
gen:do_call/4                                        44673     0.51    356412  [      7.98]
perf_SUITE:'-counter/2-fun-0-'/4                  22400000     1.06    737655  [      0.03]
leveled_head:build_head/2                         22400000     1.14    794253  [      0.04]
leveled_codec:striphead_to_v1details/1            22400000     1.19    829215  [      0.04]
leveled_codec:from_ledgerkey/1                    22400000     1.26    879291  [      0.04]
leveled_penciller:'-find_nextkeys/6-fun-1-'/1     21700000     1.31    910607  [      0.04]
leveled_head:get_size/2                           22400000     1.44   1003059  [      0.04]
leveled_codec:to_ledgerkey/3                      22400020     1.58   1101796  [      0.05]
maps:update_with/3                                22400010     2.69   1868585  [      0.08]
leveled_codec:maybe_accumulate/5                  23100010     2.75   1911402  [      0.08]
leveled_runner:'-accumulate_objects/4-fun-0-'/7   22400000     5.75   3998070  [      0.18]
leveled_sst:deserialise_checkedblock/2              891366     6.03   4196508  [      4.71]
leveled_head:riak_metadata_to_binary/2            22400000     6.65   4622823  [      0.21]
leveled_codec:return_proxy/4                      22400000     8.17   5680153  [      0.25]
zstd:quick_decompress/1                             889356    11.69   8129847  [      9.14]
erlang:term_to_binary/1                           22416100    12.27   8532287  [      0.38]
erlang:binary_to_term/1                             891377    13.26   9221506  [     10.35]
leveled_penciller:find_nextkeys/6                128750550    16.72  11631140  [      0.09]
-----------------------------------------------  ---------  -------  --------  [----------]
Total:                                           446446438  100.00%  69558305  [      0.16]

From commit 24ef4cb:

Profile full:

FUNCTION                                             CALLS        %      TIME  [uS / CALLS]
--------                                             -----  -------      ----  [----------]
prim_file:pread_nif/3                                43724     0.34    274407  [      6.28]
erlang:'++'/2                                      1219505     0.35    283447  [      0.23]
lists:reverse/2                                    1068464     0.37    297614  [      0.28]
lists:reverse/1                                    1051325     0.54    443675  [      0.42]
perf_SUITE:'-counter/2-fun-0-'/4                  22400000     0.92    748641  [      0.03]
leveled_head:build_head/2                         22400000     0.98    795345  [      0.04]
leveled_codec:striphead_to_v1details/1            22400000     1.04    848665  [      0.04]
leveled_codec:from_ledgerkey/1                    22400000     1.14    929506  [      0.04]
leveled_penciller:'-find_nextkeys/6-fun-1-'/1     21700000     1.23   1005327  [      0.05]
leveled_codec:to_ledgerkey/3                      22400020     1.39   1129934  [      0.05]
leveled_head:get_size/2                           22400000     1.46   1184771  [      0.05]
leveled_codec:maybe_accumulate/5                  23100010     2.32   1890196  [      0.08]
gen:do_call/4                                        44254     2.64   2147989  [     48.54]
maps:update_with/3                                22400010     2.67   2174414  [      0.10]
leveled_runner:'-accumulate_objects/4-fun-0-'/7   22400000     4.91   4000124  [      0.18]
leveled_sst:deserialise_checkedblock/2              875370     6.76   5502848  [      6.29]
leveled_codec:return_proxy/4                      22400000     8.28   6739167  [      0.30]
leveled_head:riak_metadata_to_binary/2            22400000     9.59   7804844  [      0.35]
zstd:quick_decompress/1                             875370     9.63   7837760  [      8.95]
erlang:binary_to_term/1                             875371    12.31  10021829  [     11.45]
erlang:term_to_binary/1                           22402561    14.04  11429797  [      0.51]
leveled_penciller:find_nextkeys/6                126312251    14.66  11932893  [      0.09]
-----------------------------------------------  ---------  -------  --------  [----------]
Total:                                           439167469  100.00%  81422280  [      0.19]

From master:

Profile full:

FUNCTION                                             CALLS        %      TIME  [uS / CALLS]
--------                                             -----  -------      ----  [----------]
erlang:crc32/1                                     1116050     0.25    203522  [      0.18]
erlang:'++'/2                                      1558272     0.30    242790  [      0.16]
prim_file:pread_nif/3                                43841     0.34    274562  [      6.26]
lists:reverse/2                                    1163715     0.39    323075  [      0.28]
lists:reverse/1                                    1063781     0.45    366984  [      0.34]
zstd:quick_compress/2                                27972     0.58    470871  [     16.83]
perf_SUITE:'-counter/2-fun-0-'/4                  22400000     0.92    751929  [      0.03]
leveled_head:build_head/2                         22400000     0.98    803197  [      0.04]
leveled_codec:from_ledgerkey/1                    22400000     1.12    913178  [      0.04]
leveled_codec:striphead_to_v1details/1            22400000     1.12    919072  [      0.04]
leveled_penciller:'-find_nextkeys/6-fun-1-'/1     21700000     1.44   1177810  [      0.05]
leveled_codec:to_ledgerkey/3                      22400020     1.62   1323444  [      0.06]
leveled_head:get_size/2                           22400000     1.90   1550695  [      0.07]
leveled_codec:maybe_accumulate/5                  23100010     2.32   1900101  [      0.08]
maps:update_with/3                                22400010     2.95   2410153  [      0.11]
gen:do_call/4                                        44946     3.00   2450715  [     54.53]
leveled_runner:'-accumulate_objects/4-fun-0-'/7   22400000     4.92   4024900  [      0.18]
leveled_head:riak_metadata_to_binary/2            22400000     5.91   4830970  [      0.22]
leveled_codec:return_proxy/4                      22400000     7.69   6291738  [      0.28]
zstd:quick_decompress/1                             901963     9.70   7931114  [      8.79]
erlang:term_to_binary/1                           22427979    10.22   8356691  [      0.37]
leveled_sst:deserialise_checkedblock/2              901963    10.87   8893493  [      9.86]
erlang:binary_to_term/1                             901981    13.27  10854579  [     12.03]
leveled_penciller:find_nextkeys/6                129134363    14.59  11932884  [      0.09]
-----------------------------------------------  ---------  -------  --------  [----------]
Total:                                           456443503  100.00%  81793457  [      0.18]

It looks as if there was an issue with erlang:term_to_binary/1 and leveled_head:riak_metadata_to_binary/2 (which is just creating a small binary) which is now resolved. @ThomasArts had raised #8314 related to these slowdowns in binary construction. So the fix there appears to have been successful.

However, there remains some key deltas around gen:do_call/4, erlang:binary_to_term/1 and leveled_sst:deserialise_checkedblock/2 (which is just a function that does binary_to_term after running zstd decompression ). The maps:update_with/3 calls which appear to be slower will be passing the tl/1 function for the update (e.g. https://github.com/martinsumner/leveled/blob/develop-3.1-otp27/src/leveled_penciller.erl#L1771).

Issues with binary_to_term/1 have been isolated and raised by @ThomasArts - #8329.

@jhogberg
Copy link
Contributor

jhogberg commented Apr 5, 2024

I've made some tweaks in #8347 that brings most performance back on small sizes, and makes larger ones faster than before. Can you try it out?

@mkuratczyk
Copy link
Contributor Author

As far as RabbitMQ tests go, there's definitely an improvement with #8347, but still a regression compared to 26.2:

Linux, i9-10900X CPU @ 3.70GHz 26.2.3 master pr8347
java -jar perf-test.jar -x 2 -y 0 -qa x-queue-version=2 -c 1000 -u q -ad false -f persistent -s 100 -z 10 113000 95500 102000
java -jar perf-test.jar -x 2 -y 2 -qa x-queue-version=2 -c 1000 -u q -ad false -f persistent -s 100 -z 10 85000 55000 64000
java -jar stream-perf-test.jar -s 100 -z 10 900000 775000 900000
M1 Pro, macOS 14.4.1 26.2.2 master pr8347
java -jar perf-test.jar -x 2 -y 0 -qa x-queue-version=2 -c 1000 -u q -ad false -f persistent -s 100 -z 10 80000 65000 76000
java -jar perf-test.jar -x 2 -y 2 -qa x-queue-version=2 -c 1000 -u q -ad false -f persistent -s 100 -z 10 52000 32000 37000
java -jar stream-perf-test.jar -s 100 -z 10 830000 650000 785000

The biggest difference is with the test in the middle, which actually doesn't write to disk - in this test with have 2 publishers and 2 consumers (-x 2 -y 2) so messages (RabbitMQ/AMQP messages, not Erlang messages) are consumed immediately and never written to disk (I only put 1 number in the table, but the same number of messages is consumed at the same time).

@martinsumner
Copy link

From leveled tests, initial test runs of #8347 look positive, but I'm getting some variation between runs. The eprof profiles look more aligned with OTP 26.

I will run a series of longer tests over the weekend which should give more reliable results and report back on Monday.

@jhogberg
Copy link
Contributor

jhogberg commented Apr 5, 2024

Thanks, I've pushed another tweak of the virtual heap growth curve that brings performance back to where it was (or better) on my machine (edit: it varies a lot though, both before and after). This is not without downsides however, so I might have to make changes case later on in case this reduces performance or increases garbage retention too much for other uses.

I'll try to strike the best balance I can, but chances are that it'll still be slower than before for your specific code, so you may want to look into tweaking the minimum virtual heap sizes for the processes involved.

@jhogberg
Copy link
Contributor

jhogberg commented Apr 5, 2024

I just have to ask you folks (cc @ThomasArts), have these benchmarks been used as target for performance improvements for a long time?

If so I fear there's not much more I can do. I've managed to reproduce this slowdown and #8329 in earlier versions by fixing the off-heap accounting bug alone, which strongly suggests that the code has been optimized to minimize inopportune GCs for these specific benchmarks with these specific parameters (probably inadvertently).

@lhoguin
Copy link
Contributor

lhoguin commented Apr 5, 2024

If I understand correctly we should be able to get similar performance as before by tweaking virtual heap configuration. Could you list the process flags one would need to tinker with to achieve that? Just the min_bin_vheap_size flag? More? Thank you!

@jhogberg
Copy link
Contributor

jhogberg commented Apr 5, 2024

Currently just min_bin_vheap_size, but we might add options to change the per-process growth curve etc in the future.

@martinsumner
Copy link

The leveled perf_SUITE test we're using was only recently created, we've been using this test just for the past few months.

Most of the resultant optimisations have been easy to justify (e.g. they align with the efficiency guide, or update to a more modern feature like maps), but I can't entirely rule out these sort of inadvertent improvements that were dependent on bugs.

We have no production use of OTP 26 at present, nothing higher than OTP 24. So a small degradation from 26 to 27 isn't a big issue for us as long as the test is improving from OTP 24.

I will compare over the weekend OTP24, OTP26, OTP27-RC2 and OTP27-PR8347.

Are you interested in comparison between different commits on PR8347 - or should I just be testing 436568a?

@jhogberg
Copy link
Contributor

jhogberg commented Apr 5, 2024

The leveled perf_SUITE test we're using was only recently created, we've been using this test just for the past few months.

Most of the resultant optimisations have been easy to justify (e.g. they align with the efficiency guide, or update to a more modern feature like maps), but I can't entirely rule out these sort of inadvertent improvements that were dependent on bugs.

We have no production use of OTP 26 at present, nothing higher than OTP 24. So a small degradation from 26 to 27 isn't a big issue for us as long as the test is improving from OTP 24.

I will compare over the weekend OTP24, OTP26, OTP27-RC2 and OTP27-PR8347.

Thanks, that's really good to know :)

Are you interested in comparison between different commits on PR8347 - or should I just be testing 436568a?

If you could test both, that'd be great.

@mkuratczyk
Copy link
Contributor Author

mkuratczyk commented Apr 5, 2024

With the most recent changes I get these results (same machines as above):
linux: 103000 / 73000 / 920000
macos: 80000 / 40000 / 840000

So indeed some tests are even a bit better than with 26.2 while some (the middle one, which doesn't write to disk) is still quite a bit worse. We will play around with min_bin_vheap_size soon and report back. Either way, thank you very much for all your work on this!

As for how we are using these tests - the specific tests used in this issue and related ones were picked out of a much larger suite of tests we run regularly, mostly to check for improvements/regressions in RabbitMQ itself. We've been using these tests for a few years now (with some adjustments and additions). We have a dozen or so scenarios that test different code paths (that also may trigger very different behaviours - some will use more and more memory, some will write to disk, some will only read from disk, etc) and we test with at least 3 different message sizes. Based on that we produce internal results to validate PRs etc, but also occasionally share results in this example: https://www.rabbitmq.com/blog/2023/05/17/rabbitmq-3.12-performance-improvements#312-cqv1-vs-cqv2

Comparing OTP changes is more of a side effect of these tests, but since we have a branch dedicated to prepare the codebase for OTP27, I ran the tests as soon as we had a correct build and many tests were significantly worse across the board (although the 100B message size seems to be the most impacted).

@mkuratczyk
Copy link
Contributor Author

Only a quick and dirty test for now but: setting +hmbs (default virtual binary heap) doesn't change anything but setting +hms (default heap size) restores the performance of OTP26 in this particular test at least.

@martinsumner
Copy link

Test results from experiments with PR8347 on leveled ....

The test starts-up six different databases (in series) with different key counts, and then runs phases of different types of intense activity - i.e. data load, head requests get requests, folds etc. We then total the times across all six tests, and then take the average across 3 separate runs.

After each phase of intense activity, a check of the memory used is done by calling garbage_collect/0 on the test process and then checking erlang:memory/0. So the memory measurement is a snapshot once each phase of activity is over - there are six measurements, on each of the six databases on each of the 3 runs (108 measurements over all).

The control for this was running the test on OTP 24.3.4.13 - and the results on this run is the 100% line on the chart. The results of the other runs are the columns - with timings on the left, and memory used on the right.

perfSUITE-v7

There were six comparisons taken:

  • OTP 26.2.3
  • OTP 27.0 RC2
  • OTP 27.0 PR8347 commit ecd8df8
  • OTP 27.0 PR8347 commit 968cb0b
  • OTP 27.0 PR8347 commit 436568a
  • OTP 27.0 PR8347 commit ecd8df8 with hmbs set to 131072 words (4 times default)

Couple of observations:

  • the initial commit on PR8347 (ecd8df8) meets the baseline of making things as-good-as or better-than OTP 24, but there is a cost in therms of memory used (when compared with both OTP 24 and OTP 26).
  • the third commit on PR8347 (436568a) has a memory cost (by the measure in this test) of over 30% when compared to OTP24, so this would not be a desirable change in the context of leveled.
  • the best results overall (better than with OTP 26) are made by simply increasing the default binary virtual heap size in OTP 27. It should be noted that in the test where a large volume of binaries are handled it is by a small number of short-lived processes (which handle individual fold or merge events). Most of the memory in the system is taken by the large number of long-lived processes that control access to the individual files (that do relatively little direct processing of binaries).

So overall the vheap resizing algorithm changes are probably not a great trade-off for our test case. It looks positive for our case that we can improve performance through an established configurable.

I will work further on this next week, and try and enhance the memory monitoring in the test - in particular to try and capture memory usage during the phases, not just after. Also, based on the understanding of what appears to be driving performance, I’m going to look at what re-coding/re-design we might be able to do to achieve equivalent benefits without tweaking default emulator settings.

Thanks for all the help so far.

@jhogberg
Copy link
Contributor

jhogberg commented Apr 8, 2024

Huge thanks for putting that together, I’ll remove the vheap curve stuff from #8347 and focus more on adding flags to tweak it in the near term. 🙂

@harrisi
Copy link

harrisi commented Sep 6, 2024

This issue came up, maybe inadvertently, in the Erlang forums. I was wondering if this is an ongoing problem still? Most of the related PRs, issues, and branches seem to be getting merged.

@mkuratczyk
Copy link
Contributor Author

mkuratczyk commented Sep 6, 2024

Yes, we still haven't merged any fixes for this so RabbitMQ on OTP27 works, but is significantly slower for some message sizes (from 65 bytes up to 1-2kb - above that it usually doesn't matter that much).

Setting min_heap_size and min_bin_vheap_size generally helps but overall this issue is still quite confusing: since RabbitMQ supports different protocols and queue types, some code paths are more affected than others, we need to set these values in many places (in the past we never used these flags) and test a lot of combinations of values in different processes across different message sizes. The results also vary between environments.

For example we had a situation where:

  • on one machine there's a 40% drop in throughput on OTP27 and setting min_heap_size and min_bin_vheap_size did not help at all
  • on another machine the same workload was "only" 20% slower and min_bin_vheap_size restored the performance to the OTP26 level (in fact, made a little faster)
  • on the third machine the same workload was roughly 25% slower but setting min_bin_vheap_size restored the performance only partially (reduced the gap to 15%)

Here's an example of a PR where we are playing with different values and you can see the results:
rabbitmq/rabbitmq-server#12208

The graphs might be hard to decipher without additional context but there run different workloads for 5 minutes each with different message sizes and for each there are three environments - main (no changes, OTP26), pr-12208-with-channel-tweaks (this PR as it is right now, running on OTP26) and pr-12208-otp27-with-channel-tweaks - the same PR but actually running on OTP27.

Feel free to ping me on Erlanger slack if you want to chat about any of that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM
Projects
None yet
Development

No branches or pull requests

5 participants