-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce repeated work in concurrent code loading situations #7503
Conversation
CT Test Results 2 files 65 suites 1h 2m 43s ⏱️ Results for commit cb85704. ♻️ This comment has been updated with latest results. To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass. See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally. Artifacts// Erlang/OTP Github Action Bot |
874bf2c
to
43add23
Compare
We have a server with a lot of concurrent process doing the same work in parallel on the system startup. The server was stuck after a few seconds and we manually needed to reduce the load to get it working. The server was running OTP-26.0.2 and a coredump showed 18K I have tried this fix in production on OTP-26.0.2 and I can confirm that resolves the issue for us. Thank you @michalmuskala :) |
This PR seems to cause crashes in
|
I'll take a look |
|
||
monitor_loader(#state{loading = Loading0} = St, Mod, Pid, Bin, FName) -> | ||
Tag = {'LOADER_DOWN', {Mod, Bin, FName}}, | ||
Ref = erlang:monitor(process, Pid, [{tag, Tag}]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very neat use of the tag to avoid the need for keeping stuff in state. Love it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I always wanted to do this since this feature was introduced. Saves on the "back" mapping
With changes in erlang#6736 significant work when code loading was moved away from the code server and shifted into the requesting process. However, there could be a lot of repeated work, especially on system startup, if a lot of similar processes are started, all trying to load the same module at the same time, overwhelming the code server with `get_object_code` requests. In this change, we add additional synchronisation to `ensure_loaded` that makes sure only one process at a time tries to load the module. In the added test showcasing the worst-case scenario of long load path and many concurrent requests, this changes the runtime (on my local machine) from 8s+ to around 200ms. Furthermore, the special sync operation can be merged with this and save one round-trip to the code server.
43add23
to
cb85704
Compare
Sorry for the delay, I got a bit sick. I've sent an update that should fix the |
Thanks, I've added it back to our tests :) |
Any updates on the results of the latest tests? |
The tests looks good! |
Erlang/OTP 25 only attempted to perform code loading if the mode was interactive: https://github.com/erlang/otp/blob/maint-25/lib/kernel/src/code_server.erl#L301 This check was removed in erlang#6736 as part of the decentralization. However, we received reports of increased cpu/memory usage in Erlang/OTP 26.1 in a code that was calling code:ensure_loaded/1 on a hot path. The underlying code was fixed but, given erlang#7503 added the server back into the equation for ensure_loaded we can add the mode check back to preserve Erlang/OTP 25 behaviour.
Erlang/OTP 25 only attempted to perform code loading if the mode was interactive: https://github.com/erlang/otp/blob/maint-25/lib/kernel/src/code_server.erl#L301 This check was removed in erlang#6736 as part of the decentralization. However, we received reports of increased cpu/memory usage in Erlang/OTP 26.1 in a code that was calling code:ensure_loaded/1 on a hot path. The underlying code was fixed but, given erlang#7503 added the server back into the equation for ensure_loaded we can add the mode check back to preserve Erlang/OTP 25 behaviour.
The first regression is not attempt to load code if mode is embedded. Erlang/OTP 25 only attempted to perform code loading if the mode was interactive: https://github.com/erlang/otp/blob/maint-25/lib/kernel/src/code_server.erl#L301 This check was removed in erlang#6736 as part of the decentralization. However, we received reports of increased cpu/memory usage in Erlang/OTP 26.1 in a code that was calling code:ensure_loaded/1 on a hot path. The underlying code was fixed but, given erlang#7503 added the server back into the equation for ensure_loaded we can add the mode check back to preserve Erlang/OTP 25 behaviour. The second regression would cause the caller process to deadlock if attempting to load a file with invalid .beam more than once.
The first regression is not attempt to load code if mode is embedded. Erlang/OTP 25 only attempted to perform code loading if the mode was interactive: https://github.com/erlang/otp/blob/maint-25/lib/kernel/src/code_server.erl#L301 This check was removed in erlang#6736 as part of the decentralization. However, we received reports of increased cpu/memory usage in Erlang/OTP 26.1 in a code that was calling code:ensure_loaded/1 on a hot path. The underlying code was fixed but, given erlang#7503 added the server back into the equation for ensure_loaded we can add the mode check back to preserve Erlang/OTP 25 behaviour. The second regression would cause the caller process to deadlock if attempting to load a file with invalid .beam more than once.
We were accidentally returning the old closed over state, which lead to loss of information, causing the code_server to crash. This was introduced in erlang#7503.
With changes in #6736 significant work when code loading was moved away from the code server and shifted into the requesting process. However, there could be a lot of repeated work, especially on system startup, if a lot of similar processes are started, all trying to load the same module at the same time, overwhelming the code server with
get_object_code
requests.In this change, we add additional synchronisation to
ensure_loaded
that makes sure only one process at a time tries to load the module. In the added test showcasing the worst-case scenario of long load path and many concurrent requests, this changes the runtime (on my local machine) from 8s+ to around 200ms.Furthermore, the special sync operation can be merged with this and save one round-trip to the code server.
Fixes #7479