Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[draft] erts: kill spawned child processes on VM exit #9453

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

adamwight
Copy link

This is a very rough proof-of-concept for discussion, which ensures all children spawned with open_port are terminated along with the BEAM.

Will be discussed in https://erlangforums.com/t/open-port-and-zombie-processes

@CLAassistant
Copy link

CLAassistant commented Feb 17, 2025

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

github-actions bot commented Feb 17, 2025

CT Test Results

    3 files    141 suites   49m 55s ⏱️
1 602 tests 1 553 ✅ 49 💤 0 ❌
2 311 runs  2 242 ✅ 69 💤 0 ❌

Results for commit dba896f.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

}

static Eterm get_port_id(pid_t os_pid)
{
ErtsSysExitStatus est, *es;
Eterm port_id;
est.os_pid = os_pid;
es = hash_remove(forker_hash, &est);
es = hash_get(forker_hash, &est);
if (!es) return THE_NON_VALUE;
port_id = es->port_id;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to preserve the original behavior of only sending exit_status back to callers which have requested it, since port_id is set conditionally by the caller and is still used in the guard around sending—not simply whether the os_pid exists in the hash table.

@adamwight adamwight force-pushed the aw-orphans branch 5 times, most recently from 9f87bc1 to dba896f Compare February 20, 2025 07:45
@garazdawi garazdawi self-assigned this Feb 20, 2025
@garazdawi garazdawi added the team:VM Assigned to OTP team VM label Feb 20, 2025
@garazdawi
Copy link
Contributor

Hello!

I think that we can move forward with this. There is no need to have an option to disable it for now (unless our existing tests shows that it is needed...), but there needs to be testcases to test that it works as expected on both Unix and Windows.

I do wonder however if we should send some other signal than KILL? Should we allow the child to be able to catch it and deal with it if they want to?

@adamwight
Copy link
Author

I do wonder however if we should send some other signal than KILL? Should we allow the child to be able to catch it and deal with it if they want to?

Good point, TIL that sigkill is untrappable. Looking at erlexec for precedents, its default behavior is to send a sigterm to the direct child process, wait a configurable 5 seconds, and then send sigkill.

I experimented a bit locally to see if sh would react to a sigterm by stopping its own children, and it does not—so in order for spawn commands to benefit as well as spawn_executable, I would stick with the choice to send a signal to the entire child process group, but send TERM to be more polite. I like that this also offers the descendant processes a second and more straightforward workaround to prevent the termination if needed.

Keep the mapping of all living child processes so that we'll be able
to iterate over them to clean up, rather than only storing the
children which have an associated port.
If the uds_fd connection to the parent BEAM is broken or closed, react
by killing all children and any descendants in the same process group.

A concise demonstration of the problem being solved is to run this
command with and without the patch, then kill the BEAM.  Without the
patch, the "sleep" process will continue:

    erl -noshell -eval 'os:cmd("sleep 60")'

To intentionally start a child process which can outlive BEAM
termination, give it a new process group for example by using
`setsid`:

    erl -noshell -eval 'os:cmd("setsid sleep 60")'
FIXME: Not working yet—the grandchild must never get TERM?
FIXME: Don't sleep for fixed intervals, use messages for timing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team:VM Assigned to OTP team VM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants