Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible race condition on socket when calling ssh:connect #7571

Closed
yarisx opened this issue Aug 17, 2023 · 3 comments · Fixed by #7849
Closed

Possible race condition on socket when calling ssh:connect #7571

yarisx opened this issue Aug 17, 2023 · 3 comments · Fixed by #7849
Assignees
Labels
bug Issue is reported as a bug team:PS Assigned to OTP team PS

Comments

@yarisx
Copy link
Contributor

yarisx commented Aug 17, 2023

Describe the bug
Race condition is possible when a process that called ssh:connect() has died due to some reason before calling ssh_connection_handler:takeover(). In such case the next operation on socket (either inet:peername() or inet:sockname()) will result in {error, einval}
The error report is confusing and sometimes it is hard to tell the race condition from real problems with the socket.

To Reproduce
It is hard to reproduce without mingling with process scheduling in Erlang VM or patching the ssh_connection_handler. In our tests the race has appeared when multiple worker processes were created, each starting own SSH connection as a client, and then some workers could be terminated by the parent calling exit(Pid, shutdown) based on timeouts/load/etc. Some of the terminated workers reported {error, einval}.

Expected behavior
Exit without "einval"

Affected versions
OTP-25, OTP-26

Additional context
The following patch for OTP-25 fixed the race in our tests:

diff --git a/lib/ssh/src/ssh_connection_handler.erl b/lib/ssh/src/ssh_connection_handler.erl
index 4ef45516ca..a2a368af9a 100644
--- a/lib/ssh/src/ssh_connection_handler.erl
+++ b/lib/ssh/src/ssh_connection_handler.erl
@@ -401,6 +401,11 @@ alg(ConnectionHandler) ->
 %%====================================================================

 init([Role, Socket, Opts]) when Role==client ; Role==server ->
+    %% ssh_params will be changed in post_init() to values derived from Opts
+    D = #data{socket = Socket, ssh_params = #ssh{opts = Opts}},
+    {ok, {post_init, Role}, D}.
+
+post_init(Role, #data{socket = Socket, ssh_params = #ssh{opts = Opts}}) ->
     case inet:peername(Socket) of
         {ok, PeerAddr} ->
             try
@@ -414,7 +419,8 @@ init([Role, Socket, Opts]) when Role==client ; Role==server ->
                           connection_state = init_connection_record(Role, Socket, Opts)
                          },
                 process_flag(trap_exit, true),
-                {ok, {hello,Role}, D}
+                NextEvent = {next_event, internal, socket_controlled},
+                {next_state, {hello,Role}, D, NextEvent}
             catch
                 _:{error,Error} -> {stop, {error,Error}};
                 error:Error ->     {stop, {error,Error}}
@@ -584,7 +590,11 @@ callback_mode() ->

 %%% ######## {hello, client|server} ####
 %% The very first event that is sent when the we are set as controlling process of Socket
-handle_event(cast, socket_control, {hello,_}=StateName, #data{ssh_params = Ssh0} = D) ->
+handle_event(cast, socket_control, {post_init, Role}, DIn) ->
+    post_init(Role, DIn);
+
+handle_event(internal, socket_controlled, {hello, _Role} = StateName, D) ->
+    Ssh0 = D#data.ssh_params,
     VsnMsg = ssh_transport:hello_version_msg(string_version(Ssh0)),
     send_bytes(VsnMsg, D),
     case inet:getopts(Socket=D#data.socket, [recbuf]) of
@@ -1364,6 +1374,11 @@ handle_event(Type, Ev, StateName, D0) ->

 %% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

+terminate(_, {post_init, _}, _) ->
+    %% No need to to anything - maybe we have not yet gotten
+    %% control over the socket
+    ok;
+
 terminate(normal, _StateName, D) ->
     close_transport(D);
     
@yarisx yarisx added the bug Issue is reported as a bug label Aug 17, 2023
@IngelaAndin IngelaAndin added the team:PS Assigned to OTP team PS label Aug 18, 2023
@u3s u3s self-assigned this Aug 30, 2023
@u3s
Copy link
Contributor

u3s commented Nov 3, 2023

sorry for delay. I hope this can be investigated soon.

@u3s
Copy link
Contributor

u3s commented Nov 8, 2023

I agree this might be a problem and reproduced it with patched ssh_connection_handler. I will create PR based on your git patch.

@u3s u3s linked a pull request Nov 9, 2023 that will close this issue
@u3s
Copy link
Contributor

u3s commented Nov 20, 2023

Fix will be included in next versions of OTP-26, 25 and 24.

@u3s u3s closed this as completed Nov 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug team:PS Assigned to OTP team PS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants