Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA-400275 Tolerate a restarting fairlock service while connecting #713

Conversation

TimSmithCtx
Copy link
Contributor

Cope with the case where the fairlock service is restarted while something is in the middle of connecting to it.

@MarkSymsCtx
Copy link
Contributor

Is this going to be sufficient? In the failed test case the python process looked it it was stuck in the recv. Does it need something to cause it to fail out? Timeout on recv maybe?

@TimSmithCtx
Copy link
Contributor Author

Is this going to be sufficient? In the failed test case the python process looked it it was stuck in the recv. Does it need something to cause it to fail out? Timeout on recv maybe?

Being stuck in a recv() from a unix-domain socket when the server process has terminated would be an extremely serious kernel bug.

@MarkSymsCtx
Copy link
Contributor

tuck in a recv() from a unix-domain socket when the server process has terminated would be

Well, that what the log said

root      465084  0.0  0.8 233528 21904 ?        Ss   Sep30   0:00 /usr/bin/python3 /opt/xensource/sm/LVMSR <methodCall><methodName>vdi_snapshot</methodName><params><param><value><struct><member><name>host_ref</name><value>OpaqueRef:4f3e63e0-3e59-d3bb-2457-6d6e7975b24f</value></member><member><name>command</name><value>vdi_snapshot</value></member><member><name>args</name><value><array><data/></array></value></member><member><name>device_config</name><value><struct><member><name>SRmaster</name><value>true</value></member><member><name>device</name><value>/dev/disk/by-id/ata-ST1000NX0423_W470NRX9-part3</value></member></struct></value></member><member><name>session_ref</name><value>OpaqueRef:b2a1344d-3693-98c3-57ec-49283d1aeb35</value></member><member><name>sr_ref</name><value>OpaqueRef:740264bb-1a98-f187-2353-6e8b043b586e</value></member><member><name>sr_uuid</name><value>18759869-73f1-fe91-442c-131be537da1b</value></member><member><name>vdi_ref</name><value>OpaqueRef:12e58463-4eb9-dfef-5e37-fc8da9993977</value></member><member><name>vdi_location</name><value>ee8288c4-9c4e-4552-ae2f-6034f707a7e1</value></member><member><name>vdi_uuid</name><value>ee8288c4-9c4e-4552-ae2f-6034f707a7e1</value></member><member><name>driver_params</name><value><struct><member><name>epochhint</name><value>96f965b9-43c9-8c9b-9590-56e377e0e48b</value></member></struct></value></member><member><name>subtask_of</name><value>DummyRef:|38688b39-732e-536c-3d65-87e93f212fa3|VDI.snapshot</value></member><member><name>vdi_on_boot</name><value>persist</value></member><member><name>vdi_allow_caching</name><value>false</value></member></struct></value></param></params></methodCall>
---
[root@lcy2-dt71 ~]# cat /proc/465084/stack
[<0>] unix_stream_read_generic+0x66d/0x890
[<0>] unix_stream_recvmsg+0x53/0x70
[<0>] __sys_recvfrom+0xc5/0x130
[<0>] __x64_sys_recvfrom+0x24/0x30
[<0>] do_syscall_64+0x4e/0x100
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff
---
[root@lcy2-dt71 ~]# ls -l /proc/465084/fd
total 0
l-wx------ 1 root root 64 Sep 30 00:30 0 -> /dev/null
l-wx------ 1 root root 64 Sep 30 00:30 1 -> /run/nonpersistent/forkexecd/execute_command_get_out655f7b.log
l-wx------ 1 root root 64 Sep 30 00:30 2 -> /run/nonpersistent/forkexecd/execute_command_get_err09be6e.log
lrwx------ 1 root root 64 Sep 30 00:30 3 -> socket:[261952377]
lrwx------ 1 root root 64 Sep 30 00:30 4 -> /run/lock/sm/18759869-73f1-fe91-442c-131be537da1b/sr
lrwx------ 1 root root 64 Sep 30 00:30 5 -> socket:[261952380]

@TimSmithCtx
Copy link
Contributor Author

That just means something else had a connection and wasn't letting go of it. The question is: what?

@rdn32
Copy link
Contributor

rdn32 commented Oct 17, 2024

If the service might be restarted at any time, isn't there a danger that it might restart when one process thinks it is already holding a lock, and then grants the lock to another process?

Copy link
Contributor

@MarkSymsCtx MarkSymsCtx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine in and of itself, I'm not sure it will address the original issue but defence in depth against failures is always good.

@MarkSymsCtx
Copy link
Contributor

If the service might be restarted at any time, isn't there a danger that it might restart when one process thinks it is already holding a lock, and then grants the lock to another process?

@MarkSymsCtx MarkSymsCtx reopened this Oct 24, 2024
Cope with the case where the fairlock service is restarted while
something is in the middle of connecting to it.

Signed-off-by: Tim Smith <tim.smith@cloud.com>
@TimSmithCtx TimSmithCtx force-pushed the private/timsmi/private/timsmi/CA-400275 branch from 748c022 to 61a575b Compare November 15, 2024 17:01
@TimSmithCtx
Copy link
Contributor Author

Removed the ticket from the commit, since it's not entirely clear what's happening in that instance or whether this fixes it.

@TimSmithCtx TimSmithCtx merged commit b5e86fe into xapi-project:master Nov 25, 2024
2 checks passed
@TimSmithCtx TimSmithCtx deleted the private/timsmi/private/timsmi/CA-400275 branch November 28, 2024 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants