CA-400275 Tolerate a restarting fairlock service while connecting #713

TimSmithCtx · 2024-10-15T12:08:01Z

Cope with the case where the fairlock service is restarted while something is in the middle of connecting to it.

MarkSymsCtx · 2024-10-15T13:24:13Z

Is this going to be sufficient? In the failed test case the python process looked it it was stuck in the recv. Does it need something to cause it to fail out? Timeout on recv maybe?

TimSmithCtx · 2024-10-15T14:41:00Z

Is this going to be sufficient? In the failed test case the python process looked it it was stuck in the recv. Does it need something to cause it to fail out? Timeout on recv maybe?

Being stuck in a recv() from a unix-domain socket when the server process has terminated would be an extremely serious kernel bug.

MarkSymsCtx · 2024-10-15T14:44:11Z

tuck in a recv() from a unix-domain socket when the server process has terminated would be

Well, that what the log said

root      465084  0.0  0.8 233528 21904 ?        Ss   Sep30   0:00 /usr/bin/python3 /opt/xensource/sm/LVMSR <methodCall><methodName>vdi_snapshot</methodName><params><param><value><struct><member><name>host_ref</name><value>OpaqueRef:4f3e63e0-3e59-d3bb-2457-6d6e7975b24f</value></member><member><name>command</name><value>vdi_snapshot</value></member><member><name>args</name><value><array><data/></array></value></member><member><name>device_config</name><value><struct><member><name>SRmaster</name><value>true</value></member><member><name>device</name><value>/dev/disk/by-id/ata-ST1000NX0423_W470NRX9-part3</value></member></struct></value></member><member><name>session_ref</name><value>OpaqueRef:b2a1344d-3693-98c3-57ec-49283d1aeb35</value></member><member><name>sr_ref</name><value>OpaqueRef:740264bb-1a98-f187-2353-6e8b043b586e</value></member><member><name>sr_uuid</name><value>18759869-73f1-fe91-442c-131be537da1b</value></member><member><name>vdi_ref</name><value>OpaqueRef:12e58463-4eb9-dfef-5e37-fc8da9993977</value></member><member><name>vdi_location</name><value>ee8288c4-9c4e-4552-ae2f-6034f707a7e1</value></member><member><name>vdi_uuid</name><value>ee8288c4-9c4e-4552-ae2f-6034f707a7e1</value></member><member><name>driver_params</name><value><struct><member><name>epochhint</name><value>96f965b9-43c9-8c9b-9590-56e377e0e48b</value></member></struct></value></member><member><name>subtask_of</name><value>DummyRef:|38688b39-732e-536c-3d65-87e93f212fa3|VDI.snapshot</value></member><member><name>vdi_on_boot</name><value>persist</value></member><member><name>vdi_allow_caching</name><value>false</value></member></struct></value></param></params></methodCall>
---
[root@lcy2-dt71 ~]# cat /proc/465084/stack
[<0>] unix_stream_read_generic+0x66d/0x890
[<0>] unix_stream_recvmsg+0x53/0x70
[<0>] __sys_recvfrom+0xc5/0x130
[<0>] __x64_sys_recvfrom+0x24/0x30
[<0>] do_syscall_64+0x4e/0x100
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff
---
[root@lcy2-dt71 ~]# ls -l /proc/465084/fd
total 0
l-wx------ 1 root root 64 Sep 30 00:30 0 -> /dev/null
l-wx------ 1 root root 64 Sep 30 00:30 1 -> /run/nonpersistent/forkexecd/execute_command_get_out655f7b.log
l-wx------ 1 root root 64 Sep 30 00:30 2 -> /run/nonpersistent/forkexecd/execute_command_get_err09be6e.log
lrwx------ 1 root root 64 Sep 30 00:30 3 -> socket:[261952377]
lrwx------ 1 root root 64 Sep 30 00:30 4 -> /run/lock/sm/18759869-73f1-fe91-442c-131be537da1b/sr
lrwx------ 1 root root 64 Sep 30 00:30 5 -> socket:[261952380]

TimSmithCtx · 2024-10-16T07:59:27Z

That just means something else had a connection and wasn't letting go of it. The question is: what?

rdn32 · 2024-10-17T11:10:26Z

If the service might be restarted at any time, isn't there a danger that it might restart when one process thinks it is already holding a lock, and then grants the lock to another process?

MarkSymsCtx

This seems fine in and of itself, I'm not sure it will address the original issue but defence in depth against failures is always good.

MarkSymsCtx · 2024-10-24T08:46:49Z

If the service might be restarted at any time, isn't there a danger that it might restart when one process thinks it is already holding a lock, and then grants the lock to another process?

Cope with the case where the fairlock service is restarted while something is in the middle of connecting to it. Signed-off-by: Tim Smith <tim.smith@cloud.com>

TimSmithCtx · 2024-11-15T17:04:21Z

Removed the ticket from the commit, since it's not entirely clear what's happening in that instance or whether this fixes it.

TimSmithCtx requested review from rdn32, MarkSymsCtx and KevinLCtx October 15, 2024 12:08

MarkSymsCtx approved these changes Oct 24, 2024

View reviewed changes

MarkSymsCtx closed this Oct 24, 2024

MarkSymsCtx reopened this Oct 24, 2024

Tolerate a restarting fairlock service while connecting

61a575b

Cope with the case where the fairlock service is restarted while something is in the middle of connecting to it. Signed-off-by: Tim Smith <tim.smith@cloud.com>

TimSmithCtx force-pushed the private/timsmi/private/timsmi/CA-400275 branch from 748c022 to 61a575b Compare November 15, 2024 17:01

KevinLCtx approved these changes Nov 19, 2024

View reviewed changes

TimSmithCtx merged commit b5e86fe into xapi-project:master Nov 25, 2024
2 checks passed

TimSmithCtx deleted the private/timsmi/private/timsmi/CA-400275 branch November 28, 2024 08:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CA-400275 Tolerate a restarting fairlock service while connecting #713

CA-400275 Tolerate a restarting fairlock service while connecting #713

TimSmithCtx commented Oct 15, 2024

MarkSymsCtx commented Oct 15, 2024

TimSmithCtx commented Oct 15, 2024

MarkSymsCtx commented Oct 15, 2024

TimSmithCtx commented Oct 16, 2024

rdn32 commented Oct 17, 2024

MarkSymsCtx left a comment

MarkSymsCtx commented Oct 24, 2024

TimSmithCtx commented Nov 15, 2024

CA-400275 Tolerate a restarting fairlock service while connecting #713

CA-400275 Tolerate a restarting fairlock service while connecting #713

Conversation

TimSmithCtx commented Oct 15, 2024

MarkSymsCtx commented Oct 15, 2024

TimSmithCtx commented Oct 15, 2024

MarkSymsCtx commented Oct 15, 2024

TimSmithCtx commented Oct 16, 2024

rdn32 commented Oct 17, 2024

MarkSymsCtx left a comment

Choose a reason for hiding this comment

MarkSymsCtx commented Oct 24, 2024

TimSmithCtx commented Nov 15, 2024