Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes sd-log shutdown error makes provisioning fail #1247

Open
rocodes opened this issue Feb 14, 2025 · 3 comments
Open

Sometimes sd-log shutdown error makes provisioning fail #1247

rocodes opened this issue Feb 14, 2025 · 3 comments

Comments

@rocodes
Copy link
Contributor

rocodes commented Feb 14, 2025

Noticed eg in https://ws-ci-runner.securedrop.org/2025-02-13-222929262445-46d18ffdae2edf06935352b67dc690db2a2625d8-Qubes_4.2_D-update_20250213043413.log.txt although I've seen it on and off: qvm-shutdown --wait on sd-log doesn't always succeed, so sometimes (CI) provisioining fails (but presumably would also be true of real world systems).

Could be because logs are being streamed in via qrexec - could be that we need to force-shut down instead of just shutting down, or that there's a more graceful way to handle all of it (a pre-shutdown signal to stop the qrexec logging service and allow clean shutdown, if that is at all the issue?).

Merging #1246 (where this most recently occurred) and tracking this as a separate issue.

@deeplow
Copy link
Contributor

deeplow commented Feb 19, 2025

Could be because logs are being streamed in via qrexec - could be that we need to force-shut down instead of just shutting down, or that there's a more graceful way to handle all of it (a pre-shutdown signal to stop the qrexec logging service and allow clean shutdown, if that is at all the issue?).

I think so and I've managed to reproduce the original issue.

Reproduction Steps:

  1. open sd-app's terminal and run while true; do cat /dev/random | qrexec-client-vm sd-log securedrop.Log; done
  2. in dom0 run manually several times `qvm-shutdown --wait sd-log && echo worked this time

Here is the output in my case:

[user@dom0 ~]$ qvm-shutdown --wait sd-log && echo worked this time
worked this time
[user@dom0 ~]$ qvm-shutdown --wait sd-log && echo worked this time
worked this time
[user@dom0 ~]$ qvm-shutdown --wait sd-log && echo worked this time
qvm-shutdown: error: Failed to shut down: sd-log
[user@dom0 ~]$ echo $?
1
[user@dom0 ~]$ qvm-shutdown --wait sd-log && echo worked this time
worked this time
[user@dom0 ~]$ qvm-shutdown --wait sd-log && echo worked this time
qvm-shutdown: error: Failed to shut down: sd-log

As you can see, in some cases sd-log fails to shutdown because something else has caused it to start. So it's a race condition and sometimes it happens sometimes it doesn't.

And I agree that the solution is a temporary policy disabling sd-log. How critical is logging during installation? We can always access logs in dom0, if needed. So we may want disable it entirely during install.

Sadly, I don't think 4.2 has this patch, but it introduces a new policy directory /run/qubes/policy.d which gets deleted on shutdown. Otherwise, we could have overrided there.

@rocodes
Copy link
Contributor Author

rocodes commented Feb 19, 2025

Thanks for confirming. My opinion is that a force-shutdown as I've proposed in the linked PR is an acceptable way to balance efficiacy vs complexity, and a future "suspend logging to facilitate clean shutdown" idea is an enhancement but not necessarily a priority at this stage.

We may find that there's an upstream way of handling this (not saying this is where to look, but for example looking the qrexec cleanup methods to see if there is a soft interrupt option, or looking at qvm modules, or looking at specific domains like sys-gui that would have similar behaviour pattern to see how it's handled there).

@rocodes
Copy link
Contributor Author

rocodes commented Feb 19, 2025

Update after talking to deeplow: We are going to defer the change I suggested, because it resolves some but not all of the edge cases, which are because of the sd-log autostart parameter meaning that the log vm could receive conflicting instructions (to start up and shut down). We decided that configuring autostart at the end of the provisioning in addition to the changes in the linked PR would address the issue, but introduce additional complexity, so we're going to leave this for now and address that more systematically later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants