-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
query: interrupt query script on SIGTERM #858
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #858 +/- ##
==========================================
- Coverage 87.61% 87.60% -0.02%
==========================================
Files 128 128
Lines 11385 11431 +46
Branches 1540 1543 +3
==========================================
+ Hits 9975 10014 +39
- Misses 1023 1028 +5
- Partials 387 389 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
36fb854
to
ff93db2
Compare
Deploying datachain-documentation with Cloudflare Pages
|
ff93db2
to
68e5cfd
Compare
Is there a way to test this? |
I don't see an easy way to test this unfortunately. I don't want to write some heavy tests, and simple mock tests are not going to be useful. |
68e5cfd
to
d5c10f6
Compare
a50bc43
to
9525334
Compare
This now supports gracefully terminating script on SIGTERM. Also added logs.
9525334
to
89b9323
Compare
src/datachain/catalog/catalog.py
Outdated
# ignore SIGINT in the main process | ||
signal.signal(signal.SIGINT, signal.SIG_IGN) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need two different behaviors from this query
API that this has become complicated.
CLI should ideally ignore SIGINT
and forward SIGTERM
to the child.
However, Celery tasks should not ignore SIGINT
, but consider SIGTERM
as a special signal for graceful termination.
Here, I have made a compromise: to ignore SIGINT
and use SIGTERM
as a special signal for graceful termination everywhere.
So, SIGINT
still works for CLI (where it matters most), and SIGTERM
will still work for canceling celery tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, I'm a bit lost - are we ignoring SIGINT in CLI or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are ignoring Ctrl-C
in the main process. If you are running in terminal, OS (tty
driver) will send the signal to all processes in the foreground process group, so the script will also get the signal. If we forward signals, then the subprocess will receive the signal twice.
If you are running a background process (e.g.: via a script), you'd have to interrupt the script subprocess rather than the CLI process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will send the signal to all processes in the foreground process group
okay, I see
I think it would be helpful to describe these details in the comment for the next generations, it's a dense, low level code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
signal.signal(signal.SIGTERM, orig_sigterm_handler) | ||
signal.signal(signal.SIGINT, orig_sigint_handler) | ||
if thread: | ||
thread.join() # wait for the reader thread |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Q] Will this still run into deadlocks while we are waiting for tqdm/tqdm#1649?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deadlock was happening when fp.write()
was raising exception. That won't happen now that we use SIGTERM
to cancel and wait.
But I can imagine, when the conditions are favorable, the tqdm
may still deadlock.
But the chances are slim. Eg: when you press Ctrl-C
right at the time display
runs, it may deadlock if there are other tqdm
processes running in other thread.
Unfortunately, we need an e2e test in Studio. I can help with this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
SIGINT = signal.SIGINT | ||
|
||
|
||
def shutdown_process( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unit test this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no point in unittesting this. We'll check mock_popen.terminate.assert_called_once()
which is not that useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean function test this - it should not be about testing "mocks" or course
# ignore SIGINT in the main process | ||
signal.signal(signal.SIGINT, signal.SIG_IGN) | ||
|
||
orig_sigterm_handler = signal.getsignal(signal.SIGTERM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be happening inside try (to avoid getting the TerminationSignal raised in between)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TerminationSignal
is raised only after we set a signal handler, which happens after this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it can be raised right before try
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically yes, it could.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, then we should move it inside and catch and make sure that we handle if for example of the original handlers is not yet initialized, etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this a few lines code change, no?
also, doing multithreading / multiprocessing is always hard exactly because lot of problems manifest themselves rarely but bite hard in terms of time someone can then spend debugging them later ... so, yes, if there are obvious improvements while it is still possible w/o obvious downsides I would always err on the side of doing safer code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The best thing to do here is move sigterm
handler out of this with
loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, doing multithreading / multiprocessing is always hard exactly because lot of problems manifest themselves rarely but bite hard in terms of time someone can then spend debugging them later ... so, yes, if there are obvious improvements while it is still possible w/o obvious downsides I would always err on the side of doing safer code
There is no multithreading
or multiprocessing
involved here. The main thread always gets a signal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The best thing to do here is move sigterm handler out of this with loop.
probably, and in except
clause and in finally
check if proc is setup, if original handlers are set
There is no multithreading or multiprocessing involved here. The main thread always gets a signal.
there is to my mind. An external process sends a signal to a another one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#873.
@@ -63,3 +66,15 @@ def test_non_zero_exitcode(catalog, mock_popen): | |||
catalog.query("pass") | |||
assert e.value.return_code == 1 | |||
assert "Query script exited with error code 1" in str(e.value) | |||
|
|||
|
|||
def test_shutdown_process_on_sigterm(mocker, catalog, mock_popen): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure tbh this is a very meaningful test :) seems we are testing mocks primarily
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main goal of this test here is to assert that shutdown_process
was called (and with expected values).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, I understand. Doesn't see to be very useful though. Or let me put it in a different way - I think we rarely test things this way (to check if function is called w/o testing logic in some way). Still seems to be not very meaningful to my mind. While we can do in this case most likely a proper test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handling signals is a tricky thing, especially when the main process needs to receive it (where it happens to be a test process for tests).
I think we rarely test things this way (to check if function is called w/o testing logic in some way
We do that in plenty of places in DVC, especially in CLI tests.
While we can do in this case most likely a proper test
If you have some example unittest that we can replace this with (without crossing boundaries into the working of shutdown_process
), I'd be very interested.
Again, my goal with this test is to make sure that the shutdown_process
gets called with expected values. If you see a way to do that, I'll be happy to write it, or review it.
I did add a functional test, but it's complex, and you have to tread very carefully not to get main process killed/blocked/waiting forever.
f7e7723
to
bd4d98a
Compare
I have added a functional test for |
bd4d98a
to
fbab6bb
Compare
This now supports gracefully terminating script on SIGTERM. Graceful shutdown is also done on Ctrl-C.
Also added logs.