Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query: interrupt query script on SIGTERM #858

Merged
merged 2 commits into from
Jan 28, 2025
Merged

query: interrupt query script on SIGTERM #858

merged 2 commits into from
Jan 28, 2025

Conversation

skshetry
Copy link
Member

This now supports gracefully terminating script on SIGTERM. Graceful shutdown is also done on Ctrl-C.

Also added logs.

Copy link

codecov bot commented Jan 24, 2025

Codecov Report

Attention: Patch coverage is 96.07843% with 2 lines in your changes missing coverage. Please review.

Project coverage is 87.60%. Comparing base (ef23a20) to head (fbab6bb).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
src/datachain/catalog/catalog.py 96.07% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #858      +/-   ##
==========================================
- Coverage   87.61%   87.60%   -0.02%     
==========================================
  Files         128      128              
  Lines       11385    11431      +46     
  Branches     1540     1543       +3     
==========================================
+ Hits         9975    10014      +39     
- Misses       1023     1028       +5     
- Partials      387      389       +2     
Flag Coverage Δ
datachain 87.52% <94.11%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

cloudflare-workers-and-pages bot commented Jan 24, 2025

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: fbab6bb
Status: ✅  Deploy successful!
Preview URL: https://1bcd9044.datachain-documentation.pages.dev
Branch Preview URL: https://handle-sigterm.datachain-documentation.pages.dev

View logs

@shcheklein
Copy link
Member

Is there a way to test this?

@skshetry
Copy link
Member Author

skshetry commented Jan 27, 2025

Is there a way to test this?

I don't see an easy way to test this unfortunately. I don't want to write some heavy tests, and simple mock tests are not going to be useful.

@skshetry skshetry requested a review from a team January 27, 2025 12:11
@skshetry skshetry force-pushed the handle-sigterm branch 4 times, most recently from a50bc43 to 9525334 Compare January 27, 2025 13:48
This now supports gracefully terminating script on SIGTERM.

Also added logs.
Comment on lines 1557 to 1558
# ignore SIGINT in the main process
signal.signal(signal.SIGINT, signal.SIG_IGN)
Copy link
Member Author

@skshetry skshetry Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need two different behaviors from this query API that this has become complicated.

CLI should ideally ignore SIGINT and forward SIGTERM to the child.
However, Celery tasks should not ignore SIGINT, but consider SIGTERM as a special signal for graceful termination.

Here, I have made a compromise: to ignore SIGINT and use SIGTERM as a special signal for graceful termination everywhere.

So, SIGINT still works for CLI (where it matters most), and SIGTERM will still work for canceling celery tasks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, I'm a bit lost - are we ignoring SIGINT in CLI or not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are ignoring Ctrl-C in the main process. If you are running in terminal, OS (tty driver) will send the signal to all processes in the foreground process group, so the script will also get the signal. If we forward signals, then the subprocess will receive the signal twice.

If you are running a background process (e.g.: via a script), you'd have to interrupt the script subprocess rather than the CLI process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will send the signal to all processes in the foreground process group

okay, I see

I think it would be helpful to describe these details in the comment for the next generations, it's a dense, low level code

Copy link
Member

@mattseddon mattseddon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

signal.signal(signal.SIGTERM, orig_sigterm_handler)
signal.signal(signal.SIGINT, orig_sigint_handler)
if thread:
thread.join() # wait for the reader thread
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Q] Will this still run into deadlocks while we are waiting for tqdm/tqdm#1649?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deadlock was happening when fp.write() was raising exception. That won't happen now that we use SIGTERM to cancel and wait.

But I can imagine, when the conditions are favorable, the tqdm may still deadlock.
But the chances are slim. Eg: when you press Ctrl-C right at the time display runs, it may deadlock if there are other tqdm processes running in other thread.

@mattseddon
Copy link
Member

Is there a way to test this?

Unfortunately, we need an e2e test in Studio. I can help with this.

Copy link
Contributor

@dreadatour dreadatour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

SIGINT = signal.SIGINT


def shutdown_process(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unit test this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no point in unittesting this. We'll check mock_popen.terminate.assert_called_once() which is not that useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean function test this - it should not be about testing "mocks" or course

# ignore SIGINT in the main process
signal.signal(signal.SIGINT, signal.SIG_IGN)

orig_sigterm_handler = signal.getsignal(signal.SIGTERM)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be happening inside try (to avoid getting the TerminationSignal raised in between)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TerminationSignal is raised only after we set a signal handler, which happens after this line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it can be raised right before try?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically yes, it could.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, then we should move it inside and catch and make sure that we handle if for example of the original handlers is not yet initialized, etc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this a few lines code change, no?

also, doing multithreading / multiprocessing is always hard exactly because lot of problems manifest themselves rarely but bite hard in terms of time someone can then spend debugging them later ... so, yes, if there are obvious improvements while it is still possible w/o obvious downsides I would always err on the side of doing safer code

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best thing to do here is move sigterm handler out of this with loop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, doing multithreading / multiprocessing is always hard exactly because lot of problems manifest themselves rarely but bite hard in terms of time someone can then spend debugging them later ... so, yes, if there are obvious improvements while it is still possible w/o obvious downsides I would always err on the side of doing safer code

There is no multithreading or multiprocessing involved here. The main thread always gets a signal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best thing to do here is move sigterm handler out of this with loop.

probably, and in except clause and in finally check if proc is setup, if original handlers are set

There is no multithreading or multiprocessing involved here. The main thread always gets a signal.

there is to my mind. An external process sends a signal to a another one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#873.

@@ -63,3 +66,15 @@ def test_non_zero_exitcode(catalog, mock_popen):
catalog.query("pass")
assert e.value.return_code == 1
assert "Query script exited with error code 1" in str(e.value)


def test_shutdown_process_on_sigterm(mocker, catalog, mock_popen):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure tbh this is a very meaningful test :) seems we are testing mocks primarily

Copy link
Member Author

@skshetry skshetry Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main goal of this test here is to assert that shutdown_process was called (and with expected values).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, I understand. Doesn't see to be very useful though. Or let me put it in a different way - I think we rarely test things this way (to check if function is called w/o testing logic in some way). Still seems to be not very meaningful to my mind. While we can do in this case most likely a proper test

Copy link
Member Author

@skshetry skshetry Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handling signals is a tricky thing, especially when the main process needs to receive it (where it happens to be a test process for tests).

I think we rarely test things this way (to check if function is called w/o testing logic in some way

We do that in plenty of places in DVC, especially in CLI tests.

While we can do in this case most likely a proper test

If you have some example unittest that we can replace this with (without crossing boundaries into the working of shutdown_process), I'd be very interested.
Again, my goal with this test is to make sure that the shutdown_process gets called with expected values. If you see a way to do that, I'll be happy to write it, or review it.

I did add a functional test, but it's complex, and you have to tread very carefully not to get main process killed/blocked/waiting forever.

@skshetry skshetry force-pushed the handle-sigterm branch 3 times, most recently from f7e7723 to bd4d98a Compare January 28, 2025 10:30
@skshetry
Copy link
Member Author

Unfortunately, we need an e2e test in Studio. I can help with this.

I have added a functional test for SIGTERM shutdown. It looks very complicated, and I don't see any simpler way to test this. Open to suggestions.

@skshetry skshetry merged commit c0f23b4 into main Jan 28, 2025
37 checks passed
@skshetry skshetry deleted the handle-sigterm branch January 28, 2025 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants