Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mfserv] Conflict between process termination by circus and exit handler functions coded in out plugin #636

Open
corentin-bl opened this issue Oct 16, 2024 · 10 comments

Comments

@corentin-bl
Copy link

We have an mfserv plugin that responds to HTTP requests using data fetched from a database, and we want to control database connections closures.

Normally, this plugin is configured so that circus spawns 8 processes, lasting 1 hour each. The process hierarchy is as such : circus -> signal_wrapper.py -> bjoern_wrapper.py. When max_age is reached, circus sends a signal to signal_wrapper.py, which forwards it to bjoern_wrapper.py, and eventually to our plugin. We understand that SIGTERM is first sent, and then after some timeout (3s according to the arguments given to signal_wrapper.py), SIGKILL is sent.

We have implemented a function in our plugin that is triggered upon signal reception. It simply retrieves the database connection and closes it. In practice, we see that SIGTERM is well caught and that the function starts. However, we can tell from the log that the function is never fully executed, and SIGKILL is eventually sent after 3s.

From the log messages, it is clear that the function did not exceed the timeout as the part of the function that was executed lasted ~0.4ms, so it is not clear why it does not run until the end.

@mrechte and I tried to investigate the propagation of signals through processes, but it seems that everything works normally. @matthieumarrast and I tried to dive deeper in the code of the socket_down function in signal_wrapper ; an idea would be to keep the socket busy during the execution of our exit handler, but this remains to be tested, and this would still look like a patch-up job. If anybody has a better understanding of what happens when circus terminates processes, it would be much appreciated.

Steps to reproduce (mfserv 2.2) :

  • create an mfserv plugin for a python WSGI application,
  • implement an exit handler function and use signal from the standard library to bind it to signal catch; make the function long enough (e.g. using time.sleep()) and print messages in the log file,
  • observe the logs; if the function is long enough, not all messages will be in the log, indicating that the function has not been fully executed.
@matthieumarrast
Copy link
Contributor

There is a way to reproduce:

# create a python/flask plugin
bootstrap_plugin.py create foo
cd foo && make develop

in plugin's config.ini, set:

max_age=30

for circus to restart plugin each 30 to 60 second

Edit the main/wsgi.py script by adding a sigterm handler, and making the process lasting 30s:

import os
import mflog
import time
import signal
from flask import Flask, render_template

app = Flask(__name__)
if os.environ.get('MFSERV_CURRENT_APP_DEBUG', '0') == '1':
    app.config['PROPAGATE_EXCEPTIONS'] = True
logger = mflog.get_logger(__name__)

def handle_sigterm(signum, frame):
    logger.info(f"signal {signum} received")
    time.sleep(30)
    logger.info(f"end of signal {signum} handling")
    exit(0)

signal.signal(signal.SIGTERM, handle_sigterm)

@app.route("/foo/")
def hello_world():
    time.sleep(30)
    logger.info("This is an info message")
    return "Hello World !"

Check log:

# plugin is up and running
2024-10-16T13:54:52.204106Z    [DEBUG] (bjoern_wrapper#4068258) process start
2024-10-16T13:54:54.936047Z    [DEBUG] (signal_wrapper#4068173) socket_up on nginx
2024-10-16T13:53:31.698795Z    [DEBUG] (urllib3.connectionpool#4066145) Starting new HTTP connection (1): 127.0.0.1:18868
2024-10-16T13:53:31.700266Z    [DEBUG] (urllib3.connectionpool#4066145) http://127.0.0.1:18868 "GET /__socket_up/L2hvbWUvbWZzZXJ2L3Zhci9hcHBfZm9vX21haW5fMS5zb2NrZXQ= HTTP/1.1" 200 None
# we execute the program through url http://localhost:18868/foo/ (30s duration) in browser
# circus send a SIGTERM for restarting the plugin
2024-10-16T13:55:38.177021Z     [INFO] (signal_wrapper#4068173) received SIGTERM => smart closing...
2024-10-16T13:55:38.178173Z     [INFO] (signal_wrapper#4068173) stopping new connections and waiting for no connection left on /home/mfserv/var/app_foo_main_1.socket (timeout: 60 seconds)
2024-10-16T13:55:38.180883Z    [DEBUG] (urllib3.connectionpool#4068173) Starting new HTTP connection (1): 127.0.0.1:18868
# metwork is waiting for no connection left on socket, calling __socket_down...
# end of the program execution:
2024-10-16T13:55:54.411389Z     [INFO] (main.wsgi#4068258) This is an info message
# __socket_down now answers OK (no connection left on the socket):
2024-10-16T13:55:54.729325Z    [DEBUG] (urllib3.connectionpool#4068173) http://127.0.0.1:18868 "GET /__socket_down/L2hvbWUvbWZzZXJ2L3Zhci9hcHBfZm9vX21haW5fMS5zb2NrZXQ=?wait=1 HTTP/1.1" 200 None
# SIGTERM is send to the bjoern_wrapper child process:
2024-10-16T13:55:54.730175Z     [INFO] (signal_wrapper#4068173) sending SIGTERM to 4068258
# so the handle_sigterm functions is called:
2024-10-16T13:55:54.817808Z     [INFO] (main.wsgi#4068258) signal 15 received
# SIGKILL received by the process, the handle_sigterm function cannot ends !
2024-10-16T13:55:58.015018Z  [WARNING] (signal_wrapper#4068173) sending SIGKILL to 4068258 and its children
2024-10-16T13:55:58.050907Z     [INFO] (signal_wrapper#4068173) process: 4068258 (smart) stopped
2024-10-16T13:55:58.431029Z    [DEBUG] (bjoern_wrapper#4071232) process start

=> we notice the handle_sigterm function cannot ends as a SIGKILL is received...

do you agree with this way to reproduce @corentin-bl ?

@matthieumarrast
Copy link
Contributor

matthieumarrast commented Oct 16, 2024

If we look at the above log and the code of https://github.com/metwork-framework/mfserv/blob/master/adm/signal_wrapper.py
we are in this part of code when the SIGKILL is triggered:

if SIG_SHUTDOWN is not None and args.timeout_after_signal > 0:
            now = datetime.datetime.now()
            delta = now - SIG_SHUTDOWN
            if delta.total_seconds() > args.timeout_after_signal:
                LOGGER.warning("sending SIGKILL to %i and its children", pid)
                kill_process_and_children(pid)

moreover we notice the 3 secondes of delay in the log:
2024-10-16T13:55:54.817808Z [INFO] (main.wsgi#4068258) signal 15 received
2024-10-16T13:55:58.015018Z [WARNING] (signal_wrapper#4068173) sending SIGKILL to 4068258 and its children

so the timeout_after_signal delay (3s) is responsible of the SIGKILL during the sigterm handling.

if we look in the circus.ini the command:
args = --stdout /home/mfserv/log/app_foo_main.log --stderr STDOUT -- plugin_wrapper foo -- signal_wrapper.py --timeout=60 --signal=15 --timeout-after-signal=3 --socket-up-after=3 /home/mfserv/var/app_foo_main_$(circus.wid).socket -- bjoern_wrapper.py --timeout=60 main.wsgi:app "/home/mfserv/var/app_foo_main_$(circus.wid).socket"
we notice this --timeout-after-signal=3
this is the smart_stop_delay=3 in the plugin's config.ini (cf. plugins.py)

So I set smart_stop_delay=40 then it works fine:

2024-10-16T16:34:43.876512Z     [INFO] (signal_wrapper#158942) sending SIGTERM to 158983
2024-10-16T16:34:43.917286Z     [INFO] (main.wsgi#158983) signal 15 received
2024-10-16T16:35:13.917849Z     [INFO] (main.wsgi#158983) end of signal 15 handling

@corentin-bl
Copy link
Author

Hi Matthieu, thanks for looking at this so quickly! I will try changing the configuration parameter to see if it does anything. However, as mentionned in the initial post, it seems that the part of our function that was executed lasted only 0.4ms, way lower than 3s, so I don't know if increasing the timeout will help ... I'll test it today and let you know!

@corentin-bl
Copy link
Author

Unfortunately, and as expected, increasing smart_stop_delay to a greater value (60s here) did not change anything:

2024-10-17T13:54:56.452578Z     [INFO] (signal_wrapper#2491393) sending SIGTERM to 2491470
2024-10-17T13:54:56.525893Z     [INFO] (sacdp#2491470) entering exit_handler function
2024-10-17T13:54:56.526036Z     [INFO] (sacdp#2491470) current_thread_name='MainThread'
2024-10-17T13:54:56.526088Z     [INFO] (sacdp#2491470) exit_handler: received signal SIGTERM
2024-10-17T13:54:56.526129Z     [INFO] (sacdp#2491470) exit_handler: getting connection variable
2024-10-17T13:54:56.526168Z     [INFO] (sacdp#2491470) after get_local
2024-10-17T13:55:56.858723Z  [WARNING] (signal_wrapper#2491393) sending SIGKILL to 2491470 and its children

@corentin-bl
Copy link
Author

Regarding the way to reproduce, we did not use Flask to build our plugin. We have a script that defines the application(environ, start_response) function, which is called by bjoern_wrapper.py. I don't know if this is different from a Flask app, I would say that it doesn't matter, but I prefer to let you know.

@corentin-bl
Copy link
Author

corentin-bl commented Oct 18, 2024

Hello @matthieumarrast, I have made some progress on the debug and I have found two things:

  • on the one hand, we have wrapped the built-in function getattr in a custom function. When we run that custom function, getattr is called, and it is setup so that if the attribute is not found, None is returned. During my tests, I saw that, even if getattr returns None, our wrapper function does not return anything (I don't understand why so far). The variable is later reused in the function, which is fine if it exists, but causes a NameError if not. This seems to be a problem for me (and @mrechte, who wrote the initial code).
  • on the other hand, I think there is something to look up from metwork's side. As said above, we get a NameError if we try to use a non-existing variable, but we could never see it before. I discovered it because you advised me to use try/except statements (which I combined with traceback.format_exc()), but otherwise mfserv would never have told me. I think mfdata is able to catch bubbled-up errors and to display them in the logs, is normal that mfserv acts differently?

And a final question: now that I am able to catch the NameError, I managed to run the function until the end. The last statement is a sys.exit(0), which would imply that the SIGTERM is enough to stop the process. However, we still see in the logs that a SIGKILL is sent, is it normal?

@matthieumarrast
Copy link
Contributor

  • Can you provide here a piece of code that shows mfserv and mfdata behave differently regarding "bubbled-up" errors ?

  • I also noticed the sys.exit(0) has no effect on the loop waiting for the smart/stop delay and a sigkill is sent anyway... but I have not analysed that point. I agree that this doesn't seem normal.

@corentin-bl
Copy link
Author

corentin-bl commented Oct 23, 2024

Hi Matthieu, here are the two examples :

mfdata :

#!/usr/bin/python3
"""
Demo code for GitHub issue
"""

from acquisition import AcquisitionStep


class GitHubDemo(AcquisitionStep):
    """Demo class."""

    plugin_name = "insert_cdp"
    step_name = "github_issue_demo"

    def process(self, xaf):
        """Demo method"""
        self.info(f"process for file {xaf.filepath}")

        my_var = 1

        assert my_var == 0

        return True

    def plugin_path(self):
        """Other method"""
        path = self.get_plugin_directory_path()

        return path


if __name__ == "__main__":
    x = GitHubDemo()
    x.run()

mfserv :

"""
Demo code for GitHub issue
"""

import signal


def exit_handler(sig, frame):
    """Exit function where the error happens."""
    print("Entering exit_handler")
    my_var = 1
    assert my_var == 0
    print("After the assertion error")
    return True


signal.signal(signal.SIGTERM, exit_handler)


def application(environ, start_response):
    """Main function"""
    print("Entering application function")
    return True

Let me know if this works as is or if you need further info (e.g. from the config file for instance).

While preparing the examples, I noticed that in the case of mfserv, errors happenning in the application function were caught, but as you will see in the example, errors in the exit_handler are not ...

@corentin-bl
Copy link
Author

Hi @matthieumarrast, a quick follow-up regarding my message from Oct. 18th. On our side, I found the cause of the problem: I made a stupid typo which prevented from getting the correct variable, hence the NameError. However, the fact that the error was not caught by the parent function (or at least not visible in the app logs) is still something worth looking into in my opinion. If you investigate this further, let me know if I can help.

@matthieumarrast
Copy link
Contributor

https://docs.python.org/3/library/signal.html

"If the handler raises an exception, it will be raised “out of thin air” in the main thread. See the note below for a discussion."

Interesting article to read here:
Never ever, ever raise a regular exception in a Python signal handler. -> https://anonbadger.wordpress.com/2018/12/15/python-signal-handlers-and-exceptions/

The key to understanding this is to understand that when the signal handler is invoked, Python adds the call to the signal handler onto its current call stack (The call stack roughly represents the nesting of functions that lead up to where the code is currently executing. So, in our example, inside of the signal handler, the call stack would start with the module toplevel, then long_function(), and finally handler(). If you look at the traceback from the first example, you can see exactly that call stack leading you through all the function calls to the point where the exception was raised.) When the signal handler raises its exception, Python unwinds the call stack one function at a time to find an exception handler (not to be confused with our signal handler) which can handle the exception....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants