Skip to content

Conversation

@mmac-m3a
Copy link

@mmac-m3a mmac-m3a commented Nov 1, 2024

Summary

Adding new command line and config option worker_healthcheck_timeout which sets the timeout for worker liveness from the supervisor when multiple workers are in use. Default timeout is unchanged as well as frequency of health checks.

Rationale

Applications with CPU intensive synchronous startup may starve the worker process for CPU cycles and make the pong thread generate response too late, which in turn makes the supervisor kill and relaunch the worker.

Checklist

  • I understand that this PR may be closed in case there was no previous discussion. (This doesn't apply to typos!)
  • I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
  • I've updated the documentation accordingly.

Rationale for no explicit test

I was not able to create simple unit test that would reliably trigger the health check timeout. I can 100% reliably trigger it in my application and I've also verified that longer health check timeout resolves the problem.

@omid-jf
Copy link

omid-jf commented Dec 5, 2024

I was receiving many "Child Process Died" messages in my FastAPI application. Manually increasing the timeout mentioned in this PR fixed my issue.

@racinmat
Copy link

Hi, is anything blocking merging this PR? It will solve #2450 for many cases, because you could pass higher timeout.

@vjeranc
Copy link

vjeranc commented May 20, 2025

This was discussed a while ago, the fix to increase the timeout was known then, so not sure what's the blocker.

IMO, there should be no pinging or ponging. OS is not stupid, either process is suspended or it is not. If process is stuck, either in a deadlock or in CPU heavy operation and would be considered "inactive", that's another issue.

Anyone hosting web apps should have their own health checks and restart the process/container.

Similarly, processes/workers in uvicorn are created with a very expensive fork. It would be better to just invoke uvicorn N times at the web app level and completely remove "workers" concept. The savings on the socket level (all processes reading from the same socket is abysmal), cheaper fork would allow lower memory footprint, a much better saving, which is not the case now.

If a cheaper fork was done instead, it would make some sense to manage process liveness at uvicorn level. With current expensive fork, nothing is gained by the workers=N argument, as we see in this case, it just created more bugs.

@Dhavaleswar
Copy link

waiting for this change to be merged

@markdoliner
Copy link

vjeranc commented on May 20:

Similarly, processes/workers in uvicorn are created with a very expensive fork. ... cheaper fork would allow lower memory footprint, a much better saving, which is not the case now.

I saw this comment and just wanted to leave a warning: I believe that fork() can have undefined behavior if an application has running threads. That's the reason multiprocessing moved away from it as the default and has been showing DeprecationWarnings about it. See this section of the docs. "On macOS, the spawn start method is now the default. The fork start method should be considered unsafe as it can lead to crashes of the subprocess as macOS system libraries may start threads. See bpo-33725."

@vjeranc
Copy link

vjeranc commented Oct 6, 2025

This is a normal way fork works and forking threads are copied to the child process while others vanish.

Using spawn is just a monkey patch over unclear fork misbehavior. Instead of fixing the underlying issue around thread lifetimes, we're deciding on inferior performance. This choice created a 1+ year issue where child processes were killed because their imports were being redone and were taking more than the health check.

Uvicorn having a ping-pong protocol to figure out if child process is responsive is monkey-patch to handle child processes stuck in infinite noncooperative loops. OS never lies about children and we can easily ask OS to check if child process is alive. I don't think it is uvicorn's job to kill processes stuck in noncooperative loops, we already deploy apps in docker images and have separate health checking protocols for our http servers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants