DagFileProcessorManager Death Spiral and Defunct Scheduler Processes #63749
Unanswered
zacharyloeffler-creator
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Description
We are experiencing a persistent "death spiral" where the DagFileProcessorManager fails heartbeats and enters a restart loop. Scheduler processes become immune to
SIGTERMand evenSIGKILL, which eventually leads to a build-up of defunct processes and scheduler failure. Our environment then becomes inoperable because DAGs cannot run and our code base will not parse.The scheduler processes stay in the process table as defunct or in an uninterruptible sleep state. Our theory is that the interaction between top-level Python code and the shared filesystem can cause an issue at the OS level, preventing these scheduler processes from being killed gracefully. However, we have not been able to successfully reproduce the issue on demand to identify a singular cause.
Environment
Key Log Signature
Based on the scheduler logs, the following shows where we identified when the DagFileProcessorManager "death-spiral" actually starts
Additionally, shown are some of the defunct scheduler processes we are seeing:

Code Pattern
We believe these are the DAGs responsible for the issue, because in both cases we've seen it is appearing after these DAGs are scheduled/executed. Code snippet of our DAG definition file has been redacted a bit, but the logic is identical to our setup.
This loop is generating 4 DAGs based on the yaml config file, and we are not importing any heavy libraries directly into the DAG definition file.
Troubleshooting Performed
We isolated the dynamic DAGs into a separate, clean Airflow environment, and the issue did not immediately re-occur.
Questions
Beta Was this translation helpful? Give feedback.
All reactions