DagFileProcessorManager Death Spiral and Defunct Scheduler Processes #63749

zacharyloeffler-creator · 2026-03-16T19:32:59Z

zacharyloeffler-creator
Mar 16, 2026

Description

We are experiencing a persistent "death spiral" where the DagFileProcessorManager fails heartbeats and enters a restart loop. Scheduler processes become immune to SIGTERM and even SIGKILL, which eventually leads to a build-up of defunct processes and scheduler failure. Our environment then becomes inoperable because DAGs cannot run and our code base will not parse.

The scheduler processes stay in the process table as defunct or in an uninterruptible sleep state. Our theory is that the interaction between top-level Python code and the shared filesystem can cause an issue at the OS level, preventing these scheduler processes from being killed gracefully. However, we have not been able to successfully reproduce the issue on demand to identify a singular cause.

Environment

Airflow Version: 2.9.3
Deployment: OpenShift / Kubernetes (Bitnami Chart)
Storage: NFS Persistent Storage Volumes

Key Log Signature

Based on the scheduler logs, the following shows where we identified when the DagFileProcessorManager "death-spiral" actually starts

[2026-03-06T10:07:35.383+0000] {manager.py:285) ERROR - DagFileProcessorManager (PID=3027916) last sent a heartbeat 50.90 seconds ago! Restarting it [cite: 15]
[2026-03-06T10:07:35.384+0000] {process_utils.py:132} INFO - Sending 15 to group 3027916. PIDs of all processes in the group: [3033769, 3033770, 3027916] [cite: 16, 17]
[2026-03-06T10:07:35.385+0000] {process_utils.py:87} INFO - Sending the signal 15 to group 3027916 [cite: 18]
[2026-03-06T10:08:35.387+0000] {process_utils.py:150} WARNING - process psutil.Process(pid=3033770, name='airflow schedul', status='zombie', started='10:05:54') did not respond to SIGTERM. Trying SIGKILL [cite: 19]
[2026-03-06T10:08:35.389+0000] {process_utils.py:150} WARNING - process psutil.Process(pid=3027916, name='airflow scheduler DagFileProcessorManager', status='disk-sleep', started='10:03:14') did not respond to SIGTERM. Trying SIGKILL [cite: 20, 21]
[2026-03-06T10:08:35.389+0000] {process_utils.py:150} WARNING - process psutil.Process(pid=3033769, name='airflow schedul', status='zombie', started='10:05:54') did not respond to SIGTERM. Trying SIGKILL [cite: 22]
[2026-03-06T10:08:35.390+0000] {process_utils.py:87} INFO - Sending the signal 9 to group 3027916 [cite: 23]
[2026-03-06T10:09:35.398+0000] {process_utils.py:161) ERROR - Process psutil.Process(pid=3033770, name='airflow schedul', status='zombie', started='10:05:54') (3033770) could not be killed. Giving up. [cite: 24]
[2026-03-06T10:09:35.399+0000] {process_utils.py:161) ERROR - Process psutil.Process(pid=3027916, name='airflow scheduler -- DagFileProcessorManager', status='disk-sleep', started='10:03:14') (3027916) could not be killed. Giving up. [cite: 25, 26]
[2026-03-06T10:09:35.406+0000] {manager.py:170} INFO - Launched DagFileProcessorManager with pid: 3034098 [cite: 27]
[2026-03-06T10:10:25.502+0000] {manager.py:285} ERROR - DagFileProcessor Manager (PID=3034098) last sent a heartbeat 50.10 seconds ago! Restarting it [cite: 32]

Additionally, shown are some of the defunct scheduler processes we are seeing:

Code Pattern

We believe these are the DAGs responsible for the issue, because in both cases we've seen it is appearing after these DAGs are scheduled/executed. Code snippet of our DAG definition file has been redacted a bit, but the logic is identical to our setup.

This loop is generating 4 DAGs based on the yaml config file, and we are not importing any heavy libraries directly into the DAG definition file.

os.environ["DLT_PROJECT_DIR"] = full_project_path

with open(config_path, "r") as f:
    config = yaml.safe_load(f)

default_args = {
    "owner": "",
    "start_date": datetime(2026, 2, 13),
    "depends_on_past": False
}

for pipe_cfg in config['pipelines']:
    dag_id = f"{pipe_cfg['name']}"

    datasource_name = ''
    task_id = pipe_cfg['name']
    dag_id = datasource_name + f'_{task_id}'

    with DAG(
        dag_id=dag_id,
        default_args=default_args,
        tags=[],
        schedule='@hourly',
        max_active_runs=1,
        catchup=False
    ) as dag:
        run_dlt = PythonOperator(
            task_id="run_pipeline",
            python_callable=run_pipeline,
            op_kwargs={
                "pipeline_name": pipe_cfg['name'],
                "resource_list": pipe_cfg['resources'],
                'pipeline_data_folder': pipeline_data_folder,
                'cert_path': cert_path,
            },
            on_failure_callback=google_utils.task_fail_alert
        )

        trigger_flow= trigger_flow_task.partial(
            cert_path=os.environ.get("OUR_CERT_PATH")
        ).expand_kwargs(pipe_cfg['urls'])

        run_dlt >> trigger_flow

        if pipe_cfg.get('trigger_metadata'):
            metadata_task = trigger_flow_task.partial(
                cert_path=os.environ.get("OUR_CERT_PATH")
            ).expand_kwargs(config['metadata_reporting']['callback_urls'])

            trigger_flow>> metadata_task

    globals()[dag_id] = dag

Troubleshooting Performed

We isolated the dynamic DAGs into a separate, clean Airflow environment, and the issue did not immediately re-occur.

Questions

What could we possibly be doing to cause this? Is there any chance this could be caused by the dlt library? DLT is imported in our main_module containing primarily pipeline logic, not the DAG definition file
Are there known issues with dynamic DAG generation and os.environ or top-level yaml.safe_load triggering kernel-level I/O blocks on NFS?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DagFileProcessorManager Death Spiral and Defunct Scheduler Processes #63749

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

DagFileProcessorManager Death Spiral and Defunct Scheduler Processes #63749

Uh oh!

Uh oh!

zacharyloeffler-creator Mar 16, 2026

Description

Environment

Key Log Signature

Code Pattern

Troubleshooting Performed

Questions

Replies: 0 comments

zacharyloeffler-creator
Mar 16, 2026