@@ -167,55 +167,20 @@ These can be useful if your code has extra knowledge about its environment and w
167
167
168
168
.. _concepts:zombies :
169
169
170
- Zombie/Undead Tasks
171
- -------------------
170
+ Zombie Tasks
171
+ ------------
172
172
173
- No system runs perfectly, and task instances are expected to die once in a while. Airflow detects two kinds of task/process mismatch:
173
+ No system runs perfectly, and task instances are expected to die once in a while.
174
174
175
- * * Zombie tasks * are ``TaskInstances `` stuck in a ``running `` state despite their associated jobs being inactive
176
- (e.g. their process did not send a recent heartbeat as it got killed, or the machine died). Airflow will find these
177
- periodically, clean them up, and either fail or retry the task depending on its settings. Tasks can become zombies for
178
- many reasons, including:
175
+ *Zombie tasks * are ``TaskInstances `` stuck in a ``running `` state despite their associated jobs being inactive
176
+ (e.g. their process did not send a recent heartbeat as it got killed, or the machine died). Airflow will find these
177
+ periodically, clean them up, and either fail or retry the task depending on its settings. Tasks can become zombies for
178
+ many reasons, including:
179
179
180
- * The Airflow worker ran out of memory and was OOMKilled.
181
- * The Airflow worker failed its liveness probe, so the system (for example, Kubernetes) restarted the worker.
182
- * The system (for example, Kubernetes) scaled down and moved an Airflow worker from one node to another.
180
+ * The Airflow worker ran out of memory and was OOMKilled.
181
+ * The Airflow worker failed its liveness probe, so the system (for example, Kubernetes) restarted the worker.
182
+ * The system (for example, Kubernetes) scaled down and moved an Airflow worker from one node to another.
183
183
184
- * *Undead tasks * are tasks that are *not * supposed to be running but are, often caused when you manually edit Task
185
- Instances via the UI. Airflow will find them periodically and terminate them.
186
-
187
-
188
- Below is the code snippet from the Airflow scheduler that runs periodically to detect zombie/undead tasks.
189
-
190
- .. exampleinclude :: /../../airflow/jobs/scheduler_job_runner.py
191
- :language: python
192
- :start-after: [START find_and_purge_zombies]
193
- :end-before: [END find_and_purge_zombies]
194
-
195
-
196
- The explanation of the criteria used in the above snippet to detect zombie tasks is as below:
197
-
198
- 1. **Task Instance State **
199
-
200
- Only task instances in the RUNNING state are considered potential zombies.
201
-
202
- 2. **Job State and Heartbeat Check **
203
-
204
- Zombie tasks are identified if the associated job is not in the RUNNING state or if the latest heartbeat of the job is
205
- earlier than the calculated time threshold (limit_dttm). The heartbeat is a mechanism to indicate that a task or job is
206
- still alive and running.
207
-
208
- 3. **Job Type **
209
-
210
- The job associated with the task must be of type ``LocalTaskJob ``.
211
-
212
- 4. **Queued by Job ID **
213
-
214
- Only tasks queued by the same job that is currently being processed are considered.
215
-
216
- These conditions collectively help identify running tasks that may be zombies based on their state, associated job
217
- state, heartbeat status, job type, and the specific job that queued them. If a task meets these criteria, it is
218
- considered a potential zombie, and further actions, such as logging and sending a callback request, are taken.
219
184
220
185
Reproducing zombie tasks locally
221
186
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0 commit comments