Airflow scheduler will crash when connection to the database drops, but container will not stop #2661
Labels
💻 aspect: code
Concerns the software code in the repository
🛠 goal: fix
Bug fix
help wanted
Open to participation from the community
🟨 priority: medium
Not blocking but should be addressed soon
🧱 stack: catalog
Related to the catalog and Airflow DAGs
🐳 tech: docker
Involves Docker
Description
We have had a few occasions where the upstream database gets restarted, which means the scheduler cannot communicate with it. This crashes the scheduler with the following logs:
Crucially, though, this does not stop the scheduler container. If it did, the restart policy on the containers should spin up a new container of the scheduler, until the database is back and the scheduler can communicate with the backend once again. However, because it hangs and the container does not exit, the scheduler stops running and no DAGs are run. We will be alerted by this in the future with #2335, but we should also find a way to exit the container appropriately when crashes like this happen.
Reproduction
just c
, thenjust logs scheduler
. Wait for the scheduler logs to stabilize after initialization.docker stop openverse-upstream_db-1
docker ps
to see the scheduler container is still running.The text was updated successfully, but these errors were encountered: