Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Airflow scheduler will crash when connection to the database drops, but container will not stop #2661

Closed
AetherUnbound opened this issue Jul 17, 2023 · 2 comments
Assignees
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix help wanted Open to participation from the community 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs 🐳 tech: docker Involves Docker

Comments

@AetherUnbound
Copy link
Collaborator

Description

We have had a few occasions where the upstream database gets restarted, which means the scheduler cannot communicate with it. This crashes the scheduler with the following logs:

[2023-07-15T08:34:12.556+0000] {local_executor.py:400} INFO - Shutting down LocalExecutor; waiting for running tasks to finish.  Signal again if you don't want to wait.
[2023-07-15T15:09:53.307+0000] {base.py:73} INFO - Using connection ID 'aws_default' for task execution.
[2023-07-15T15:09:53.308+0000] {connection_wrapper.py:337} INFO - AWS Connection (conn_id='aws_default', conn_type='aws') credentials retrieved from login and password.
[2023-07-15T15:09:54.631+0000] {base.py:73} INFO - Using connection ID 'aws_default' for task execution.
[2023-07-15T15:09:54.632+0000] {connection_wrapper.py:337} INFO - AWS Connection (conn_id='aws_default', conn_type='aws') credentials retrieved from login and password.
[2023-07-15T15:09:54.940+0000] {process_utils.py:131} INFO - Sending Signals.SIGTERM to group 123. PIDs of all processes in the group: []
[2023-07-15T15:09:54.941+0000] {process_utils.py:86} INFO - Sending the signal Signals.SIGTERM to group 123
[2023-07-15T15:09:54.941+0000] {process_utils.py:100} INFO - Sending the signal Signals.SIGTERM to process 123 as process group is missing.
[2023-07-15T15:09:54.941+0000] {scheduler_job_runner.py:871} INFO - Exited execute loop
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3366, in _wrap_pool_connect
    return fn()
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 327, in connect
    return _ConnectionFairy._checkout(self)
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 894, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 497, in checkout
    with util.safe_reraise():
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
    raise exception
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 495, in checkout
    dbapi_connection = rec.get_connection()
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 630, in get_connection
    self.__connect()
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 690, in __connect
    with util.safe_reraise():
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
    raise exception
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 686, in __connect
    self.dbapi_connection = connection = pool._invoke_creator(self)
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/create.py", line 574, in connect
    return dialect.connect(*cargs, **cparams)
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 598, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/home/airflow/.local/lib/python3.10/site-packages/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: connection to server at "[REDACTED]" ([IP REDACTED]), port 5432 failed: Connection refused
	Is the server running on that host and accepting TCP/IP connections?


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/airflow/.local/bin/airflow", line 8, in <module>
    sys.exit(main())
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/__main__.py", line 48, in main
    args.func(args)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/cli/cli_config.py", line 51, in command
    return func(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/utils/cli.py", line 112, in wrapper
    return f(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/cli/commands/scheduler_command.py", line 77, in scheduler
    _run_scheduler_job(job_runner, skip_serve_logs=args.skip_serve_logs)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/cli/commands/scheduler_command.py", line 42, in _run_scheduler_job
    run_job(job=job_runner.job, execute_callable=job_runner._execute)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/utils/session.py", line 76, in wrapper
    return func(*args, session=session, **kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/jobs/job.py", line 284, in run_job
    return execute_job(job, execute_callable=execute_callable)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/jobs/job.py", line 313, in execute_job
    ret = execute_callable()
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/jobs/scheduler_job_runner.py", line 842, in _execute
    self._run_scheduler_loop()
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/jobs/scheduler_job_runner.py", line 976, in _run_scheduler_loop
    num_queued_tis = self._do_scheduling(session)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/jobs/scheduler_job_runner.py", line 1050, in _do_scheduling
    self._create_dagruns_for_dags(guard, session)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/utils/retries.py", line 78, in wrapped_function
    for attempt in run_with_db_retries(max_retries=retries, logger=logger, **retry_kwargs):
  File "/home/airflow/.local/lib/python3.10/site-packages/tenacity/__init__.py", line 347, in __iter__
    do = self.iter(retry_state=retry_state)
  File "/home/airflow/.local/lib/python3.10/site-packages/tenacity/__init__.py", line 325, in iter
    raise retry_exc.reraise()
  File "/home/airflow/.local/lib/python3.10/site-packages/tenacity/__init__.py", line 158, in reraise
    raise self.last_attempt.result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/utils/retries.py", line 87, in wrapped_function
    return func(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/jobs/scheduler_job_runner.py", line 1112, in _create_dagruns_for_dags
    query, dataset_triggered_dag_info = DagModel.dags_needing_dagruns(session)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/dag.py", line 3412, in dags_needing_dagruns
    .all()
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2773, in all
    return self._iter().all()
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2916, in _iter
    result = self.session.execute(
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1716, in execute
    conn = self._connection_for_bind(bind)
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1555, in _connection_for_bind
    return self._transaction._connection_for_bind(
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 750, in _connection_for_bind
    conn = bind.connect()
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3320, in connect
    return self._connection_cls(self, close_with_result=close_with_result)
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 96, in __init__
    else engine.raw_connection()
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3399, in raw_connection
    return self._wrap_pool_connect(self.pool.connect, _connection)
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3369, in _wrap_pool_connect
    Connection._handle_dbapi_exception_noconnection(
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 2203, in _handle_dbapi_exception_noconnection
    util.raise_(
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
    raise exception
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3366, in _wrap_pool_connect
    return fn()
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 327, in connect
    return _ConnectionFairy._checkout(self)
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 894, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 497, in checkout
    with util.safe_reraise():
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
    raise exception
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 495, in checkout
    dbapi_connection = rec.get_connection()
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 630, in get_connection
    self.__connect()
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 690, in __connect
    with util.safe_reraise():
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
    raise exception
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 686, in __connect
    self.dbapi_connection = connection = pool._invoke_creator(self)
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/create.py", line 574, in connect
    return dialect.connect(*cargs, **cparams)
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 598, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/home/airflow/.local/lib/python3.10/site-packages/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection to server at "[REDACTED]" ([IP REDACTED]), port 5432 failed: Connection refused
	Is the server running on that host and accepting TCP/IP connections?

(Background on this error at: https://sqlalche.me/e/14/e3q8)

Crucially, though, this does not stop the scheduler container. If it did, the restart policy on the containers should spin up a new container of the scheduler, until the database is back and the scheduler can communicate with the backend once again. However, because it hangs and the container does not exit, the scheduler stops running and no DAGs are run. We will be alerted by this in the future with #2335, but we should also find a way to exit the container appropriately when crashes like this happen.

Reproduction

  1. Run just c, then just logs scheduler. Wait for the scheduler logs to stabilize after initialization.
  2. In a separate window, run docker stop openverse-upstream_db-1
  3. Observe in the logs that the above error occurs, but the container does not exit. This can be confirmed with running docker ps to see the scheduler container is still running.
@AetherUnbound AetherUnbound added 💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Jul 17, 2023
@github-project-automation github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Jul 17, 2023
@AetherUnbound AetherUnbound self-assigned this Jul 17, 2023
@AetherUnbound AetherUnbound moved this from 📋 Backlog to 📅 To do in Openverse Backlog Jul 17, 2023
@sarayourfriend
Copy link
Collaborator

Just a guess, but if the scheduler is run as part of ENTRYPOINT rather than CMD or with run, I don't think the container would exit if the scheduler process exits. I don't know if that is how the scheduler is run, it's just the only thing I can think of off the top of my head that would cause that.

@AetherUnbound AetherUnbound added 🟨 priority: medium Not blocking but should be addressed soon and removed 🟧 priority: high Stalls work on the project or its dependents labels Oct 4, 2023
@AetherUnbound AetherUnbound moved this from 📅 To do to 📋 Backlog in Openverse Backlog Oct 4, 2023
@AetherUnbound AetherUnbound added help wanted Open to participation from the community 🐳 tech: docker Involves Docker labels Oct 24, 2023
@AetherUnbound
Copy link
Collaborator Author

I revisited this issue recently and was unable to reproduce the original issue! When the database becomes unreachable, the scheduler container does indeed shut down. No action needed here.

@AetherUnbound AetherUnbound closed this as not planned Won't fix, can't repro, duplicate, stale Dec 26, 2023
@openverse-bot openverse-bot moved this from 📋 Backlog to 🗑 Discarded in Openverse Backlog Dec 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix help wanted Open to participation from the community 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs 🐳 tech: docker Involves Docker
Projects
Archived in project
Development

No branches or pull requests

2 participants