FAB Provider - 'Server has gone away' #63718
Replies: 8 comments
-
|
Please check server logs. Server has gone away is always accompanied by a message on the server side explaining what was wrong. Alternatively - inspect your firewall rules. Often what happens is that in case of lack of activilty on connectoins, firewalls might close such opened connection. That would be the same if your infrastructure has similar behaviour on closing long running connection to a database. Such issues can be solved by enabling pooling, and ping per conneciton or several other techniques described in https://docs.sqlalchemy.org/en/21/core/pooling.html#dealing-with%20-disconnects -> look at the docs, you can pass engine parameters when you configure airflow, so you should be able to experiment with that. Let us know how your experiments went. |
Beta Was this translation helpful? Give feedback.
-
|
It happened in two different isolated environments, two airflow deployments with two different database servers roughly after the same time period after deploy with same stacktrace. And airflow 3.1.6 with fab provider 3.1.2 works well under absolutely the same conditions during months. I don't think it has any relation to firewall, network configuration etc. |
Beta Was this translation helpful? Give feedback.
-
|
Without you looking at your server logs it's impossible to help you with your problems. |
Beta Was this translation helpful? Give feedback.
-
|
Sometimes there are other circumstances and you can even not knowingly have some differences that you are aware - and it's simply impossible to figure out all possibilitites and guess what could be the problem, or whether it was caused by upgrade or deployment differences. But looking at logs at your server might actually bring some solutions. I know there is an easy temptation to say "the only thing changed is Airflow version" and have hopes that maintainers will magically guess what your problem is, but maintainers are volunteers and they help others if they have at least some information that can help them to make at least intelligent guessess. Generally this is the kind of "help" and "support" you get for a software that you get absolutely for free - it's free as a puppy - you need to take care of it, even of you got it for free - and usually it means that people will gladly help and even direct you what to do (for example look for logs telling what happens or at some firewall settings) but sometimes that is maximum they can do, because ultimately you are the only person who has access to your deployment and can diagnose things (aka. Deployment Manager). |
Beta Was this translation helpful? Give feedback.
-
|
Also this is what Claude says: That's a classic MySQL error. It typically means the connection between your application and MySQL server was dropped. The most common causes are: |
Beta Was this translation helpful? Give feedback.
-
|
The captured pod logs don’t contain timestamps, so I can’t correlate them with the RDS server logs. I also can’t reproduce the installation of Airflow 3.1.8, because downgrading the database in this version fails and requires recreating the entire database. What I can add is that after installation, Airflow ran without issues for several hours and only then failed (in two separate environments). This error doesn’t appear when using the same database on the same cluster but with a different version of the FAB provider. So, let's close this issue as "unreproducible". Next time, when a new version of Airflow is released and it fails, I'll try to capture more info. P.S. Claude offered a long list of possible causes—Valkey cache tuning, Celery settings, and so on—for the “Task state changed externally” error. But the real root cause turned out to be simple: the API server was restarting because of a memory leak. |
Beta Was this translation helpful? Give feedback.
-
|
Converted to discussion. |
Beta Was this translation helpful? Give feedback.
-
|
This looks like the classic “MySQL server has gone away” issue with Airflow, usually caused by stale or idle connections in the SQLAlchemy pool. Since it happens after a few hours, it’s likely MySQL (RDS) is closing idle connections (wait_timeout) and Airflow is trying to reuse them. Fix is to enable connection recycling and health checks, like setting sql_alchemy_pool_recycle to a value lower than MySQL timeout and sql_alchemy_pool_pre_ping = True so dead connections are refreshed automatically. Also double check RDS settings like wait_timeout and max_allowed_packet. This is more of a connection lifecycle issue than load or max connections. That said, in rare cases if the MySQL database itself is corrupted or tables become inaccessible, similar errors can show up. If you start seeing InnoDB errors in logs or queries failing inconsistently, you can try third party recovery tools like Stellar Repair for MySQL, helps to Repair and extract data from damaged tables. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Apache Airflow Provider(s)
fab
Versions of Apache Airflow Providers
apache-airflow-providers-fab 3.4.0
Apache Airflow version
3.1.8
Operating System
linux
Deployment
Official Apache Airflow Helm Chart
Deployment details
Airflow - AWS EKS K8s cluster, database - MySQL AWS RDS.
What happened
After several hours, Airflow UI becomes unavailable, and an exception is raised. Confirmed in two different environments (and different database servers), so it is not a db problem.
What you think should happen instead
No response
How to reproduce
Deploy Airflow 3.1.8 with fab provider v3.4.0 and MySQL as Airflow's metadata database.
Anything else
No response
Are you willing to submit PR?
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions