-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
recovery after temporary DB connection close #38
Comments
I'm looking into supabase realtime codes to get some hints but can't figure out for now |
Even though it is happening in my prod app once a week, Currently closing the issue while migrating my app from v2.3.0 to the latest version of WalEx. Will reopen the issue when it happens again in the latest version. |
FYI it works really well on the latest version. |
Super! Thank you for your help on this. I also pushed up a few other changes (mainly, supervisor strategies): e2a2c0f But haven't cut a release for it yet. |
It happened again today with the latest version. @cpursley can you please give me some advice as this is a really critical issue on my prod service? |
Hum, sorry to hear that. Can you try the last master (reference the github repo instead of last release)? Supervisors are set to restart more often. And maybe fork latest master and add some logging in various places? |
will try that. thanks. |
Happened today with the latest master branch. Same symptom. |
@DohanKim Is there anything showing up in logs or your error reporting system (it sounds like no, but thought I'd ask). I'd like to help but not sure how to set up a scenario that reproduces the issue. |
I am only getting this error from Ecto. I spent a couple of days reproducing the issue but was not successful. The scenario I suspect is I will try more and share the results here. |
Thank you for the update and sorry for the trouble. Perhaps we could write some additional tests here to test the scenario of a restarting db: https://github.com/cpursley/walex/blob/master/test/walex/database_test.exs#L36 |
happened again with v3.8.0. now investigating 🥲 |
I wonder if we should instead of creating a slot with https://github.com/cpursley/walex/blob/master/lib/walex/replication/server.ex#L41 |
@cpursley that would be a good idea to create a slot name with app_name. let me first write a test case reproducing the error. |
@DohanKim could you create a PR with the experiment you are doing? That way I could also pull down and investigate. Thank you for you help on this! |
@cpursley This is the test code I'm currently working on. You can just replace it with
|
@DohanKim I took your idea and created a test branch here: #46 I also set slot name to the app name: https://github.com/cpursley/walex/pull/46/files#diff-f7aa5bafef0b9d259456d1b5344450f3ae79ce730a61d65d6e0cae665592ad4cR43 Please let me know what if this covers the situation you've been experiencing. Feel free to make your own changes. I want to be sure we cover all possible connection cases. Thanks! |
I added another test case that attempts to stop Postgres via command line. It seems to work on MacOS where Postgres was installed via Postgres.app. Postgres on MacOS via homebrew is also covered but untested. Also, linux (Debian) is covered but I haven't tested locally. It does not work on the Github Workflow due to no sudo access (and that it's in docker and I don't believe can actually be started/stopped in the test runner). I'm not sure what type of local machine you use, but I would appreciate you testing this and reporting back. Thanks! walex/test/walex/database_test.exs Line 116 in e13a9cb
|
@cpursley Thanks for the effort! I'm using M1 Mac (Apple Silicon) locally (and Supabase on prod). But the test cases are still passing with the random slot names, meaning that the test cases are not covering the case my prod server is experiencing. I'll try to write a test case covering my error case. |
I've spent roughly a week attempting to replicate the issue, but haven't succeeded 🥲. |
Feel free to put in logging! We can always remove later when the issue is resolved. |
Also, could you submit a change to this branch with your homebrew related changes? But please modify so that the version number is dynamic instead of hard coded. |
Hi @DohanKim ~ any thoughts on my previous comments? |
@cpursley sorry for the late reply. I'm running late on my service upgrade 🥲. Anyway, I had put a couple of loggings inside my app instead of WalEx and found some clues. I'll spend at most a couple of days reproducing the error again in the test code, and if it's not successful, |
Also, I'll submit PR with homebrew related codes. |
@DohanKim Thanks for the changes. I went ahead and merged them to make it easer (so we can start new branches as needed). |
Finally figured out the problem and it was more of Postgrex's problem (with auto_reconnect option turned on). Closing the issue. |
Great sleuthing, thank you @DohanKim |
It happens that sometimes the DB connection is closed.
Even though other processes are restarted and reconnected shortly after the temporary connection issue is resolved,
Walex just stopped working.
Can you give me some ideas and how to implement them?
(ex: reconnecting after exponential backoff)
@cpursley
The text was updated successfully, but these errors were encountered: