-
Notifications
You must be signed in to change notification settings - Fork 595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent deadlock when closing a channel using CloseAsync in 7.x #1751
Comments
Hi, thanks for the report. As I'm sure you're aware of, there's not much to work with here 😸 Obviously, the gold standard is to provide code that reproduces this issue, or at least some idea of steps to do so.
What does this mean? Do you have some way in your application to increase the frequency of channel closure? |
We're running tests that create and close channels very frequently, and it appears that the test suite that do this the most; is the the one that is usually getting stuck. Anyhow, I can try to look dig into this further and see if I can provide something that will help you reproduce it. Thanks |
@Andersso channel and connection churn are workloads explicitly recommended against. |
It would be extremely helpful for you to share your test code. If you can't do that, describe the test as best you can:
My guess is that you could be hitting a
This is a related issue: |
Also note that management UI has connection and channel churn metrics, on the Overview page but also on the node page IIRC. So at the very least it should be easy to see the churn rate: is it 50 channels opened per second? Is it 200? |
@Andersso @ZajacPiotr98 - I've modified a test app in this project to try and trigger the error in this issue, or the error in #1749, and it works fine every time in my environment: |
Hi again, and sorry for the delayed response. I hope you guys had a good Christmas and new year!
I’ve been working on reproducing the issue in a test project but haven’t had any success. I’ve tried experimenting with different thread pool sizes, but it didn’t seem to affect the outcome. Based on my investigation of my latest memory dump, there’s no indication of thread pool starvation; all the threads in the pool are idle and waiting for work. It is also worth mentioning that my application is a console app so it does not have any synchronization context. Regarding the connection churn, wouldn’t that have caused issues in the 6.x versions as well? We’ve had this setup running fine for years without any problems until the upgrade to 7.x. I’ve done some additional digging by analyzing the memory dump. Specifically, I’ve looked at the tasks being awaited in the method that always seem to get stuck (according to the async dump):
It appears that the channel never gets completed, which prevents the method from ever completing. |
@Andersso I'm going to publish a 7.1.0 alpha release right now. When it's ready, I'll follow-up here. There have been a couple fixes merged that might help here. Any luck reproducing this issue reliably? |
@Andersso please give this version a try! https://www.nuget.org/packages/RabbitMQ.Client/7.1.0-alpha.0 |
Hey, |
I also performed the test with the alpha version and same results. Also I tried to do workaround with passing cancellation token to the In my case it was around 500 close requests in 2 minutes from one instance of my application (overall 6 instances, 5 connection each, 5 RabbitMQ nodes with 3GiB high watermark). Second instance of app had same issue for around 1000 close requests in 4 minutes. |
Thanks for your reports. I'll try to reproduce this issue locally, though I've had no luck so far. |
Fixes #1751 Attempt to fix deadlock by waiting on channel dispatcher first, then channel reader.
@Andersso @ZajacPiotr98 I'm wondering if you're running into this condition - https://stackoverflow.com/a/66521303 Is it possible to test my PR branch in your environments? If not, I can publish another alpha release. Thank you! |
Hey, I do not have the infrastructure to use your repo directly. A nuget package would be perfect! Thanks |
@Andersso - I build the packages locally on my branch, and uploaded them here: https://www.myget.org/feed/rabbitmq-dotnet-client/package/nuget/RabbitMQ.Client/7.1.0-alpha.0.1 |
I tested this PR and issue is still there. I added logs and it seems that for some reason |
Thanks for the follow-up. I wish I could reproduce this! I think the best fix will be to either not await |
@ZajacPiotr98 @Andersso I've uploaded a new version to MyGet - https://www.myget.org/feed/rabbitmq-dotnet-client/package/nuget/RabbitMQ.Client/7.1.0-alpha.0.2 When the |
I will run 7.1.0-alpha.0.2 over the weekend, fingers crossed! Sorry for my ignorance, but where does the log end up? |
You have to configure Use that class as a starting point in your own project. Instead of writing to |
Hey again, sorry for the delayed response. Unfortunately, the issue is still present, and no log output has been observed. (I did verify that the event listener is working) I will have another dive once I have a fresh memory dump. Thanks |
@Andersso thanks for the report. Argh, I wish I could reproduce this issue here. I will try some other ideas and will publish a new release to MyGet. I REALLY appreciate you being willing to test and investigate. |
Fixes #1751 Attempt to fix deadlock by waiting on channel dispatcher first, then channel reader.
@ZajacPiotr98 @Andersso - could either of you share some or all of your code that reproduces this issue? Does your code start consumers on the channels that refuse to stop? Can you share your full consumer code? I'm wondering which events or methods they use. Do they subscribe or handle the channel shutdown event? |
@ZajacPiotr98 @Andersso I have created version 7.1.0-alpha.0.5 - https://www.myget.org/feed/rabbitmq-dotnet-client/package/nuget/RabbitMQ.Client @Andersso this adds logging here: Thanks for giving it a try and reporting back if any of the |
Hey again, I used your latest package over night and I was able to reproduce the issue. However, I get the same log message regardless if it gets stuck or not, so I'm not sure how help it actually is. I'm also getting these log messages duplicated for each instance I'm running, but I think that is because I have not correctly set up the trace listener. |
Sorry but I am limited in what I can share, there is a lot of moving parts which it making it very difficult for me to give a sample that has the same behavior as the real thing. I have already tried to create a sample but I eventually gave up. I think the best I can offer you right now is to test the code and do memory analysis I appreciate the help! |
Describe the bug
Hi there,
Ever since upgrading from 6.x to 7.x, I've been running into intermittent deadlocks whenever I try to close a channel via
CloseAsync
.I haven't been able to reproduce it locally, but I've been able to do some remote debugging, but I could not get any insight. (all TP threads are waiting for work)
I did however manage to run a
dotnet-dump dumpasync
during one of these deadlocks and got the following info:First dump
Second dump (another instance)
I noticed that in both dump instances, the stacks aren’t displayed with the usual
Awaiting:
notation you often see in async stack traces, but it might be normal.Reproduction steps
I haven’t pinned down a reliable way to reproduce this, but calling
CloseAsync
more frequently seems to increase the chances of hitting the deadlock. It also appears more common on Linux than Windows, though that might just be due to hardware differences rather than OS behavior.Expected behavior
When calling
CloseAsync
, I’d expect the channel to close normally without causing a deadlock.Additional context
No response
The text was updated successfully, but these errors were encountered: