Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent deadlock when closing a channel using CloseAsync in 7.x #1751

Open
Andersso opened this issue Dec 19, 2024 · 27 comments · May be fixed by #1771
Open

Intermittent deadlock when closing a channel using CloseAsync in 7.x #1751

Andersso opened this issue Dec 19, 2024 · 27 comments · May be fixed by #1771
Assignees
Labels
Milestone

Comments

@Andersso
Copy link

Describe the bug

Hi there,

Ever since upgrading from 6.x to 7.x, I've been running into intermittent deadlocks whenever I try to close a channel via CloseAsync.
I haven't been able to reproduce it locally, but I've been able to do some remote debugging, but I could not get any insight. (all TP threads are waiting for work)

I did however manage to run a dotnet-dump dumpasync during one of these deadlocks and got the following info:

First dump

00007ebcbe050400 00007efcd1b1e298 ( ) System.Threading.Tasks.Task
  00007ebcbe9d7c48 00007efcd74554b8 (0) RabbitMQ.Client.ConsumerDispatching.ConsumerDispatcherChannelBase+<WaitForShutdownAsync>d__17
    00007ebcbe915d68 00007efcd7453028 (3) RabbitMQ.Client.Impl.Channel+<CloseAsync>d__73
      00007ebcbe915e10 00007efcd74533e0 (0) RabbitMQ.Client.Impl.AutorecoveringChannel+<CloseAsync>d__53
        00007ebcbe915ea0 00007efcd7453788 (0) <My code>

Second dump (another instance)

00007f0a56a17290 00007f4a69375380 ( ) System.Threading.Tasks.Task
  00007f0a597ed238 00007f4a6d2dd968 (0) RabbitMQ.Client.ConsumerDispatching.ConsumerDispatcherChannelBase+<WaitForShutdownAsync>d__17
    00007f0a59573cf8 00007f4a6d2da998 (3) RabbitMQ.Client.Impl.Channel+<CloseAsync>d__73
      00007f0a59573da0 00007f4a6d2dad50 (0) RabbitMQ.Client.Impl.AutorecoveringChannel+<CloseAsync>d__53
        00007f0a59573e30 00007f4a6d2db0f8 (0) <My code>

I noticed that in both dump instances, the stacks aren’t displayed with the usual Awaiting: notation you often see in async stack traces, but it might be normal.

Reproduction steps

I haven’t pinned down a reliable way to reproduce this, but calling CloseAsync more frequently seems to increase the chances of hitting the deadlock. It also appears more common on Linux than Windows, though that might just be due to hardware differences rather than OS behavior.

Expected behavior

When calling CloseAsync, I’d expect the channel to close normally without causing a deadlock.

Additional context

No response

@Andersso Andersso added the bug label Dec 19, 2024
@lukebakken
Copy link
Contributor

Hi, thanks for the report. As I'm sure you're aware of, there's not much to work with here 😸 Obviously, the gold standard is to provide code that reproduces this issue, or at least some idea of steps to do so.

calling CloseAsync more frequently

What does this mean? Do you have some way in your application to increase the frequency of channel closure?

@lukebakken lukebakken self-assigned this Dec 19, 2024
@lukebakken lukebakken added this to the 7.1.0 milestone Dec 19, 2024
@Andersso
Copy link
Author

What does this mean? Do you have some way in your application to increase the frequency of channel closure?

We're running tests that create and close channels very frequently, and it appears that the test suite that do this the most; is the the one that is usually getting stuck.

Anyhow, I can try to look dig into this further and see if I can provide something that will help you reproduce it.

Thanks

@michaelklishin
Copy link
Member

@Andersso channel and connection churn are workloads explicitly recommended against.

@lukebakken
Copy link
Contributor

lukebakken commented Dec 19, 2024

We're running tests that create and close channels very frequently, and it appears that the test suite that do this the most; is the the one that is usually getting stuck.

It would be extremely helpful for you to share your test code. If you can't do that, describe the test as best you can:

  • How many channels are created at any given point?
  • Are they created concurrently?
  • Is your test code a separate, console app, or using a test framework like xunit?

My guess is that you could be hitting a ThreadPool limit which prevents a Task from being scheduled, while another Task waits for the result. If you'd like to test that theory, please add the following code to the startup of your test program / test suite:

ThreadPool.SetMinThreads(16 * Environment.ProcessorCount, 16 * Environment.ProcessorCount);

This is a related issue:
#1354

@michaelklishin
Copy link
Member

Also note that management UI has connection and channel churn metrics, on the Overview page but also on the node page IIRC.

So at the very least it should be easy to see the churn rate: is it 50 channels opened per second? Is it 200?

@lukebakken
Copy link
Contributor

lukebakken commented Dec 26, 2024

@Andersso @ZajacPiotr98 - I've modified a test app in this project to try and trigger the error in this issue, or the error in #1749, and it works fine every time in my environment:

https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/rabbitmq-dotnet-client-1749/projects/Applications/CreateChannel/Program.cs

@Andersso
Copy link
Author

Andersso commented Jan 13, 2025

Hi again, and sorry for the delayed response. I hope you guys had a good Christmas and new year!

  • How many channels are created at any given point?
    The application creates a total of 20 channels, with 5 being recreated between each test case.

  • Are they created concurrently?
    No, they are created one at a time and stopped one at a time.

  • Is your test code a separate, console app, or using a test framework like xunit?
    The actual application is a console application, running separately from the test application (different instances). The tests use RabbitMQ to send messages to the application but maintain their own connection/channel, which is kept alive for the entire test suite.

I’ve been working on reproducing the issue in a test project but haven’t had any success. I’ve tried experimenting with different thread pool sizes, but it didn’t seem to affect the outcome. Based on my investigation of my latest memory dump, there’s no indication of thread pool starvation; all the threads in the pool are idle and waiting for work. It is also worth mentioning that my application is a console app so it does not have any synchronization context.

Regarding the connection churn, wouldn’t that have caused issues in the 6.x versions as well? We’ve had this setup running fine for years without any problems until the upgrade to 7.x.

I’ve done some additional digging by analyzing the memory dump. Specifically, I’ve looked at the tasks being awaited in the method that always seem to get stuck (according to the async dump):

    class ConsumerDispatcherChannelBase
    {
        public async Task WaitForShutdownAsync()
        {
            if (_disposed) // Value=0
            {
                return;
            }

            if (_quiesce) // Value=1
            {
                try
                {
                    await _reader.Completion; // m_stateFlags=33555520 (WaitingForActivation)
                    await _worker; // m_stateFlags=117441536 (RanToCompletion)
                }
            }
        }
    }

It appears that the channel never gets completed, which prevents the method from ever completing.

@lukebakken
Copy link
Contributor

@Andersso I'm going to publish a 7.1.0 alpha release right now. When it's ready, I'll follow-up here. There have been a couple fixes merged that might help here.

Any luck reproducing this issue reliably?

@lukebakken
Copy link
Contributor

@Andersso please give this version a try! https://www.nuget.org/packages/RabbitMQ.Client/7.1.0-alpha.0

@Andersso
Copy link
Author

@Andersso please give this version a try! https://www.nuget.org/packages/RabbitMQ.Client/7.1.0-alpha.0

Hey,
I tested it over the weekend, unfortunately it did not change anything.

@ZajacPiotr98
Copy link

I also performed the test with the alpha version and same results. Also I tried to do workaround with passing cancellation token to the CloseAsync method, but it does not work as CloseAsync is still running on cancellation request. So I think it block on awaiting ChannelCloseAsyncRpcContinuation or ConsumerDispatcher.WaitForShutdownAsync().

In my case it was around 500 close requests in 2 minutes from one instance of my application (overall 6 instances, 5 connection each, 5 RabbitMQ nodes with 3GiB high watermark). Second instance of app had same issue for around 1000 close requests in 4 minutes.

@lukebakken
Copy link
Contributor

Thanks for your reports. I'll try to reproduce this issue locally, though I've had no luck so far.

lukebakken added a commit that referenced this issue Jan 22, 2025
Fixes #1751

Attempt to fix deadlock by waiting on channel dispatcher first, then channel reader.
@lukebakken lukebakken linked a pull request Jan 22, 2025 that will close this issue
@lukebakken
Copy link
Contributor

@Andersso @ZajacPiotr98 I'm wondering if you're running into this condition - https://stackoverflow.com/a/66521303

Is it possible to test my PR branch in your environments?
#1771

If not, I can publish another alpha release. Thank you!

@Andersso
Copy link
Author

Hey, I do not have the infrastructure to use your repo directly. A nuget package would be perfect!

Thanks

@lukebakken
Copy link
Contributor

@Andersso - I build the packages locally on my branch, and uploaded them here:

https://www.myget.org/feed/rabbitmq-dotnet-client/package/nuget/RabbitMQ.Client/7.1.0-alpha.0.1

@ZajacPiotr98
Copy link

ZajacPiotr98 commented Jan 24, 2025

I tested this PR and issue is still there. I added logs and it seems that for some reason _worker finished, but it's stuck on the _reader.Completion. I think it may be the case when the _worker was stopped for some reason (exception?) and there was still work never read from a channel, most likely WorkType.Shutdown. I tried to reproduce it with more logs, but unfortunately never happened.

@lukebakken
Copy link
Contributor

Thanks for the follow-up. I wish I could reproduce this! I think the best fix will be to either not await reader.Completion or drain the reader first.

@lukebakken
Copy link
Contributor

Related PR that may have caused this behavior - #1606

@bollhals if you have a second to read this issue and comment, I would appreciate it. Thanks!

@lukebakken
Copy link
Contributor

@ZajacPiotr98 @Andersso I've uploaded a new version to MyGet -

https://www.myget.org/feed/rabbitmq-dotnet-client/package/nuget/RabbitMQ.Client/7.1.0-alpha.0.2

When the AsyncConsumerDispatcher processing loop exits it should drain the _reader, allowing WaitForShutdownAsync to complete. It'll log the WorkType that is discarded.

@Andersso
Copy link
Author

I will run 7.1.0-alpha.0.2 over the weekend, fingers crossed!

Sorry for my ignorance, but where does the log end up?

@lukebakken
Copy link
Contributor

You have to configure EventSource output. There is an example in the integration tests:

https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/main/projects/Test/Common/TestOutputWriterEventListener.cs

Use that class as a starting point in your own project. Instead of writing to ITestOutputHelper like that code does, you'd log to the console, your own logging infrastructure, etc.

@Andersso
Copy link
Author

Hey again, sorry for the delayed response.

Unfortunately, the issue is still present, and no log output has been observed. (I did verify that the event listener is working)
I've been trying to spend some time poking around in the memory dump to get further insight of the state of the Channel.
From what I have seen so far from one dump, is that everything points to the channel being completed, but the Completion task state remains WaitingForActivation, which I find very strange. But take my observations with a grain of salt.

I will have another dive once I have a fresh memory dump.

Thanks

@lukebakken
Copy link
Contributor

@Andersso thanks for the report. Argh, I wish I could reproduce this issue here. I will try some other ideas and will publish a new release to MyGet. I REALLY appreciate you being willing to test and investigate.

lukebakken added a commit that referenced this issue Feb 5, 2025
Fixes #1751

Attempt to fix deadlock by waiting on channel dispatcher first, then channel reader.
@lukebakken
Copy link
Contributor

@ZajacPiotr98 @Andersso - could either of you share some or all of your code that reproduces this issue? Does your code start consumers on the channels that refuse to stop? Can you share your full consumer code? I'm wondering which events or methods they use. Do they subscribe or handle the channel shutdown event?

@lukebakken
Copy link
Contributor

@ZajacPiotr98 @Andersso I have created version 7.1.0-alpha.0.5 -

https://www.myget.org/feed/rabbitmq-dotnet-client/package/nuget/RabbitMQ.Client

@Andersso this adds logging here:
https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/rabbitmq-dotnet-client-1751/projects/RabbitMQ.Client/ConsumerDispatching/AsyncConsumerDispatcher.cs#L29-L32

Thanks for giving it a try and reporting back if any of the consumer work log messages from AsyncConsumerDispatcher show up.

@Andersso
Copy link
Author

Andersso commented Feb 7, 2025

Hey again,

I used your latest package over night and I was able to reproduce the issue.
When I look in the logs I see the following message: processing consumer work while quiescing: Shutdown

However, I get the same log message regardless if it gets stuck or not, so I'm not sure how help it actually is.

I'm also getting these log messages duplicated for each instance I'm running, but I think that is because I have not correctly set up the trace listener.

@Andersso
Copy link
Author

Andersso commented Feb 7, 2025

@ZajacPiotr98 @Andersso - could either of you share some or all of your code that reproduces this issue? Does your code start consumers on the channels that refuse to stop? Can you share your full consumer code? I'm wondering which events or methods they use. Do they subscribe or handle the channel shutdown event?

Sorry but I am limited in what I can share, there is a lot of moving parts which it making it very difficult for me to give a sample that has the same behavior as the real thing. I have already tried to create a sample but I eventually gave up.

I think the best I can offer you right now is to test the code and do memory analysis

I appreciate the help!
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants