Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ObjectDisposedException when channel is closed by RabbitMQ due to a channel exception #1749

Closed
ramonsmits opened this issue Dec 16, 2024 · 21 comments · Fixed by #1752 or #1764
Closed
Assignees
Labels
Milestone

Comments

@ramonsmits
Copy link
Contributor

ramonsmits commented Dec 16, 2024

See @ZajacPiotr98's comment for a description of this bug:

#1749 (comment)

The issue reported by @ramonsmits is NOT fixed by #1752, and has been copied to a new issue, #1760

@ramonsmits ramonsmits added the bug label Dec 16, 2024
@lukebakken lukebakken self-assigned this Dec 16, 2024
@lukebakken lukebakken added this to the 7.1.0 milestone Dec 16, 2024
@ZajacPiotr98
Copy link

ZajacPiotr98 commented Dec 19, 2024

I encountered similar problem, but kind of backwards, because this ObjectDisposedException caused closing connection with such close reason.

CloseReason: AMQP close-reason, initiated by Library, code=541, text='Cannot access a disposed object. Object name: 'System.Threading.SemaphoreSlim'.', classId=0, methodId=0, exception=System.ObjectDisposedException: Cannot access a disposed object. Object name: 'System.Threading.SemaphoreSlim'.
at System.Threading.SemaphoreSlim.WaitAsync(Int32 millisecondsTimeout, CancellationToken cancellationToken)
at RabbitMQ.Client.Impl.Channel.MaybeHandlePublisherConfirmationTcsOnChannelShutdownAsync(ShutdownEventArgs reason)
at RabbitMQ.Client.Impl.Channel.OnChannelShutdownAsync(ShutdownEventArgs reason)
at RabbitMQ.Client.Impl.Channel.OnSessionShutdownAsync(Object sender, ShutdownEventArgs reason)
at RabbitMQ.Client.Impl.AsyncEventingWrapper1.InternalInvoke(Delegate[] handlers, Object sender, TEvent event)
at RabbitMQ.Client.Impl.AsyncEventingWrapper1.InternalInvoke(Delegate[] handlers, Object sender, TEvent event)
at RabbitMQ.Client.Impl.Channel.HandleChannelCloseAsync(IncomingCommand cmd, CancellationToken cancellationToken)
at RabbitMQ.Client.Impl.Channel.HandleCommandAsync(IncomingCommand cmd, CancellationToken cancellationToken)
at RabbitMQ.Client.Framing.Connection.ProcessFrameAsync(InboundFrame frame, CancellationToken cancellationToken)
at RabbitMQ.Client.Framing.Connection.ReceiveLoopAsync(CancellationToken mainLoopCancellationToken) at RabbitMQ.Client.Framing.Connection.MainLoop()

The error happened when there was significant load and RabbitMQ responses were several times slower than usual. But I haven't got to the bottom of issue and I don't know the root cause. I'm not sure if it's library itself or usage of the library causing the error, but I can't find any documentation about handling this.
The code that I use tries to close a channel before disposal, but only when channel IsOpen and the task is awaited until timeout is reached:
Plain Text
IChannel Instance
...
var timeoutCts = new CancellationTokenSource(timeout);
var closeChannelTask = this.Instance.CloseAsync();
await closeChannelTask.WaitAsync(timeoutCts.Token);

Without this timeout some of close requests were running for really long (maybe even infinitely? Some deadlock? There was no channel on server side), so the timeout was added to unblock the thread.

To be sure I also checked without running CloseAsync on IChannel at all, but it haven't helped.

lukebakken added a commit that referenced this issue Dec 19, 2024
@lukebakken
Copy link
Contributor

lukebakken commented Dec 19, 2024

@ramonsmits I still have yet to hit an ObjectDisposedException using my test project. If you could provide the entire stack trace, it would help me out.

Here is my test application:
https://github.com/rabbitmq/rabbitmq-dotnet-client/tree/rabbitmq-dotnet-client-1749/projects/Applications/GH-1749

It starts a consumer, then repeatedly uses the HTTP API to close the connection. I have left it running a long time in my test environment, and have yet to see ObjectDisposedException.

Maybe there's something up with my code? If you can take a look, I'd appreciate it.

@ZajacPiotr98
Copy link

I managed to reproduce the problem I described. I called DisposeAsync method twice simultaneously and extended Close response time using clumsy. First Dispose invocation called CloseAsync and it set CloseReason, so IsClosed was true. The second invocation went straight to the semaphores disposing. When Close response was returned, the semaphore was already disposed. It thrown ObjectDisposedException and closed entire Connection.

I think the ObjectDisposedException is the reason of that and a bug. However I'm not sure if closing connection on any exception is good approach. Probably there should be distinction between some server exceptions or any other that should cause it, and the exceptions thrown by bugs in the implementation.

@michaelklishin
Copy link
Member

For I/O exceptions, it is. And it can and does trigger a connection recovery.

For when a connection is closed, the methods should simply be made idempotent. And Dispose should likely be an equivalent of calling CloseAsync the first time, and equally idempotent the 2nd, 3rd time, and so on.

@lukebakken
Copy link
Contributor

For when a connection is closed, the methods should simply be made idempotent. And Dispose should likely be an equivalent of calling CloseAsync the first time, and equally idempotent the 2nd, 3rd time, and so on.

Yep, that's how it's supposed to work 😉

Thank you for the details, @ZajacPiotr98. Since you can reproduce this issue, would you be able to test a pull request once I open it?

lukebakken added a commit that referenced this issue Dec 20, 2024
Fixes #1749

* Add test project
* Add `FirstChanceException` logging.
* Add code to repeatedly close connection.
* Allow passing hostname to GH-1749 program as the first arg.
@ZajacPiotr98
Copy link

ZajacPiotr98 commented Dec 23, 2024

I tested these changes and it's not resolving problem for the case I run. It was simple test:

var task1 = channel.DisposeAsync();
var task2 = channel.DisposeAsync();
await task1;
await task2;

Also, for this test the CloseReason of Connection has HandleChannelCloseOkAsync is the stack trace as close was invoked from the app. But for running load tests on the app using this library, I see more frequently HandleChannelCloseAsync in the CloseReason stack trace. I tried to debug it locally when exception is thrown from invoking a Command (e.g. QueueDeclarePassiveAsync with not existing queue name). When Dispose finishes execution before HandleChannelCloseAsync finished, then error occurs. In the app, the channel is disposed on every exception as it shouldn't be reused for subsequent requests.

        try
        {
            return await channel.QueueDeclarePassiveAsync(queueName);
        }
        catch (Exception)
        {
            await channel.DisposeAsync();
            throw;
        }

I'm not completely sure what is handled in the HandleChannelCloseOkAsync and HandleChannelCloseAsync. Is it necessary to always invoke these methods? If so, maybe Dispose should wait with semaphore disposal until these handles finish (with some timeout probably). Or these methods should ignore waiting for semaphore if it's already disposed.

@michaelklishin
Copy link
Member

@ZajacPiotr98 as far as I understand, #1752 focuses on connection objects and you are trying with channels.

Their (channels') behavior should generally be identical to the idea above but you and @lukebakken may be focussing on the same problem in two different places.

@lukebakken
Copy link
Contributor

@ZajacPiotr98 did you follow the same test procedure you detailed in this comment?

#1752 adds a _disposedValue check to the Channel class, so your second DisposeAsync() should do nothing. I'm going to add a basic test to the PR, but it won't use Clumsy, of course.

@ZajacPiotr98
Copy link

ZajacPiotr98 commented Dec 23, 2024

Yes, maybe I was not precise if it was Channel or Connection. Now I've done quick test for Connection. Same thing with invoking DisposeAsync, and awaiting both tasks later. Also got an ObjectDisposedException. In this case From the invocation of Connection disposal await _channel0.DisposeAsync().

@lukebakken I've even tried without Clumsy and had same results. _disposedValue is set at the end of Dispose method invocation. If the second invocation starts before the first one is finished, then the problem occurs. The check for the _disposedValue in the second invocation finishes before setting this to true in the first one.

Additionally, to mitigate this behavior, as workaround, I added in the app the isDisposing field, that is setting to true at the start of Disposing channel and never lets run another DisposeAsync on that channel or only when first one is finished. But it still doesn't work for the HandleChannelCloseAsync case I described and when disposal is reaching the timeout on the app side (underneath commands are still received, only awaiting for disposal is cancelled in the app). I added that because of #1751

@lukebakken
Copy link
Contributor

Ah, you're not awaiting the DisposeAsync(). Yes, I see the problem. Hopefully nobody does that in practice. I'll come up with a fix.

lukebakken added a commit that referenced this issue Dec 23, 2024
Fixes #1749

* Add test project
* Add `FirstChanceException` logging.
* Add code to repeatedly close connection.
* Allow passing hostname to GH-1749 program as the first arg.
* Toxiproxy tests fixup.
* Standardize on `_disposedValue` name.
* Ensure `_disposedValue` is set in `finally` block.
* Add lock object for disposal of Channels and Connections. Note: a `SemaphoreSlim` can't be used because it must be disposed as well, and that can't happen cleanly in a `Dispose` method.
@lukebakken
Copy link
Contributor

@ZajacPiotr98 - please give my PR a test. I've added what amounts to a "Disposing" state to channels and connections, that only allows one thread to do the disposal.

@ZajacPiotr98
Copy link

I will be able to test it next week.

But the problem is not with the multiple dispose. I'm awaiting the dispose invocation and it only happens when multiple threads do dispose in the same time. But, like I mentioned I corrected this behavior with workaround on my side.

The problem, that is really impacting application is:
HandleChannelCloseAsync and HandleChannelCloseOkAsync are invoked or at least finish when the semaphore are already disposed and it throws the exception, that is not handles, thus closing entire connection with all channels opened on that connection (so all operations in progress are canceled). This is impacting performance of the application and I don't see any option to suppress invocation of these HandleChannelClose methods.

I think invoking dispose on the exception or when channel is no longer needed should not close entire connection. But it does sometimes, when disposal of semaphores finishes before HandleChannelClose.

I only used double dispose without awaiting as an example to reproduce that issue. I couldn't find a way to reproduce it same as on the environment under load tests as there is significant load and thus closing channel is taking more time.

I asked about it here:

I'm not completely sure what is handled in the HandleChannelCloseOkAsync and HandleChannelCloseAsync. Is it necessary to always invoke these methods? If so, maybe Dispose should wait with semaphore disposal until these handles finish (with some timeout probably). Or these methods should ignore waiting for semaphore if it's already disposed.

@michaelklishin
Copy link
Member

Well, concurrent disposal is a scenario I haven't seen mentioned here, I only thought of the case where a connectio nor channel is disposed N times sequentially.

Given that shared use of channels was not supported by this library till 7.0, and that concurrent disposal is not necessarily supported by a lot of libraries for other data services, this may be out of scope for this client.

Sharing one-off channels for passive declares between threads seems unnecessary to me but maybe that's an inevitable effect of adopting tasks for the parts that interact with RabbitMQ :(

@lukebakken
Copy link
Contributor

lukebakken commented Dec 26, 2024

@ZajacPiotr98 - I'm looking to see if there is a straightforward way to address what you have pointed out with the HandleChannelClose* methods.

Your comment here gives me the best description of how to reproduce the error.

When you do a passive declaration of a queue that doesn't exist, RabbitMQ closes the channel by sending close-channel to the .NET client, which is handled by HandleChannelCloseAsync. This raises an exception, which your code catches, and then your code disposes the channel. DisposeAsync will abort an open channel, which should ignore all exceptions, but it seems some have been missed.

This gives me a simple test case to try.

@lukebakken
Copy link
Contributor

lukebakken commented Dec 26, 2024

Also, for this test the CloseReason of Connection has HandleChannelCloseOkAsync is the stack trace as close was invoked from the app. But for running load tests on the app using this library, I see more frequently HandleChannelCloseAsync in the CloseReason stack trace.

Can you please save all stack traces to text files, and attach the files to this issue?

HandleChannelCloseAsync and HandleChannelCloseOkAsync are invoked or at least finish when the semaphore are already disposed

Which semaphore?

If I had a more complete set of information from you, I could probably figure this out 😄

@lukebakken
Copy link
Contributor

The code you provide here should never result in ObjectDisposedException, unless you call that code concurrently, which could result in DisposeAsync being called concurrently (which this library does not yet support).

Before I spend any more time investigating, I need code with instructions for reproducing the errors you describe with the HandleChannelClose* methods. I think I'm going in circles at this point.

@ZajacPiotr98
Copy link

ZajacPiotr98 commented Jan 2, 2025

I managed to reproduce this issue with simulating some delay between _continuationQueue.HandleChannelShutdown(reason); and await MaybeHandlePublisherConfirmationTcsOnChannelShutdownAsync(reason).ConfigureAwait(false); in the OnChannelShutdownAsync method of the Channel class. I've done this by adding a ChannelShutdownAsync event handler with some delay. I'm not exactly sure why it happens sometimes in my application (I don't use any event handler for the ChannelShutdownAsync). Maybe it's because of thread scheduling or some operation taking more time once for a while. Essentially, when _continuationQueue.HandleChannelShutdown(reason); causes some action (e.g. QueueDeclarePassive) throwing OperationInterruptedException and Dispose is invoked on exception and finishes before await MaybeHandlePublisherConfirmationTcsOnChannelShutdownAsync(reason).ConfigureAwait(false);, then ObjectDisposedException occurs in that method and it closes entire connection with all channels already opened. Also channel has to have publish acks enabled.

Here's the code, that reproduces issue 100% times:


        public static async Task Main()
        {
            var connectionFactory = new ConnectionFactory
            {
                UserName = "myuser",
                Password = "mypassword",
                VirtualHost = "/",
                RequestedConnectionTimeout = TimeSpan.FromSeconds(600),
                RequestedHeartbeat = TimeSpan.FromSeconds(600),
                AutomaticRecoveryEnabled = false,
                TopologyRecoveryEnabled = false
            };

            await using (IConnection connection = (await connectionFactory.CreateConnectionAsync()))
            {
                IChannel ch = await connection.CreateChannelAsync(new CreateChannelOptions(true, true));
                try
                {
                    ch.ChannelShutdownAsync += async (sender, args) =>
                    {
                        await Task.Delay(1000);
                    };

                    await ch.QueueDeclarePassiveAsync(Guid.NewGuid().ToString());
                }
                catch (OperationInterruptedException)
                {
                    await ch.DisposeAsync();
                    await Task.Delay(2000);
                }
            }
        }

I posted stack trace in my first post here (it's close reason of a connection):

CloseReason: AMQP close-reason, initiated by Library, code=541, text='Cannot access a disposed object. Object name: 'System.Threading.SemaphoreSlim'.', classId=0, methodId=0, exception=System.ObjectDisposedException: Cannot access a disposed object. Object name: 'System.Threading.SemaphoreSlim'.
at System.Threading.SemaphoreSlim.WaitAsync(Int32 millisecondsTimeout, CancellationToken cancellationToken)
at RabbitMQ.Client.Impl.Channel.MaybeHandlePublisherConfirmationTcsOnChannelShutdownAsync(ShutdownEventArgs reason)
at RabbitMQ.Client.Impl.Channel.OnChannelShutdownAsync(ShutdownEventArgs reason)
at RabbitMQ.Client.Impl.Channel.OnSessionShutdownAsync(Object sender, ShutdownEventArgs reason)
at RabbitMQ.Client.Impl.AsyncEventingWrapper1.InternalInvoke(Delegate[] handlers, Object sender, TEvent event)
at RabbitMQ.Client.Impl.AsyncEventingWrapper1.InternalInvoke(Delegate[] handlers, Object sender, TEvent event)
at RabbitMQ.Client.Impl.Channel.HandleChannelCloseAsync(IncomingCommand cmd, CancellationToken cancellationToken)
at RabbitMQ.Client.Impl.Channel.HandleCommandAsync(IncomingCommand cmd, CancellationToken cancellationToken)
at RabbitMQ.Client.Framing.Connection.ProcessFrameAsync(InboundFrame frame, CancellationToken cancellationToken)
at RabbitMQ.Client.Framing.Connection.ReceiveLoopAsync(CancellationToken mainLoopCancellationToken) at RabbitMQ.Client.Framing.Connection.MainLoop()

lukebakken added a commit that referenced this issue Jan 3, 2025
Fixes #1749

* Add test project
* Add `FirstChanceException` logging.
* Add code to repeatedly close connection.
* Allow passing hostname to GH-1749 program as the first arg.
* Toxiproxy tests fixup.
* Standardize on `_disposedValue` name.
* Ensure `_disposedValue` is set in `finally` block.
* Add lock object for disposal of Channels and Connections. Note: a `SemaphoreSlim` can't be used because it must be disposed as well, and that can't happen cleanly in a `Dispose` method.
@lukebakken
Copy link
Contributor

Thank you very much. I can reproduce the issue every time and am looking into it.

lukebakken added a commit that referenced this issue Jan 7, 2025
Fixes #1749

* Add test project for GH #1749.
* Add `FirstChanceException` logging in #1749 test project.
* Add code to repeatedly close connection.
* Allow passing hostname to GH-1749 program as the first arg.
* Toxiproxy tests fixup.
* Ensure `_disposed` is set in `finally` block.
* Add lock object for disposal of Channels and Connections. Note: a
  `SemaphoreSlim` can't be used because it must be disposed as well,
  and that can't happen cleanly in a `Dispose` method.
* Add basic test to see what dispose after a channel exception does.
* Modify `CreateChannel` app to try and trigger GH1751
* Add TCS for server-originated channel closure. This is the "real" fix
  for the issue.
lukebakken added a commit that referenced this issue Jan 7, 2025
Fixes #1749

* Add test project for GH #1749.
* Add `FirstChanceException` logging in #1749 test project.
* Add code to repeatedly close connection.
* Allow passing hostname to GH-1749 program as the first arg.
* Toxiproxy tests fixup.
* Ensure `_disposed` is set in `finally` block.
* Add lock object for disposal of Channels and Connections. Note: a
  `SemaphoreSlim` can't be used because it must be disposed as well,
  and that can't happen cleanly in a `Dispose` method.
* Add basic test to see what dispose after a channel exception does.
* Modify `CreateChannel` app to try and trigger GH1751
* Add TCS for server-originated channel closure. This is the "real" fix
  for the issue.
@lukebakken
Copy link
Contributor

@ZajacPiotr98 if you could try the latest code in this PR, I would appreciate it:

#1752

Basically, this is the problem:

  • The QueueDeclarePassiveAsync call causes a broker-initiated channel closure.
  • The repro code immediately calls DisposeAsync, which disposes the SemaphoreSlim objects in the Channel instance.
  • The server-initiated closure still takes place, and tries to use one of those disposed SemaphoreSlim objects.

The fix is to ensure that DisposeAsync waits for a broker-initiated channel closure to complete, if such a closure is in-progress. To accomplish this, I added a TaskCompletionSource here:

https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/rabbitmq-dotnet-client-1749/projects/RabbitMQ.Client/Impl/Channel.cs#L1511-L1516

...and, if this completion source is not-null during disposal, the code waits up to 5 seconds for it to complete here:

https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/rabbitmq-dotnet-client-1749/projects/RabbitMQ.Client/Impl/Channel.cs#L588-L589

@lukebakken lukebakken changed the title ObjectDisposedException when connection is closed from the server ObjectDisposedException when channel is closed by the AMQP broker. Jan 7, 2025
lukebakken added a commit that referenced this issue Jan 7, 2025
Fixes #1749

* Ensure `Dispose` and `DisposeAsync` are idempotent and thread-safe.
* Use TaskCompletionSource when `HandleChannelCloseAsync` runs to allow dispose methods to wait.
@michaelklishin michaelklishin changed the title ObjectDisposedException when channel is closed by the AMQP broker. ObjectDisposedException when channel is closed by RabbitMQ Jan 7, 2025
@michaelklishin michaelklishin changed the title ObjectDisposedException when channel is closed by RabbitMQ ObjectDisposedException when channel is closed by RabbitMQ due to a channel exception Jan 7, 2025
@ZajacPiotr98
Copy link

I tested this changes and I haven't seen problem anymore. Seems good to me. When it's possible to get a new version of the library with these changes?

@lukebakken
Copy link
Contributor

When it's possible to get a new version of the library with these changes?

Hopefully by the end of this month.

lukebakken added a commit that referenced this issue Jan 9, 2025
Fixes #1749

* Ensure `Dispose` and `DisposeAsync` are idempotent and thread-safe.
* Use TaskCompletionSource when `HandleChannelCloseAsync` runs to allow dispose methods to wait.
* Use `Interlocked` for thread safety.
* I like `_isDisposing` better. So sue me!
* Move the `Interlocked.Exchange` code to a getter, for readability.
* Minor nullable change.

Co-authored-by: Daniel Marbach <daniel.marbach@openplace.net>
lukebakken added a commit that referenced this issue Jan 9, 2025
Fixes #1749

* Add test project for GH #1749.
* Add `FirstChanceException` logging in #1749 test project.
* Add code to repeatedly close connection.
* Allow passing hostname to GH-1749 program as the first arg.
* Toxiproxy tests fixup.
* Ensure `_disposed` is set in `finally` block.
* Add lock object for disposal of Channels and Connections. Note: a
  `SemaphoreSlim` can't be used because it must be disposed as well,
  and that can't happen cleanly in a `Dispose` method.
* Add basic test to see what dispose after a channel exception does.
* Modify `CreateChannel` app to try and trigger GH1751
* Add TCS for server-originated channel closure. This is the "real" fix
  for the issue.
lukebakken added a commit that referenced this issue Jan 9, 2025
Fixes #1749

* Add test project for GH #1749.
* Add `FirstChanceException` logging in #1749 test project.
* Add code to repeatedly close connection.
* Allow passing hostname to GH-1749 program as the first arg.
* Toxiproxy tests fixup.
* Ensure `_disposed` is set in `finally` block.
* Add lock object for disposal of Channels and Connections. Note: a
  `SemaphoreSlim` can't be used because it must be disposed as well,
  and that can't happen cleanly in a `Dispose` method.
* Add basic test to see what dispose after a channel exception does.
* Modify `CreateChannel` app to try and trigger GH1751
* Add TCS for server-originated channel closure. This is the "real" fix
  for the issue.
lukebakken added a commit that referenced this issue Jan 23, 2025
Fixes #1760

* Start by re-adding #1749 test project.
lukebakken added a commit that referenced this issue Feb 11, 2025
Fixes #1760

* Start by re-adding #1749 test project.

Only log first-chance `ObjectDisposedException` instances.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment