Skip to content
This repository has been archived by the owner on Dec 14, 2020. It is now read-only.

No way to detect a connection loss or recover a connection #188

Open
borlinp opened this issue Nov 19, 2019 · 14 comments
Open

No way to detect a connection loss or recover a connection #188

borlinp opened this issue Nov 19, 2019 · 14 comments

Comments

@borlinp
Copy link

borlinp commented Nov 19, 2019

When a connection is lost to the server, there is no way for an application to know the connection is lost. The client, conn, and session sit idle as if nothing is wrong. Even publishing gives no error.

Would recommend a way to detect a closed connection internally, and providing a channel to a user of the amp library which notifies of a connection loss to the user.

@vcabbage
Copy link
Owner

There are both TCP and AMQP level timeouts that will be reported. Can you please provide some more details about what you're experiencing?

@borlinp
Copy link
Author

borlinp commented Nov 19, 2019

Thank you for the quick response. How do you set those timeouts? Are there defaults?

@vcabbage
Copy link
Owner

The AMQP timeout is configured via https://godoc.org/pack.ag/amqp#ConnIdleTimeout, default of 1 minute.

TCP is more complex and is the default set by net.Dial unless you provide your own net.Conn to https://godoc.org/pack.ag/amqp#New. I think the docs here apply https://golang.org/pkg/net/#Dialer.KeepAlive

@borlinp
Copy link
Author

borlinp commented Nov 19, 2019

Great, thank you for that. How do you listen for the timeouts?

@vcabbage
Copy link
Owner

Currently you have to do some sort of operation. This typically isn't a problem with receivers, but may not be ideal for senders.

I feel there's a high bar to justify exposing channels in an external API, but perhaps it's warranted here. The alternative would be to add a blocking Done() error method to Client. With the channel approach it'd likely need to be Done() <-chan struct{} and Err() error (very similar to context.Context).

@alanconway You've thought a lot about these APIs, do you have an opinion?

@alanconway
Copy link
Contributor

The simplest solution is an identified exception like:
// ErrConnAborted returned if the connection closed unexpectedly or timed out.
That is different form ErrConnClosed which means the remote closed politely and without error.
If any call returns ErrConnAborted you know the entire connection is gone.

Electron has the Done()/Err() API on all endpoints: Connection, Session, Sender and Receiver.
In retrospect I'm not sure it was necessary, the only place it's used in the examples could have been handled more simply with:
err, m := x.Send()
if err == ErrConnAborted { ...

I'd avoid it until you're sure you need it, it risks giving a false sense of completion - Done() closing means the AMQP thing has closed but it does not tell you anything about the state of the goroutines that were calling Send(), Receive() etc. at the time - which may still be executing. So if you are using multiple goroutines you really should use your own sync.Group or errgroup.Group to wait till they all see the ErrConnAborted or are otherwise in a safe place. If you're not using multiple goroutines, you don't need Done()

@borlinp
Copy link
Author

borlinp commented Nov 19, 2019

Great ideas.

What I'm struggling with at the moment is:

Currently you have to do some sort of operation. This typically isn't a problem with receivers, but may not be ideal for senders.

The operations return an error of "EOF". Not very clear what is happening without lots of digging. Some sort of mechanism to clearly convey from the API that a connection is lost would be desired.

Second, why is this not a problem with receivers? Do they try to reconnect or something?

Lastly, it appears that the keep-alive just swallows the error if a keep-alive is unsuccessful. I would think that is a great opportunity to convey to the user of the API that the connection has gone stale, or attempt to reconnect.

Thank you both for exploring this and giving me more insight.

@vcabbage
Copy link
Owner

@alanconway Thanks!

@borlinp I should have been more clear. Since send operations may happen infrequently you may not find out that the connection has been disconnected until one is tried. Receiving tends to happen constantly and will block until a message is received or an error occurs.

I agree that more context should be added to errors where relevant.

it appears that the keep-alive just swallows the error if a keep-alive is unsuccessful.

It should not be ignored. Have you observed this and can you describe the details and/or share code?

@borlinp
Copy link
Author

borlinp commented Nov 22, 2019

Line 600 of conn.go gets the error from the keepalive frame.
599 case <-keepalive:
600 _, err = c.net.Write(keepaliveFrame)

This error gets bubbled up to conn.err. The value of conn.err isn't checked until close(). Perhaps a good place to re-establish the connection is when the error is assigned to conn.err.

@vcabbage
Copy link
Owner

Once conn.mux receives the error on conn.connErr it sets conn.err and returns (L347). When conn.mux returns the deferred conn.close() runs and conn.done is closed. At that point all sessions and links will return conn.err.

This is intended to be a low-level library, reconnection is up to the user.

@borlinp
Copy link
Author

borlinp commented Nov 22, 2019

Understandable that you wouldn't want to reconnect; that makes sense.

I would love to see something like Streadway's AMQP 0.9 library where you can subscribe and be notified of a connection closure: https://godoc.org/github.com/streadway/amqp#Channel.NotifyClose

We have a work around that is functioning for us. I'll leave this issue up to you to close or not, depending if you think the effort/feature would be valuable. Thanks for the conversation about it.

@vcabbage
Copy link
Owner

I'm happy to keep this open since I think there are some things to be done to allow detecting the errors more proactively and include more information about where the error originated from. Though I don't know when I'll get around to making any changes. (I try to respond to issues promptly and fix bugs, but as it's a personal side project for me I don't dedicate a lot of time to other improvements.)

To add a little more detail as to why I defer to users to handle reconnection: Doing a connection reconnect probably wouldn't be all that difficult (even that may be involved depending on the auth method), but recreating the sessions and links is non-trivial since reusing the initial parameters may or may not be appropriate depending on the broker semantics.

@borlinp
Copy link
Author

borlinp commented Nov 22, 2019

If I get some time, I might be willing to come up with a patch to contribute to the project. Thanks.

@vcabbage
Copy link
Owner

Contributions are welcome! If you do decide to do that please share what you have in mind before putting in too much effort. The streadway API takes a very different approach from this library and would not fit in well. I'd like to see something more along the lines of #188 (comment)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants