Skip to content
This repository has been archived by the owner on Jul 24, 2024. It is now read-only.

Dead session detection #57

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

lynxbat
Copy link

@lynxbat lynxbat commented Sep 3, 2014

This PR adds a feature to handle unreliable client connections within a server implementation.

If a client loses connection or closes their connection poorly a server goroutine can get stuck waiting on a blocking method. Over a period of time these waiting routines stack up. In a server that handles a high frequency of sessions even a smaller session failure rate can add up quickly.

This PR adds the following features:

  • A timeout option for listener. It will return an error from spdy.TransportListener.AcceptTransport() if the timeout expires before an initial connection is opened.
  • A heartbeat for spdy.Transport.WaitReceiveChannel() that will trigger an error if the client session is detected as dead.
  • A heartbeat for spdy.channel.Receive() that will trigger an error like the above.

This provides the ability to handle client dead session for each of the blocking methods run inside a server goroutine allowing the goroutine for that session to exit gracefully.

Notes:

  • Tests are provided and show the usage.
  • Heartbeat defaults to 30 second pings with a failure limit of 3 lost pings in a row (120 seconds)
  • Heartbeat can be overridden by each server for both ping interval and failure limit.
  • Heartbeat uses the spdystream.Connection.Ping() method.

Docker-DCO-1.1-Signed-off-by: Nicholas Weaver lynxbat@gmail.com (github: lynxbat)

Docker-DCO-1.1-Signed-off-by: Nicholas Weaver <lynxbat@gmail.com> (github: lynxbat)
@lynxbat
Copy link
Author

lynxbat commented Sep 7, 2014

Not sure if needed but since this is a big add I can spend time on Hangouts with someone demoing or doing a code review if desired.

@dmcgowan
Copy link
Contributor

dmcgowan commented Sep 8, 2014

Spending some time on code review will be useful. I will slot some time later in the week to review and test this change.

@lynxbat
Copy link
Author

lynxbat commented Sep 10, 2014

Feel free to ping me if needed.

@dmcgowan
Copy link
Contributor

There seems to be two distinct changes in this PR, atleast from the review perspective. Let me address the accept timeout first and then heartbeat/dead session detection second.

Do you have a specific use case which is requiring the accept timeout? I have not seen this pattern widely used since normally accept is done until an explicit close. An explicit close on the listener is still allowed which would cause any further accepts to throw an error, effectively ending any accept loop. As for the code, when using the timeout method in that case I would advocate for setting the timeoutChan to nil when the timeout value is 0. Making the channel nil will have the same block forever effect in the select statement. On a more trivial note the waitForTimeout also seems to have the same functionality as time.After. Likewise errors.New(fmt.Sprintf()) can be replaced by fmt.Errorf, although in the case of timeouts creating an ErrTimeout variable maybe more useful for the caller to easily check if the error was a timeout vs an error caused by a closed listener.

I like the idea behind detecting the dead sessions and heartbeat. Will need more time to fully review and test the code. I am wondering if there is a different way to do the teardown upon bad connection detection. When the transport goes bad such as what is detected by the dead session detector, any attempt to receive should already cause EOF. If it doesn't then the normal connection teardown should begin so any callers of receive will be able to end their receive loops. In the case of a detected bad session, it may be useful to store the error received somewhere that can be checked by the transport initiator, but not necessarily throwing a non-EOF error to callers of receive. Would have to think some more about how this error is expected to be handled by the caller. In an ideal world the libchan transport would be able to fallback to use another connection, but that would be in the future with quite a bit more work.

@lynxbat
Copy link
Author

lynxbat commented Sep 12, 2014

I debated even adding the listener timeout since it seems like it would be more specific to a server implementation. I left it in since I made it work basically. I can understand removing from the heartbeat work and approaching separately or maybe not at all right now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants