Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisiting gracefulClose with STM racing #589

Conversation

kazu-yamamoto
Copy link
Collaborator

First of all, please read Implementing graceful-close in Haskell network library. This PR revisits "Approach 4: callbacks of the IO/Timer manager".

What we know so far:

  • timeout cannot be used in the resource-release clause of bracket. This is because an asynchronous exception, which is used in timeout, cannot be delivered in the clause.
  • Callback functions of the system IO/Timer manager MUST not be blocked. If blocked, the managers are also blocked.

The approach 4 uses MVar. Even we use tryPutMVar, the IO manager is sometime blocked. This results in so many CLOSE_WAITs on Linux. This name is confusing but it appeared that they represent entries of the listen queue, not established connections waiting close on Linux. Anyway, it is a clear evidence of the blocked IO manager.

@khibino suggested me to use STM actions both for timeout and recv-ready. Since they are a single action for earch, they are not blocked. And they can be composed because they are STM actions. This PR implements his idea instead of using MVar.

I deployed this network library on my test server. So far, I don't see any CLOSE_WAIT.

Cc: @larskuhtz @snoyberg @fosskers @nh2

@nh2
Copy link
Member

nh2 commented Oct 21, 2024

@kazu-yamamoto There are stilla couple things I don't understand, you need to elaborate:

  • timeout cannot be used in the resource-release clause of bracket. This is because an asynchronous exception, which is used in timeout, cannot be delivered in the clause.

Where is this timeout interaction in bracket? Which bracket are we talking about?

It is only shortly mentioned in your post, but I don't really understand what scenario/implementation that's referring to.

Thus, I do not currently understand why we can't just do

timeout lingerTimeout (forever (recv s))
  • Callback functions of the system IO/Timer manager MUST not be blocked. If blocked, the managers are also blocked.

Can you outline the (problematic) Approach 4 with sample code, so that I can understand it, and its problem?


I also don't understand what's wrong with the current code (which is Approach 3).

Is that explained somewhere? What's the current problem, is there an open issue I missed?

The commit message should mention what the problem is that the commit solves.


Thank you -- I'm eager to participate in this!

@kazu-yamamoto
Copy link
Collaborator Author

  • timeout cannot be used in the resource-release clause of bracket. This is because an asynchronous exception, which is used in timeout, cannot be delivered in the clause.

Where is this timeout interaction in bracket? Which bracket are we talking about?

Sorry for your confusion. This is about approach 2.

  • Callback functions of the system IO/Timer manager MUST not be blocked. If blocked, the managers are also blocked.

Can you outline the (problematic) Approach 4 with sample code, so that I can understand it, and its problem?

77ee187 is the commit to remove the final code of approach 4.

I also don't understand what's wrong with the current code (which is Approach 3).

Unnecessary delay happens.

For instance, if we delay 30 us but a packet arrives after 10us, we wait for unnecesary 20 us.

The commit message should mention what the problem is that the commit solves.

Thank you for the suggestion. I will.

@nh2
Copy link
Member

nh2 commented Oct 21, 2024

Sorry for your confusion. This is about approach 2.

Yes, that I understand, but in the post it only says:

Of course, a very easy way is combine the timeout function and the original recv function which may trigger the IO manager. This does not work if used with bracket because timeout throws an asynchronous exception but it does not reach the resource-release action in bracket.

I don't understand which bracket is being referred to here. What's the proposed example code that doesn't work? It it something like

timeout lingerTimeout (forever (recvUntilEof s))

Why doesn't that work?

@kazu-yamamoto
Copy link
Collaborator Author

Suppose we define:

gracefulClose t s = timeout t (recv s ...)

We cannot use this function as follows:

bracket (openSocket ...) gracefulClose $ \sock ...

@nh2
Copy link
Member

nh2 commented Oct 22, 2024

bracket (openSocket ...) gracefulClose $ \sock ...

Maybe we (in warp, and users in general) should not use the function that way anyway!

Because then you get an uninterruptible cleanup action that runs for a long time. You keep pressing Ctrl+C and the program does not terminate.

Shouldn't we write the function so that the second async exception immediately calls close and stops the graceful-close loop?

For example, inlining the definition of bracket (also described in docs of mask and modifying it a bit:

-- | The first async exception will initiate graceful closing,
-- a second one will close the socket (which may result in
-- TCP connection resets of the other side).
withInterruptibleGracefulClose :: (Socket -> IO a) -> IO a
withInterruptibleGracefulClose action = do

  mask $ \restore -> do
    s <- openSocket

    let gracefulCloseOnExceptionClose = (gracefulClose s) `onException` (close s)

    r <- restore (action s) `onException` (restore gracefulCloseOnExceptionClose)
    _ <- restore gracefulCloseOnExceptionClose
    return r

That said, I do see the appeal of having a function that can be used as you say, with

bracket (openSocket ...) gracefulClose $ \sock ...

To make terminology clear, we would then have to declare that gracefulClose is an "interruptible" function, as referred to by these docs:

Interruptible functions

Docs of bracket say:

Bracket wraps the release action with mask, which is sufficient to ensure that the release action executes to completion when it does not invoke any interruptible actions, even in the presence of asynchronous exceptions. For example, hClose is uninterruptible when it is not racing other uses of the handle. Similarly, closing a socket (from "network" package) is also uninterruptible under similar conditions. An example of an interruptible action is killThread. Completion of interruptible release actions can be ensured by wrapping them in uninterruptibleMask_, but this risks making the program non-responsive to Control-C, or timeouts.

And Applying mask to an exception handler says more:

Some operations are interruptible, which means that they can receive asynchronous exceptions even in the scope of a mask. Any function which may itself block is defined as interruptible; this includes takeMVar (but not tryTakeMVar), and most operations which perform some I/O with the outside world.
...
It is useful to think of mask not as a way to completely prevent asynchronous exceptions, but as a way to switch from asynchronous mode to polling mode. The main difficulty with asynchronous exceptions is that they normally can occur anywhere, but within a mask an asynchronous exception is only raised by operations that are interruptible (or call other interruptible operations). In many cases these operations may themselves raise exceptions, such as I/O errors, so the caller will usually be prepared to handle exceptions arising from the operation anyway. To perform an explicit poll for asynchronous exceptions inside mask, use allowInterrupt.

Idea

So, could we use interruptible?

For example, could we do

gracefulClose s tmout0 = sendRecvFIN `E.finally` close s
  where
    sendRecvFIN = do
        -- Sending TCP FIN.
        ex <- E.try $ shutdown s ShutdownSend
        case ex of
          Left (E.SomeException _) -> return ()
          Right () -> do
              -- Giving CPU time to other threads hoping that
              -- FIN arrives meanwhile.
              yield
              -- Waiting TCP FIN.

              -- Make the graceful-close waiting `interruptible`
              -- even when `gracefulClose` is called in `mask`, e.g. as
              --
              --     bracket (openSocket ...) gracefulClose $ \sock ...
              --
              -- If the timeout part gets interrupted by an async exception,
              -- the `finally` above will close the socket.
              interruptible (timeout lingerTimeout recvUntilEof)

(Note interruptible has the same implementation as the restore function of mask; both use unsafeUnmask.)

@kazu-yamamoto
Copy link
Collaborator Author

Your research is great!

Sorry but it seems to me that I misunderstand timeout.
Through some experiments, timeout works well in Control.Exception.bracket.
I need to understand why it works.

The reason why I misunderstand is probably that I use UnliftIO.Exception.bracket.
timeout does not work in it.

import Control.Exception
-- import UnliftIO.Exception

import Control.Concurrent
import System.Timeout

main :: IO ()
main = bracket newEmptyMVar (timeout 3000000 . takeMVar) $ \_ -> return ()

@kazu-yamamoto
Copy link
Collaborator Author

Since timeout works in bracket, recvEOFloop is now based on timeout.
recvEOFevent is nicer than recvEOFloop since recvEOFevent does not spawn a thread.

@edsko
Copy link

edsko commented Oct 23, 2024

The reason timeout does not work with the unliftio version of bracket is that it runs the cleanup handler using uninterruptibleMask_, so even interruptible operations won't be interruptible anymore (see https://hackage.haskell.org/package/unliftio-0.2.25.0/docs/src/UnliftIO.Exception.html#bracket). I must admit that personally I don't like the design of unliftio; it uses functions with the same name (bracket, catch), but with subtly different semantics (uninterruptibleMask versus mask in bracket, only catching synchronous exceptions in catch). Not only are these differences subtle, they are very difficult to debug and can likely result in non-deterministic failures.

@kazu-yamamoto
Copy link
Collaborator Author

I have already removed the dependency for unliftio from my packages on github.
I'm now trying to remove it from warp.

@nh2
Copy link
Member

nh2 commented Oct 25, 2024

I have already removed the dependency for unliftio from my packages on github.
I'm now trying to remove it from warp.

@kazu-yamamoto Not sure that's a good idea, see below:

I must admit that personally I don't like the design of unliftio; it uses functions with the same name (bracket, catch), but with subtly different semantics (uninterruptibleMask versus mask in bracket, only catching synchronous exceptions in catch). Not only are these differences subtle, they are very difficult to debug and can likely result in non-deterministic failures.

@edsko That behaviour is arguably the correct thing to do in most cases, and the reason why using unliftio makes applications drastically less buggy.

Most uses of bracket do not intend to handle (and then not re-throw) async exceptions, but accidentally do, which makes the programs wrong (usually deadlock).

This is quite nicely explained in https://tech.fpcomplete.com/haskell/tutorial/exceptions/#sync-vs-async

An asynchronous exception is totally different. It is a demand from outside of our control to shut down as soon as possible. If we were to catch such an exception and recover from it, we would be breaking the expectations of the thread that tried to shut us down. Instead, with asynchronous exceptions, exception handling best practices tell us we're allowed to clean up, but not recover. For example, the timeout function uses asynchronous exceptions. What should the expected behavior here be? [Example]

The example shows exactly the case discussed in this issue: "timeout + delay-on-interruption"

In most application code, the additional delay should not happen; an async exception should reliably kill the entire thread. That should be the default; thus unliftio's timeout is good.

But gracefulClose has special semantics, where we want only the second exception to kill the thread. In that case, we need to use the normal timeout with it's pitfalls (but it's OK to use here, because we're implementing a very special case).

@kazu-yamamoto Thus my recommendation is be to stick to unliftio as the default, and use Control.Exception only for special cases such as "timeout + delay-on-interruption".

@kazu-yamamoto
Copy link
Collaborator Author

@kazu-yamamoto Thus my recommendation is be to stick to unliftio as the default, and use Control.Exception only for special cases such as "timeout + delay-on-interruption".

As a heavy user of unliftio, I should admit the following two:

  1. With unliftio, some clean-up functions cannot work. Examples are not only gracefulClose but also bye of tls. When these functions are used, Control.Exception must be used.
  2. We cannot ask all package maintainers to introduce unliftio. If we could use unliftio only over packages relating our project, it would give us sense of security. However, the reality is not. In the real world, our project based on both unliftio and Control.Exception. This introduces new anxiety.

That said, I would like to go with only Control.Exception and carefully write up exception handlers to rethrow asynchronous exceptions.

@kazu-yamamoto
Copy link
Collaborator Author

I'm going to adopt the asynchronous-exception-less programming with Control.Exception.
An example is gracefulClose in the case where EventManager is available.
If not available, timeout is used.

@kazu-yamamoto kazu-yamamoto merged commit bda5016 into haskell:master Nov 7, 2024
10 checks passed
@kazu-yamamoto kazu-yamamoto deleted the gracefulClose-based-on-racing-again branch November 7, 2024 00:03
@nh2
Copy link
Member

nh2 commented Nov 7, 2024

I would like to go with only Control.Exception and carefully write up exception handlers to rethrow asynchronous exceptions.

I think any choice you make for network in this regard is fine -- as long as the person that writes the code is aware what happens (and you are), both approaches are OK. I just wanted to support the idea of unliftio in general, because I think it's the saner default.

@kazu-yamamoto
Copy link
Collaborator Author

I found no problems in my field testing.
So, I have merged this PR and released a new version.
If we find problems, let's open a new issue.

@kazu-yamamoto
Copy link
Collaborator Author

Unfortunately, I saw CLOSE_WAIT today.
I will open a new issue.

@kazu-yamamoto kazu-yamamoto mentioned this pull request Nov 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants