-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SWF] shutting down an activity worker is a nightmare of exceptions and lost tasks #2
Comments
From @fulghum on May 10, 2016 18:39 Thanks for the feedback on the SWF Flow library. We've pinged the SWF team and passed along the feedback to them. |
Not sure how helpful this is (or consistent) but I was able to reproduce inconsistent downscaling behavior. Here is at least one key difference that might be useful in debugging the problem: A *FAILED shutdown that never hits the hook initiates with the following AbortedException: I haven't dug deep enough in the SDK code to see why one would be handled differently than the other but may be a good lead... still digging... |
Update: It seems I am seeing our shutdown hook being called at least some of the time. When the above Aborted exception occurs, my current hypothesis is that since the abort is happening in the actual stream read: I was able to reproduce this on one of our activity workers that consistently take at least 1-2 seconds to do its work (so my window of interruption is slightly longer). I sat in a second SSH window with the first window tailing our log. When the log starts to move I systemctl stop the service. The above logs came from a 1-1 (1 successful shutdown 2nd one failed). Today I succeeded twice before failing. Either way fairly high risk of occurrence. |
@manikandanrs @danwashusen I've also encountered this issue and thank you for posting this solution, it helped me a lot. I've also found a couple of interested things. First of all, if a worker is shutting down and during this process a new task is accepted (by one of the already running poll requests) then the whole process hangs since amazon uses a client-blocking rejection handler for the task executors task pool (see GenericActivityWorker.java:99). And it looks like it's impossible to change this implementation (to, say, the default one which would just throw an exception). It's possible to extend the GenericWorker and override the createPoller method, by it's impossible to use this extended variant in SpringActivityWorker so I had to create my own SpringActivityWorker which would use my implementation of the GenericWorker Another thing is: by default the poll requests have no timeout, so it would be probably a good idea to set their timeout to something less than the time that the graceful shutdown worker sleeps after suspending polling. This way all the running poll requests will have either accepted a task or disconnected and started waiting on the countdown latch (because of the suspending). This should guarantee that after the sleep() method there will be no running long poll requests. Unfortunately, as far as I can tell, this timeout is not configurable, so I had to implement my own MyActivityTaskPoller which extends ActivityTaskPoller by essentially copying the existing one and only changing the timeout. Also, in conjunction with the above items in my comment (default rejection handler and custom ActivityTaskPoller) it's possible to implemented a safety net: MyActivityTaskPoller would catch the RejectedExecutionException and fail the task which arrived. This way this task would be immediately marked as failed in amazon and the workflow would continue instead of eventually timing out. This all was done as a PoC: copying the existing classes isn't a very good practise so a more robust and clean approach is needed. Is it possible to submit a pull request to the swf framework addressing my points? Just to sum up:
Please, let me know if I missed anything and/or is this can be done with the existing code. I tried to look carefully but of course I may have missed something. UPD: Another improvement I made is in ActivityTaskPoller: instead of waiting on the semaphore infinitely, I added a timeout: this way, there won't be dangling polling tasks left during the shutdown process. |
From @danwashusen on February 25, 2016 5:2
We use SWF to coordinate long running task processing in combination with EC2 auto-scaling and we've noticed a bunch of issues with SpringActivityWorker and the JVM shutdown process.
We've managed to work around issue 3 by overriding the 'stop' method and looping on 'awaitTermination' until it returns true (see below).
Issue 2 seems to be the real kicker; during JVM shutdown spring calls 'stop' on SpringActivityWorker which starts the shutdown process, skipping the 'service' shutdown (as configured) and stopping the 'pollExecutor' and 'poller'. However (as far as I can tell) because the 'service' hasn't been shutdown (which would send an abort to any open requests) its possible for an existing long polling 'pollForActivityTask' request to be running and fetching new tasks. These tasks end up failing with a 'start to close' timeout because they are associated with an instance that was in the process of shutting down.
Maybe I'm missing something obvious because I can't find anyone else complaining about this. We've managed to work around all these issues with the following extension of 'SpringActivityWorker', but honestly the whole shebang makes me anxious.
Copied from original issue: aws/aws-sdk-java#642
The text was updated successfully, but these errors were encountered: