Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SWF] shutting down an activity worker is a nightmare of exceptions and lost tasks #642

Closed
danwashusen opened this issue Feb 25, 2016 · 2 comments

Comments

@danwashusen
Copy link

We use SWF to coordinate long running task processing in combination with EC2 auto-scaling and we've noticed a bunch of issues with SpringActivityWorker and the JVM shutdown process.

  1. As far as I can tell with the default functionality (disableServiceShutdownOnStop=false) tasks fail to complete because the 'service' (SWF client) is shut down before 'pollExecutor' and 'poller', as a result tasks can't communicate their results.
  2. Setting disableServiceShutdownOnStop=true seems to fix the task results issue mentioned in 1 but causes other activity tasks to 'timeout' because the poller could still be long-polling. If a task arrives during the shutdown process the JVM locks up trying to submit the task to an executor service that is shutting down.
  3. The 'stop' method on SpringActivityWorker gives no indication (logs, etc) that tasks are being abandoned if they don't complete in terminationTimeoutSeconds (it could check the result of 'awaitTermination').

We've managed to work around issue 3 by overriding the 'stop' method and looping on 'awaitTermination' until it returns true (see below).

Issue 2 seems to be the real kicker; during JVM shutdown spring calls 'stop' on SpringActivityWorker which starts the shutdown process, skipping the 'service' shutdown (as configured) and stopping the 'pollExecutor' and 'poller'. However (as far as I can tell) because the 'service' hasn't been shutdown (which would send an abort to any open requests) its possible for an existing long polling 'pollForActivityTask' request to be running and fetching new tasks. These tasks end up failing with a 'start to close' timeout because they are associated with an instance that was in the process of shutting down.

Maybe I'm missing something obvious because I can't find anyone else complaining about this. We've managed to work around all these issues with the following extension of 'SpringActivityWorker', but honestly the whole shebang makes me anxious.

public class GracefulShutdownSpringActivityWorker extends SpringActivityWorker implements ApplicationListener {
    private final Logger logger = LoggerFactory.getLogger(GracefulShutdownSpringActivityWorker.class);

    public GracefulShutdownSpringActivityWorker() {
    }

    public GracefulShutdownSpringActivityWorker(AmazonSimpleWorkflow service, String domain, String taskListToPoll) {
        super(service, domain, taskListToPoll);

        // this value must be true to avoid delayed shutdown issues
        // activities will not be able to report their completed/failed state if the swf client is shutdown...
        setDisableServiceShutdownOnStop(true);
    }

    @Override
    public void onApplicationEvent(ApplicationEvent event) {
        if (event.getClass().equals(ContextClosedEvent.class)) {
            // tell the poller to not fetch any more tasks (this locks up the poller threads)
            logger.info("Suspend polling for new activity tasks...");
            super.suspendPolling();
        }
    }

    /* the default impl. of this method leaves dangling activities that eventually 'time out', this shutdown process ensures that activities complete and report back */
    @Override
    public void stop() {
        if (!isDisableServiceShutdownOnStop()) {
            logger.warn("disableServiceShutdownOnStop is set to false, activities will not be able to report their completed/failed state!");
        }

        // its possible that despite suspending polling we get a task (because it might be currently polling for a task)
        // sleeping for 90 seconds ensures that any tasks accepted before we suspended polling get processed
        sleep();

        // request a shutdown (all running activities should complete)
        logger.info("Stopping the worker...");
        super.stop();

        // now release the suspended polling latch which should now exit because the poller is terminating
        super.resumePolling();

        // wait until all activities complete
        try {
            while (!super.awaitTermination(10, TimeUnit.SECONDS)) {
                logger.info("Still waiting for activity worker to shutdown...");
            }
            logger.info("Done waiting for activity worker to shutdown...");
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
            throw new IllegalStateException("Failed while waiting for task executor to complete currently running tasks...", e);
        }
    }

    private void sleep() {
        final Duration duration = Duration.standardSeconds(90);
        logger.info(String.format("Sleeping for %s to allow dangling activity tasks to complete...", duration));
        ThreadUtils.sleepIfYouCan(duration);
    }
}
@zhangzhx zhangzhx added the SWF label Mar 3, 2016
@fulghum
Copy link
Contributor

fulghum commented May 10, 2016

Thanks for the feedback on the SWF Flow library. We've pinged the SWF team and passed along the feedback to them.

@manikandanrs
Copy link
Contributor

This issue was moved to aws/aws-swf-flow-library#2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants