You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We use SWF to coordinate long running task processing in combination with EC2 auto-scaling and we've noticed a bunch of issues with SpringActivityWorker and the JVM shutdown process.
As far as I can tell with the default functionality (disableServiceShutdownOnStop=false) tasks fail to complete because the 'service' (SWF client) is shut down before 'pollExecutor' and 'poller', as a result tasks can't communicate their results.
Setting disableServiceShutdownOnStop=true seems to fix the task results issue mentioned in 1 but causes other activity tasks to 'timeout' because the poller could still be long-polling. If a task arrives during the shutdown process the JVM locks up trying to submit the task to an executor service that is shutting down.
The 'stop' method on SpringActivityWorker gives no indication (logs, etc) that tasks are being abandoned if they don't complete in terminationTimeoutSeconds (it could check the result of 'awaitTermination').
We've managed to work around issue 3 by overriding the 'stop' method and looping on 'awaitTermination' until it returns true (see below).
Issue 2 seems to be the real kicker; during JVM shutdown spring calls 'stop' on SpringActivityWorker which starts the shutdown process, skipping the 'service' shutdown (as configured) and stopping the 'pollExecutor' and 'poller'. However (as far as I can tell) because the 'service' hasn't been shutdown (which would send an abort to any open requests) its possible for an existing long polling 'pollForActivityTask' request to be running and fetching new tasks. These tasks end up failing with a 'start to close' timeout because they are associated with an instance that was in the process of shutting down.
Maybe I'm missing something obvious because I can't find anyone else complaining about this. We've managed to work around all these issues with the following extension of 'SpringActivityWorker', but honestly the whole shebang makes me anxious.
public class GracefulShutdownSpringActivityWorker extends SpringActivityWorker implements ApplicationListener {
private final Logger logger = LoggerFactory.getLogger(GracefulShutdownSpringActivityWorker.class);
public GracefulShutdownSpringActivityWorker() {
}
public GracefulShutdownSpringActivityWorker(AmazonSimpleWorkflow service, String domain, String taskListToPoll) {
super(service, domain, taskListToPoll);
// this value must be true to avoid delayed shutdown issues
// activities will not be able to report their completed/failed state if the swf client is shutdown...
setDisableServiceShutdownOnStop(true);
}
@Override
public void onApplicationEvent(ApplicationEvent event) {
if (event.getClass().equals(ContextClosedEvent.class)) {
// tell the poller to not fetch any more tasks (this locks up the poller threads)
logger.info("Suspend polling for new activity tasks...");
super.suspendPolling();
}
}
/* the default impl. of this method leaves dangling activities that eventually 'time out', this shutdown process ensures that activities complete and report back */
@Override
public void stop() {
if (!isDisableServiceShutdownOnStop()) {
logger.warn("disableServiceShutdownOnStop is set to false, activities will not be able to report their completed/failed state!");
}
// its possible that despite suspending polling we get a task (because it might be currently polling for a task)
// sleeping for 90 seconds ensures that any tasks accepted before we suspended polling get processed
sleep();
// request a shutdown (all running activities should complete)
logger.info("Stopping the worker...");
super.stop();
// now release the suspended polling latch which should now exit because the poller is terminating
super.resumePolling();
// wait until all activities complete
try {
while (!super.awaitTermination(10, TimeUnit.SECONDS)) {
logger.info("Still waiting for activity worker to shutdown...");
}
logger.info("Done waiting for activity worker to shutdown...");
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new IllegalStateException("Failed while waiting for task executor to complete currently running tasks...", e);
}
}
private void sleep() {
final Duration duration = Duration.standardSeconds(90);
logger.info(String.format("Sleeping for %s to allow dangling activity tasks to complete...", duration));
ThreadUtils.sleepIfYouCan(duration);
}
}
The text was updated successfully, but these errors were encountered:
We use SWF to coordinate long running task processing in combination with EC2 auto-scaling and we've noticed a bunch of issues with SpringActivityWorker and the JVM shutdown process.
We've managed to work around issue 3 by overriding the 'stop' method and looping on 'awaitTermination' until it returns true (see below).
Issue 2 seems to be the real kicker; during JVM shutdown spring calls 'stop' on SpringActivityWorker which starts the shutdown process, skipping the 'service' shutdown (as configured) and stopping the 'pollExecutor' and 'poller'. However (as far as I can tell) because the 'service' hasn't been shutdown (which would send an abort to any open requests) its possible for an existing long polling 'pollForActivityTask' request to be running and fetching new tasks. These tasks end up failing with a 'start to close' timeout because they are associated with an instance that was in the process of shutting down.
Maybe I'm missing something obvious because I can't find anyone else complaining about this. We've managed to work around all these issues with the following extension of 'SpringActivityWorker', but honestly the whole shebang makes me anxious.
The text was updated successfully, but these errors were encountered: