Allow long-running jobs to finish in-between deployments #491

yoelcabo · 2023-09-22T20:44:54Z

yoelcabo
Sep 22, 2023

Context

At Happy Scribe, we are using Kamal to orchestrate a bunch of workers that run FFmpeg to do video transcoding. FFmpeg is not great at distributed computing, and some video transcoding jobs can take more than 30 minutes.

Kamal is being extremely useful to us because it allows us to run those jobs in cheap hardware in Hetzner, while we keep the rest of our application running on Heroku. Thanks a lot for this project 🙌

The only problem we have is that we cannot find a way to prevent our jobs from being stopped midway while also having quick deployments as part of our Monolith's CI/CD system

Configuration that almost works

Right now, we have Kamal configured with a stop_timeout of 10 minutes, which calls docker stop -t 600. This sends first a SIGTERM to the worker and waits 10 minutes until sending a SIGKILL. In most queuing systems, upon receiving the SIGTERM, workers stop picking up new jobs but still finish those that are in progress. While the old worker container is finishing the jobs, Kamal has already started a new one with the new code. We then have our queuing system configured to re-enqueue the job in case it's running close to 10 minutes after SIGTERM.

This works reasonably well. However, we would like the stop timeout to be something more like 30 minutes or 1 hour, while at the same time not locking subsequent deployments.

Proposal

What I am envisioning is Kamal just calling docker stop -t stop_timeout in the background and "forgetting" about the container. To have a bit of persistence and be able to monitor this, we could also create some files in the host's file system to keep track of what PID is responsible for killing what container. And then check things are good in subsequent deployments.

Is this something you'd be willing to explore? (perhaps under a configuration flag) If so, I'll work on a PR and submit it.
If not, is there any alternative approach you'd recommend?

Note: I know this would be "easy" with Kubernetes, but honestly I'll rather maintain my own fork of Kamal, than our own Kubernetes cluster.

yoelcabo · 2023-11-12T07:57:21Z

yoelcabo
Nov 12, 2023
Author

I've submitted a PR:
#579

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow long-running jobs to finish in-between deployments #491

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Allow long-running jobs to finish in-between deployments #491

yoelcabo Sep 22, 2023

Context

Configuration that almost works

Proposal

Replies: 1 comment

yoelcabo Nov 12, 2023 Author

yoelcabo
Sep 22, 2023

yoelcabo
Nov 12, 2023
Author