Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docker driver] - Add support for Docker's live-restore feature to prevent unnecessary allocation respawns. #24105

Closed
eduardolmedeiros opened this issue Oct 2, 2024 · 1 comment
Assignees

Comments

@eduardolmedeiros
Copy link
Contributor

Description

When using Docker's live-restore feature, Nomad currently respawns containers (creates new allocations) unnecessarily during Docker daemon restarts. This behavior counteracts the benefits of using live-restore and might cause disruptions in environments where Docker daemon updates or restarts are necessary.

Current Behavior

  • Docker daemon is configured with live-restore enabled
  • When the Docker daemon restarts, containers continue running as expected
  • However, Nomad creates new allocations for these containers, effectively respawning them
  • This results in unnecessary disruption and resource usage

Expected Behavior

  • Nomad should be aware of Docker's live-restore feature
  • When Docker daemon restarts, Nomad should:
    1. Detect that the daemon is unavailable
    2. Wait for a configurable timeout period
    3. Once the daemon is available again, check the actual state of containers
    4. Only create new allocations if the containers are genuinely not running

Proposed Solution

Add a new configuration option for the Docker driver in Nomad, such as:

config {
  docker_live_restore_timeout = "5m"
}

This would allow Nomad to wait for the specified duration before deciding to create new allocations when it loses connection to the Docker daemon.

Additional Context

  • This feature would be particularly useful for environments where Docker daemon updates or restarts are necessary, such as for security patches or version upgrades
  • It would allow for more seamless operations and reduce unnecessary container churn

Possible Implementation

  1. Add a new configuration option to the Docker driver
  2. Modify the Docker driver's health check mechanism to be aware of this timeout
  3. Implement a reconciliation process that checks the actual state of containers with Docker after a daemon restart

Impact

This feature would improve Nomad's behavior in environments using Docker's live-restore, reducing unnecessary allocation churn and making Docker daemon maintenance less disruptive.

@Juanadelacuesta
Copy link
Member

Hello @eduardolmedeiros, Thanks for suggesting this but in the latest version, when the daemon goes down, Nomad is unable to determine if the containers are running or not, so they allocations are classified as pending but once the daemon goes back up, it reports again the running containers to Nomad and the agent picks them up again. No new allocations are spawned. If you have a good example where the containers are redeployed, we would love to see it and learn if there is something that needs fixing. Feel free to reach out again if you keep running into problems, we are always looking to make Nomad better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

2 participants