[docker driver] - Add support for Docker's live-restore feature to prevent unnecessary allocation respawns. #24105

eduardolmedeiros · 2024-10-02T08:32:35Z

Description

When using Docker's live-restore feature, Nomad currently respawns containers (creates new allocations) unnecessarily during Docker daemon restarts. This behavior counteracts the benefits of using live-restore and might cause disruptions in environments where Docker daemon updates or restarts are necessary.

Current Behavior

Docker daemon is configured with live-restore enabled
When the Docker daemon restarts, containers continue running as expected
However, Nomad creates new allocations for these containers, effectively respawning them
This results in unnecessary disruption and resource usage

Expected Behavior

Nomad should be aware of Docker's live-restore feature
When Docker daemon restarts, Nomad should:
1. Detect that the daemon is unavailable
2. Wait for a configurable timeout period
3. Once the daemon is available again, check the actual state of containers
4. Only create new allocations if the containers are genuinely not running

Proposed Solution

Add a new configuration option for the Docker driver in Nomad, such as:

config {
  docker_live_restore_timeout = "5m"
}

This would allow Nomad to wait for the specified duration before deciding to create new allocations when it loses connection to the Docker daemon.

Additional Context

This feature would be particularly useful for environments where Docker daemon updates or restarts are necessary, such as for security patches or version upgrades
It would allow for more seamless operations and reduce unnecessary container churn

Possible Implementation

Add a new configuration option to the Docker driver
Modify the Docker driver's health check mechanism to be aware of this timeout
Implement a reconciliation process that checks the actual state of containers with Docker after a daemon restart

Impact

This feature would improve Nomad's behavior in environments using Docker's live-restore, reducing unnecessary allocation churn and making Docker daemon maintenance less disruptive.

The text was updated successfully, but these errors were encountered:

Juanadelacuesta · 2024-10-22T17:31:18Z

Hello @eduardolmedeiros, Thanks for suggesting this but in the latest version, when the daemon goes down, Nomad is unable to determine if the containers are running or not, so they allocations are classified as pending but once the daemon goes back up, it reports again the running containers to Nomad and the agent picks them up again. No new allocations are spawned. If you have a good example where the containers are redeployed, we would love to see it and learn if there is something that needs fixing. Feel free to reach out again if you keep running into problems, we are always looking to make Nomad better.

eduardolmedeiros added the type/enhancement label Oct 2, 2024

tgross added this to Nomad - Community Issues Triage Oct 2, 2024

github-project-automation bot moved this to Needs Triage in Nomad - Community Issues Triage Oct 2, 2024

Juanadelacuesta self-assigned this Oct 22, 2024

Juanadelacuesta closed this as completed Oct 22, 2024

github-project-automation bot moved this from Needs Triage to Done in Nomad - Community Issues Triage Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docker driver] - Add support for Docker's live-restore feature to prevent unnecessary allocation respawns. #24105

[docker driver] - Add support for Docker's live-restore feature to prevent unnecessary allocation respawns. #24105

eduardolmedeiros commented Oct 2, 2024

Juanadelacuesta commented Oct 22, 2024

[docker driver] - Add support for Docker's live-restore feature to prevent unnecessary allocation respawns. #24105

[docker driver] - Add support for Docker's live-restore feature to prevent unnecessary allocation respawns. #24105

Comments

eduardolmedeiros commented Oct 2, 2024

Description

Current Behavior

Expected Behavior

Proposed Solution

Additional Context

Possible Implementation

Impact

Juanadelacuesta commented Oct 22, 2024