Regression in Nomad 1.9.0: Stdin might not be closed in Docker exec #24171

MrSerth · 2024-10-11T14:41:41Z

Since updating to Nomad 1.9.0, I am observing a weird behavior I cannot explain. Below is a bug report, since I would classify the behavior as a release regression from Nomad 1.9.0. The same code works flawlessly in Nomad 1.8.4 and lower.

Potentially, this issue is caused by switching to the native Docker library as done with #23966.

The issue is that in some circumstances (with a relatively short input buffer), an allocation Exec doesn't close stdin, so that the command within the allocation keeps waiting. The attached job file and the Go code were extracted from our production setup. They demonstrate the erroneous behavior, but might still be too large. For now, I didn't manage to reduce the example further.

I tried many different steps to nail down the problem, but couldn't. Any help would be greatly appreciated!

Nomad version

Nomad v1.9.0
BuildDate 2024-10-10T07:13:43Z
Revision 7ad36851ec02f875e0814775ecf1df0229f0a615

Operating system and Environment details

Ubuntu 24.04.1 LTS and macOS 15.0.1

# Ubuntu
Docker version 27.3.1, build ce12230

# macOS
Docker version 27.2.0, build 3ab4256

Issue

When a command in an allocation that waits for stdin is executed, the stream might not be closed as expected. This prevents the command to complete successfully.

Reproduction steps

Download the following reproduction_example.zip and execute it with go run example.go.

Please note a few things:

The Nomad server / agent version is relevant, not the library version for the Go script
There are two "TODO" notes in the reproduction example:
- The first one is at line 40 where the actual file size will be controlled. A file with 8100 bytes will not be processed correctly, but a file with 8200 bytes will.
- The second one is at line 71, where the actual buffer size that is used for the Exec command is measured. This length is important and triggering the issue: A buffer size of 10,000 or less will not work, a buffer size of 10,000 or more works.
The same code works without any changes and with either file / buffer size in Nomad 1.8.4. It doesn't work with Nomad 1.9.0 any longer.

Expected Result

The reproduction example succeeds:

Tar archive created with 9728 bytes
fixed_filename.txt
Command executed with exit code: 0

Actual Result

The reproduction example hangs:

Tar archive created with 9728 bytes
# hang

Job file (if appropriate)

job "template-0" {
  datacenters = ["dc1"]
  type = "batch"

  group "default-group" {
    ephemeral_disk {
      migrate = false
      size    = 10
      sticky  = false
    }
    count = 1
    spread {
      attribute = "${node.unique.name}"
      weight = 100
    }
    restart {
      attempts = 3
      delay = "15s"
      interval = "1h"
      mode = "fail"
    }
    reschedule {
      unlimited = false
      attempts = 3
      interval = "6h"
      delay = "1m"
      max_delay = "4m"
      delay_function = "exponential"
    }

    task "default-task" {
      driver = "docker"
      kill_timeout = "0s"
      kill_signal = "SIGKILL"

      config {
        image = "openhpi/docker_exec_phusion"
        command = "sleep"
        args = ["infinity"]
        network_mode = "none"
      }

      logs {
        max_files     = 1
        max_file_size = 1
      }

      resources {
        cpu    = 40
        memory = 50
      }
    }
  }

  group "config" {
    count = 0
    task "config" {
      driver = "exec"
      config {
        command = "true"
      }
      logs {
        max_files     = 1
        max_file_size = 1
      }
      resources {
        cpu    = 1
        memory = 10
      }
    }
    meta {
      used = "false"
      prewarmingPoolSize = "0"
    }
  }
}

Nomad Server logs (if appropriate)

    2024-10-11T16:20:21.488+0200 [DEBUG] http: request complete: method=GET path=/v1/allocations duration="238.666µs"
    2024-10-11T16:20:21.490+0200 [DEBUG] http: request complete: method=GET path=/v1/allocation/e358ab13-8452-c952-93cc-70bfec08c553 duration="304.375µs"
    2024-10-11T16:20:21.492+0200 [DEBUG] http: request complete: method=GET path=/v1/node/93f25c78-a983-5af2-085a-4f25d53e9a52 duration="264.416µs"

Nomad Client logs (if appropriate)

    2024-10-11T16:20:21.493+0200 [INFO]  client: task exec session starting: exec_id=6523a6e3-01cc-ff4c-1173-5c740a1d38bf alloc_id=e358ab13-8452-c952-93cc-70bfec08c553 task=default-task command=["tar", "--extract", "--verbose", "--file=/dev/stdin"] tty=false action=""

The text was updated successfully, but these errors were encountered:

MrSerth · 2024-10-11T15:02:26Z

I just added the reproduction example as a small repository on GitHub: https://github.com/MrSerth/nomad-exec

There, I also included the reproduction example with a GitHub Actions pipeline, clearly showing the impact of the Nomad version: https://github.com/MrSerth/nomad-exec/actions/runs/11294633497

shoenig · 2024-10-17T14:55:21Z

Thanks for the detailed report @MrSerth! I was able to create a fix and even turn your reproduction example into an e2e test to help make sure something like this doesn't break again, in #24202

MrSerth · 2024-10-19T12:12:02Z

Awesome, thanks @shoenig for your work on this issue and the e2e test. I can confirm that your changes solve the issue reported. Furthermore, our test suite is now completing successfully given a Nomad binary from the current main branch. Looking forward to the next release of Nomad! 👍

MrSerth added the type/bug label Oct 11, 2024

tgross added this to Nomad - Community Issues Triage Oct 11, 2024

github-project-automation bot moved this to Needs Triage in Nomad - Community Issues Triage Oct 11, 2024

shoenig self-assigned this Oct 11, 2024

shoenig added the theme/driver/docker label Oct 11, 2024

MrSerth mentioned this issue Oct 11, 2024

Stderr task finished with error openHPI/poseidon#589

Closed

shoenig mentioned this issue Oct 15, 2024

wip: close the stdin connection #24221

Closed

tgross added this to the 1.9.1 milestone Oct 15, 2024

tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Oct 16, 2024

tgross added the hcc/jira label Oct 16, 2024

shoenig mentioned this issue Oct 17, 2024

docker: close response connection once stdin is exhausted #24202

Merged

shoenig closed this as completed in #24202 Oct 17, 2024

github-project-automation bot moved this from In Progress to Done in Nomad - Community Issues Triage Oct 17, 2024

hc-github-team-nomad-core mentioned this issue Oct 17, 2024

Backport of docker: close response connection once stdin is exhausted into release/1.9.x #24243

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression in Nomad 1.9.0: Stdin might not be closed in Docker exec #24171

Regression in Nomad 1.9.0: Stdin might not be closed in Docker exec #24171

MrSerth commented Oct 11, 2024 •

edited

Loading

MrSerth commented Oct 11, 2024

shoenig commented Oct 17, 2024

MrSerth commented Oct 19, 2024

Regression in Nomad 1.9.0: Stdin might not be closed in Docker exec #24171

Regression in Nomad 1.9.0: Stdin might not be closed in Docker exec #24171

Comments

MrSerth commented Oct 11, 2024 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

MrSerth commented Oct 11, 2024

shoenig commented Oct 17, 2024

MrSerth commented Oct 19, 2024

MrSerth commented Oct 11, 2024 •

edited

Loading