Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in Nomad 1.9.0: Stdin might not be closed in Docker exec #24171

Closed
MrSerth opened this issue Oct 11, 2024 · 3 comments · Fixed by #24202
Closed

Regression in Nomad 1.9.0: Stdin might not be closed in Docker exec #24171

MrSerth opened this issue Oct 11, 2024 · 3 comments · Fixed by #24202

Comments

@MrSerth
Copy link

MrSerth commented Oct 11, 2024

Since updating to Nomad 1.9.0, I am observing a weird behavior I cannot explain. Below is a bug report, since I would classify the behavior as a release regression from Nomad 1.9.0. The same code works flawlessly in Nomad 1.8.4 and lower.

Potentially, this issue is caused by switching to the native Docker library as done with #23966.

The issue is that in some circumstances (with a relatively short input buffer), an allocation Exec doesn't close stdin, so that the command within the allocation keeps waiting. The attached job file and the Go code were extracted from our production setup. They demonstrate the erroneous behavior, but might still be too large. For now, I didn't manage to reduce the example further.

I tried many different steps to nail down the problem, but couldn't. Any help would be greatly appreciated!

Nomad version

Nomad v1.9.0
BuildDate 2024-10-10T07:13:43Z
Revision 7ad36851ec02f875e0814775ecf1df0229f0a615

Operating system and Environment details

Ubuntu 24.04.1 LTS and macOS 15.0.1

# Ubuntu
Docker version 27.3.1, build ce12230
# macOS
Docker version 27.2.0, build 3ab4256

Issue

When a command in an allocation that waits for stdin is executed, the stream might not be closed as expected. This prevents the command to complete successfully.

Reproduction steps

Download the following reproduction_example.zip and execute it with go run example.go.

Please note a few things:

  • The Nomad server / agent version is relevant, not the library version for the Go script
  • There are two "TODO" notes in the reproduction example:
    • The first one is at line 40 where the actual file size will be controlled. A file with 8100 bytes will not be processed correctly, but a file with 8200 bytes will.
    • The second one is at line 71, where the actual buffer size that is used for the Exec command is measured. This length is important and triggering the issue: A buffer size of 10,000 or less will not work, a buffer size of 10,000 or more works.
  • The same code works without any changes and with either file / buffer size in Nomad 1.8.4. It doesn't work with Nomad 1.9.0 any longer.

Expected Result

The reproduction example succeeds:

Tar archive created with 9728 bytes
fixed_filename.txt
Command executed with exit code: 0

Actual Result

The reproduction example hangs:

Tar archive created with 9728 bytes
# hang

Job file (if appropriate)

job "template-0" {
  datacenters = ["dc1"]
  type = "batch"

  group "default-group" {
    ephemeral_disk {
      migrate = false
      size    = 10
      sticky  = false
    }
    count = 1
    spread {
      attribute = "${node.unique.name}"
      weight = 100
    }
    restart {
      attempts = 3
      delay = "15s"
      interval = "1h"
      mode = "fail"
    }
    reschedule {
      unlimited = false
      attempts = 3
      interval = "6h"
      delay = "1m"
      max_delay = "4m"
      delay_function = "exponential"
    }

    task "default-task" {
      driver = "docker"
      kill_timeout = "0s"
      kill_signal = "SIGKILL"

      config {
        image = "openhpi/docker_exec_phusion"
        command = "sleep"
        args = ["infinity"]
        network_mode = "none"
      }

      logs {
        max_files     = 1
        max_file_size = 1
      }

      resources {
        cpu    = 40
        memory = 50
      }
    }
  }

  group "config" {
    count = 0
    task "config" {
      driver = "exec"
      config {
        command = "true"
      }
      logs {
        max_files     = 1
        max_file_size = 1
      }
      resources {
        cpu    = 1
        memory = 10
      }
    }
    meta {
      used = "false"
      prewarmingPoolSize = "0"
    }
  }
}

Nomad Server logs (if appropriate)

    2024-10-11T16:20:21.488+0200 [DEBUG] http: request complete: method=GET path=/v1/allocations duration="238.666µs"
    2024-10-11T16:20:21.490+0200 [DEBUG] http: request complete: method=GET path=/v1/allocation/e358ab13-8452-c952-93cc-70bfec08c553 duration="304.375µs"
    2024-10-11T16:20:21.492+0200 [DEBUG] http: request complete: method=GET path=/v1/node/93f25c78-a983-5af2-085a-4f25d53e9a52 duration="264.416µs"

Nomad Client logs (if appropriate)

    2024-10-11T16:20:21.493+0200 [INFO]  client: task exec session starting: exec_id=6523a6e3-01cc-ff4c-1173-5c740a1d38bf alloc_id=e358ab13-8452-c952-93cc-70bfec08c553 task=default-task command=["tar", "--extract", "--verbose", "--file=/dev/stdin"] tty=false action=""
@MrSerth
Copy link
Author

MrSerth commented Oct 11, 2024

I just added the reproduction example as a small repository on GitHub: https://github.com/MrSerth/nomad-exec

There, I also included the reproduction example with a GitHub Actions pipeline, clearly showing the impact of the Nomad version: https://github.com/MrSerth/nomad-exec/actions/runs/11294633497

@shoenig
Copy link
Member

shoenig commented Oct 17, 2024

Thanks for the detailed report @MrSerth! I was able to create a fix and even turn your reproduction example into an e2e test to help make sure something like this doesn't break again, in #24202

@MrSerth
Copy link
Author

MrSerth commented Oct 19, 2024

Awesome, thanks @shoenig for your work on this issue and the e2e test. I can confirm that your changes solve the issue reported. Furthermore, our test suite is now completing successfully given a Nomad binary from the current main branch. Looking forward to the next release of Nomad! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment