Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad 1.7.6 does not pick up Docker DNS settings #20174

Closed
Jess3Jane opened this issue Mar 20, 2024 · 7 comments · Fixed by #20189
Closed

Nomad 1.7.6 does not pick up Docker DNS settings #20174

Jess3Jane opened this issue Mar 20, 2024 · 7 comments · Fixed by #20189
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/cni theme/networking type/bug
Milestone

Comments

@Jess3Jane
Copy link

Nomad version

Version in which this functionality is broken:

root@client-1:/# nomad version
Nomad v1.7.6
BuildDate 2024-03-12T07:27:36Z
Revision 594fedbfbc4f0e532b65e8a69b28ff9403eb822e

Version in which this functionality is working (I tested down to 1.7.2 and they all behave the same):

root@client-1:/# nomad version
Nomad v1.7.5
BuildDate 2024-02-13T15:10:13Z
Revision 5f5d4646198d09b8f4f6cb90fb5d50b53fa328b8

Operating system and Environment details

root@client-1:/# uname -a
Linux client-1 5.15.0-67-generic #74-Ubuntu SMP Wed Feb 22 14:14:39 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
root@client-1:/# lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.2 LTS
Release:	22.04
Codename:	jammy

This is a completely fresh digital ocean node (I will note all of the changes I made below). Of note we have also hit this issue on systems of varying other configurations (all Ubuntu 22.04 x64).

Issue

Docker allows you to configure a set of DNS servers to give to each container. In Nomad 1.7.5 if you did not set the DNS settings for a job using the docker driver the containers in that job would have this DNS configuration. In Nomad 1.7.6 the containers instead have the host's default DNS configuration which is often undesirable.

Reproduction steps

  1. Spin up a fresh Ubuntu 22.04 server (I used a Digital Ocean droplet for our reproduction but we've noticed this happening across our fleet so I don't think they're doing anything weird).
  2. Install docker-ce as per their docs (I used Docker's apt registry to install it).
  3. Install Nomad as per the docs (for the reproduction I specifically used the version of Nomad from Hashicorps repos).
  4. Install the base CNI plugins by placing the contents of https://github.com/containernetworking/plugins/releases/download/v1.0.0/cni-plugins-linux-amd64-v1.0.0.tgz into /opt/cni/bin
  5. Use a dropin file to configure DNS servers for docker. Specifically, I placed the following file at /etc/systemd/system/docker.service.d/overrides.conf:
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock --dns=8.8.8.8
  1. systemctl daemon-reload
  2. systemctl start docker
  3. systemctl start nomad
  4. nomad job run <job-file.hcl>

Expected Result

If we exec into our task on Nomad 1.7.5 we can observe the correct value of resolv.conf

root@client-1:/# nomad alloc exec -i -t -task test d4de6479 cat /etc/resolv.conf
# Generated by Docker Engine.
# This file can be edited; Docker Engine will not make further changes once it
# has been modified.

nameserver 8.8.8.8
search .

# Based on host file: '/run/systemd/resolve/resolv.conf' (legacy)
# Overrides: [nameservers]

Compare this to the result of running a container with docker directly to see that they match:

root@client-1:/# docker run -it busybox cat /etc/resolv.conf
# Generated by Docker Engine.
# This file can be edited; Docker Engine will not make further changes once it
# has been modified.

nameserver 8.8.8.8
search .

# Based on host file: '/run/systemd/resolve/resolv.conf' (legacy)
# Overrides: [nameservers]

Actual Result

If you do exactly the same with Nomad 1.7.6 you instead will find that the job has a different resolv.conf:

root@client-1:/etc/systemd/system# nomad alloc exec -i -t -task test f39dae59 cat /etc/resolv.conf
# This is /run/systemd/resolve/resolv.conf managed by man:systemd-resolved(8).
# Do not edit.
#
# This file might be symlinked as /etc/resolv.conf. If you're looking at
# /etc/resolv.conf and seeing this text, you have followed the symlink.
#
# This is a dynamic resolv.conf file for connecting local clients directly to
# all known uplink DNS servers. This file lists all configured search domains.
#
# Third party programs should typically not access this file directly, but only
# through the symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a
# different way, replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 67.207.67.2
nameserver 67.207.67.3
nameserver 67.207.67.2
# Too many DNS servers configured, the following entries may be ignored.
nameserver 67.207.67.3
nameserver 67.207.67.2
nameserver 67.207.67.3
search .

The exact values will likely differ on your system but we can confirm that this is the contents of /run/systemd/resolv/resolv.conf:

root@client-1:/# cat /run/systemd/resolve/resolv.conf
# This is /run/systemd/resolve/resolv.conf managed by man:systemd-resolved(8).
# Do not edit.
#
# This file might be symlinked as /etc/resolv.conf. If you're looking at
# /etc/resolv.conf and seeing this text, you have followed the symlink.
#
# This is a dynamic resolv.conf file for connecting local clients directly to
# all known uplink DNS servers. This file lists all configured search domains.
#
# Third party programs should typically not access this file directly, but only
# through the symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a
# different way, replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 67.207.67.2
nameserver 67.207.67.3
nameserver 67.207.67.2
# Too many DNS servers configured, the following entries may be ignored.
nameserver 67.207.67.3
nameserver 67.207.67.2
nameserver 67.207.67.3
search .

Job file (if appropriate)

This happens with all jobs that don't configure any dns settings but the specific job I've used for testing is:

job "jess-test-job" {
	type = "service"
	datacenters = ["*"]
	group "test" {
		network {
			mode = "bridge"
		}
		task "test" {
			driver = "docker"
			config {
				args = ["sleep", "infinity"]
				image = "busybox"
			}
		}
	}
}
@apollo13
Copy link
Contributor

@tgross This awfully feels like 45b2c34 ?

@lgfa29 lgfa29 added theme/networking theme/cni stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Mar 21, 2024
@lgfa29
Copy link
Contributor

lgfa29 commented Mar 21, 2024

Thank you for the detailed report @Jess3Jane and thank you @apollo13 for the git log spelunking. I was able to verify that reverting that commit does fix the problem.

I suspect we need to guard the DNS config override to explicit cni/ networks, so the bridge network is not affected, but I'm not sure if this would also revert the intended fix.

// set DNS from any CNI plugins
netStatus := tr.allocHookResources.GetAllocNetworkStatus()
if netStatus != nil && netStatus.DNS != nil {
dns = &drivers.DNSConfig{
Servers: netStatus.DNS.Servers,
Searches: netStatus.DNS.Searches,
Options: netStatus.DNS.Options,
}
}

I've placed the issue for further roadmapping.

@apollo13
Copy link
Contributor

apollo13 commented Mar 22, 2024 via email

@tgross
Copy link
Member

tgross commented Mar 22, 2024

Hi folks! This is certainly a regression caused by the bug fix I did in #20007. I had a comment in that PR:

there are three potential sources for DNS configuration here: the output of CNI plugins, the user-provided DNS config, and whatever is on the client

It turns out there are four! At first glance prioritizing the four different sources would be challenging because of the work I'm working on for transparent proxy (as @apollo13 noted). But a quick check of spew.Dump shows that the DNS value we get back from mode="bridge" isn't nil, but it is empty. So that's likely a simple bug at task_runner.go#L1131 where that value is never nil, and we should be checking whether it actually has any values.

I'll do some testing of this and likely get a PR up later today if that proves to be the correct hypothesis. Thanks again @Jess3Jane, @apollo13, and @lgfa29!

Edit: bah that's two different structs, as we convert from a non-pointer CNI DNS result to a Nomad-internal struct pointer, so that's not the problem exactly but I'm sure it's something along those lines. Still investigating.

@tgross tgross self-assigned this Mar 22, 2024
@tgross tgross added this to the 1.7.x milestone Mar 22, 2024
@tgross
Copy link
Member

tgross commented Mar 22, 2024

Ok, I've got it. The cni.Result returns a slice of DNS entries, and if there's no DNS it returns a single DNS struct that is empty (because it's not a pointer). So there's a bug in the original code in networking_cni.go around how we handle the fallback in the no-DNS case, but that bug was harmless until #20007 because we always through the DNS value away anyways 🤦

Fix should be easy, just working up some unit tests to make sure the behavior is properly exercised as well.

tgross added a commit that referenced this issue Mar 22, 2024
In #20007 we fixed a bug where the DNS configuration set by CNI plugins was not
threaded through to the task configuration. This resulted in a regression where
a DNS override set by `dockerd` was not respected for `bridge` mode
networking. Our existing handling of CNI DNS incorrectly assumed that the DNS
field would be empty, when in fact it contains a single empty DNS struct.

Handle this case correctly by checking whether the DNS struct we get back from
CNI has any nameservers, and ignore it if it doesn't. Expand test coverage of
this case.

Fixes: #20174
@tgross
Copy link
Member

tgross commented Mar 22, 2024

Fixed in #20189

tgross added a commit that referenced this issue Mar 22, 2024
In #20007 we fixed a bug where the DNS configuration set by CNI plugins was not
threaded through to the task configuration. This resulted in a regression where
a DNS override set by `dockerd` was not respected for `bridge` mode
networking. Our existing handling of CNI DNS incorrectly assumed that the DNS
field would be empty, when in fact it contains a single empty DNS struct.

Handle this case correctly by checking whether the DNS struct we get back from
CNI has any nameservers, and ignore it if it doesn't. Expand test coverage of
this case.

Fixes: #20174
tgross added a commit that referenced this issue Mar 22, 2024
In #20007 we fixed a bug where the DNS configuration set by CNI plugins was not
threaded through to the task configuration. This resulted in a regression where
a DNS override set by `dockerd` was not respected for `bridge` mode
networking. Our existing handling of CNI DNS incorrectly assumed that the DNS
field would be empty, when in fact it contains a single empty DNS struct.

Handle this case correctly by checking whether the DNS struct we get back from
CNI has any nameservers, and ignore it if it doesn't. Expand test coverage of
this case.

Fixes: #20174
tgross added a commit that referenced this issue Mar 22, 2024
In #20007 we fixed a bug where the DNS configuration set by CNI plugins was not
threaded through to the task configuration. This resulted in a regression where
a DNS override set by `dockerd` was not respected for `bridge` mode
networking. Our existing handling of CNI DNS incorrectly assumed that the DNS
field would be empty, when in fact it contains a single empty DNS struct.

Handle this case correctly by checking whether the DNS struct we get back from
CNI has any nameservers, and ignore it if it doesn't. Expand test coverage of
this case.

Fixes: #20174
@tgross
Copy link
Member

tgross commented Mar 22, 2024

This fix will get shipped in the next release of Nomad, as well as backported to 1.6.x and 1.5.x

tgross added a commit that referenced this issue Mar 22, 2024
…rd into release/1.6.x (#20192)

In #20007 we fixed a bug where the DNS configuration set by CNI plugins was not
threaded through to the task configuration. This resulted in a regression where
a DNS override set by `dockerd` was not respected for `bridge` mode
networking. Our existing handling of CNI DNS incorrectly assumed that the DNS
field would be empty, when in fact it contains a single empty DNS struct.

Handle this case correctly by checking whether the DNS struct we get back from
CNI has any nameservers, and ignore it if it doesn't. Expand test coverage of
this case.

Fixes: #20174

Co-authored-by: Tim Gross <tgross@hashicorp.com>
tgross added a commit that referenced this issue Mar 22, 2024
…rd into release/1.5.x (#20191)

In #20007 we fixed a bug where the DNS configuration set by CNI plugins was not
threaded through to the task configuration. This resulted in a regression where
a DNS override set by `dockerd` was not respected for `bridge` mode
networking. Our existing handling of CNI DNS incorrectly assumed that the DNS
field would be empty, when in fact it contains a single empty DNS struct.

Handle this case correctly by checking whether the DNS struct we get back from
CNI has any nameservers, and ignore it if it doesn't. Expand test coverage of
this case.

Fixes: #20174

Co-authored-by: Tim Gross <tgross@hashicorp.com>
philrenaud pushed a commit that referenced this issue Apr 18, 2024
In #20007 we fixed a bug where the DNS configuration set by CNI plugins was not
threaded through to the task configuration. This resulted in a regression where
a DNS override set by `dockerd` was not respected for `bridge` mode
networking. Our existing handling of CNI DNS incorrectly assumed that the DNS
field would be empty, when in fact it contains a single empty DNS struct.

Handle this case correctly by checking whether the DNS struct we get back from
CNI has any nameservers, and ignore it if it doesn't. Expand test coverage of
this case.

Fixes: #20174
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/cni theme/networking type/bug
Projects
Development

Successfully merging a pull request may close this issue.

4 participants