-
Notifications
You must be signed in to change notification settings - Fork 1.3k
dockerfile: run buildkitd within a cgroup namespace for cgroup v2 #6368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
dockerfile: run buildkitd within a cgroup namespace for cgroup v2 #6368
Conversation
|
@tonistiigi this is the alternative approach I mentioned in #6343 (comment). Note that I first tried to implement the ns creation and remounting in buildkitd using calls to In any case, using |
e12f7af to
3cf93c3
Compare
Introduce a new entrypoint script for the Linux image that, if cgroup v2
is in use, creates a new cgroup and mount namespace for buildkitd within
a new entrypoint using `unshare` and remounts `/sys/fs/cgroup` to
restrict its view of the unified cgroup hierarchy. This will ensure its
`init` cgroup and all OCI worker managed cgroups are kept beneath the
root cgroup of the initial entrypoint process.
When buildkitd is run in a managed environment like Kubernetes without
its own cgroup namespace (the default behavior of privileged pods in
Kubernetes where cgroup v2 is in use; see [cgroup v2 KEP][kep]), the OCI
worker will spawn processes in cgroups that are outside of the cgroup
hierarchy that was created for the buildkitd container, leading to
incorrect resource accounting and enforcement which in turn can cause
OOM errors and CPU contention on the node.
Example behavior without this change:
```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/buildkit/{runc-container-id}
```
Example behavior with this change:
```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/buildkit/{runc-container-id}
```
Note this was developed as an alternative approach to moby#6343
[kep]: https://github.com/kubernetes/enhancements/tree/6d3210f7dd5d547c8f7f6a33af6a09eb45193cd7/keps/sig-node/2254-cgroup-v2#cgroup-namespace
Signed-off-by: Dan Duvall <dduvall@wikimedia.org>
3cf93c3 to
7a50ed7
Compare
In the long term can we just extend Kubernetes to support unsharing cgroupns? |
That would be ideal if Kubernetes had a field in FWIW we've been using a custom entrypoint based on this PR for a couple of weeks. No issues so far. |
| EOF | ||
| ENV BUILDKIT_SETUP_CGROUPV2_ROOT=1 | ||
| ENTRYPOINT ["buildkitd"] | ||
| ENTRYPOINT ["/usr/bin/buildkitd-entrypoint"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of docker run --privileged this script does not seem needed, as Docker unshares cgroupns even for privileged mode.
So this entrypoint script should be opt-in.
It should be also marked as a workaround until Kubernetes supports unsharing cgroupns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AkihiroSuda
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introduce a new entrypoint script for the Linux image that, if cgroup v2 is in use, creates a new cgroup and mount namespace for buildkitd within a new entrypoint using
unshareand remounts/sys/fs/cgroupto restrict its view of the unified cgroup hierarchy. This will ensure itsinitcgroup and all OCI worker managed cgroups are kept beneath the root cgroup of the initial entrypoint process.When buildkitd is run in a managed environment like Kubernetes without its own cgroup namespace (the default behavior of privileged pods in Kubernetes where cgroup v2 is in use; see cgroup v2 KEP), the OCI worker will spawn processes in cgroups that are outside of the cgroup hierarchy that was created for the buildkitd container, leading to incorrect resource accounting and enforcement which in turn can cause OOM errors and CPU contention on the node.
Example behavior without this change:
Example behavior with this change:
Note this was developed as an alternative approach to #6343