dockerfile: run buildkitd within a cgroup namespace for cgroup v2 #6368

marxarelli · 2025-11-17T20:18:38Z

Introduce a new entrypoint script for the Linux image that, if cgroup v2 is in use, creates a new cgroup and mount namespace for buildkitd within a new entrypoint using unshare and remounts /sys/fs/cgroup to restrict its view of the unified cgroup hierarchy. This will ensure its init cgroup and all OCI worker managed cgroups are kept beneath the root cgroup of the initial entrypoint process.

When buildkitd is run in a managed environment like Kubernetes without its own cgroup namespace (the default behavior of privileged pods in Kubernetes where cgroup v2 is in use; see cgroup v2 KEP), the OCI worker will spawn processes in cgroups that are outside of the cgroup hierarchy that was created for the buildkitd container, leading to incorrect resource accounting and enforcement which in turn can cause OOM errors and CPU contention on the node.

Example behavior without this change:

root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/buildkit/{runc-container-id}

Example behavior with this change:

root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/buildkit/{runc-container-id}

Note this was developed as an alternative approach to #6343

marxarelli · 2025-11-17T20:27:53Z

@tonistiigi this is the alternative approach I mentioned in #6343 (comment).

Note that I first tried to implement the ns creation and remounting in buildkitd using calls to unix.Unshare and unix.Mount but encountered some strange behavior: The main buildkitd process was placed in a new cgroup namespace but for some reason buildkit-runc was not. It may have been that not all Go threads were moved into the cgroup, I'm not sure.

In any case, using unshare in the entrypoint seems less error prone.

Introduce a new entrypoint script for the Linux image that, if cgroup v2 is in use, creates a new cgroup and mount namespace for buildkitd within a new entrypoint using `unshare` and remounts `/sys/fs/cgroup` to restrict its view of the unified cgroup hierarchy. This will ensure its `init` cgroup and all OCI worker managed cgroups are kept beneath the root cgroup of the initial entrypoint process. When buildkitd is run in a managed environment like Kubernetes without its own cgroup namespace (the default behavior of privileged pods in Kubernetes where cgroup v2 is in use; see [cgroup v2 KEP][kep]), the OCI worker will spawn processes in cgroups that are outside of the cgroup hierarchy that was created for the buildkitd container, leading to incorrect resource accounting and enforcement which in turn can cause OOM errors and CPU contention on the node. Example behavior without this change: ```console root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup 0::/init root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup 0::/buildkit/{runc-container-id} ``` Example behavior with this change: ```console root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup 0::/kubepods/burstable/pod{pod-id}/{container-id}/init root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup 0::/kubepods/burstable/pod{pod-id}/{container-id}/buildkit/{runc-container-id} ``` Note this was developed as an alternative approach to moby#6343 [kep]: https://github.com/kubernetes/enhancements/tree/6d3210f7dd5d547c8f7f6a33af6a09eb45193cd7/keps/sig-node/2254-cgroup-v2#cgroup-namespace Signed-off-by: Dan Duvall <dduvall@wikimedia.org>

AkihiroSuda · 2025-11-21T02:25:10Z

When buildkitd is run in a managed environment like Kubernetes without its own cgroup namespace (the default behavior of privileged pods in Kubernetes where cgroup v2 is in use; see cgroup v2 KEP), the OCI worker will spawn processes in cgroups that are outside of the cgroup hierarchy that was created for the buildkitd container, leading to incorrect resource accounting and enforcement which in turn can cause OOM errors and CPU contention on the node.

In the long term can we just extend Kubernetes to support unsharing cgroupns?

marxarelli · 2025-12-02T17:50:59Z

In the long term can we just extend Kubernetes to support unsharing cgroupns?

That would be ideal if Kubernetes had a field in SecurityContext for controlling that.

FWIW we've been using a custom entrypoint based on this PR for a couple of weeks. No issues so far.

AkihiroSuda · 2025-12-03T09:41:54Z

Dockerfile

+EOF
 ENV BUILDKIT_SETUP_CGROUPV2_ROOT=1
-ENTRYPOINT ["buildkitd"]
+ENTRYPOINT ["/usr/bin/buildkitd-entrypoint"]


In the case of docker run --privileged this script does not seem needed, as Docker unshares cgroupns even for privileged mode.
So this entrypoint script should be opt-in.
It should be also marked as a workaround until Kubernetes supports unsharing cgroupns.

KEP:

KEP-5714: Allow specifying whether to unshare cgroup namespaces kubernetes/enhancements#5715

AkihiroSuda

https://github.com/moby/buildkit/pull/6368/files#r2584358620

github-actions bot added the area/project label Nov 17, 2025

github-actions bot assigned marxarelli Nov 17, 2025

marxarelli force-pushed the review/unshare-cgroupns-entrypoint branch from e12f7af to 3cf93c3 Compare November 17, 2025 20:40

marxarelli force-pushed the review/unshare-cgroupns-entrypoint branch from 3cf93c3 to 7a50ed7 Compare November 17, 2025 20:57

AkihiroSuda reviewed Dec 3, 2025

View reviewed changes

AkihiroSuda requested changes Dec 3, 2025

View reviewed changes

AkihiroSuda mentioned this pull request Dec 3, 2025

Allow specifying whether to unshare cgroup namespaces kubernetes/enhancements#5714

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

dockerfile: run buildkitd within a cgroup namespace for cgroup v2 #6368

dockerfile: run buildkitd within a cgroup namespace for cgroup v2 #6368

marxarelli commented Nov 17, 2025

Uh oh!

marxarelli commented Nov 17, 2025

Uh oh!

AkihiroSuda commented Nov 21, 2025

Uh oh!

marxarelli commented Dec 2, 2025

Uh oh!

AkihiroSuda Dec 3, 2025

Uh oh!

AkihiroSuda Dec 3, 2025

Uh oh!

AkihiroSuda left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dockerfile: run buildkitd within a cgroup namespace for cgroup v2 #6368

Are you sure you want to change the base?

dockerfile: run buildkitd within a cgroup namespace for cgroup v2 #6368

Conversation

marxarelli commented Nov 17, 2025

Uh oh!

marxarelli commented Nov 17, 2025

Uh oh!

AkihiroSuda commented Nov 21, 2025

Uh oh!

marxarelli commented Dec 2, 2025

Uh oh!

AkihiroSuda Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

AkihiroSuda Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

AkihiroSuda left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants