Skip to content

Conversation

@marxarelli
Copy link
Contributor

Introduce a new entrypoint script for the Linux image that, if cgroup v2 is in use, creates a new cgroup and mount namespace for buildkitd within a new entrypoint using unshare and remounts /sys/fs/cgroup to restrict its view of the unified cgroup hierarchy. This will ensure its init cgroup and all OCI worker managed cgroups are kept beneath the root cgroup of the initial entrypoint process.

When buildkitd is run in a managed environment like Kubernetes without its own cgroup namespace (the default behavior of privileged pods in Kubernetes where cgroup v2 is in use; see cgroup v2 KEP), the OCI worker will spawn processes in cgroups that are outside of the cgroup hierarchy that was created for the buildkitd container, leading to incorrect resource accounting and enforcement which in turn can cause OOM errors and CPU contention on the node.

Example behavior without this change:

root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/buildkit/{runc-container-id}

Example behavior with this change:

root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/buildkit/{runc-container-id}

Note this was developed as an alternative approach to #6343

@marxarelli
Copy link
Contributor Author

@tonistiigi this is the alternative approach I mentioned in #6343 (comment).

Note that I first tried to implement the ns creation and remounting in buildkitd using calls to unix.Unshare and unix.Mount but encountered some strange behavior: The main buildkitd process was placed in a new cgroup namespace but for some reason buildkit-runc was not. It may have been that not all Go threads were moved into the cgroup, I'm not sure.

In any case, using unshare in the entrypoint seems less error prone.

@marxarelli marxarelli force-pushed the review/unshare-cgroupns-entrypoint branch from e12f7af to 3cf93c3 Compare November 17, 2025 20:40
Introduce a new entrypoint script for the Linux image that, if cgroup v2
is in use, creates a new cgroup and mount namespace for buildkitd within
a new entrypoint using `unshare` and remounts `/sys/fs/cgroup` to
restrict its view of the unified cgroup hierarchy. This will ensure its
`init` cgroup and all OCI worker managed cgroups are kept beneath the
root cgroup of the initial entrypoint process.

When buildkitd is run in a managed environment like Kubernetes without
its own cgroup namespace (the default behavior of privileged pods in
Kubernetes where cgroup v2 is in use; see [cgroup v2 KEP][kep]), the OCI
worker will spawn processes in cgroups that are outside of the cgroup
hierarchy that was created for the buildkitd container, leading to
incorrect resource accounting and enforcement which in turn can cause
OOM errors and CPU contention on the node.

Example behavior without this change:

```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/buildkit/{runc-container-id}
```

Example behavior with this change:

```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/buildkit/{runc-container-id}
```

Note this was developed as an alternative approach to moby#6343

[kep]: https://github.com/kubernetes/enhancements/tree/6d3210f7dd5d547c8f7f6a33af6a09eb45193cd7/keps/sig-node/2254-cgroup-v2#cgroup-namespace

Signed-off-by: Dan Duvall <dduvall@wikimedia.org>
@marxarelli marxarelli force-pushed the review/unshare-cgroupns-entrypoint branch from 3cf93c3 to 7a50ed7 Compare November 17, 2025 20:57
@AkihiroSuda
Copy link
Member

When buildkitd is run in a managed environment like Kubernetes without its own cgroup namespace (the default behavior of privileged pods in Kubernetes where cgroup v2 is in use; see cgroup v2 KEP), the OCI worker will spawn processes in cgroups that are outside of the cgroup hierarchy that was created for the buildkitd container, leading to incorrect resource accounting and enforcement which in turn can cause OOM errors and CPU contention on the node.

In the long term can we just extend Kubernetes to support unsharing cgroupns?

@marxarelli
Copy link
Contributor Author

In the long term can we just extend Kubernetes to support unsharing cgroupns?

That would be ideal if Kubernetes had a field in SecurityContext for controlling that.

FWIW we've been using a custom entrypoint based on this PR for a couple of weeks. No issues so far.

EOF
ENV BUILDKIT_SETUP_CGROUPV2_ROOT=1
ENTRYPOINT ["buildkitd"]
ENTRYPOINT ["/usr/bin/buildkitd-entrypoint"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of docker run --privileged this script does not seem needed, as Docker unshares cgroupns even for privileged mode.
So this entrypoint script should be opt-in.
It should be also marked as a workaround until Kubernetes supports unsharing cgroupns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@AkihiroSuda AkihiroSuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants