Make (private) cgroup mounts writable by default? #14322

LewisGaul · 2022-05-23T14:51:42Z

LewisGaul
May 23, 2022

Hi everyone,

[Taken from my blog post of cgroup questions.]

Context

When running a non-privileged container, cgroup mounts are created read-only by podman by default:

[root@fedora ~]# podman run --entrypoint='["findmnt", "-R", "/sys/fs/cgroup"]' ubuntu:22.04
TARGET         SOURCE  FSTYPE  OPTIONS
/sys/fs/cgroup cgroup2 cgroup2 ro,nosuid,nodev,noexec,relatime,seclabel,nsdelegate,memory_recursiveprot

(With cgroups v1 the cgroup controller mounts are read-only, and for both v1/v2 this is the case regardless of --cgroupns being 'host' or 'private', the above using podman v4.1.0).

Podman natively supports running systemd inside a container, which will be detected by default (based on the entrypoint), or the systemd setup can be forced with --systemd=always. To support systemd the cgroup mounts are created read-write instead of read-only, since systemd wants to own the container's cgroups.

Question

My question is: why is making the cgroup mounts read-write behaviour specific to running systemd? Surely it's equally valid for an alternative PID 1 entrypoint to want to own/modify the container's cgroups?

Justification

I would say that it only really makes sense for cgroup mounts to be writable when using a private cgroup namespace, since giving the container write access to the whole host's cgroups doesn't sound like a good idea. However, --cgroupns=private is the default on cgroups v2 anyway (and is an option on cgroups v1 too).

To me this seems equivalent to extra permissions allowed within a container due to namespacing:

The filesystem chroot (the root directory / is owned by the container)
The network namespace (all ports and IP addresses are available to the container unless host networking is used)
PID namespace (containers have their own PID 1 init process)
Mount namespace (container mounts are listed separately)
UID and GID namespaces (the root user is available in the container even if the container is not run by root)
Cgroup namespace (only the container’s cgroups visible to the container, made available at /sys/fs/cgroup/)

I assume the behaviour of making the cgroup mounts read-only is copied from docker, and I'm thinking of raising the question in the docker community too, but one of the key differences seems to be that podman supports running a fully-fledged system (e.g. started with systemd) whereas docker seems to prioritise single-process containers. The other reason it might have ended up this way is that cgroup namespaces haven't been around for that long, and it seems like a bad idea in the --cgroupns=host case as mentioned above.

Caveats

Note that one downside of making cgroups writable inside containers is that currently cgroup limits are applied in the cgroup that's 'delegated' to the container:

[root@fedora ~]# podman run --memory=20000000 --entrypoint='["cat", "/sys/fs/cgroup/memory.max"]' ubuntu:22.04
19996672

This means that with write access to the cgroups, a container would be able to override the limit imposed by podman. This could be 'fixed' by adding another level in the cgroup hierarchy and setting the limit on the parent cgroup to the one that's 'delegated' to the container. However, as far as I can tell in this case the container would not be able to check the cgroup limits imposed on it, but this feels like a general limitation of cgroup namespaces?

mheon · 2022-05-23T15:11:03Z

mheon
May 23, 2022
Maintainer

@giuseppe PTAL - I think you're most qualified to answer.

0 replies

giuseppe · 2022-05-25T10:40:43Z

giuseppe
May 25, 2022
Maintainer

the reason why it is ro by default is that even if delegation for cgroup v2 is safe, it still has a cost for the kernel (AFAIK, especially for the CPU controller). For this reason, it seemed safer to use ro by default, since anyway very few containers use cgroups.

6 replies

giuseppe May 25, 2022
Maintainer

we can already do that with --security-opt unmask=/sys/fs/cgroup:

$ podman run --rm fedora grep /sys/fs/cgroup /proc/self/mountinfo 
1790 1781 0:28 / /sys/fs/cgroup ro,nosuid,nodev,noexec,relatime - cgroup2 cgroup2 rw,seclabel,nsdelegate,memory_recursiveprot

$ podman run --rm --security-opt unmask=/sys/fs/cgroup fedora grep /sys/fs/cgroup /proc/self/mountinfo 
1790 1781 0:28 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime - cgroup2 cgroup2 rw,seclabel,nsdelegate,memory_recursiveprot

rhatdan May 25, 2022
Maintainer

@LewisGaul Does this work for you?

LewisGaul Jun 8, 2022
Author

Sorry for the late response, and thanks for the input. I'm afraid I'm not completely following.

the reason why it is ro by default is that even if delegation for cgroup v2 is safe, it still has a cost for the kernel (AFAIK, especially for the CPU controller).

My understanding of nsdelegate is lacking, but any cost of using a private cgroup namespace seems undesirable. Do you have any more details/references on this?

What's more, I'm primarily concerned with cgroups v1 at this stage, and I'm not sure your point applies there? Although of course consistency between v1 and v2 setup is desirable.

we can already do that with --security-opt unmask=/sys/fs/cgroup

This is interesting to know, thanks. However, I was specifically wondering why rw isn't the default (without needing extra args/configuration), with my reasoning outlined in my original post.

giuseppe Jun 8, 2022
Maintainer

delegation for v1 is not safe in any case. So besides the additional costs, for cgroup v1 it is also unsafe.

LewisGaul Jun 8, 2022
Author

What do you mean by it not being safe?

Asides from the practical details (e.g. performance costs, safety on cgroups v1), architecturally wouldn't it make more sense for the container to be able to own its cgroup namespace, in the same way it has full access within its other namespaces (e.g. being root within the container allowing killing container processes)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make (private) cgroup mounts writable by default? #14322

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Make (private) cgroup mounts writable by default? #14322

LewisGaul May 23, 2022

Context

Question

Justification

Caveats

Replies: 2 comments · 6 replies

mheon May 23, 2022 Maintainer

giuseppe May 25, 2022 Maintainer

giuseppe May 25, 2022 Maintainer

rhatdan May 25, 2022 Maintainer

LewisGaul Jun 8, 2022 Author

giuseppe Jun 8, 2022 Maintainer

LewisGaul Jun 8, 2022 Author

LewisGaul
May 23, 2022

Replies: 2 comments 6 replies

mheon
May 23, 2022
Maintainer

giuseppe
May 25, 2022
Maintainer

giuseppe May 25, 2022
Maintainer

rhatdan May 25, 2022
Maintainer

LewisGaul Jun 8, 2022
Author

giuseppe Jun 8, 2022
Maintainer

LewisGaul Jun 8, 2022
Author