Replies: 2 comments 6 replies
-
@giuseppe PTAL - I think you're most qualified to answer. |
Beta Was this translation helpful? Give feedback.
0 replies
-
the reason why it is |
Beta Was this translation helpful? Give feedback.
6 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi everyone,
[Taken from my blog post of cgroup questions.]
Context
When running a non-privileged container, cgroup mounts are created read-only by podman by default:
(With cgroups v1 the cgroup controller mounts are read-only, and for both v1/v2 this is the case regardless of
--cgroupns
being 'host' or 'private', the above using podman v4.1.0).Podman natively supports running systemd inside a container, which will be detected by default (based on the entrypoint), or the systemd setup can be forced with
--systemd=always
. To support systemd the cgroup mounts are created read-write instead of read-only, since systemd wants to own the container's cgroups.Question
My question is: why is making the cgroup mounts read-write behaviour specific to running systemd? Surely it's equally valid for an alternative PID 1 entrypoint to want to own/modify the container's cgroups?
Justification
I would say that it only really makes sense for cgroup mounts to be writable when using a private cgroup namespace, since giving the container write access to the whole host's cgroups doesn't sound like a good idea. However,
--cgroupns=private
is the default on cgroups v2 anyway (and is an option on cgroups v1 too).To me this seems equivalent to extra permissions allowed within a container due to namespacing:
/sys/fs/cgroup/
)I assume the behaviour of making the cgroup mounts read-only is copied from docker, and I'm thinking of raising the question in the docker community too, but one of the key differences seems to be that podman supports running a fully-fledged system (e.g. started with systemd) whereas docker seems to prioritise single-process containers. The other reason it might have ended up this way is that cgroup namespaces haven't been around for that long, and it seems like a bad idea in the
--cgroupns=host
case as mentioned above.Caveats
Note that one downside of making cgroups writable inside containers is that currently cgroup limits are applied in the cgroup that's 'delegated' to the container:
This means that with write access to the cgroups, a container would be able to override the limit imposed by podman. This could be 'fixed' by adding another level in the cgroup hierarchy and setting the limit on the parent cgroup to the one that's 'delegated' to the container. However, as far as I can tell in this case the container would not be able to check the cgroup limits imposed on it, but this feels like a general limitation of cgroup namespaces?
Beta Was this translation helpful? Give feedback.
All reactions