This is a hands on guide that I put together to help others understand cgroups. It is the first of a series I am planning on how containers work at low level.
It is designed to be easy to follow along by running the commands in order and providing explanations as you interact with cgroups in your system. If there is anything here you think it's missing feel free to let me know, or contribute to the repository.
Please be careful where you run these commands as they affect your kernel configuration. Even though none of the commands will destroy your system, It stands to reason you shouldn't run these on production or directly in the host OS you use for work, run them instead on a VM. To write this guide I have used OrbStack on MacOS as it's great for running low footprint VMs where you can quickly test things out, but any other VM running Linux should work.
Broadly speaking containers are made up of the following:
- cgroups: How much resources container is entitled to
- namespaces: What resources container can access
- chroot: what files a container can access
cgroups control the amount of resources that processes are entitled to. In the context of containers they are very important to ensure that no single container hogs all of the resources of a system, leaving nothing for others.
In this section we will cover exclusively cgroups.
I am not covering cgroups v1 in this guide. To determine what cgroups version your system is using run the following command:
stat -fc %T /sys/fs/cgroup/
The output should be:
cgroup2fs
If you want to know the differences between v1 and v2 read here or watch this talk. Here are a list of issues with cgroup v1 that cgroup v2 solves.
Let's begin our journey by learning how the cgroups filesystem is organized, if you run the following command:
ls /sys/fs/cgroup/
You should get an output like this:
cgroup.controllers cpu.max.burst io.max memory.swap.current
cgroup.events cpuset.cpus io.stat memory.swap.events
cgroup.freeze cpuset.cpus.effective memory.current memory.swap.high
cgroup.kill cpuset.cpus.exclusive memory.events memory.swap.max
cgroup.max.depth cpuset.cpus.exclusive.effective memory.events.local memory.swap.peak
cgroup.max.descendants cpuset.cpus.partition memory.high pids.current
cgroup.procs cpuset.mems memory.low pids.events
cgroup.stat cpuset.mems.effective memory.max pids.max
cgroup.subtree_control cpu.stat memory.min pids.peak
cgroup.threads cpu.stat.local memory.oom.group system.slice
cgroup.type cpu.weight memory.peak user.slice
cpu.idle cpu.weight.nice memory.reclaim
cpu.max init.scope memory.stat
The directory /sys/fs/cgroup
represents the root cgroup in your system, directories and subdirectories created will share resources.
I am not going to go into detail describing all these files, that's what the man page is for, but I will give an overview of their type so we can understand their classification.
The directories here represent child cgroups. The ones present now
init.scope
,user.slice
andsystem.slice
. I won't talk about them here but if you are curious you can learn more here.
cgroups are composed of core and controllers.
- Controllers: as the name implies, they control a type of system resource within the cgroup and its hierarchy. Controller files in the cgroup filesystem are prefixed by the controller name such as
memory.
orcpu.
- Core: primarily responsible for hierarchically organizing processes. Core files are prefixed by
cgroup.
Some of the files in the cgroup filesystem are read-only and dynamically show the status or configuration of the cgroup or controller. For example:
cat /sys/fs/cgroup/cgroup.controllers
--> Shows a space separated list of all controllers available to the cgroup.
watch cat /sys/fs/cgroup/memory.stat
--> If you watch the output of this file you will see it change slightly every two seconds, giving the status of the cgroup memory footprint
You can directly write to these files to change the configuration of the cgroup. For example:
echo $$ > /sys/fs/cgroup/cgroups.procs
--> Adds your current shell process to the root cgroup.
echo "max" > /sys/fs/cgroup/containers/pids.max
--> Sets the number of pids to the maximum allowed.
Except for the cgroup
prefix which controls core configuration and status of the cgroup, all other files are to configure and view status of controllers.
-
cpu
: regulates the distribution of CPU cycles. For example settingcpu.max
will limit the amount of cpu a process can use whereascpu.stat
reports the status of the cpu ans so on. -
memory
: regulates the distribution of memory. For examplememory.max
will try and keep the memory of a process within that limit andmemory.min
will ensure a process always has that amount of memory available to it. It's worth noting that unlike cpu which will constrain a process on that limit, memory can only throttle it for so long until it is killed if it goes over its limits. If you really want to ensure that a process does not "cheat" its way into an inordinate amount of memory by paging, you will want to also implementmemory.swap.max
-
io
: regulates the distribution of io resources, in other words, access to disk (hdd, ssd, etc) -
pids
: It regulates the number of processes. For examplepids.max
sets a hard limit in the number of processes whereaspids.current
returns the number of processes currently running in the cgroup and its descendants. -
cpuset
: The "cpuset" controller restricts tasks to specific CPUs and memory nodes, optimizing performance by minimizing memory access latency and contention, particularly useful on large systems with Non-Uniform Memory Access (NUMA) architecture. -
cgroup
: Not a controller.cgroup.
files provide configuration for the core.
Many of the commands going forward will not work unless you are root, you can either prefix them with
sudo
or switch to root now by doingsudo su
Before we move onwards, export these variables, you can change the values to whatever you want:
export PARENT_CGROUP=containers
export CHILD_CGROUP=goofytimes
Also, you may need to install cgtools
in your system, you can do so with the following command in Debian based systems:
apt -y install cgroup-tools
Now let's create our root cgroup
cgcreate -g memory,cpu:/${PARENT_CGROUP}
If you got no output, then that's probably a good sign, let's check that it was indeed created and what's inside it:
ls /sys/fs/cgroup/${PARENT_CGROUP}
Output:
cgroup.controllers cgroup.subtree_control cpu.weight memory.min memory.swap.max
cgroup.events cgroup.threads cpu.weight.nice memory.oom.group memory.swap.peak
cgroup.freeze cgroup.type memory.current memory.peak pids.current
cgroup.kill cpu.idle memory.events memory.reclaim pids.events
cgroup.max.depth cpu.max memory.events.local memory.stat pids.max
cgroup.max.descendants cpu.max.burst memory.high memory.swap.current pids.peak
cgroup.procs cpu.stat memory.low memory.swap.events
cgroup.stat cpu.stat.local memory.max memory.swap.high
You will notice that our cgroup has the pids
controller enabled, but why? We specifically created a control group with the cpu
and memory
controller only. So we shouldn't get the pids
controller, right?
Let's take a look at the controllers enabled in this directory with the following command:
cat /sys/fs/cgroup/${PARENT_CGROUP}/cgroup.controllers
Which gives the following output:
cpu memory pids
So what gives then? Before I explain why, run this command:
cat /sys/fs/cgroup/cgroup.subtree_control
This gives the exact same output as the file cgroup_controllers
in our new cgroup:
cpu memory pids
That means that all children of the root cgroup will have those controllers enabled (but not necessarily its grandchildren), on the other hand, if you look into the file under $PARENT_CGROUP
cat /sys/fs/cgroup/${PARENT_CGROUP}/cgroup.subtree_control
It returns nothing, meaning we can change it and add our own subtree_control
here, but only for the controllers enabled in this cgroup.
If we run same command from earlier again to create a child for our cgroup named $CHILD_CGROUP
:
cgcreate -g memory,cpu:/${PARENT_CGROUP}/${CHILD_CGROUP}
And we look again within the cgroup.subtree_control
under ${PARENT_CGROUP}
, then we see that now the right controllers are added:
# cat /sys/fs/cgroup/${PARENT_CGROUP}/cgroup.subtree_control
cpu memory
That means that any children under {$PARENT_CGROUP}
will now have cpu
and memory
controllers enabled by default.
Note that any cgroup that has
subtree_control
enabled cannot have any processes of its own, EXCEPT the root cgroup (/sys/fs/cgroup
). That means you won't be able to assign any processes to${PARENT_CGROUP}
, only to${PARENT_CGROUP}/${CHILD_CGROUP}
or any other directory created under${PARENT_CGROUP}
that doesn't havesubtree_control
enabled. Read more about it in the man page under No Internal Process Constraint
And if we look at the new cgroup folder for ${CHILD_CGROUP}
:
ls /sys/fs/cgroup/${PARENT_CGROUP}/${CHILD_CGROUP}
We get the following output:
cgroup.controllers cgroup.stat cpu.stat memory.low memory.swap.current
cgroup.events cgroup.subtree_control cpu.weight memory.max memory.swap.events
cgroup.freeze cgroup.threads cpu.weight.nice memory.min memory.swap.high
cgroup.kill cgroup.type memory.current memory.oom.group memory.swap.max
cgroup.max.depth cpu.idle memory.events memory.peak memory.swap.peak
cgroup.max.descendants cpu.max memory.events.local memory.reclaim
cgroup.procs cpu.max.burst memory.high memory.stat
Notice how the io
, cpuset
or pid
controller flags are not there. Exactly what we want.
In this section we are going to test adding maximum cpu limits and memory limits.
You can check out resource distribution models in the man page for more information about how you can distribute your resource allocation along the cgroup hierarchy.
cpu.max
: The processes assigned to this cgroup won't be able to use more CPU than what we assign here, we will assign the limit to10000
, which equals 10% of CPU.memory.max:
: maximum RAM (around 100MB in bytes), but the process can still use swap!memory.swap.max
: limit the amount of swap the process can use, also 100 MB
cpu.max
sets a hard limit to the process, meaning the cpu for the entire cgroup will not go above around 10%, to be shared among all the processes, cpu throttling never kills a process.
memory.max
: if the memory gets above the limit specified here, the kernel will enforce it by throttling the memory usage of the cgroup. Throttling means increasing memory pressure which, leads the kernel to reduce memory allocation requests and aggressively reclaim memory from the cgroup, this involves freeing data from memory that's not needed or swapping it out to disk (depending on swappiness setting). If that fails, it will trigger a OOM kill for the process in the cgroup with the highest OOM score. By setting bothmemory.max
andmemory.swap.max
we are essentially forcing an early OOM for the tests we are going to perform ahead in this guide because the kernel will be unable to reclaim a lot of memory into swap. The criteria that determines OOM score and what you can do to protect a process from triggering OOM kill can be found here.
Let's set these three:
cgset -r memory.max=100000000 ${PARENT_CGROUP}/${CHILD_CGROUP}
cgset -r memory.swap.max=100000000 ${PARENT_CGROUP}/${CHILD_CGROUP}
cgset -r cpu.max="100000 1000000" ${PARENT_CGROUP}/${CHILD_CGROUP}
The values for
memory
are set in bytes here (but could also do100M
for 100 megs, or1G
for one 1gig, and so on), for cpu is set to$MAX $PERIOD
, you can think of it as stating that for every one million microseconds (one second) the cpu can only be used for one hundred thousand microseconds (tenth of a second), so around 10% limit. You can actually adjust this ratio to as low as"1000 10000"
if you prefer as well for a similar result.
We can double check the values were added with this command:
cat /sys/fs/cgroup/${PARENT_CGROUP}/${CHILD_CGROUP}/{memory,cpu,memory.swap}.max
Which will yield the output:
99999744
100000 1000000
99999744
If you are curious why the memory we the
100000000
of memory we assigned got changed to99999744
that's because it is being adjusted internally to align with the kernel page size, running the commandgetconf PAGE_SIZE
returns4096
. If you divide100000000 / 4096
you get24414.0625
. So you can effectively use24414
pages of memory here. So24414 * 4096
and you get99999744
- the closest you can get to your original number based on the page size.
Let's start first by seeing what happens if we stress the CPU without a process being part of our cgroup:
yes >/dev/null &
sleep 0.5
ps -p $! -o %cpu
kill $!
When you run these commands you are likely going to get a CPU output of nearly 100%. This is because the command yes
doesn't have a cap on resources:
%CPU
98.1
Now let's enter a bash terminal that's inside the cgroup we created like so:
cgexec -g memory,cpu:${PARENT_CGROUP}/${CHILD_CGROUP} bash
Run the same command block to test our CPU above and notice the difference:
%CPU
11.8
As you can see, the CPU barely gets above 10%, and if you keep watching the command running with a tool like htop
you will notice it rarely will go above it.
Let's stay in our cgroup bash
fork and observe what happens to our process when we fill its memory up.
Let's now run this command within our bash cgrouped process from above:
echo $(tr -d '\0' < /dev/urandom | head -c200M) > /dev/null &
This creates a very large variable of 200 MBs, which overloads the roughly 100 MB limit we imposed earlier. Hence, it will eventually kill the process, but in the meantime we can watch the memory and swap grow on the way to death with the following command:
watch ps -p $! -o rss,sz
The output will look more or less like this:
RSS SZ
19432 7056
You will eventually see this:
[2]+ Killed echo $(tr -d '\0' < /dev/urandom | head -c200M) > /dev/null
Once you see the process die, you can exit the bash process that's on the cgroup:
exit
Yes, if you use memory.high
and memory.swap.high
instead what will happen in the above scenario is that the process will fill the memory as much as it can and then enter a D
uninterruptible state until an agent or another process frees memory up. This may be a more desirable scenario for most cases as it acts as a softer limit.
This graph shows the different memory protection levels you can assign to a process within a cgroup:
systemd-cgls
and systemd-cgtop
are equivalent to ls
and top
for cgroups.
First, it's worth noting that if a cgroup doesn't have any processes assigned to it, it will not show on systemd-cgtop
, therefore we will run a few commands inside our cgroup to check it out.
for p in {1..5} ; do cgexec -g memory,cpu:${PARENT_CGROUP}/${CHILD_CGROUP} sleep 2000 & done
cgexec -g memory,cpu:${PARENT_CGROUP}/${CHILD_CGROUP} yes > /dev/null &
Now that we have a few processes running:
systemd-cgtop
Yields the following output:
Control Group Tasks %CPU Memory Input/s Output/s
/ 78 11.6 1.3G - -
/containers 6 10.0 680.0K - -
/containers/goofytimes - 10.0 680.0K - -
/system.slice - - 41.6M - -
Notice that the CPU sits just above 10% as we defined earlier
Now let's list all our cgroups and processes inside them:
systemd-cgls /containers
And we get:
Control group /containers:
└─goofytimes (#17303)
├─12149 sleep 2000
├─12150 sleep 2000
├─12151 sleep 2000
├─12152 sleep 2000
├─12153 sleep 2000
└─12156 yes
This one is fun, now that we are done with our processes, let's kill them all by running the following command:
echo 1 > /sys/fs/cgroup/${PARENT_CGROUP}/${CHILD_CGROUP}/cgroup.kill
After you press enter twice you will see all the processes have been killed:
echo 1 > /sys/fs/cgroup/${PARENT_CGROUP}/${CHILD_CGROUP}/cgroup.kill
[6] Killed cgexec -g memory,cpu:${PARENT_CGROUP}/${CHILD_CGROUP} sleep 2000
[7] Killed cgexec -g memory,cpu:${PARENT_CGROUP}/${CHILD_CGROUP} sleep 2000
[8] Killed cgexec -g memory,cpu:${PARENT_CGROUP}/${CHILD_CGROUP} sleep 2000
[9] Killed cgexec -g memory,cpu:${PARENT_CGROUP}/${CHILD_CGROUP} sleep 2000
[10]- Killed cgexec -g memory,cpu:${PARENT_CGROUP}/${CHILD_CGROUP} sleep 2000
[11]+ Killed cgexec -g memory,cpu:${PARENT_CGROUP}/${CHILD_CGROUP} yes > /dev/null
# systemd-cgls /containers
Control group /containers:
# No processes
So far we have been running new processes with cgexec
, which puts them directly under a cgroup, but what happens when we already have a running process that we want to assign to a cgroup?
That's where the command cgclassify
comes in, consider this scenario:
yes >/dev/null &
sleep 0.5
ps -p $! -o %cpu
The output:
# yes > /dev/null &
[2] 12210
# ps -p $! -o%cpu
%CPU
99.6
Our yes
process is using 100% CPU but we want to limit it to 10% again. Time to cage that baby:
cgclassify -g cpu,memory:${PARENT_CGROUP}/${CHILD_CGROUP} $!
We are using $!
to substitute for the last command we put on the background, now here something strange happens, if you run our command to check the CPU:
ps -p $! -o %cpu
%CPU
69.0
You can see the CPU is still not at 10% but if you run htop
, the process has come down already to 10% usage but ps
very slowly reports the process coming down, I have no idea why this is happening. Let's put our process out of its misery:
kill $!
If you want to get an overview of the configuration parameters for a cgroup, we can use the command cgget
Running:
cgget ${PARENT_CGROUP}/${CHILD_CGROUP}
Yields something like this (output truncated):
containers/goofytimes:
cpu.weight: 100
cpu.stat: usage_usec 267987160
user_usec 91171028
system_usec 176816131
nr_periods 26986
nr_throttled 26758
throttled_usec 3146294769
nr_bursts 0
burst_usec 0
cpu.weight.nice: 0
cpu.idle: 0
cpu.max.burst: 0
cpu.max: 10000 100000
memory.events: low 0
high 0
max 1419
oom 43
oom_kill 2
oom_group_kill 0
...
You have reached the end of our journey and can go ahead and delete the cgroups we just created, this command will recursively delete both:
cgdelete -r -g cpu:/${PARENT_CGROUP}
Did you notice how in the commands
cgexec
andcgdelete
we are specifying the controller even though it seemingly doesn't matter? Well, it doesn't really matter. The command requires it, but this is only because in cgroups v1 the hierarchy was done according to controllers, but in v2 this is not the case anymore, but the cgroups commands work with both version so I am guessing it will stay like this for a while for backwards compatibility.
In this guide we have been using mostly cgroup-tools
to manage our cgroups but you can largely do all of this by interacting directly with cgroups filesystem, This is a list of commands and their equivalent:
cgcreate -g memory,cpu:/${PARENT_CGROUP}/${CHILD_CGROUP}
--> mkdir /sys/fs/cgroup/${PARENT_CGROUP}/ && echo "+cpu +memory" > /sys/fs/cgroup/${PARENT_CGROUP}/cgroup.subtree_control && mkdir /sys/fs/cgroup/${PARENT_CGROUP}/{CHILD_CGROUP}
cgset -r memory.max=100000000 ${PARENT_CGROUP}/${CHILD_CGROUP}
--> echo 100000000 > /sys/fs/cgroup/${PARENT_CGROUP}/${CHILD_CGROUP}/memory.max
cgexec -g memory,cpu:${PARENT_CGROUP}/${CHILD_CGROUP} sleep 200 &
--> sleep 200 & echo $! > /sys/fs/cgroup/${PARENT_CGROUP}/${CHILD_CGROUP}/cgroup.procs
cgclassify -g cpu,memory:${PARENT_CGROUP}/${CHILD_CGROUP} $$
--> echo $$ > /sys/fs/cgroup/${PARENT_CGROUP}/${CHILD_CGROUP}/cgroup.procs
systemd-cgls -a
--> find /sys/fs/cgroup -type f -name cgroup.procs -exec sh -c 'echo \\n"\e[1;31m$1\e[0m" :; output=$(tr "\n" " " < "$1"); if [ -n "$output" ]; then echo "$output" | xargs ps -o pid -o cmd -p; else echo "No processes found on this cgroup folder" ;fi' _ {} \;
--> ps -eo pid,comm,cgroup,%cpu,%mem
cgdelete -r -g cpu:/${PARENT_CGROUP}
--> rmdir -p /sys/fs/cgroup/${PARENT_CGROUP}/{CHILD_CGROUP}
This is a list of resources that were helpful in the creation of this guide:
Kubernetes On cgroup v2 - Giuseppe Scrivano, Red Hat (video)
Diving deeper into control groups (cgroups) v2 - Michael Kerrisk (video)
An introduction to control groups (cgroups) version 2 - Michael Kerrisk (video)
cgroupv2: Linux's new unified control group hierarchy - Chris Down (video)
https://chrisdown.name/talks/cgroupv2/cgroupv2-fosdem.pdf
7 years of cgroup v2: the future of Linux resource control - Chris Down (Video)
https://docs.kernel.org/admin-guide/cgroup-v2.html
https://medium.com/@charles.vissol/cgroup-introduction-45017140493d
https://medium.com/@charles.vissol/cgroup-v2-in-details-8c138088f9ba
https://medium.com/@charles.vissol/systemd-and-cgroup-7eb80a08234d
https://facebookmicrosites.github.io/cgroup2/docs/overview.html
https://kubernetes.io/docs/concepts/architecture/cgroups/
https://systemd.io/CGROUP_DELEGATION/
https://btholt.github.io/complete-intro-to-containers/
Fernando Villalba has over a decade of miscellaneous IT experience. He started in IT support ("Have you tried turning it on and off?"), veered to become a SysAdmin ("Don't you dare turn it off") and later segued into DevOps type of roles ("Destroy and replace!"). He has been a consultant for various multi-billion dollar organizations helping them achieve their highest potential with their DevOps processes.