Cgroups are a Kernel feature that allow sysadmins and human beings (:P) to group processes/tasks in order to assign system resources like CPU, memory, IO, etc in a more fine-grained way. Cgroups can be arranged in a hierarchical manner, and every process in the system will belong to exact one cgroup at any single point in time. In conjunction with Linux namespaces (next post :D), cgroups are a corner stone for things like Docker containers.
So how do cgroups provides access/accounting control to CPU, memory, etc?
There's another concept involved in cgroups, and it is subsystems. A subsystem is the resource scheduler/controller in charge of setting the limits for a particular resource. Some of the most common subsystems are:
- cpuset: this subsystem provides the possibility to assign a group of tasks to a particular CPU and Memory Node, this is particularly interesting in NUMA systems.
- memory: this subsystem as you have already realized, allows you to control the amount of memory a group of tasks can utilize.
- freezer: this subsystem allows you to easily freeze a group of processes, and eventually unfreeze them later on so they can continue running.
- blkio: yes, that's correct, this subsystem allows you to define IO limits to processes, only works if using CFQ IO scheduler.
You can get a full list of the supported subsystems on your system from /proc/cgroups file:
root@ubuntu-server:/etc# cat /proc/cgroups
#subsys_name hierarchy num_cgroups enabled
cpuset 0 1 1
cpu 0 1 1
cpuacct 0 1 1
memory 0 1 1
devices 0 1 1
freezer 0 1 1
net_cls 0 1 1
blkio 0 1 1
perf_event 0 1 1
net_prio 0 1 1
hugetlb 0 1 1
root@ubuntu-server:/etc#
this is what it looks like on an Ubuntu 14.04.4 VM.
Ok, so far so good, but how can we access these so called cgroups and subsystems?
Fortunately the cgroups interface is accessible through its Virtual File System representation (there are a couple of fancy tools that are available on RHEL systems), we can see for example the default cgroups setup that comes with Ubuntu 14.04.4:
root@ubuntu-server:/etc# mount |grep cgroup
none on /sys/fs/cgroup type tmpfs (rw)
systemd on /sys/fs/cgroup/systemd type cgroup (rw,noexec,nosuid,nodev,none,name=systemd)
root@ubuntu-server:/etc#
Note: there's a systemd "device" mounted on /sys/fs/cgroup/systemd and the File System type is cgroup.None subsystems are included in this cgroup hierarchy.
but in order to start from scratch I'll build a new cgroup hierarchy ignoring the one that comes by default.
I created a new directory under the same tmpfs:
root@ubuntu-server:/etc# mkdir /sys/fs/cgroup/MyHierarchy
root@ubuntu-server:/etc# ls /sys/fs/cgroup/
MyHierarchy systemd
root@ubuntu-server:/etc#
then mounted a cgroup hierarchy on the new directory:
root@ubuntu-server:/etc# mount -t cgroup MyHierarchy -o none,name=MyHierarchy /sys/fs/cgroup/MyHierarchy/
root@ubuntu-server:/etc# mount |grep cgroup
none on /sys/fs/cgroup type tmpfs (rw)
systemd on /sys/fs/cgroup/systemd type cgroup (rw,noexec,nosuid,nodev,none,name=systemd)
MyHierarchy on /sys/fs/cgroup/MyHierarchy type cgroup (rw,none,name=MyHierarchy)
root@ubuntu-server:/etc#
Note: -o none will cause the mount not to include any subsystem on the hierarchy.
So, is that it? Well..., no xD, we've done nothing so far, but created a cgroup hierarchy, lets take a look at the files under it:
root@ubuntu-server:/etc# ls /sys/fs/cgroup/MyHierarchy/
cgroup.clone_children cgroup.procs cgroup.sane_behavior notify_on_release release_agent tasks
root@ubuntu-server:/etc#
these are the basic files for a cgroup, they describe for example what processes belong to it (tasks and cgroup.procs), what command should be executed after the last task leaves the cgroup (notify_on_release, release_agent), etc. The most interesting file is tasks file that keeps the list of the processes that belong to the cgroup, remember that by default all the processes will be listed there since this is a root cgroup:
root@ubuntu-server:/etc# head /sys/fs/cgroup/MyHierarchy/tasks
1
2
3
5
7
8
9
10
11
12
root@ubuntu-server:/etc# wc -l /sys/fs/cgroup/MyHierarchy/tasks
132 /sys/fs/cgroup/MyHierarchy/tasks
root@ubuntu-server:/etc#
Note: Processes IDs can show up more than once and not in order, however a particular process will belong to a single cgroup under the same hierarchy.
We can easily create a sub cgroup (child cgroup) by creating a folder under the root one:
root@ubuntu-server:/etc# mkdir /sys/fs/cgroup/MyHierarchy/SubCgroup1
root@ubuntu-server:/etc# ls /sys/fs/cgroup/MyHierarchy/SubCgroup1/
cgroup.clone_children cgroup.procs notify_on_release tasks
root@ubuntu-server:/etc#
this new cgroup has again similar files but no tasks are associated to it by default:
root@ubuntu-server:/etc# cat /sys/fs/cgroup/MyHierarchy/SubCgroup1/tasks
root@ubuntu-server:/etc#
We can easily move a task to this new cgroup by just writing the task PID into the tasks file:
root@ubuntu-server:/etc# echo $$
4826
root@ubuntu-server:/etc# ps aux|grep 4826
root 4826 0.0 0.7 21332 3952 pts/0 S Apr02 0:00 bash
root 17971 0.0 0.4 11748 2220 pts/0 S+ 04:30 0:00 grep --color=auto 4826
root@ubuntu-server:/etc# echo $$ > /sys/fs/cgroup/MyHierarchy/SubCgroup1/tasks
root@ubuntu-server:/etc# cat /sys/fs/cgroup/MyHierarchy/SubCgroup1/tasks
4826
17979
root@ubuntu-server:/etc#
in the previous example I moved the root bash process to SubCgroup1, and you can see the PID inside tasks file. But there's another PID there as well, why? That PID belongs to the cat command, basically any forked process will be assigned to the same cgroup its parent belongs to. We can also confirm PID 4826 doesn't belong to the root cgroup anymore:
root@ubuntu-server:/etc# grep 4826 /sys/fs/cgroup/MyHierarchy/tasks
root@ubuntu-server:/etc#
What if we want to know to which cgroup a particular process belongs to?
We can easily find that out from our lovely /proc:
root@ubuntu-server:/etc# cat /proc/4826/cgroup
3:name=MyHierarchy:/SubCgroup1
1:name=systemd:/user/1000.user/1.session
root@ubuntu-server:/etc#
Note: see how our shell belongs to two different cgroups, this is possible because they belong to different hierarchies.
Ok, but by themselves cgroups just segregate tasks in groups, however they become interestingly powerful when mixed with the subsystems.
Testing a few Subsystems:
I rolled back the hierarchy I mounted before using umount as with any other File System, so we are back to square one:
root@ubuntu-server:/etc# umount /sys/fs/cgroup/MyHierarchy/
root@ubuntu-server:/etc# mount |grep cgroup
none on /sys/fs/cgroup type tmpfs (rw)
systemd on /sys/fs/cgroup/systemd type cgroup (rw,noexec,nosuid,nodev,none,name=systemd)
root@ubuntu-server:/etc#
I mounted the hierarchy again, but this time enabling cpu, memory and blkio subsystems, like this:
root@ubuntu-server:/etc# mount -t cgroup MyHierarchy -o cpu,memory,blkio,name=MyHierarchy /sys/fs/cgroup/MyHierarchy
mount: MyHierarchy already mounted or /sys/fs/cgroup/MyHierarchy busy
root@ubuntu-server:/etc#
WTF??? according to the error the mount point is busy or still mounted... well it turns out that
When a cgroup filesystem is unmounted, if there are any child cgroups created
below the top-level cgroup, that hierarchy will remain active even though
unmounted; if there are no child cgroups then the hierarchy will be deactivated.
so, I have to either move process 4826 and its children to the root cgroup or killem' all!!! Of course killing them was the easy way out. With that done, boala:
root@ubuntu-server:/home/juan# mount -v -t cgroup MyHierarchy -o cpu,memory,blkio,name=MyHierarchy /sys/fs/cgroup/MyHierarchy
MyHierarchy on /sys/fs/cgroup/MyHierarchy type cgroup (rw,cpu,memory,blkio,name=MyHierarchy)
root@ubuntu-server:/home/juan#
now the hierarchy includes 3 subsystems, cpu, memory and blkio. Lets see what it looks like now:
root@ubuntu-server:/home/juan# ls /sys/fs/cgroup/MyHierarchy/
blkio.io_merged blkio.sectors_recursive cpu.cfs_quota_us memory.move_charge_at_immigrate
blkio.io_merged_recursive blkio.throttle.io_service_bytes cpu.shares memory.numa_stat
blkio.io_queued blkio.throttle.io_serviced cpu.stat memory.oom_control
blkio.io_queued_recursive blkio.throttle.read_bps_device memory.failcnt memory.pressure_level
blkio.io_service_bytes blkio.throttle.read_iops_device memory.force_empty memory.soft_limit_in_bytes
blkio.io_service_bytes_recursive blkio.throttle.write_bps_device memory.kmem.failcnt memory.stat
blkio.io_serviced blkio.throttle.write_iops_device memory.kmem.limit_in_bytes memory.swappiness
blkio.io_serviced_recursive blkio.time memory.kmem.max_usage_in_bytes memory.usage_in_bytes
blkio.io_service_time blkio.time_recursive memory.kmem.slabinfo memory.use_hierarchy
blkio.io_service_time_recursive blkio.weight memory.kmem.tcp.failcnt notify_on_release
blkio.io_wait_time blkio.weight_device memory.kmem.tcp.limit_in_bytes release_agent
blkio.io_wait_time_recursive cgroup.clone_children memory.kmem.tcp.max_usage_in_bytes SubCgroup1
blkio.leaf_weight cgroup.event_control memory.kmem.tcp.usage_in_bytes tasks
blkio.leaf_weight_device cgroup.procs memory.kmem.usage_in_bytes
blkio.reset_stats cgroup.sane_behavior memory.limit_in_bytes
blkio.sectors cpu.cfs_period_us memory.max_usage_in_bytes
root@ubuntu-server:/home/juan#
yeah... many files, right? We should find a few files per active subsystem:
root@ubuntu-server:/home/juan# ls /sys/fs/cgroup/MyHierarchy/|grep -c "cpu\."
4
root@ubuntu-server:/home/juan# ls /sys/fs/cgroup/MyHierarchy/|grep -c "memory\."
22
root@ubuntu-server:/home/juan# ls /sys/fs/cgroup/MyHierarchy/|grep -c "blkio\."
27
root@ubuntu-server:/home/juan#
These files are the ones we can tune and of course I'm not going to explain all of them (not that I know them to be honest xD). Something interesting is the fact that some particular parameters can't be applied to the root cgroup, for quite obvious reasons, remember that by default all the processes belong to the root cgroup in the hierarchy and you certainly don't want to throttle some critical system processes.
So for testing purposes I'll set up a second child cgroup called SubCgroup2 and I will set two different throttle limits to write operations on the root volume /dev/sda. First thing first, I need to identify the major and minor number of the device:
root@ubuntu-server:/home/juan# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 5G 0 disk
├─sda1 8:1 0 4.5G 0 part /
├─sda2 8:2 0 1K 0 part
└─sda5 8:5 0 510M 0 part
sr0 11:0 1 1024M 0 rom
root@ubuntu-server:/home/juan#
Ok, 8 and 0 should do the trick here. Now we set the throttle using the file called blkio.throttle.write_bps_device, like this:
root@ubuntu-server:/home/juan# echo "8:0 10240000" > /sys/fs/cgroup/MyHierarchy/SubCgroup1/blkio.throttle.write_bps_device
root@ubuntu-server:/home/juan# echo "8:0 20480000" > /sys/fs/cgroup/MyHierarchy/SubCgroup2/blkio.throttle.write_bps_device
root@ubuntu-server:/home/juan#
see that I've throttled the processes under SubCgroup1 to 10240000 bytes per second and 20480000 bytes per second to processes that belong to SubCgroup2. In order to test this I've opened two new shells:
root@ubuntu-server:/home/juan# ps aux|grep bash
juan 1440 0.0 1.0 22456 5152 pts/0 Ss 22:28 0:00 -bash
root 2389 0.0 0.7 21244 3984 pts/0 S 22:35 0:00 bash
juan 6897 0.0 0.9 22456 4996 pts/2 Ss+ 23:49 0:00 -bash
juan 6961 0.1 1.0 22456 5076 pts/3 Ss+ 23:49 0:00 -bash
root 7276 0.0 0.4 11748 2132 pts/0 S+ 23:50 0:00 grep --color=auto bash
root@ubuntu-server:/home/juan#
and pushed their PIDs to the cgroups:
root@ubuntu-server:/home/juan# echo 6897 > /sys/fs/cgroup/MyHierarchy/SubCgroup1/tasks
root@ubuntu-server:/home/juan# echo 6961 > /sys/fs/cgroup/MyHierarchy/SubCgroup2/tasks
root@ubuntu-server:/home/juan#
now lets see what happens when doing some intensive write to the drive:
- Shell under SubCgroup1:
6897
juan@ubuntu-server:~$ dd oflag=dsync if=/dev/zero of=test bs=1M count=512
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 52.6549 s, 10.2 MB/s
juan@ubuntu-server:~$
- Shell under SubCgroup2:
6961
juan@ubuntu-server:~$ dd oflag=dsync if=/dev/zero of=test bs=1M count=512
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 26.3397 s, 20.4 MB/s
juan@ubuntu-server:~$
- Shell under the root cgroup (no throttling here :D):
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 3.98033 s, 135 MB/s
juan@ubuntu-server:~$
Cool, isn't it? the throttle worked fine, dd under shell 6897 was throttled at 10Mbytes/s while the one under shell 6961 was throttled at 20Mbytes/s. But what if two processes under the same cgroup try to write at the same time, how does the throttle work?
juan@ubuntu-server:~$ echo $$
6897
juan@ubuntu-server:~$ dd oflag=dsync if=/dev/zero of=test bs=1M count=512 & dd oflag=dsync if=/dev/zero of=test1 bs=1M count=512 &
[1] 8265
[2] 8266
juan@ubuntu-server:~$ 512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 105.142 s, 5.1 MB/s
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 105.24 s, 5.1 MB/s
[1]- Done dd oflag=dsync if=/dev/zero of=test bs=1M count=512
[2]+ Done dd oflag=dsync if=/dev/zero of=test1 bs=1M count=512
juan@ubuntu-server:~$
the throughput limit is shared among the processes under the same cgroup, which makes perfect sense considering the limit is applied to the group as a whole. There are tons of different parameters to play with the blkio subsystem like weights, sectors, service time, etc. So be brave and have fun with them :P.
Last but not least, lets take a look at the cpu subsystem. This subsystem allows you to put some limits on the CPU utilization, here a list of the files you can use to tune it:
root@ubuntu-server:/home/juan# ls /sys/fs/cgroup/MyHierarchy/cpu.*
/sys/fs/cgroup/MyHierarchy/cpu.cfs_period_us
/sys/fs/cgroup/MyHierarchy/cpu.cfs_quota_us
/sys/fs/cgroup/MyHierarchy/cpu.shares
/sys/fs/cgroup/MyHierarchy/cpu.stat
root@ubuntu-server:/home/juan#
for the sake of me going to bed early I will only test cpu.shares feature. This share value defines a relative weight that the processes under a cgroup will have compared to processes on different cgroups, and this impacts directly on the amount of CPU time the processes can have. For example lets take the default value for the root cgroup:
root@ubuntu-server:/home/juan# cat /sys/fs/cgroup/MyHierarchy/cpu.shares
1024
root@ubuntu-server:/home/juan#
this means all the processes in this cgroup have that particular weight, so if we set the following weights on SubCgroup1 and SubCgroup2:
root@ubuntu-server:/home/juan# echo 512 > /sys/fs/cgroup/MyHierarchy/SubCgroup1/cpu.shares
root@ubuntu-server:/home/juan# echo 256 > /sys/fs/cgroup/MyHierarchy/SubCgroup2/cpu.shares
root@ubuntu-server:/home/juan#
what we mean is processes under SubCgroup1 will have half the CPU time than processes in the root cgroup and twice than processes under SubCgroup2. This is easy to see in the following ps output:
root@ubuntu-server:/home/juan# ps aux --sort=-pcpu|head
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
juan 23289 58.0 0.5 8264 2616 pts/2 R 04:02 0:11 dd if=/dev/zero of=/dev/null bs=1M count=512000000
juan 23288 32.4 0.5 8264 2616 pts/5 R 04:02 0:06 dd if=/dev/zero of=/dev/null bs=1M count=512000000
juan 23290 14.5 0.5 8264 2680 pts/0 R 04:02 0:02 dd if=/dev/zero of=/dev/null bs=1M count=512000000
root 1 0.0 0.5 33492 2772 ? Ss Apr03 0:00 /sbin/init
root 2 0.0 0.0 0 0 ? S Apr03 0:00 [kthreadd]
root 3 0.0 0.0 0 0 ? S Apr03 0:00 [ksoftirqd/0]
root 5 0.0 0.0 0 0 ? S< Apr03 0:00 [kworker/0:0H]
root 6 0.0 0.0 0 0 ? S Apr03 0:02 [kworker/u2:0]
root 7 0.0 0.0 0 0 ? S Apr03 0:01 [rcu_sched]
root@ubuntu-server:/home/juan# cat /proc/23288/cgroup
4:cpu,memory,blkio,name=MyHierarchy:/SubCgroup1
1:name=systemd:/user/1000.user/5.session
root@ubuntu-server:/home/juan# cat /proc/23289/cgroup
4:cpu,memory,blkio,name=MyHierarchy:/
1:name=systemd:/user/1000.user/7.session
root@ubuntu-server:/home/juan# cat /proc/23290/cgroup
4:cpu,memory,blkio,name=MyHierarchy:/SubCgroup2
1:name=systemd:/user/1000.user/6.session
root@ubuntu-server:/home/juan#
We can see from the CPU utilization column how the process under the root cgroup 23289 is using 58% of the CPU while the process under SubCgroup1 23288 is using 32.4% and the one in SubCgroup2 23290 14.5%.