lunes, 4 de abril de 2016

cgroups 101 - keep your processes under control!

A few days ago reading a bit about System D and all the beauty of it, I came across Control Groups, aka cgroups. I had read about them before, however never had the chance to play around with them face to face.

Cgroups are a Kernel feature that allow sysadmins and human beings (:P) to group processes/tasks in order to assign system resources like CPU, memory, IO, etc in a more fine-grained way. Cgroups can be arranged in a hierarchical manner, and every process in the system will belong to exact one cgroup at any single point in time. In conjunction with Linux namespaces (next post :D), cgroups are a corner stone for things like Docker containers.

So how do cgroups provides access/accounting control to CPU, memory, etc? 

 

There's another concept involved in cgroups, and it is subsystems. A subsystem is the resource scheduler/controller in charge of setting the limits for a particular resource. Some of the most common subsystems are:

  • cpuset: this subsystem provides the possibility to assign a group of tasks to a particular CPU and Memory Node, this is particularly interesting in NUMA systems.
  • memory: this subsystem as you have already realized, allows you to control the amount of memory a group of tasks can utilize.
  • freezer: this subsystem allows you to easily freeze a group of processes, and eventually unfreeze them later on so they can continue running.
  • blkio: yes, that's correct, this subsystem allows you to define IO limits to processes, only works if using CFQ IO scheduler.

You can get a full list of the supported subsystems on your system from /proc/cgroups file:

root@ubuntu-server:/etc# cat /proc/cgroups
#subsys_name    hierarchy       num_cgroups     enabled
cpuset  0       1       1
cpu     0       1       1
cpuacct 0       1       1
memory  0       1       1
devices 0       1       1
freezer 0       1       1
net_cls 0       1       1
blkio   0       1       1
perf_event      0       1       1
net_prio        0       1       1
hugetlb 0       1       1
root@ubuntu-server:/etc#


this is what it looks like on an Ubuntu 14.04.4 VM.

Ok, so far so good, but how can we access these so called cgroups and subsystems?

 

Fortunately the cgroups interface is accessible through its Virtual File System representation (there are a couple of fancy tools that are available on RHEL systems), we can see for example the default cgroups setup that comes with Ubuntu 14.04.4:

root@ubuntu-server:/etc# mount |grep cgroup
none on /sys/fs/cgroup type tmpfs (rw)
systemd on /sys/fs/cgroup/systemd type cgroup (rw,noexec,nosuid,nodev,none,name=systemd)
root@ubuntu-server:/etc#


Note: there's a systemd "device" mounted on /sys/fs/cgroup/systemd and the File System type is cgroup.None subsystems are included in this cgroup hierarchy.

but in order to start from scratch I'll build a new cgroup hierarchy ignoring the one that comes by default.

I created a new directory under the same tmpfs:

root@ubuntu-server:/etc# mkdir /sys/fs/cgroup/MyHierarchy
root@ubuntu-server:/etc# ls /sys/fs/cgroup/
MyHierarchy  systemd
root@ubuntu-server:/etc#


then mounted a cgroup hierarchy on the new directory:

root@ubuntu-server:/etc# mount -t cgroup MyHierarchy -o none,name=MyHierarchy /sys/fs/cgroup/MyHierarchy/
root@ubuntu-server:/etc# mount |grep cgroup
none on /sys/fs/cgroup type tmpfs (rw)
systemd on /sys/fs/cgroup/systemd type cgroup (rw,noexec,nosuid,nodev,none,name=systemd)
MyHierarchy on /sys/fs/cgroup/MyHierarchy type cgroup (rw,none,name=MyHierarchy)
root@ubuntu-server:/etc#


Note: -o none will cause the mount not to include any subsystem on the hierarchy.

So, is that it? Well..., no xD, we've done nothing so far, but created a cgroup hierarchy, lets take a look at the files under it:

root@ubuntu-server:/etc# ls /sys/fs/cgroup/MyHierarchy/
cgroup.clone_children  cgroup.procs  cgroup.sane_behavior  notify_on_release  release_agent  tasks
root@ubuntu-server:/etc#


these are the basic files for a cgroup, they describe for example what processes belong to it (tasks and cgroup.procs), what command should be executed after the last task leaves the cgroup (notify_on_release, release_agent), etc. The most interesting file is tasks file that keeps the list of the processes that belong to the cgroup, remember that by default all the processes will be listed there since this is a root cgroup:

root@ubuntu-server:/etc# head /sys/fs/cgroup/MyHierarchy/tasks
1
2
3
5
7
8
9
10
11
12
root@ubuntu-server:/etc# wc -l /sys/fs/cgroup/MyHierarchy/tasks
132 /sys/fs/cgroup/MyHierarchy/tasks
root@ubuntu-server:/etc#


Note: Processes IDs can show up more than once and not in order, however a particular process will belong to a single cgroup under the same hierarchy.

We can easily create a sub cgroup (child cgroup) by creating a folder under the root one:

root@ubuntu-server:/etc# mkdir /sys/fs/cgroup/MyHierarchy/SubCgroup1
root@ubuntu-server:/etc# ls /sys/fs/cgroup/MyHierarchy/SubCgroup1/
cgroup.clone_children  cgroup.procs  notify_on_release  tasks
root@ubuntu-server:/etc#


this new cgroup has again similar files but no tasks are associated to it by default:

root@ubuntu-server:/etc# cat /sys/fs/cgroup/MyHierarchy/SubCgroup1/tasks
root@ubuntu-server:/etc#


We can easily move a task to this new cgroup by just writing the task PID into the tasks file:

root@ubuntu-server:/etc# echo $$
4826
root@ubuntu-server:/etc# ps aux|grep 4826
root      4826  0.0  0.7  21332  3952 pts/0    S    Apr02   0:00 bash
root     17971  0.0  0.4  11748  2220 pts/0    S+   04:30   0:00 grep --color=auto 4826
root@ubuntu-server:/etc# echo $$ > /sys/fs/cgroup/MyHierarchy/SubCgroup1/tasks
root@ubuntu-server:/etc# cat /sys/fs/cgroup/MyHierarchy/SubCgroup1/tasks
4826
17979
root@ubuntu-server:/etc#


in the previous example I moved the root bash process to SubCgroup1, and you can see the PID inside tasks file. But there's another PID there as well, why? That PID belongs to the cat command, basically any forked process will be assigned to the same cgroup its parent belongs to. We can also confirm PID 4826 doesn't belong to the root cgroup anymore:

root@ubuntu-server:/etc# grep 4826 /sys/fs/cgroup/MyHierarchy/tasks
root@ubuntu-server:/etc#

 

What if we want to know to which cgroup a particular process belongs to? 

 

We can easily find that out from our lovely /proc:

root@ubuntu-server:/etc# cat /proc/4826/cgroup
3:name=MyHierarchy:/SubCgroup1
1:name=systemd:/user/1000.user/1.session
root@ubuntu-server:/etc#


Note: see how our shell belongs to two different cgroups, this is possible because they belong to different hierarchies.

Ok, but by themselves cgroups just segregate tasks in groups, however they become interestingly powerful when mixed with the subsystems.

Testing a few Subsystems:

 

I rolled back the hierarchy I mounted before using umount as with any other File System, so we are back to square one:

root@ubuntu-server:/etc# umount /sys/fs/cgroup/MyHierarchy/
root@ubuntu-server:/etc# mount |grep cgroup
none on /sys/fs/cgroup type tmpfs (rw)
systemd on /sys/fs/cgroup/systemd type cgroup (rw,noexec,nosuid,nodev,none,name=systemd)
root@ubuntu-server:/etc#


I mounted the hierarchy again, but this time enabling cpu, memory and blkio subsystems, like this:

root@ubuntu-server:/etc# mount -t cgroup MyHierarchy -o cpu,memory,blkio,name=MyHierarchy /sys/fs/cgroup/MyHierarchy
mount: MyHierarchy already mounted or /sys/fs/cgroup/MyHierarchy busy
root@ubuntu-server:/etc# 


WTF??? according to the error the mount point is busy or still mounted... well it turns out that

When a cgroup filesystem is unmounted, if there are any child cgroups created 
below the top-level cgroup, that hierarchy will remain active even though 
unmounted; if there are no child cgroups then the hierarchy will be deactivated.

so, I have to either move process 4826 and its children to the root cgroup or killem' all!!! Of course killing them was the easy way out. With that done, boala:

root@ubuntu-server:/home/juan# mount -v -t cgroup MyHierarchy -o cpu,memory,blkio,name=MyHierarchy /sys/fs/cgroup/MyHierarchy
MyHierarchy on /sys/fs/cgroup/MyHierarchy type cgroup (rw,cpu,memory,blkio,name=MyHierarchy)
root@ubuntu-server:/home/juan#


now the hierarchy includes 3 subsystems, cpu, memory and blkio. Lets see what it looks like now:

root@ubuntu-server:/home/juan# ls /sys/fs/cgroup/MyHierarchy/
blkio.io_merged                   blkio.sectors_recursive           cpu.cfs_quota_us                    memory.move_charge_at_immigrate
blkio.io_merged_recursive         blkio.throttle.io_service_bytes   cpu.shares                          memory.numa_stat
blkio.io_queued                   blkio.throttle.io_serviced        cpu.stat                            memory.oom_control
blkio.io_queued_recursive         blkio.throttle.read_bps_device    memory.failcnt                      memory.pressure_level
blkio.io_service_bytes            blkio.throttle.read_iops_device   memory.force_empty                  memory.soft_limit_in_bytes
blkio.io_service_bytes_recursive  blkio.throttle.write_bps_device   memory.kmem.failcnt                 memory.stat
blkio.io_serviced                 blkio.throttle.write_iops_device  memory.kmem.limit_in_bytes          memory.swappiness
blkio.io_serviced_recursive       blkio.time                        memory.kmem.max_usage_in_bytes      memory.usage_in_bytes
blkio.io_service_time             blkio.time_recursive              memory.kmem.slabinfo                memory.use_hierarchy
blkio.io_service_time_recursive   blkio.weight                      memory.kmem.tcp.failcnt             notify_on_release
blkio.io_wait_time                blkio.weight_device               memory.kmem.tcp.limit_in_bytes      release_agent
blkio.io_wait_time_recursive      cgroup.clone_children             memory.kmem.tcp.max_usage_in_bytes  SubCgroup1
blkio.leaf_weight                 cgroup.event_control              memory.kmem.tcp.usage_in_bytes      tasks
blkio.leaf_weight_device          cgroup.procs                      memory.kmem.usage_in_bytes
blkio.reset_stats                 cgroup.sane_behavior              memory.limit_in_bytes
blkio.sectors                     cpu.cfs_period_us                 memory.max_usage_in_bytes
root@ubuntu-server:/home/juan#


yeah... many files, right? We should find a few files per active subsystem:

root@ubuntu-server:/home/juan# ls /sys/fs/cgroup/MyHierarchy/|grep -c "cpu\."
4
root@ubuntu-server:/home/juan# ls /sys/fs/cgroup/MyHierarchy/|grep -c "memory\."
22
root@ubuntu-server:/home/juan# ls /sys/fs/cgroup/MyHierarchy/|grep -c "blkio\."
27
root@ubuntu-server:/home/juan#


These files are the ones we can tune and of course I'm not going to explain all of them (not that I know them to be honest xD). Something interesting is the fact that some particular parameters can't be applied to the root cgroup, for quite obvious reasons, remember that by default all the processes belong to the root cgroup in the hierarchy and you certainly don't want to throttle some critical system processes.

So for testing purposes I'll set up a second child cgroup called SubCgroup2 and I will set two different throttle limits to write operations on the root volume /dev/sda. First thing first, I need to identify the major and minor number of the device:

root@ubuntu-server:/home/juan# lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0     5G  0 disk
├─sda1   8:1    0   4.5G  0 part /
├─sda2   8:2    0     1K  0 part
└─sda5   8:5    0   510M  0 part
sr0     11:0    1  1024M  0 rom
root@ubuntu-server:/home/juan#

Ok, 8 and 0 should do the trick here. Now we set the throttle using the file called blkio.throttle.write_bps_device, like this:

root@ubuntu-server:/home/juan# echo "8:0 10240000" > /sys/fs/cgroup/MyHierarchy/SubCgroup1/blkio.throttle.write_bps_device
root@ubuntu-server:/home/juan# echo "8:0 20480000" > /sys/fs/cgroup/MyHierarchy/SubCgroup2/blkio.throttle.write_bps_device
root@ubuntu-server:/home/juan#


see that I've throttled the processes under SubCgroup1 to 10240000 bytes per second and 20480000 bytes per second to processes that belong to SubCgroup2. In order to test this I've opened two new shells:

root@ubuntu-server:/home/juan# ps aux|grep bash
juan      1440  0.0  1.0  22456  5152 pts/0    Ss   22:28   0:00 -bash
root      2389  0.0  0.7  21244  3984 pts/0    S    22:35   0:00 bash
juan      6897  0.0  0.9  22456  4996 pts/2    Ss+  23:49   0:00 -bash
juan      6961  0.1  1.0  22456  5076 pts/3    Ss+  23:49   0:00 -bash

root      7276  0.0  0.4  11748  2132 pts/0    S+   23:50   0:00 grep --color=auto bash
root@ubuntu-server:/home/juan#


and pushed their PIDs to the cgroups:

root@ubuntu-server:/home/juan# echo 6897 > /sys/fs/cgroup/MyHierarchy/SubCgroup1/tasks
root@ubuntu-server:/home/juan# echo 6961 > /sys/fs/cgroup/MyHierarchy/SubCgroup2/tasks
root@ubuntu-server:/home/juan#


now lets see what happens when doing some intensive write to the drive:
  • Shell under SubCgroup1:
juan@ubuntu-server:~$ echo $$
6897

juan@ubuntu-server:~$ dd oflag=dsync if=/dev/zero of=test bs=1M count=512
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 52.6549 s, 10.2 MB/s
juan@ubuntu-server:~$

  • Shell under SubCgroup2:
juan@ubuntu-server:~$ echo $$
6961
juan@ubuntu-server:~$ dd oflag=dsync if=/dev/zero of=test bs=1M count=512
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 26.3397 s, 20.4 MB/s
juan@ubuntu-server:~$

  •  Shell under the root cgroup (no throttling here :D):
juan@ubuntu-server:~$ dd oflag=dsync if=/dev/zero of=test bs=1M count=512
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 3.98033 s, 135 MB/s
juan@ubuntu-server:~$


Cool, isn't it? the throttle worked fine, dd under shell 6897 was throttled at 10Mbytes/s while the one under shell 6961 was throttled at 20Mbytes/s. But what if two processes under the same cgroup try to write at the same time, how does the throttle work?

juan@ubuntu-server:~$ echo $$
6897

juan@ubuntu-server:~$ dd oflag=dsync if=/dev/zero of=test bs=1M count=512 & dd oflag=dsync if=/dev/zero of=test1 bs=1M count=512 &
[1] 8265
[2] 8266
juan@ubuntu-server:~$ 512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 105.142 s, 5.1 MB/s
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 105.24 s, 5.1 MB/s

[1]-  Done                    dd oflag=dsync if=/dev/zero of=test bs=1M count=512
[2]+  Done                    dd oflag=dsync if=/dev/zero of=test1 bs=1M count=512
juan@ubuntu-server:~$


the throughput limit is shared among the processes under the same cgroup, which makes perfect sense considering the limit is applied to the group as a whole. There are tons of different parameters to play with the blkio subsystem like weights, sectors, service time, etc. So be brave and have fun with them :P.

Last but not least, lets take a look at the cpu subsystem. This subsystem allows you to put some limits on the CPU utilization, here a list of the files you can use to tune it:

root@ubuntu-server:/home/juan# ls /sys/fs/cgroup/MyHierarchy/cpu.*
/sys/fs/cgroup/MyHierarchy/cpu.cfs_period_us
/sys/fs/cgroup/MyHierarchy/cpu.cfs_quota_us
/sys/fs/cgroup/MyHierarchy/cpu.shares
/sys/fs/cgroup/MyHierarchy/cpu.stat
root@ubuntu-server:/home/juan#


for the sake of me going to bed early I will only test cpu.shares feature. This share value defines a relative weight that the processes under a cgroup will have compared to processes on different cgroups, and this impacts directly on the amount of CPU time the processes can have. For example lets take the default value for the root cgroup:

root@ubuntu-server:/home/juan# cat /sys/fs/cgroup/MyHierarchy/cpu.shares
1024
root@ubuntu-server:/home/juan#


this means all the processes in this cgroup have that particular weight, so if we set the following weights on SubCgroup1 and SubCgroup2:

root@ubuntu-server:/home/juan# echo 512 > /sys/fs/cgroup/MyHierarchy/SubCgroup1/cpu.shares
root@ubuntu-server:/home/juan# echo 256 > /sys/fs/cgroup/MyHierarchy/SubCgroup2/cpu.shares
root@ubuntu-server:/home/juan#

what we mean is processes under SubCgroup1 will have half the CPU time than processes in the root cgroup and twice than processes under SubCgroup2. This is easy to see in the following ps output:

root@ubuntu-server:/home/juan# ps aux --sort=-pcpu|head
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
juan     23289 58.0  0.5   8264  2616 pts/2    R    04:02   0:11 dd if=/dev/zero of=/dev/null bs=1M count=512000000
juan     23288 32.4  0.5   8264  2616 pts/5    R    04:02   0:06 dd if=/dev/zero of=/dev/null bs=1M count=512000000
juan     23290 14.5  0.5   8264  2680 pts/0    R    04:02   0:02 dd if=/dev/zero of=/dev/null bs=1M count=512000000

root         1  0.0  0.5  33492  2772 ?        Ss   Apr03   0:00 /sbin/init
root         2  0.0  0.0      0     0 ?        S    Apr03   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        S    Apr03   0:00 [ksoftirqd/0]
root         5  0.0  0.0      0     0 ?        S<   Apr03   0:00 [kworker/0:0H]
root         6  0.0  0.0      0     0 ?        S    Apr03   0:02 [kworker/u2:0]
root         7  0.0  0.0      0     0 ?        S    Apr03   0:01 [rcu_sched]
root@ubuntu-server:/home/juan# cat /proc/23288/cgroup
4:cpu,memory,blkio,name=MyHierarchy:/SubCgroup1
1:name=systemd:/user/1000.user/5.session
root@ubuntu-server:/home/juan# cat /proc/23289/cgroup
4:cpu,memory,blkio,name=MyHierarchy:/
1:name=systemd:/user/1000.user/7.session
root@ubuntu-server:/home/juan# cat /proc/23290/cgroup
4:cpu,memory,blkio,name=MyHierarchy:/SubCgroup2
1:name=systemd:/user/1000.user/6.session
root@ubuntu-server:/home/juan# 



We can see from the CPU utilization column how the process under the root cgroup 23289 is using 58% of the CPU while the process under SubCgroup1 23288 is using 32.4% and the one in SubCgroup2 23290 14.5%.



Wrapping up:

 

Cgroups are awesome!, they provide a simple interface to a set of resource control capabilities that you can leverage on your Linux systems. There are many subsystems you can use, so for sure you will find the right one for your use case, no matter how weird it is xD. If you can choose... go with some RHEL like distribution, since they come with a set of scripts that can make your life way easier when it comes to handling cgroups, if you can't... be patiente and have fun with mount/umount hahaha.