domingo, 18 de diciembre de 2016

Process mating season 101 - Fork and Clone

Yeah, I know, the title is awesome, isn't it? Come on! at least it should be if you understand what fork() and clone() do in the context of Linux syscalls.

These two calls are the ones in charge of creating life on your system, basically spawning new processes. We'll have a look at the fundamental differences, some examples to understand them and some interesting facts

That said, before moving forward I'd like to outline the following:
  • Syscalls are the interface the kernel exposes to the user space processes, or if you want, they are the way processes access to the kernel services.
  • Usually processes don't use syscalls directly but they use them through some wrapper functions provided by glibc (in GNU/Linux at least). This wrapper functions provide sort of an abstraction layer, handling syscall parameters, return values and other situations.
Note: btw I've already talked a bit about syscalls here maybe you want to have a look.

Fork, the processes mitosis process

[off topic] I'm getting extremely good at writing titles![/off topic]

Pretty much like the eukaryotic cells replication process, the result of a fork call is a new almost identical process known as child process. The "almost" is key here, there are some properties that will be different, like (for an exhaustive list, please have a look at this):
  • PID will be different, the kernel will assign a new/unused PID to the child process. The parent PID of the child process will be its parent's PID (kind of makes sense, doesn't it?)
  • The child process doesn't inherit timers or memory locks.
  • Others
On the other side of the "almost" we have:
  • The child process will have an exact copy of its parent entire virtual address space (fork() is implemented using copy-on-write pages, so the only penalty that it incurs is the time and memory required to duplicate the parent's page tables, and to create a unique task structure for the child.). 
  • The child process inherits copies of structures like open file descriptors, open directory streams, etc.
Right after the fork() call the processes will be, although they are still sharing some resources,  two different entities, and they could be running two different code paths. This will be easier to understand with a simple example.

Simple fork() example


The code is self explanatory (or at least I tried) so I won't explain it in details
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <errno.h>
#include <string.h>

int main()
{
        int pid,ppid,childpid,keep_it;
        pid=getpid();//Get process ID
        ppid=getppid();//Get parent process ID

        childpid=fork();//fork() returns the child PID on the parent's code path and 0 on the child's. On failure returns -1
        //From this point on, there are 2 processes running, unless fork failed of course :D.
        keep_it=errno;
        if(childpid==-1)//Check if fork() failed
        {
                printf("Fork failed due to \"%s\"\n",strerror(keep_it));//print the system error that corresponds to errno
                return -1;
        }

        if(childpid==0)//Here is where the paths change, it could be done differently.
        {//Child code path here
                printf("Child process: \nPID\tPPID\n%d\t%d\n",pid,ppid);
                pid=getpid();
                ppid=getppid();
                printf("Child process: \nPID\tPPID\n%d\t%d\n",pid,ppid);
                sleep(5);
        }
        else
        {//Parent code path here
                sleep(10);
                printf("Parent process: \nPID\tPPID\n%d\t%d\n",pid,ppid);
                printf("Parent process: child PID was %d\n",childpid);
        }
        return 1;
}

lets run it to see what happens:
juan@test:~/clone_fork$ gcc -o fork_simple_example fork_simple_example.c
juan@test:~/clone_fork$ ./fork_simple_example
Child process:
PID     PPID
3213    2965
Child process:
PID     PPID
3214    3213
Parent process:
PID     PPID
3213    2965
Parent process: child PID was 3098
juan@test:~/clone_fork$
on a different shell I also captured the processes with ps:
juan@test:~$ ps axo stat,user,comm,ppid,pid|grep fork
S+   juan     fork_simple_exa  2965  3213
S+   juan     fork_simple_exa  3213  3214
juan@test:~$
So, what do we see from both outputs? The child process printed twice its PID and PPID just for the sake of showing how the first time the values on those variables were actually the ones collected by its father before the fork() call.

 File descriptors are preserved example


As we mentioned before, certain kernel structures are copied to the new child process, one of them are the open file descriptors. Now lets see that in an example using a pipe.

Note: a pipe is a type of Inter Process Communication mechanism, you can think the pipe as that simply a pipe with two ends, one where you can write to and one where you can read from. For more details please have a look at this.

The code is:
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <errno.h>
#include <string.h>
#define SIZE 250

int main()
{
        char buffer[SIZE],ch;
        int keep_it,childpid,aux,count;
        int my_pipe[2];//my_pipe[] will keep two FD, my_pipe[1] to write into the pipe and my_pipe[0] to read from the pipe.

        aux=pipe(my_pipe);//Note how the pipe is created BEFORE the fork() call

        if(aux==-1)//Check if pipe() failed
        {
                printf("Pipe failed due to \"%s\"\n",strerror(keep_it));//print the system error that corresponds to errno
                return -1;
        }

        childpid=fork();//fork() returns the child PID on the parent's code path and 0 on the child's. On failure returns -1
        //From this point on, there are 2 processes running, unless fork failed of course :D.
        keep_it=errno;
        if(childpid==-1)//Check if fork() failed
        {
                printf("Fork failed due to \"%s\"\n",strerror(keep_it));//print the system error that corresponds to errno
                return -1;
        }

        if(childpid==0)//Here is where the paths change, it could be done differently.
        {//Child code path here
                close(my_pipe[0]);//On the child process we can close the read end of the pipe
                printf("Hi, this is the child process, insert message here (:P less than %d letters please): ",SIZE);
                fgets(buffer,sizeof(buffer),stdin);
                count=write(my_pipe[1],buffer,SIZE);
                printf("message sent to parent process.\n");
        }
        else
        {//Parent code path here
                close(my_pipe[1]);//On the parent process we can close the write end of the pipe
                read(my_pipe[0],buffer,SIZE);
                printf("Parent process received message: %s",buffer);
        }
        return 1;
}

This code is a bit more complex, but the comments should help.

You can see how the pipe was open on the parent process (Line 15) and yet it was used on the child process (Line 37) without any problems! Also worth noting how the pipe requires 2 file descriptors, one to read from the pipe (stored in my_pipe[0]) and one to write into the pipe (stored in my_pipe[1]). After the fork() call since child and parent process have copies of these open file descriptors they can safely close the ones they won't use, and then they end up with a unidirectional inter process communication channel (child -> PIPE -> parent).

A funny fact

 
At this point I was tempted to run some straces to show how the fork syscall was being used (using strace) and noticed the following, this is the strace output of running the first simple example:
juan@test:~/clone_fork$ strace ./fork_simple_example
execve("./fork_simple_example", ["./fork_simple_example"], [/* 22 vars */]) = 0
brk(0)                                  = 0x16ba000
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa76a34a000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=95253, ...}) = 0
mmap(NULL, 95253, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fa76a332000
close(3)                                = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P \2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1840928, ...}) = 0
mmap(NULL, 3949248, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fa769d65000
mprotect(0x7fa769f1f000, 2097152, PROT_NONE) = 0
mmap(0x7fa76a11f000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1ba000) = 0x7fa76a11f000
mmap(0x7fa76a125000, 17088, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fa76a125000
close(3)                                = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa76a331000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa76a32f000
arch_prctl(ARCH_SET_FS, 0x7fa76a32f740) = 0
mprotect(0x7fa76a11f000, 16384, PROT_READ) = 0
mprotect(0x600000, 4096, PROT_READ)     = 0
mprotect(0x7fa76a34c000, 4096, PROT_READ) = 0
munmap(0x7fa76a332000, 95253)           = 0
getpid()                                = 4981
getppid()                               = 4978
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fa76a32fa10) = 4982
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({10, 0}, Child process:
PID     PPID
4981    4978
Child process:
PID     PPID
4982    4981
{4, 995479188})      = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=4982, si_status=1, si_utime=0, si_stime=0} ---
restart_syscall(<... resuming interrupted call ...>
) = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 8), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa76a349000
write(1, "Parent process: \n", 17Parent process:
)      = 17
write(1, "PID\tPPID\n", 9PID    PPID
)              = 9
write(1, "4981\t4978\n", 104981 4978
)            = 10
write(1, "Parent process: child PID was 49"..., 35Parent process: child PID was 4982
) = 35
exit_group(1)                           = ?
+++ exited with 1 +++
juan@test:~/clone_fork$

do you see any fork() call there? ... exactly there's no fork call!!! But I said fork is a Linux syscall and blah blah blah, right? Well, worry not, I wasn't lying :D all I said is true however...

Since version 2.3.3, rather than invoking the kernel's fork() system call, the glibc fork() wrapper that is provided as part of the NPTL threading implementation invokes clone(2) with flags that provide the same effect as the traditional system call. (A call to fork() is equivalent to a call to clone(2) specifying flags as just SIGCHLD.) 

that's the reason why we do see a clone call instead!

Now my brain needs some rest so I'll finish this post here, any feedback will be more than welcome!

On the next post I'll describe clone() and we'll see some examples to understand even better the differences with fork().

jueves, 17 de noviembre de 2016

Where did my disk space go??? Is df lying to me? or is it du lying?

I had an interesting situation today and thought might be a good idea to share it here. Basically the problem was more or less the following:
  1. A critical production system was running out of disk space and things were going downhills scarcely fast, df was showing utilization at 99% already.
  2. The user decided to delete some big log files in order to free up some space.
  3. After that df was still showing 99% , however running du on the root directory would show less space being utilized.
You can imagine now how unhappy the user was when realized that deleting files was not freeing any disk space and he was about to see the system become totally useless. On top of that he couldn't explain why df and du wouldn't come into an agreement in terms of free space on the system.

What was happening here? Was df or du lying to the user?


Well... software in general doesn't lie, usually what actually happens is that we don't really understand how it works and therefore we expect a different behavior. The root cause of the problem mentioned before is the unlink syscall, why? we can get the answer by asking the oracle (man 2 unlink):
DESCRIPTION
       unlink() deletes a name from the filesystem.  If that name was the last
       link to a file and no processes have the file open, the file is deleted
       and the space it was using is made available for reuse.

       If  the  name  was the last link to a file but any processes still have
       the file open, the file will remain in existence until  the  last  file
       descriptor referring to it is closed.

       ...
Yes :D, turns out that if you delete a file and there's at least one process that has a file descriptor pointing to that file, the space of the file won't be available right away.

Lets reproduce this:


The easiest way to understand this is by actually reproducing the behavior. I just used dd to fill-up the drive and the result was this:
[ec2-user@ip-172-31-16-117 ~]$ df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        3.9G   56K  3.9G   1% /dev
tmpfs           3.9G     0  3.9G   0% /dev/shm
/dev/xvda1      7.8G  7.7G     0 100% /
[ec2-user@ip-172-31-16-117 ~]$ du -sch / 2>/dev/null
7.7G    /
7.7G    total
[ec2-user@ip-172-31-16-117 ~]$ echo a > file
-bash: echo: write error: No space left on device
[ec2-user@ip-172-31-16-117 ~]$
the system has clearly ran out of disk space at this point and we can't even add a single byte to a file. Now the next step would be to free up some space, so we check with du where we can find some files to delete:
[ec2-user@ip-172-31-16-117 ~]$ du -sch /* 2>/dev/null
7.0M    /bin
25M     /boot
4.0K    /cgroup
56K     /dev
9.1M    /etc
6.7G    /home
46M     /lib
23M     /lib64
4.0K    /local
16K     /lost+found
4.0K    /media
4.0K    /mnt
43M     /opt
0       /proc
4.0K    /root
8.0K    /run
12M     /sbin
4.0K    /selinux
4.0K    /srv
0       /sys
8.0K    /tmp
802M    /usr
48M     /var
7.7G    total
[ec2-user@ip-172-31-16-117 ~]$
the size of /home is a bit crazy considering the size of the partition itself so lets delete some files from /home/ec2-user:
[ec2-user@ip-172-31-16-117 ~]$ ls -lah
total 6.7G
drwx------ 3 ec2-user ec2-user 4.0K Nov 17 20:10 .
drwxr-xr-x 3 root     root     4.0K Nov 17 18:42 ..
-rw-r--r-- 1 ec2-user ec2-user   18 Aug 15 23:52 .bash_logout
-rw-r--r-- 1 ec2-user ec2-user  193 Aug 15 23:52 .bash_profile
-rw-r--r-- 1 ec2-user ec2-user  124 Aug 15 23:52 .bashrc
-rw-rw-r-- 1 ec2-user ec2-user 306M Nov 17 20:09 big_file
-rw-rw-r-- 1 ec2-user ec2-user    0 Nov 17 20:10 file
-rw-rw-r-- 1 ec2-user ec2-user 6.4G Nov 17 18:44 fillingup
-rwxrwxr-x 1 ec2-user ec2-user 7.0K Nov 17 20:09 service
-rw-rw-r-- 1 ec2-user ec2-user  309 Nov 17 20:09 service.c
drwx------ 2 ec2-user ec2-user 4.0K Nov 17 18:42 .ssh
-rw------- 1 ec2-user ec2-user 2.1K Nov 17 20:09 .viminfo
[ec2-user@ip-172-31-16-117 ~]$ rm big_file
[ec2-user@ip-172-31-16-117 ~]$
[ec2-user@ip-172-31-16-117 ~]$ echo a > file
-bash: echo: write error: No space left on device
[ec2-user@ip-172-31-16-117 ~]$
I've just deleted a 306Mbytes file and can't even write 1 single byte!!! Having a look at df and du things look like:
[ec2-user@ip-172-31-16-117 ~]$ df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        3.9G   56K  3.9G   1% /dev
tmpfs           3.9G     0  3.9G   0% /dev/shm
/dev/xvda1      7.8G  7.7G     0 100% /
[ec2-user@ip-172-31-16-117 ~]$ du -sch / 2>/dev/null
7.4G    /
7.4G    total
[ec2-user@ip-172-31-16-117 ~]$
while du shows the expected size, df insists in saying the partition is at 100% utilization. Cool, with just a few steps we have reproduced the situation!

The reason behind the difference between df and du is in the fact that du calculates the size of the folders by adding up the size of the files inside and big_file does not exist anymore on the folder (has been removed from the directory entry), while df shows the real available space of the file system.

With all this in mind, the only reason why the file system size wouldn't change after deleting a file is because there's at least one process that keeps a file descriptor pointing to the deleted file (you can read more about open files and file descriptors in Linux limits 102 - Open Files). We can easily identify the process that holds the file descriptor by simple using our lovely lsof:
[ec2-user@ip-172-31-16-117 ~]$ lsof |grep "\(deleted\)"
service   23177 ec2-user    3r      REG  202,1 319934464  14708 /home/ec2-user/big_file (deleted)
[ec2-user@ip-172-31-16-117 ~]$
ja! turns out process "service" with PID 23177 has a file descriptor that points to file /home/ec2-user/big_file and that file has been deleted!!! So if we get rid of process service then the space should be finally available, lets confirm that:
[ec2-user@ip-172-31-16-117 ~]$ kill -9 23177
[ec2-user@ip-172-31-16-117 ~]$ 
[1]+  Killed                  ./service
[ec2-user@ip-172-31-16-117 ~]$ df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        3.9G   56K  3.9G   1% /dev
tmpfs           3.9G     0  3.9G   0% /dev/shm
/dev/xvda1      7.8G  7.4G  306M  97% /
[ec2-user@ip-172-31-16-117 ~]$ du -sch / 2>/dev/null
7.4G    /
7.4G    total
[ec2-user@ip-172-31-16-117 ~]$
and magic happened!!! When the processes was terminated the fd was finally closed and the last reference to the file disappeared causing the space to bee available now and therefore df matches with du now.

Depending on which process is keeping the file open you may have other options besides killing it, for example:
  • Restarting it if it's a service.
  • Perhaps sending another signal (for example to rotate logs?). 
  • If for some reason (uninterruptible state) you can't kill the process then the only easy way here is to restart the whole system.

domingo, 16 de octubre de 2016

Linux limits 102 - Open files

This post is kind of a follow up of Linux limits 101 - Ulimit from a few weeks ago where we went through ulimit and how they are used to limit users behaviors. This time we'll take a look at another limit the kernel can impose on us and therefore make our lives a bit sower, the number of open files.

I think it's worth mentioning (I may have said this before xD) that when we talk about Open Files in Linux systems we are basically talking about File descriptors (aka file handles). A file descriptor is a data structure used by processes to access: files, Unix sockets, Networking sockets, pipes, etc. Every new process comes by default with 3 file descriptors:
  • FD 0: standard input
  • FD 1: standard output
  • FD 2: standard error
So just by having lets say:
juan@test:~$ ps aux|wc -l
142
juan@test:~$
about 142 processes running on the system, we should expect at least around 426 file descriptors to be in use (142x3=426). What if there was a way to know how many file descriptors a particular process is using?

File descriptors a process is using:


Yeahp, of course that's possible! And as always in Linux, there are at least two different ways. The first approach will be of course the easiest one, each process has a folder under /proc that will provide loads of information, in this case will focus on subfolder fd, where guess what's going to show up? indeed, the file descriptors for the process:
root@test:/home/juan# ls /proc/1/fd | wc -l
24
root@test:/home/juan#
init process (PID=1) has 24 file descriptors in use! We can see more details about them in the next output:
root@test:/home/juan# ls -la /proc/1/fd
total 0
dr-x------ 2 root root  0 Oct 15 10:39 .
dr-xr-xr-x 9 root root  0 Oct 15 10:38 ..
lrwx------ 1 root root 64 Oct 15 10:39 0 -> /dev/null
lrwx------ 1 root root 64 Oct 15 10:39 1 -> /dev/null
lrwx------ 1 root root 64 Oct 15 10:39 10 -> socket:[8662]
lrwx------ 1 root root 64 Oct 15 10:39 11 -> socket:[9485]
l-wx------ 1 root root 64 Oct 15 10:39 12 -> /var/log/upstart/network-manager.log.1 (deleted)
lrwx------ 1 root root 64 Oct 15 10:39 14 -> socket:[10329]
l-wx------ 1 root root 64 Oct 15 10:39 16 -> /var/log/upstart/systemd-logind.log.1 (deleted)
lrwx------ 1 root root 64 Oct 15 10:39 17 -> socket:[8637]
lrwx------ 1 root root 64 Oct 15 10:39 18 -> /dev/ptmx
lrwx------ 1 root root 64 Oct 15 10:39 2 -> /dev/null
lrwx------ 1 root root 64 Oct 15 10:39 20 -> /dev/ptmx
lrwx------ 1 root root 64 Oct 15 10:39 22 -> /dev/ptmx
l-wx------ 1 root root 64 Oct 15 10:39 24 -> /var/log/upstart/modemmanager.log.1 (deleted)
lrwx------ 1 root root 64 Oct 15 10:39 29 -> /dev/ptmx
lr-x------ 1 root root 64 Oct 15 10:39 3 -> pipe:[8403]
lrwx------ 1 root root 64 Oct 15 10:39 30 -> /dev/ptmx
l-wx------ 1 root root 64 Oct 15 10:39 31 -> /var/log/upstart/mysql.log.1 (deleted)
lrwx------ 1 root root 64 Oct 15 10:39 34 -> /dev/ptmx
lrwx------ 1 root root 64 Oct 15 10:39 36 -> /dev/ptmx
l-wx------ 1 root root 64 Oct 15 10:39 4 -> pipe:[8403]
lr-x------ 1 root root 64 Oct 15 10:39 5 -> anon_inode:inotify
lr-x------ 1 root root 64 Oct 15 10:39 6 -> anon_inode:inotify
lrwx------ 1 root root 64 Oct 15 10:39 7 -> socket:[8404]
lrwx------ 1 root root 64 Oct 15 10:39 9 -> socket:[12675]
root@test:/home/juan#
we can see how the file descriptors are represented as links, a brief of the output would be:
  • Default file descriptors (0,1 and 2) have been pointed to /dev/null, which is ok for a process like init that isn't an interactive process.
  • There are a couple of UNIX sockets "socket:[XXXX]" opened (7,9,10, 11, etc), probably to connect to other processes.
  • There's a pipe "pipe:[8403]" as well using two fd (3 and 4) that's normal, pipes provide a fd to write and one to read while data is buffered on the kernel.
  • The rest of the fs point to:
    • /dev/ptmx pseudo terminal device.
    • inotify a way to monitor changes on files, this means init is interested in the events on two particular fd.
    • some deleted log files like /var/log/upstart/mysql.log.1 this is odd. Probably files were rotated or something like that.
 If for some reason these details weren't enough, you can go hardcore and try the second way.  lsof (list open files, makes sense, right?) is the program you need for that.

Lets list all the open files for a particular process using lsof, init in this case:
root@test:/home/juan# lsof -p 1
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/112/gvfs
      Output information may be incomplete.
COMMAND PID USER   FD   TYPE             DEVICE SIZE/OFF   NODE NAME
init      1 root  cwd    DIR                8,1     4096      2 /
init      1 root  rtd    DIR                8,1     4096      2 /
init      1 root  txt    REG                8,1   265848 261189 /sbin/init
init      1 root  mem    REG                8,1    43616 581960 /lib/x86_64-linux-gnu/libnss_files-2.19.so
init      1 root  mem    REG                8,1    47760 555508 /lib/x86_64-linux-gnu/libnss_nis-2.19.so
init      1 root  mem    REG                8,1    97296 555504 /lib/x86_64-linux-gnu/libnsl-2.19.so
init      1 root  mem    REG                8,1    39824 555503 /lib/x86_64-linux-gnu/libnss_compat-2.19.so
init      1 root  mem    REG                8,1    14664 555500 /lib/x86_64-linux-gnu/libdl-2.19.so
init      1 root  mem    REG                8,1   252032 540246 /lib/x86_64-linux-gnu/libpcre.so.3.13.1
init      1 root  mem    REG                8,1   141574 555505 /lib/x86_64-linux-gnu/libpthread-2.19.so
init      1 root  mem    REG                8,1  1840928 581957 /lib/x86_64-linux-gnu/libc-2.19.so
init      1 root  mem    REG                8,1    31792 581956 /lib/x86_64-linux-gnu/librt-2.19.so
init      1 root  mem    REG                8,1    43464 527349 /lib/x86_64-linux-gnu/libjson-c.so.2.0.0
init      1 root  mem    REG                8,1   134296 527439 /lib/x86_64-linux-gnu/libselinux.so.1
init      1 root  mem    REG                8,1   281552 527323 /lib/x86_64-linux-gnu/libdbus-1.so.3.7.6
init      1 root  mem    REG                8,1    38920 527371 /lib/x86_64-linux-gnu/libnih-dbus.so.1.0.0
init      1 root  mem    REG                8,1    96280 527373 /lib/x86_64-linux-gnu/libnih.so.1.0.0
init      1 root  mem    REG                8,1   149120 555506 /lib/x86_64-linux-gnu/ld-2.19.so
init      1 root    0u   CHR                1,3      0t0   1029 /dev/null
init      1 root    1u   CHR                1,3      0t0   1029 /dev/null
init      1 root    2u   CHR                1,3      0t0   1029 /dev/null
init      1 root    3r  FIFO                0,9      0t0   8403 pipe
init      1 root    4w  FIFO                0,9      0t0   8403 pipe
init      1 root    5r  0000               0,10        0   7661 anon_inode
init      1 root    6r  0000               0,10        0   7661 anon_inode
init      1 root    7u  unix 0xffff8800b37c8780      0t0   8404 @/com/ubuntu/upstart
init      1 root    9u  unix 0xffff8800a14e7c00      0t0  12675 @/com/ubuntu/upstart
init      1 root   10u  unix 0xffff8800b37c9a40      0t0   8662 @/com/ubuntu/upstart
init      1 root   11u  unix 0xffff8800b37a1e00      0t0   9485 @/com/ubuntu/upstart
init      1 root   12w   REG                8,1      283 551619 /var/log/upstart/network-manager.log.1 (deleted)
init      1 root   14u  unix 0xffff8800b37a3c00      0t0  10329 @/com/ubuntu/upstart
init      1 root   16w   REG                8,1      451 522345 /var/log/upstart/systemd-logind.log.1 (deleted)
init      1 root   17u  unix 0xffff8800b37cb0c0      0t0   8637 socket
init      1 root   18u   CHR                5,2      0t0   1932 /dev/ptmx
init      1 root   20u   CHR                5,2      0t0   1932 /dev/ptmx
init      1 root   22u   CHR                5,2      0t0   1932 /dev/ptmx
init      1 root   24w   REG                8,1      502 527289 /var/log/upstart/modemmanager.log.1 (deleted)
init      1 root   29u   CHR                5,2      0t0   1932 /dev/ptmx
init      1 root   30u   CHR                5,2      0t0   1932 /dev/ptmx
init      1 root   31w   REG                8,1      881 552236 /var/log/upstart/mysql.log.1 (deleted)
init      1 root   34u   CHR                5,2      0t0   1932 /dev/ptmx
init      1 root   36u   CHR                5,2      0t0   1932 /dev/ptmx
root@test:/home/juan#
now we can see a few more things, like:
  • Details of the FD, like its Type, Device it belongs to, etc.
  • We can see also some things that aren't really opened fd but some extra process information:
    • cwd current working directory
    • rtd root directory
    • txt init's binary code file
    • memory mapped files in this case bunch of system libraries. YES, these had a fd when they were mapped, but the fd was closed right after the mmap call was successful (you can see that checking this entry about strace).
Ok, now we know what a file descriptor or file handle is and how to identify them and map them to our processes. Is there any system wide limit for the file descriptors you can open?
 

Max open files, system wide:


If the answer for the previous question was now, I wouldn't have a reason to write this article in the first place I guess xD, therefore the answer is YES :P. The kernel is cool, and in order to play safe it has to set limits (I sound like a father now...) to avoid bigger problems.

The maximum number of open files the kernel can handle can be obtained from our beloved /proc, particularly in file file-nr under sys/fs directory. Here we can see the numbers for my current system:
root@test:/home/juan# cat /proc/sys/fs/file-nr
2944 0 298505
root@test:/home/juan#
These values mean the following:
  • First value (2944) indicates the number of allocated file descriptors, these are allocated dynamically by the kernel.
  • Second value (0) is the number of allocated but unused file descriptor. Kernels from 2.6.something free any unused fd, so this value should always be 0.
  • Third value (298505) indicates the maximum number of file descriptors that the kernel can allocate (also visible on file-max file).
Summing up, there are 2944 file descriptors allocated and in use at this precise moment. Almost 3k file descriptors allocated for about 142, interesting right?

Just for the sake of it, lets track down the process using the most number of file descriptors:
root@test:/home/juan# for i in `ps -Ao pid|grep -v PID`;do count=`ls /proc/$i/fd/ 2> /dev/null|wc -l`; echo "$count $i";done | sort -nr | head
61 1075
48 393
32 1264
31 1365
27 1132
24 1
21 440
20 1265
19 1325
19 1311
root@test:/home/juan#
there we see the top ten (first column is the number of FD and the second is the PID). Interestingly enough there's a process using 61 file descriptors, turns out I had mysqld installed on this VM (had no idea...):
root@test:/home/juan# ps aux|grep 1075
mysql     1075  0.1  1.9 624040 57792 ?        Ssl  10:38   0:07 /usr/sbin/mysqld
root      9664  0.0  0.0  15948  2232 pts/1    S+   12:35   0:00 grep --color=auto 1075
root@test:/home/juan#
 

Increasing the limit


If by any chance the almost 300k file descriptors the kernel allows to open is not enough (some busy systems may reach that limit) you will notice logs like "VFS: file-max limit reached" on dmesg and probably in messages or syslog files. In that case,  you can increase that limit using one of the following ways:
  • iminproductionpainchangeitrightnowgoddamnit way, by just updating /proc with the new value, like
root@test:/home/juan# echo 400000 > /proc/sys/fs/file-max
root@test:/home/juan# cat /proc/sys/fs/file-nr
3072    0       400000
root@test:/home/juan#
  • Or you can be more elegant and use sysctl command:
root@test:/home/juan# sysctl -w fs.file-max=500000
fs.file-max = 500000
root@test:/home/juan# cat /proc/sys/fs/file-nr
3072    0       500000
root@test:/home/juan#

In any case, don't forget to make the change persistent by doing:
root@test:/home/juan# echo "fs.file-max=500000" >> /etc/sysctl.conf
root@test:/home/juan#

If the error showing on your logs is instead "Too many open files" then the limit you've reached is most likely the ulimit for the user :D, which you know how to deal with because you've read this.

And that's about it! 

jueves, 22 de septiembre de 2016

Hard links and Soft links - Do you really understand them?

I've noticed the Hard/Soft link concept seems to be a bit confusing for some people, some of them don't even known Hard links exist. Well, I must confess that I didn't quite get them at the beginning however after dealing a bit and breaking stuff you get to love understand them. Therefore this post is intended to explain them in details to understand their differences and the reasons behind them.

Test scenario


The test scenario is pretty simple:
juan@test:~/hard_soft$ ls -lai
total 16
457232 drwxrwxr-x  2 juan juan 4096 Sep 21 17:45 .
415465 drwxr-xr-x 39 juan juan 4096 Sep 21 17:45 ..
457337 -rw-rw-r--  1 juan juan   14 Sep 21 17:45 File1
juan@test:~/hard_soft$
just a folder with 1 file that we are going to use as target for the links. The content of the file is just a string "File1 content\n", so 14 bytes. We can see the inode numbers is 457337 -> File1 (every file has an inode number). We also see a number 1 on the third column, that is the number of "Hard Links" the inode has, we'll see this more in details later on :D.

We can also get similar information (and more) by using stat program:
juan@test:~/hard_soft$ stat File1
File: ‘File1’
  Size: 14              Blocks: 8          IO Block: 4096   regular file
Device: 801h/2049d      Inode: 457337      Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/    juan)   Gid: ( 1000/    juan)
Access: 2016-09-21 17:45:43.744729861 +0100
Modify: 2016-09-21 17:45:43.744729861 +0100
Change: 2016-09-21 17:45:43.744729861 +0100
 Birth: -
juan@test:~/hard_soft$ 
so this is the file we'll use to play.

Soft links aka Symbolic link


Lets take a look at the soft ones first. Long story short, creating a soft link is creating a new file that points to a target file, therefore you end up having 2 files, yeahp, trust me! You can see that happening here:
juan@test:~/hard_soft$ ln -s File1 LinkToFile1
juan@test:~/hard_soft$ ls -lai
total 16
457232 drwxrwxr-x  2 juan juan 4096 Sep 21 18:06 .
415465 drwxr-xr-x 39 juan juan 4096 Sep 21 17:45 ..
457337 -rw-rw-r--  1 juan juan   14 Sep 21 17:45 File1
457342 lrwxrwxrwx  1 juan juan    5 Sep 21 18:06 LinkToFile1 -> File1
juan@test:~/hard_soft$
I just used "ln -s" to create a soft link called LinkToFile1 that points to File1. The particular syscall in use here is symlink, we can see that here:
juan@test:~/hard_soft$ strace ln -s File1 Link1ToFile1
execve("/bin/ln", ["ln", "-s", "File1", "Link1ToFile1"], [/* 22 vars */]) = 
brk(0)                                  = 0x13e0000
...
stat("Link1ToFile1", 0x7ffc1f9a1470)    = -1 ENOENT (No such file or directory)
symlink("File1", "Link1ToFile1")        = 0
lseek(0, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
close(0)                                = 0
close(1)                                = 0
close(2)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++
juan@test:~/hard_soft$
ln first checks if there's a file already called Link1ToFile1 and since there isn't any, it moves forward and creates the symbolic link with that name pointing to file File1.

Taking a look at the "ls" output:
  • File LinkToFile1 has its own inode number 45734, therefore is an independent file. Also the 1 on the third column suggest there's a single link to the inode.
  • Before the permissions we have an "l", that suggests this is not a regular file, but a symbolic link.
  • The file size is different, isn't that funny? just 5 bytes! Why?
  • Permissions are kind of open, right? Yeahp, that's normal for soft links, the permissions of the target file are the ones that matters.
Regarding the size of LinkToFile1, what are these 5 bytes?
juan@test:~/hard_soft$ cat LinkToFile1
File1 content
juan@test:~/hard_soft$
oops... of course, doing "cat LinkToFile1" is in the end doing cat to File1! So how can we actually read the content of LinkToFile1? Lets see if strace can help here (wanna know more about strace? take a look at this post):
juan@test:~/hard_soft$ strace cat LinkToFile1
execve("/bin/cat", ["cat", "LinkToFile1"], [/* 22 vars */]) = 0
brk(0)                                  = 0x149f000
...
open("LinkToFile1", O_RDONLY)           = 3
fstat(3, {st_mode=S_IFREG|0664, st_size=14, ...}) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
read(3, "File1 content\n", 65536)       = 14
write(1, "File1 content\n", 14File1 content
)         = 14
read(3, "", 65536)                      = 0
close(3)                                = 0
close(1)                                = 0
close(2)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++
juan@test:~/hard_soft$
turns out that by default open syscall will recognize the file as a soft link and will follow it (you can avoid this with certain flags). In the end, the returned FD 3 will actually point to File1 and that's why the read syscall returns "File1 content\n". So how can we actually retrieve the content of LinkToFile1?, well we can use readlink program (which actually uses readlink syscall xD) to read the content of a Symbolic link, just like this:
juan@test:~/hard_soft$ readlink LinkToFile1
File1
juan@test:~/hard_soft$
Yes :D, the content of LinkToFile1 is File1, the name of the file (the relative path actually) that's why the size is 5 bytes!!! But if the content of LinkToFile1 is a path to File1 what happens if I move File1 somewhere else?
Lets have a look:
juan@test:~/hard_soft$ mv File1 ..
juan@test:~/hard_soft$ ls -lai
total 12
457232 drwxrwxr-x  2 juan juan 4096 Sep 21 18:44 .
415465 drwxr-xr-x 39 juan juan 4096 Sep 21 18:44 ..
457342 lrwxrwxrwx  1 juan juan    5 Sep 21 18:06 LinkToFile1 -> File1
juan@test:~/hard_soft$ readlink LinkToFile1
File1
juan@test:~/hard_soft$ cat LinkToFile1
cat: LinkToFile1: No such file or directory
juan@test:~/hard_soft$
exactly, the link breaks and it doesn't work anymore! Same thing happens if we remove the target file or if we move LinkToFile1 instead:
juan@test:~/hard_soft$ ls
File1  File2  LinkToFile1
juan@test:~/hard_soft$ mv LinkToFile1 ../
juan@test:~/hard_soft$ ll -i ../LinkToFile1
457342 lrwxrwxrwx 1 juan juan 5 Sep 21 18:06 ../LinkToFile1 -> File1
juan@test:~/hard_soft$ cat ../LinkToFile1
cat: ../LinkToFile1: No such file or directory
juan@test:~/hard_soft$
We could workaround the moving link file issue by using the full path to File1 as target when creating the link instead of the relative one:
juan@test:~/hard_soft$ ln -s /home/juan/hard_soft/File1 Link2ToFile1
juan@test:~/hard_soft$ ls -lai
total 16
457232 drwxrwxr-x  2 juan juan 4096 Sep 21 18:50 .
415465 drwxr-xr-x 39 juan juan 4096 Sep 21 18:50 ..
457337 -rw-rw-r--  1 juan juan   14 Sep 21 17:45 File1
457343 lrwxrwxrwx  1 juan juan   26 Sep 21 18:50 Link2ToFile1 -> /home/juan/hard_soft/File1
457342 lrwxrwxrwx  1 juan juan    5 Sep 21 18:06 LinkToFile1 -> File1
juan@test:~/hard_soft$ readlink Link2ToFile1
/home/juan/hard_soft/File1
juan@test:~/hard_soft$
What if I delete a Soft link? since soft links are just files, deleting one of them will just make it go away and nothing will happen with the target/linked file:
juan@test:~/hard_soft$ rm Link*
juan@test:~/hard_soft$ ls -lai
total 12
457232 drwxrwxr-x  2 juan juan 4096 Sep 21 19:14 .
415465 drwxr-xr-x 39 juan juan 4096 Sep 21 18:50 ..
457337 -rw-rw-r--  1 juan juan   14 Sep 21 17:45 File1
juan@test:~/hard_soft$
What actually happens when you delete a file is that the number of links in the inode is decreased by one (and the entry gets removed from the directory). Once the number of links reaches 0, the file is officially gone (unless there's a running process that has a FD using it).

So... Soft links are just files and its content is the path to the targeted/linked file, makes perfect sense now, right?!

Hard links


The story is a bit different with Hard Links. When you create one you DO NOT get an extra file, nope you don't. Creating a Hard Link increases the number of links for a particular inode, lets see an example:
juan@test:~/hard_soft$ ln File1 Link1ToFile1
juan@test:~/hard_soft$ ls -lai
total 16
457232 drwxrwxr-x  2 juan juan 4096 Sep 21 19:24 .
415465 drwxr-xr-x 39 juan juan 4096 Sep 21 18:50 ..
457337 -rw-rw-r--  2 juan juan   14 Sep 21 17:45 File1
457337 -rw-rw-r--  2 juan juan   14 Sep 21 17:45 Link1ToFile1
juan@test:~/hard_soft$
Note: the syscall in play here is link or linkat.

Interesting! looks like 2 files, but:
  • They both have the same inode number! therefore they are the same file. The reason behind this is that directory wise, there are indeed two directory entries, one for File1 and one for Link1ToFile1
  • Permissions are the same, that makes sense because is the same file (xD you are getting my point, right?), and that also applies for the rest of the properties MAC times for example.
Does that mean that if I have a third Hard link I'll get 3 as number of links for inode 457337, yeahp, that's correct:
juan@test:~/hard_soft$ ln File1 Link2ToFile1
juan@test:~/hard_soft$ ls -lai
total 20
457232 drwxrwxr-x  2 juan juan 4096 Sep 21 19:56 .
415465 drwxr-xr-x 39 juan juan 4096 Sep 21 18:50 ..
457337 -rw-rw-r--  3 juan juan   14 Sep 21 17:45 File1
457337 -rw-rw-r--  3 juan juan   14 Sep 21 17:45 Link1ToFile1
457337 -rw-rw-r--  3 juan juan   14 Sep 21 17:45 Link2ToFile1
juan@test:~/hard_soft$
The good thing about hard links, is that you can move them around and they just keep working:
juan@test:~/hard_soft$ mv Link1ToFile1 ../
juan@test:~/hard_soft$ cat ../Link1ToFile1
File1 content
juan@test:~/hard_soft$
that's because by moving the hard link, the directory entry in hard_soft directory was removed and a corresponding one was created on the parent directory (/home/juan), so accessing the link keeps working. Did this change the number of links on inode 457337?
juan@test:~/hard_soft$ ls -lai
total 16
457232 drwxrwxr-x  2 juan juan 4096 Sep 21 20:01 .
415465 drwxr-xr-x 39 juan juan 4096 Sep 21 20:01 ..
457337 -rw-rw-r--  3 juan juan   14 Sep 21 17:45 File1
457337 -rw-rw-r--  3 juan juan   14 Sep 21 17:45 Link2ToFile1
juan@test:~/hard_soft$ ls -lai ../Link1ToFile1
457337 -rw-rw-r-- 3 juan juan 14 Sep 21 17:45 ../Link1ToFile1
juan@test:~/hard_soft$
of course not, inode 457337 keeps having 3 links.

Then what if I delete a Hard link? As we mentioned before deleting a file decreases by one the link counter on the inode, therefore if we have 3 hard links and we delete one of them we'll be back to 2, like you can see here:
juan@test:~/hard_soft$ ls -lai
total 20
457232 drwxrwxr-x  2 juan juan 4096 Sep 22 19:52 .
415465 drwxr-xr-x 39 juan juan 4096 Sep 22 19:51 ..
457337 -rw-rw-r--  3 juan juan   14 Sep 21 17:45 File1
457337 -rw-rw-r--  3 juan juan   14 Sep 21 17:45 Link1ToFile1
457337 -rw-rw-r--  3 juan juan   14 Sep 21 17:45 Link2ToFile1
juan@test:~/hard_soft$ rm Link2ToFile1
juan@test:~/hard_soft$ ls -lai
total 16
457232 drwxrwxr-x  2 juan juan 4096 Sep 22 19:53 .
415465 drwxr-xr-x 39 juan juan 4096 Sep 22 19:51 ..
457337 -rw-rw-r--  2 juan juan   14 Sep 21 17:45 File1
457337 -rw-rw-r--  2 juan juan   14 Sep 21 17:45 Link1ToFile1
juan@test:~/hard_soft$

Summary


To wrap up the idea, I'll summarize the most important points here:

  • A soft link is another file (link type though), its content indicates the location of the targeted file.
  • You can have a soft links pointing to a file in a different partitions, which is NOT possible with hard links.
  • Hard links don't require extra space or inodes, and they can be moved around (in the same partition) and will keep working fine.
  • Every time you create a file, a hard link is created and that's link 1 in the inode :D.

lunes, 5 de septiembre de 2016

Linux limits 101 - Ulimit

Resources aren't infinite, and that's old news, right? We are usually worried about disk space and memory utilization, however these are far from being the only available resources on a Linux system you should worry about. A few months ago I wrote an entry about cgroups and I mentioned they are a way to limit/assign resources to processes, this time we'll see a different kind of restrictions that can also cause some production pain situations.

Ulimit - Users Limits


User limits are restrictions enforced to processes spawn from your shell, and they are placed in order to keep users under control, somehow. For every resource that is tracked (by the kernel) there are 2 limits, a soft limit and a hard limit. While the hard limit cannot be changed by an unprivileged processes, the soft limit can be raised up to the hard limit if necessary (more details here).

We can see the User limits by using the bash builtin command ulimit, for example we can see the soft limits with ulimit -a:
juan@test:~$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 11664
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 11664
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
juan@test:~$
and we can see the hard limits with -aH:
juan@test:~$ ulimit -aH
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 11664
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 4096
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 11664
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
juan@test:~$
There are two different situations here:
  • Some resources like "open files" and "core file size" have a soft limit lower than the hard limit, which means the process itself can increase it if necessary. 
  • Other resources like "max memory size" and "max user processes" have both soft and hard limit with the same value, which suggests  the user can only decrease the value.
These limits are inherited by the child processes after a fork call and they are maintained after an execve call.  

Using ulimit command, you can also update the values, for example if we believe the maximum number of open files per process (1024) is not enough, we can go ahead and raise the soft limit with -S, like this:
juan@test:~$ ulimit -n
1024
juan@test:~$ ulimit -S -n 2048
juan@test:~$ ulimit -n
2048
juan@test:~$
now the process (our shell in this case) can open up to 2048 files. If we spawn a new process out of this shell we'll see the limit is still there:
juan@test:~$ /bin/bash
juan@test:~$ ulimit -n
2048
juan@test:~$ exit
exit
juan@test:~$
Using -H we can decrease (or increase if it's a privileged process) the hard limit for a particular value, but be careful you can't increase it back!!!
juan@test:~$ ulimit -H -n 1027
juan@test:~$ ulimit -Hn
1027
juan@test:~$ ulimit -H -n 1028
-bash: ulimit: open files: cannot modify limit: Operation not permitted
juan@test:~$
at this point we decreased the hard limit from 4096 to 1027 so if we want to open more than 1027 files with this particular process we won't be able to.
All these changes we've done on the soft and hard limits are persistent as long as the shell used is still there, if we just close that shell and open a new one the default limits will come back to play. So how the heck do I get them to be persistent?

File /etc/security/limits.conf


This is the file used by pam_limits module to enforce ulimit  limits to all the user sessions on the system. Just by reading the comments on the file you will be able to understand its syntax, for more check here. I could easily change the default ulimits for user juan by adding for example:
juan               soft           nofile             2048
this would increase the soft limit for the number of files a process can open. The change will take effect for the next session, not for the current one.

C examples


Just for the sake of fun, I wrote a small C program that will try to open 2048 files, and will abort if it doesn't succeed. The first code open_files.c is here:
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <errno.h>
#define SIZE 2048

int main()
{
        int open_files[SIZE];
        int index=0;
        int i,keep_it;

        for(i=0;i<SIZE;i++)
        {
                printf("Opening file number %d:\n",i);
                open_files[i]=open("/etc/passwd",O_RDONLY);
                keep_it=errno;//we save errno before doing anything else
                if(open_files[i] == -1)
                {
                        printf("%s\n",strerror(keep_it));//we print the system error that corresponds to errno
                        return open_files[i];
                }
                printf("Opened file number %d, assigned FD=%d:\n",i,open_files[i]);
        }
        printf("%d files have been opened.\n",SIZE);

        return 0;
}
if you compile and run it you should see something like:
juan@test:~/ulimit$ ./open_files
Opening file number 0:
Opened file number 0, assigned FD=3:
Opening file number 1:
Opened file number 1, assigned FD=4:
Opening file number 2:
Opened file number 2, assigned FD=5:
Opening file number 3:
Opened file number 3, assigned FD=6:
Opening file number 4:
Opened file number 4, assigned FD=7:
...
Opening file number 1018:
Opened file number 1018, assigned FD=1021:
Opening file number 1019:
Opened file number 1019, assigned FD=1022:
Opening file number 1020:
Opened file number 1020, assigned FD=1023:
Opening file number 1021:
Too many open files
juan@test:~/ulimit$
a few things to take from the previous run:
  • The first file descriptor returned by the open syscall is 3, Why is that? :D exactly, because FD 0 is the STDIN, FD 1 is STOUT and FD 2 is STDERR, so the first available file descriptor for a new process is 3.
  • As soon as the process tries to open the 1021st file the open call returns -1 and sets errno to "Too many open files". This is because the maximum number of open files has been reached.
How could we address this? Well, the easiest way would by changing the soft limit before running the program, but that would allow all newly spawn processes to open 2048 files and we might not want that side effect. So lets change the soft limit inside the C program:
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <errno.h>
#define SIZE 2048
#define JUMP 100

int main()
{
        int open_files[SIZE];
        int index=0;
        int i,keep_it,aux;
        struct rlimit old, new;

        for(i=0;i%lt;SIZE;i++)
        {
                printf("Opening file number %d:\n",i);
                open_files[i]=open("/etc/passwd",O_RDONLY);
                keep_it=errno;//we save errno before doing anything else
                if(open_files[i] == -1)
                {
                        if(keep_it == 24)//Too many open files
                        {
                                printf("%s\n",strerror(keep_it));//we print the system error that corresponds to errno
                                printf("Increasing NOFILE in %d\n",JUMP);
                                getrlimit(RLIMIT_NOFILE,&old);
                                printf("Current soft limit %d, current hard limit %d\n",(int)old.rlim_cur,(int)old.rlim_max);
                                new.rlim_max=old.rlim_max;
                                new.rlim_cur=old.rlim_cur+JUMP;
                                aux=setrlimit(RLIMIT_NOFILE,&new);
                                keep_it=errno;
                                if(aux==0)
                                {
                                        i=i-1;//reduce i in 1 to "move back" the loop one cycle.
                                }
                                else
                                {
                                        printf("Couldn't raise the soft limit: %s\n",strerror(keep_it));
                                        return -1;
                                }
                        }
                        else
                        {//some different error
                                return -1;
                        }
                }
                else
                {
                        printf("Opened file number %d, assigned FD=%d:\n",i,open_files[i]);
                }
        }
        printf("%d files have been opened.\n",SIZE);

        return 0;
}

The example will get the current soft and hard limit using getrlimit syscall and then update the soft limit using setrlimit. Two rlimit structures were added to the code, old and new, in order to update the limit. We can see the update is done by adding JUMP to the current limit, in this case adding 100. The rest of the code is pretty much the same :D.

If we ran the new code we'll see something like:
juan@test:~/ulimit$ ./open_files_increase_soft
Opening file number 0:
Opened file number 0, assigned FD=3:
Opening file number 1:
Opened file number 1, assigned FD=4:
Opening file number 2:
Opened file number 2, assigned FD=5:
Opening file number 3:
Opened file number 3, assigned FD=6:
Opening file number 4:
Opened file number 4, assigned FD=7:
...
Opening file number 1019:
Opened file number 1019, assigned FD=1022:
Opening file number 1020:
Opened file number 1020, assigned FD=1023:
Opening file number 1021:
Too many open files
Increasing NOFILE in 100
Current soft limit 1024, current hard limit 4096
Opening file number 1021:
Opened file number 1021, assigned FD=1024:
Opening file number 1022:
Opened file number 1022, assigned FD=1025:
...
Opened file number 2043, assigned FD=2046:
Opening file number 2044:
Opened file number 2044, assigned FD=2047:
Opening file number 2045:
Opened file number 2045, assigned FD=2048:
Opening file number 2046:
Opened file number 2046, assigned FD=2049:
Opening file number 2047:
Opened file number 2047, assigned FD=2050:
2048 files have been opened.
juan@test:~/ulimit$
now the process was able to open 2048 files by increasing its soft limit slowly on demand.

Wrapping up 


So whenever you are working with production systems you need to be aware of these limits unless of course you enjoy getting paged randomly haha. I've seen production systems going unresponsive because of reaching these limits, bear in mind that when we talk about Open Files we talk about file descriptors therefore this limit also applies to network connections, not just files! On top of that, if the application doesn't capture the error showing it on its logs it can be pretty hard to spot...

viernes, 19 de agosto de 2016

Strace 101 - stracing your stuff!

It's been a while since the last blog post now, like 4 months went by! I remeber once I thought I could write one post per week xD, that did not work at all hahaha. Anyway..., the last few days had the chance to spend some time working with strace and perf, so I decided to write something about it here as well.

Strace? What the heck is it?

Strace is a nice debugging/troubleshooting tool for Linux, that helps you identify the syscalls a particular process is using and the signals a proess receives (more details here). Syscalls are basically the interface to access to kernel services. Therefore knowing what particular syscalls a program is issuing and what the results of these calls are is really interesting when debugging some software problems.

But how is it possible that a User space process is able to see the syscalls another User space process issues? Well, there's a kernel feature called ptrace that makes that possible (yeah... ptrace is a system call xD). By definition ptrace is:

The ptrace() system call provides a means by which one process (the "tracer") may observe and control the execution of another process (the "tracee"), and examine and change the tracee's memory and registers. It is primarily used to implement breakpoint debugging and system call tracing. 

There are basically 2 ways you can strace a process to see its syscalls, you can either launch the process using strace or you can attach strace to a running process (under certain conditions :D).

How does it work? Show me the money!

I strongly believe the best way to explain something is by showing working examples, so there we go... Lets see how many syscalls (and which ones) the classic "Hello World" issues. The C code is:
#include <stdio.h>
int main()
{
     printf("Hello World\n");
     return 0;
}
Compile and run:
juan@juan-VirtualBox:~$ gcc -o classic classic.c 
juan@juan-VirtualBox:~$ ./classic 
Hello World
juan@juan-VirtualBox:~$ 
now we run it using strace instead:
juan@juan-VirtualBox:~$ strace ./classic
execve("./classic", ["./classic"], [/* 60 vars */]) = 0
brk(0)  = 0x7f0000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0362d17000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=85679, ...}) = 0
mmap(NULL, 85679, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f0362d02000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P \2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1840928, ...}) = 0
mmap(NULL, 3949248, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f0362732000
mprotect(0x7f03628ec000, 2097152, PROT_NONE) = 0
mmap(0x7f0362aec000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1ba000) = 0x7f0362aec000
mmap(0x7f0362af2000, 17088, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f0362af2000
close(3) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0362d01000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0362cff000
arch_prctl(ARCH_SET_FS, 0x7f0362cff740) = 0
mprotect(0x7f0362aec000, 16384, PROT_READ) = 0
mprotect(0x600000, 4096, PROT_READ) = 0
mprotect(0x7f0362d19000, 4096, PROT_READ) = 0
munmap(0x7f0362d02000, 85679) = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0 
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0362d16000
write(1, "Hello World\n", 12Hello World
) = 12
exit_group(0) = ?
+++ exited with 12 +++
juan@juan-VirtualBox:~$
The ouput provides one system call per line, and each line includes the syscall name, the parameters used to invoke it and the result returned (after the =). As we can see the simplest piece of code you can imagine will actually make use of many syscalls. I will only comment on a few of them (probably in a next post we can dive deep there :D):
  • execve (the first one), just before this call strace forked itself and the child process called ptrace syscall allowing strace parent to trace him. So after all that, the process runs execve changing its running code to our classic C program.
  • brk changes the location of the program break, which defines the end of the process's data segment. Increasing it has the effect of allocating memory to the process; or decreasing the break deallocates memory. In this case is called with 0 as increment, which makes the call return the current program break (0x7f0000).
  • we see a couple of mmap syscalls mapping memory regions as anonymous and also some mapping library libc.so.6. This is the Dynamic Linker doing its job and adding all the necessary libraries to the process memory space.
  • there are also a few open syscalls, opening files like /etc/ld.so.cache where it can find a list of the available system libraries.
  • just before finishing we see a write syscall, sending our classic "Hello World" to file descriptor 1, also known as STDOUT (standard output). Since both strace and classic are sending the standard output to the console we can see how the colided in lines 29 and 30.
  • the last call in place was exit_group, it's the equivalent to exit syscall but it terminates not only the calling thread but all the threads in the thread group (this particular example was single threaded).
This should provide a fair idea of what a particular piece of software does and which kernel services it's accessing to. However, sometimes we don't want to go into so much details, but instead we would like to see a summary. We can easily get that with -c flag:
juan@juan-VirtualBox:~$ strace -c ./classic 
Hello World
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0.00    0.000000           0         1           read
  0.00    0.000000           0         1           write
  0.00    0.000000           0         2           open
  0.00    0.000000           0         2           close
  0.00    0.000000           0         3           fstat
  0.00    0.000000           0         8           mmap
  0.00    0.000000           0         4           mprotect
  0.00    0.000000           0         1           munmap
  0.00    0.000000           0         1           brk
  0.00    0.000000           0         3         3 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000                    28         3 total
juan@juan-VirtualBox:~$ 
the output shows the list of syscalls issued by the process (and its threads if there were more than 1), the number of times each one was called, the number of times they failed, and the CPU time they consumed (kernel space time aka system time). With this output we can easily identify that 3 syscalls returned failed states, if we go back to the full strace output we'll see that lines 4, 6 and 11 returned ENOENT "No such file or directory".

I mentioned before that is also possible to strace an already running processes, so lets take a look at that. First we have to identify a process as target, in this case nc with PID 8739:
juan@juan-VirtualBox:~$ ps aux|grep -i nc
root       954  0.0  0.0  19196  2144 ?        Ss   01:33   0:04 /usr/sbin/irqbalance
juan      2055  0.0  0.2 355028  8396 ?        Ssl  01:33   0:00 /usr/lib/at-spi2-core/at-spi-bus-launcher --launch-immediately
juan      8739  0.0  0.0   9132   800 pts/0    S    11:52   0:00 nc -l 9999
juan      2585  0.0  0.0  24440  1964 ?        S    01:33   0:00 dbus-launch --autolaunch 0c0058daf07f369dd9b0d1605654eff1 --binary-syntax --close-stderr
juan      9477  0.0  0.0  15948  2304 pts/2    R+   14:03   0:00 grep --color=auto -i nc
juan@juan-VirtualBox:~$ 
now lets try to attach strace to it in order to inspect the syscalls:
juan@juan-VirtualBox:~$ strace -p 8739
strace: attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
juan@juan-VirtualBox:~$ 
interesting :D. Turns out that by default Ubuntu doesn't allow ptrace_attach feature, why? we'll see an example later :P. Ptrace has different scopes (4 actually, 0 to 3, being 3 the most restrictive one):

A PTRACE scope of "0" is the more permissive mode.  A scope of "1" limits PTRACE only to direct child processes (e.g. "gdb name-of-program" and "strace -f name-of-program" work, but gdb's "attach" and "strace -fp $PID" do not). The PTRACE scope is ignored when a user has CAP_SYS_PTRACE, so "sudo strace -fp $PID" will work as before.

if we take a look at the current ptrace_scope value will see we have scope 1:
juan@juan-VirtualBox:~$ cat /proc/sys/kernel/yama/ptrace_scope
1
juan@juan-VirtualBox:~$ 
At this point we have two options, we either try with sudo or we enable scope 1 system wide by changing /proc/sys/kernel/yama/ptrace_scope (this might be dangerous). Lets go with sudo now:
juan@juan-VirtualBox:~$ sudo strace -p 8739
Process 8739 attached
accept(3, {sa_family=AF_INET, sin_port=htons(34404), sin_addr=inet_addr("127.0.0.1")}, [16]) = 4
close(3) = 0
poll([{fd=4, events=POLLIN}, {fd=0, events=POLLIN}], 2, 4294967295) = 1 ([{fd=4, revents=POLLIN}])
read(4, "Hello World through NC\n", 2048) = 23
write(1, "Hello World through NC\n", 23) = 23
poll([{fd=4, events=POLLIN}, {fd=0, events=POLLIN}], 2, 4294967295) = 1 ([{fd=4, revents=POLLIN}])
read(4, "", 2048) = 0
shutdown(4, SHUT_RD) = 0
close(4) = 0
close(3) = -1 EBADF (Bad file descriptor)
close(3) = -1 EBADF (Bad file descriptor)
exit_group(0) = ?
+++ exited with 0 +++
juan@juan-VirtualBox:~$ 
It worked indeed! Lets do a brief review of the syscalls:

  • First an accept call. It extracts the first connection request on the queue of pending connections for the listening socket, 3, creates a new connected socket, and returns a new file descriptor referring to that socket , 4. The newly created socket is not in the listening state. The original socket 3 is unaffected by this call.
  • Then close call closes the listening socket 3.
  • The next call was poll, it waits for one of a set of file descriptors to become ready to perform I/O. In this case we can see it waits for fd 4 (the recent socket created due to the incoming connection) and fd 0 (the standard input).
  • Right after the poll we sea the read call, reading 23 bytes out of fd 4 using a 2048 bytes buffer.
  • After finishing reading, nc uses write to send the received bytes to fd 1, the standard output.
  • Then nc polls again for any extra data coming in, and this time the call returns empty as can be seen on the next read call returning 0. This poll is probably triggered by the connection being finished on the other side.
  • Shutdown call shuts down all or part of a full dupplex connection on a given socket, in this case the one pointed by fd 4. 
  • Then we have 3 close calls, the first one closes the file descriptor used by the socket created by the accept call, while the next two calls try to close a fd that has already been closed on line 4 (fd 3), which is kind of weird and could be a bug.

Why ptrace could be dangerous?

Usually debugging tools are like two edged swords, right? Well, ptrace is no exception to that rule. Having access to the interface between user space and kernel space of a process can leak some important information, like credentials.

Lets see an extremely simple example. My virtual machine has vsftpd 3.0.2 running, so lets capture the credentials of a system user that logs into the FTP service. In this case will set a few extra flags on strace in order to make things easier:

  • -f will Trace child processes as they are created by currently traced processes as a result of the fork system call.
  • -eread -ewrite are two filters to tell strace to only record read and write syscalls.
  • -o sets an output file where the syscalls will be recorded.
So lets strace:
juan@juan-VirtualBox:~$ sudo strace -f -e trace=read,write -o output -p $(pidof vsftpd)
Process 10040 attached
Process 10280 attached
Process 10281 attached
Process 10282 attached
^CProcess 10040 detached
juan@juan-VirtualBox:~$ 
we can see vsftpd forked a couple of times while we connected to it using the ftp client. Now lets take a look at the content of output file:
juan@juan-VirtualBox:~$ cat output 
10280 read(3, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\n\0\0\0\n\0\0\0\0"..., 4096) = 3533
10280 read(3, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\v\0\0\0\v\0\0\0\0"..., 4096) = 2248
10280 read(4,  
10281 read(4, "# /etc/nsswitch.conf\n#\n# Example"..., 4096) = 507
10281 read(4, "", 4096)                 = 0
10281 read(4, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260\23\0\0\0\0\0\0"..., 832) = 832
10281 read(4, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240!\0\0\0\0\0\0"..., 832) = 832
10281 read(4, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240\"\0\0\0\0\0\0"..., 832) = 832
10281 write(3, "Sat Aug 20 17:06:02 2016 [pid 10"..., 65) = 65
10281 write(0, "220 (vsFTPd 3.0.2)\r\n", 20) = 20
10281 read(0, "USER juan\r\n", 11)      = 11
10281 write(0, "331 Please specify the password."..., 34) = 34
10281 read(0, "PASS MyPassw0rd\r\n", 15)  = 15
10281 write(5, "\1", 1)                 = 1
10280 <... read resumed> "\1", 1)       = 1
10281 write(5, "\4\0\0\0", 4 
10280 read(4,  
10281 <... write resumed> )             = 4
10280 <... read resumed> "\4\0\0\0", 4) = 4
10281 write(5, "juan", 4 
10280 read(4,  
...
10282 write(0, "230 Login successful.\r\n", 23) = 23
10282 read(0, "SYST\r\n", 6)            = 6
10282 write(0, "215 UNIX Type: L8\r\n", 19) = 19
10282 read(0, "QUIT\r\n", 6)            = 6
10282 write(0, "221 Goodbye.\r\n", 14)  = 14
10282 +++ exited with 0 +++
10280 <... read resumed> 0x7ffe4994960f, 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
10280 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=10282, si_status=0, si_utime=0, si_stime=0} ---
10280 +++ killed by SIGSYS +++
10040 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=10280, si_status=SIGSYS, si_utime=0, si_stime=1} ---
juan@juan-VirtualBox:~$ 
As we anticipated, we can see both username and password on lines 12 and 14. Ok, yes, this is FTP and we could have captured the keys also with a simple network capture, but this is just an example :D of how strace (ptrace actually) can be used to leak sensitive information.

I hope this was interesting, at least it was for me xD. Here I list some interesting links I found on the way:

http://linux.die.net/man/1/strace
http://linux.die.net/man/2/ptrace
http://man7.org/linux/man-pages/man2/syscalls.2.html
https://www.kernel.org/doc/Documentation/security/Yama.txt

lunes, 4 de abril de 2016

cgroups 101 - keep your processes under control!

A few days ago reading a bit about System D and all the beauty of it, I came across Control Groups, aka cgroups. I had read about them before, however never had the chance to play around with them face to face.

Cgroups are a Kernel feature that allow sysadmins and human beings (:P) to group processes/tasks in order to assign system resources like CPU, memory, IO, etc in a more fine-grained way. Cgroups can be arranged in a hierarchical manner, and every process in the system will belong to exact one cgroup at any single point in time. In conjunction with Linux namespaces (next post :D), cgroups are a corner stone for things like Docker containers.

So how do cgroups provides access/accounting control to CPU, memory, etc? 

 

There's another concept involved in cgroups, and it is subsystems. A subsystem is the resource scheduler/controller in charge of setting the limits for a particular resource. Some of the most common subsystems are:

  • cpuset: this subsystem provides the possibility to assign a group of tasks to a particular CPU and Memory Node, this is particularly interesting in NUMA systems.
  • memory: this subsystem as you have already realized, allows you to control the amount of memory a group of tasks can utilize.
  • freezer: this subsystem allows you to easily freeze a group of processes, and eventually unfreeze them later on so they can continue running.
  • blkio: yes, that's correct, this subsystem allows you to define IO limits to processes, only works if using CFQ IO scheduler.

You can get a full list of the supported subsystems on your system from /proc/cgroups file:

root@ubuntu-server:/etc# cat /proc/cgroups
#subsys_name    hierarchy       num_cgroups     enabled
cpuset  0       1       1
cpu     0       1       1
cpuacct 0       1       1
memory  0       1       1
devices 0       1       1
freezer 0       1       1
net_cls 0       1       1
blkio   0       1       1
perf_event      0       1       1
net_prio        0       1       1
hugetlb 0       1       1
root@ubuntu-server:/etc#


this is what it looks like on an Ubuntu 14.04.4 VM.

Ok, so far so good, but how can we access these so called cgroups and subsystems?

 

Fortunately the cgroups interface is accessible through its Virtual File System representation (there are a couple of fancy tools that are available on RHEL systems), we can see for example the default cgroups setup that comes with Ubuntu 14.04.4:

root@ubuntu-server:/etc# mount |grep cgroup
none on /sys/fs/cgroup type tmpfs (rw)
systemd on /sys/fs/cgroup/systemd type cgroup (rw,noexec,nosuid,nodev,none,name=systemd)
root@ubuntu-server:/etc#


Note: there's a systemd "device" mounted on /sys/fs/cgroup/systemd and the File System type is cgroup.None subsystems are included in this cgroup hierarchy.

but in order to start from scratch I'll build a new cgroup hierarchy ignoring the one that comes by default.

I created a new directory under the same tmpfs:

root@ubuntu-server:/etc# mkdir /sys/fs/cgroup/MyHierarchy
root@ubuntu-server:/etc# ls /sys/fs/cgroup/
MyHierarchy  systemd
root@ubuntu-server:/etc#


then mounted a cgroup hierarchy on the new directory:

root@ubuntu-server:/etc# mount -t cgroup MyHierarchy -o none,name=MyHierarchy /sys/fs/cgroup/MyHierarchy/
root@ubuntu-server:/etc# mount |grep cgroup
none on /sys/fs/cgroup type tmpfs (rw)
systemd on /sys/fs/cgroup/systemd type cgroup (rw,noexec,nosuid,nodev,none,name=systemd)
MyHierarchy on /sys/fs/cgroup/MyHierarchy type cgroup (rw,none,name=MyHierarchy)
root@ubuntu-server:/etc#


Note: -o none will cause the mount not to include any subsystem on the hierarchy.

So, is that it? Well..., no xD, we've done nothing so far, but created a cgroup hierarchy, lets take a look at the files under it:

root@ubuntu-server:/etc# ls /sys/fs/cgroup/MyHierarchy/
cgroup.clone_children  cgroup.procs  cgroup.sane_behavior  notify_on_release  release_agent  tasks
root@ubuntu-server:/etc#


these are the basic files for a cgroup, they describe for example what processes belong to it (tasks and cgroup.procs), what command should be executed after the last task leaves the cgroup (notify_on_release, release_agent), etc. The most interesting file is tasks file that keeps the list of the processes that belong to the cgroup, remember that by default all the processes will be listed there since this is a root cgroup:

root@ubuntu-server:/etc# head /sys/fs/cgroup/MyHierarchy/tasks
1
2
3
5
7
8
9
10
11
12
root@ubuntu-server:/etc# wc -l /sys/fs/cgroup/MyHierarchy/tasks
132 /sys/fs/cgroup/MyHierarchy/tasks
root@ubuntu-server:/etc#


Note: Processes IDs can show up more than once and not in order, however a particular process will belong to a single cgroup under the same hierarchy.

We can easily create a sub cgroup (child cgroup) by creating a folder under the root one:

root@ubuntu-server:/etc# mkdir /sys/fs/cgroup/MyHierarchy/SubCgroup1
root@ubuntu-server:/etc# ls /sys/fs/cgroup/MyHierarchy/SubCgroup1/
cgroup.clone_children  cgroup.procs  notify_on_release  tasks
root@ubuntu-server:/etc#


this new cgroup has again similar files but no tasks are associated to it by default:

root@ubuntu-server:/etc# cat /sys/fs/cgroup/MyHierarchy/SubCgroup1/tasks
root@ubuntu-server:/etc#


We can easily move a task to this new cgroup by just writing the task PID into the tasks file:

root@ubuntu-server:/etc# echo $$
4826
root@ubuntu-server:/etc# ps aux|grep 4826
root      4826  0.0  0.7  21332  3952 pts/0    S    Apr02   0:00 bash
root     17971  0.0  0.4  11748  2220 pts/0    S+   04:30   0:00 grep --color=auto 4826
root@ubuntu-server:/etc# echo $$ > /sys/fs/cgroup/MyHierarchy/SubCgroup1/tasks
root@ubuntu-server:/etc# cat /sys/fs/cgroup/MyHierarchy/SubCgroup1/tasks
4826
17979
root@ubuntu-server:/etc#


in the previous example I moved the root bash process to SubCgroup1, and you can see the PID inside tasks file. But there's another PID there as well, why? That PID belongs to the cat command, basically any forked process will be assigned to the same cgroup its parent belongs to. We can also confirm PID 4826 doesn't belong to the root cgroup anymore:

root@ubuntu-server:/etc# grep 4826 /sys/fs/cgroup/MyHierarchy/tasks
root@ubuntu-server:/etc#

 

What if we want to know to which cgroup a particular process belongs to? 

 

We can easily find that out from our lovely /proc:

root@ubuntu-server:/etc# cat /proc/4826/cgroup
3:name=MyHierarchy:/SubCgroup1
1:name=systemd:/user/1000.user/1.session
root@ubuntu-server:/etc#


Note: see how our shell belongs to two different cgroups, this is possible because they belong to different hierarchies.

Ok, but by themselves cgroups just segregate tasks in groups, however they become interestingly powerful when mixed with the subsystems.

Testing a few Subsystems:

 

I rolled back the hierarchy I mounted before using umount as with any other File System, so we are back to square one:

root@ubuntu-server:/etc# umount /sys/fs/cgroup/MyHierarchy/
root@ubuntu-server:/etc# mount |grep cgroup
none on /sys/fs/cgroup type tmpfs (rw)
systemd on /sys/fs/cgroup/systemd type cgroup (rw,noexec,nosuid,nodev,none,name=systemd)
root@ubuntu-server:/etc#


I mounted the hierarchy again, but this time enabling cpu, memory and blkio subsystems, like this:

root@ubuntu-server:/etc# mount -t cgroup MyHierarchy -o cpu,memory,blkio,name=MyHierarchy /sys/fs/cgroup/MyHierarchy
mount: MyHierarchy already mounted or /sys/fs/cgroup/MyHierarchy busy
root@ubuntu-server:/etc# 


WTF??? according to the error the mount point is busy or still mounted... well it turns out that

When a cgroup filesystem is unmounted, if there are any child cgroups created 
below the top-level cgroup, that hierarchy will remain active even though 
unmounted; if there are no child cgroups then the hierarchy will be deactivated.

so, I have to either move process 4826 and its children to the root cgroup or killem' all!!! Of course killing them was the easy way out. With that done, boala:

root@ubuntu-server:/home/juan# mount -v -t cgroup MyHierarchy -o cpu,memory,blkio,name=MyHierarchy /sys/fs/cgroup/MyHierarchy
MyHierarchy on /sys/fs/cgroup/MyHierarchy type cgroup (rw,cpu,memory,blkio,name=MyHierarchy)
root@ubuntu-server:/home/juan#


now the hierarchy includes 3 subsystems, cpu, memory and blkio. Lets see what it looks like now:

root@ubuntu-server:/home/juan# ls /sys/fs/cgroup/MyHierarchy/
blkio.io_merged                   blkio.sectors_recursive           cpu.cfs_quota_us                    memory.move_charge_at_immigrate
blkio.io_merged_recursive         blkio.throttle.io_service_bytes   cpu.shares                          memory.numa_stat
blkio.io_queued                   blkio.throttle.io_serviced        cpu.stat                            memory.oom_control
blkio.io_queued_recursive         blkio.throttle.read_bps_device    memory.failcnt                      memory.pressure_level
blkio.io_service_bytes            blkio.throttle.read_iops_device   memory.force_empty                  memory.soft_limit_in_bytes
blkio.io_service_bytes_recursive  blkio.throttle.write_bps_device   memory.kmem.failcnt                 memory.stat
blkio.io_serviced                 blkio.throttle.write_iops_device  memory.kmem.limit_in_bytes          memory.swappiness
blkio.io_serviced_recursive       blkio.time                        memory.kmem.max_usage_in_bytes      memory.usage_in_bytes
blkio.io_service_time             blkio.time_recursive              memory.kmem.slabinfo                memory.use_hierarchy
blkio.io_service_time_recursive   blkio.weight                      memory.kmem.tcp.failcnt             notify_on_release
blkio.io_wait_time                blkio.weight_device               memory.kmem.tcp.limit_in_bytes      release_agent
blkio.io_wait_time_recursive      cgroup.clone_children             memory.kmem.tcp.max_usage_in_bytes  SubCgroup1
blkio.leaf_weight                 cgroup.event_control              memory.kmem.tcp.usage_in_bytes      tasks
blkio.leaf_weight_device          cgroup.procs                      memory.kmem.usage_in_bytes
blkio.reset_stats                 cgroup.sane_behavior              memory.limit_in_bytes
blkio.sectors                     cpu.cfs_period_us                 memory.max_usage_in_bytes
root@ubuntu-server:/home/juan#


yeah... many files, right? We should find a few files per active subsystem:

root@ubuntu-server:/home/juan# ls /sys/fs/cgroup/MyHierarchy/|grep -c "cpu\."
4
root@ubuntu-server:/home/juan# ls /sys/fs/cgroup/MyHierarchy/|grep -c "memory\."
22
root@ubuntu-server:/home/juan# ls /sys/fs/cgroup/MyHierarchy/|grep -c "blkio\."
27
root@ubuntu-server:/home/juan#


These files are the ones we can tune and of course I'm not going to explain all of them (not that I know them to be honest xD). Something interesting is the fact that some particular parameters can't be applied to the root cgroup, for quite obvious reasons, remember that by default all the processes belong to the root cgroup in the hierarchy and you certainly don't want to throttle some critical system processes.

So for testing purposes I'll set up a second child cgroup called SubCgroup2 and I will set two different throttle limits to write operations on the root volume /dev/sda. First thing first, I need to identify the major and minor number of the device:

root@ubuntu-server:/home/juan# lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0     5G  0 disk
├─sda1   8:1    0   4.5G  0 part /
├─sda2   8:2    0     1K  0 part
└─sda5   8:5    0   510M  0 part
sr0     11:0    1  1024M  0 rom
root@ubuntu-server:/home/juan#

Ok, 8 and 0 should do the trick here. Now we set the throttle using the file called blkio.throttle.write_bps_device, like this:

root@ubuntu-server:/home/juan# echo "8:0 10240000" > /sys/fs/cgroup/MyHierarchy/SubCgroup1/blkio.throttle.write_bps_device
root@ubuntu-server:/home/juan# echo "8:0 20480000" > /sys/fs/cgroup/MyHierarchy/SubCgroup2/blkio.throttle.write_bps_device
root@ubuntu-server:/home/juan#


see that I've throttled the processes under SubCgroup1 to 10240000 bytes per second and 20480000 bytes per second to processes that belong to SubCgroup2. In order to test this I've opened two new shells:

root@ubuntu-server:/home/juan# ps aux|grep bash
juan      1440  0.0  1.0  22456  5152 pts/0    Ss   22:28   0:00 -bash
root      2389  0.0  0.7  21244  3984 pts/0    S    22:35   0:00 bash
juan      6897  0.0  0.9  22456  4996 pts/2    Ss+  23:49   0:00 -bash
juan      6961  0.1  1.0  22456  5076 pts/3    Ss+  23:49   0:00 -bash

root      7276  0.0  0.4  11748  2132 pts/0    S+   23:50   0:00 grep --color=auto bash
root@ubuntu-server:/home/juan#


and pushed their PIDs to the cgroups:

root@ubuntu-server:/home/juan# echo 6897 > /sys/fs/cgroup/MyHierarchy/SubCgroup1/tasks
root@ubuntu-server:/home/juan# echo 6961 > /sys/fs/cgroup/MyHierarchy/SubCgroup2/tasks
root@ubuntu-server:/home/juan#


now lets see what happens when doing some intensive write to the drive:
  • Shell under SubCgroup1:
juan@ubuntu-server:~$ echo $$
6897

juan@ubuntu-server:~$ dd oflag=dsync if=/dev/zero of=test bs=1M count=512
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 52.6549 s, 10.2 MB/s
juan@ubuntu-server:~$

  • Shell under SubCgroup2:
juan@ubuntu-server:~$ echo $$
6961
juan@ubuntu-server:~$ dd oflag=dsync if=/dev/zero of=test bs=1M count=512
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 26.3397 s, 20.4 MB/s
juan@ubuntu-server:~$

  •  Shell under the root cgroup (no throttling here :D):
juan@ubuntu-server:~$ dd oflag=dsync if=/dev/zero of=test bs=1M count=512
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 3.98033 s, 135 MB/s
juan@ubuntu-server:~$


Cool, isn't it? the throttle worked fine, dd under shell 6897 was throttled at 10Mbytes/s while the one under shell 6961 was throttled at 20Mbytes/s. But what if two processes under the same cgroup try to write at the same time, how does the throttle work?

juan@ubuntu-server:~$ echo $$
6897

juan@ubuntu-server:~$ dd oflag=dsync if=/dev/zero of=test bs=1M count=512 & dd oflag=dsync if=/dev/zero of=test1 bs=1M count=512 &
[1] 8265
[2] 8266
juan@ubuntu-server:~$ 512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 105.142 s, 5.1 MB/s
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 105.24 s, 5.1 MB/s

[1]-  Done                    dd oflag=dsync if=/dev/zero of=test bs=1M count=512
[2]+  Done                    dd oflag=dsync if=/dev/zero of=test1 bs=1M count=512
juan@ubuntu-server:~$


the throughput limit is shared among the processes under the same cgroup, which makes perfect sense considering the limit is applied to the group as a whole. There are tons of different parameters to play with the blkio subsystem like weights, sectors, service time, etc. So be brave and have fun with them :P.

Last but not least, lets take a look at the cpu subsystem. This subsystem allows you to put some limits on the CPU utilization, here a list of the files you can use to tune it:

root@ubuntu-server:/home/juan# ls /sys/fs/cgroup/MyHierarchy/cpu.*
/sys/fs/cgroup/MyHierarchy/cpu.cfs_period_us
/sys/fs/cgroup/MyHierarchy/cpu.cfs_quota_us
/sys/fs/cgroup/MyHierarchy/cpu.shares
/sys/fs/cgroup/MyHierarchy/cpu.stat
root@ubuntu-server:/home/juan#


for the sake of me going to bed early I will only test cpu.shares feature. This share value defines a relative weight that the processes under a cgroup will have compared to processes on different cgroups, and this impacts directly on the amount of CPU time the processes can have. For example lets take the default value for the root cgroup:

root@ubuntu-server:/home/juan# cat /sys/fs/cgroup/MyHierarchy/cpu.shares
1024
root@ubuntu-server:/home/juan#


this means all the processes in this cgroup have that particular weight, so if we set the following weights on SubCgroup1 and SubCgroup2:

root@ubuntu-server:/home/juan# echo 512 > /sys/fs/cgroup/MyHierarchy/SubCgroup1/cpu.shares
root@ubuntu-server:/home/juan# echo 256 > /sys/fs/cgroup/MyHierarchy/SubCgroup2/cpu.shares
root@ubuntu-server:/home/juan#

what we mean is processes under SubCgroup1 will have half the CPU time than processes in the root cgroup and twice than processes under SubCgroup2. This is easy to see in the following ps output:

root@ubuntu-server:/home/juan# ps aux --sort=-pcpu|head
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
juan     23289 58.0  0.5   8264  2616 pts/2    R    04:02   0:11 dd if=/dev/zero of=/dev/null bs=1M count=512000000
juan     23288 32.4  0.5   8264  2616 pts/5    R    04:02   0:06 dd if=/dev/zero of=/dev/null bs=1M count=512000000
juan     23290 14.5  0.5   8264  2680 pts/0    R    04:02   0:02 dd if=/dev/zero of=/dev/null bs=1M count=512000000

root         1  0.0  0.5  33492  2772 ?        Ss   Apr03   0:00 /sbin/init
root         2  0.0  0.0      0     0 ?        S    Apr03   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        S    Apr03   0:00 [ksoftirqd/0]
root         5  0.0  0.0      0     0 ?        S<   Apr03   0:00 [kworker/0:0H]
root         6  0.0  0.0      0     0 ?        S    Apr03   0:02 [kworker/u2:0]
root         7  0.0  0.0      0     0 ?        S    Apr03   0:01 [rcu_sched]
root@ubuntu-server:/home/juan# cat /proc/23288/cgroup
4:cpu,memory,blkio,name=MyHierarchy:/SubCgroup1
1:name=systemd:/user/1000.user/5.session
root@ubuntu-server:/home/juan# cat /proc/23289/cgroup
4:cpu,memory,blkio,name=MyHierarchy:/
1:name=systemd:/user/1000.user/7.session
root@ubuntu-server:/home/juan# cat /proc/23290/cgroup
4:cpu,memory,blkio,name=MyHierarchy:/SubCgroup2
1:name=systemd:/user/1000.user/6.session
root@ubuntu-server:/home/juan# 



We can see from the CPU utilization column how the process under the root cgroup 23289 is using 58% of the CPU while the process under SubCgroup1 23288 is using 32.4% and the one in SubCgroup2 23290 14.5%.



Wrapping up:

 

Cgroups are awesome!, they provide a simple interface to a set of resource control capabilities that you can leverage on your Linux systems. There are many subsystems you can use, so for sure you will find the right one for your use case, no matter how weird it is xD. If you can choose... go with some RHEL like distribution, since they come with a set of scripts that can make your life way easier when it comes to handling cgroups, if you can't... be patiente and have fun with mount/umount hahaha.