jueves, 21 de diciembre de 2017

Processes in uninterruptable state (D) - What to do with them?

On the previous entry a I talked about the infamous zombie state and how they just can't be killed. Now is time to some even more scary than the Z state, [suspense sounds].... the D state [/suspense sounds].

As its name precisely suggests, processes in uninterruptible state (D in ps/top/etc) cannot be interrupted hence they can't be killed either!!! An unlike a process Z state, the former might be indeed occupying memory you can't get back!

How do you get into D state?

Most of the time people associates processes in D state with IO activity, and that's not wrong actually that's exactly what man ps says:

PROCESS STATE CODES
       Here are the different values that the s, stat and state output specifiers (header "STAT" or "S") will display to describe the state of a
       process:

               D    uninterruptible sleep (usually IO)
               R    running or runnable (on run queue)
               S    interruptible sleep (waiting for an event to complete)
               T    stopped, either by a job control signal or because it is being traced
               W    paging (not valid since the 2.6.xx kernel)
               X    dead (should never be seen)
               Z    defunct ("zombie") process, terminated but not reaped by its parent

However processes that go into D state due to IO usually come back to R once the IO operation that was holding them is completed (unless something goes nuts). Now, there are some scenarios that don't involve IO and can also lead to D state, for example when using serialization mechanisms like mutexes. When you try to acquire a mutex that has been locked already you have two options, either you go into interruptable sleep (S) or you go into uninterruptible sleep (D) this is something that will depend on what the code is supposed to do and whether you want to be able to receive signals and stuff while waiting for the mutex to be available. If you are curious about mutexes in the kernel have a look at this or a shorter and perhaps more human version here.

So how do you detect a process in D state?

That's rather simple! You can use the regular tools like ps and top and look for the State field/column. For example:

juan@test:~$ ps aux|grep test
root     20592  0.0  0.0   4220   656 pts/2    S+   21:36   0:00 ./test
root     20593  0.0  0.0   4216   640 pts/24   D+   21:36   0:00 ./test
juan     20982  0.0  0.0  15964  2268 pts/27   R+   23:26   0:00 grep --color=auto test
juan@test:~$

in this case, process 20593 is in D+ state, so Uninterruptable sleep and is a foreground process.

What to do with a process in D state? How to know what is actually locking the process:

This is where things get interesting. As I mentioned before D state is usually temporary and most of the times if you see a process in that state as soon as you run ps again the process will have changed to either R or S. BUT... if you believe the process is just refusing to leave D state, then what to do?

Lots of people answer this question with "I kill the process"... well that's exactly why is called UNinterruptible, it will absolutely ignore your most violent attempts to kill it. Any signals you send to the process will be queued and delivered eventually once the process leaves the D state, if it ever does of course.

Fortunately there are a few things that can help you debug this lovely situation and prevent it from happening in the future.

Have a look at the process stack:

This would be an initial approach, by having a look at the process stack you can get an idea of what code path took the process to that D condition. Luckily having a look at the stack isn't that hard now a days and you can do that by using /proc/:

juan@test:~$ sudo cat /proc/20593/stack
[<ffffffffc04ee045>] dev_open+0x15/0x40 [ebbcharmutex]
[<ffffffff811f0a7f>] chrdev_open+0x9f/0x1d0
[<ffffffff811e9527>] do_dentry_open+0x1f7/0x340
[<ffffffff811ead87>] vfs_open+0x57/0x60
[<ffffffff811faccc>] do_last+0x4ec/0x1190
[<ffffffff811fb9f0>] path_openat+0x80/0x600
[<ffffffff811fd10a>] do_filp_open+0x3a/0x90
[<ffffffff811eb109>] do_sys_open+0x129/0x280
[<ffffffff811eb27e>] SyS_open+0x1e/0x20
[<ffffffff817b788d>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
juan@test:~$

nice! So the process is locked in Kernel space and started by running an open syscall (SyS_Open+0x1e/0x20) on some sort of character device (chrdev_open+0x9f/0x1d0) and there's a module intervening here called [ebbcharmutex]. Since this is a stack trace we know with high certainty that the process is blocked particularly inside dev_open call, and that the code for the function belongs to module ebbcharmutex.
For more details about ebbcharmutex, I'd suggest to go to the source (really nice website around Linux and embedded systems) anyways here a piece of the code:

/** @brief The device open function that is called each time the device is opened
 *  This will only increment the numberOpens counter in this case.
 *  @param inodep A pointer to an inode object (defined in linux/fs.h)
 *  @param filep A pointer to a file object (defined in linux/fs.h)
 */
static int dev_open(struct inode *inodep, struct file *filep){

//   if(!mutex_trylock(&ebbchar_mutex)){                  // Try to acquire the mutex (returns 0 on fail)
//      printk(KERN_ALERT "EBBChar: Device in use by another process");
//      return -EBUSY;
//  }
   mutex_lock(&ebbchar_mutex);
   numberOpens++;
   printk(KERN_INFO "EBBChar: Device has been opened %d time(s)\n", numberOpens);
   return 0;
}

as you can see the very first line of the function calls mutex_lock() in its uninterrtible fashion hence this is where our process is stuck.

Now what? well... technically there's not much you can do to unlock the process, all you can try to do is identify who could be the one holding the lock and not releasing it (of course this would require some serious understanding of the problem and the code itself). In my simple case scenario you can see that I had two test processes one of them in S+ state (scroll up, I'm not going to paste it twice :P), if we check the opened files for that process we can see it opened /dev/ebbchar so he is the one holding the lock:

root@test:/home/juan# lsof -p 20592
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF   NODE NAME
test    20592 root  cwd    DIR    8,1     4096 466847 /home/juan/ModTest/exploringBB/extras/kernel/ebbcharmutex
test    20592 root  rtd    DIR    8,1     4096      2 /
test    20592 root  txt    REG    8,1     9102 466927 /home/juan/ModTest/exploringBB/extras/kernel/ebbcharmutex/test
test    20592 root  mem    REG    8,1  1857312 583879 /lib/x86_64-linux-gnu/libc-2.19.so
test    20592 root  mem    REG    8,1   149120 583895 /lib/x86_64-linux-gnu/ld-2.19.so
test    20592 root    0u   CHR  136,2      0t0      5 /dev/pts/2
test    20592 root    1u   CHR  136,2      0t0      5 /dev/pts/2
test    20592 root    2u   CHR  136,2      0t0      5 /dev/pts/2
test    20592 root    3u   CHR  250,0      0t0  69224 /dev/ebbchar
root@test:/home/juan#

If you can't really understand why the process isn't releasing the lock, since it is in S+ state, you can kill it and this should free the lock hence releasing the other stuck process:

root@test:/home/juan# kill -9 20592
root@test:/home/juan# ps aux|grep test
root     20593  0.0  0.0   4220   640 pts/24   S+   05:39   0:00 ./test
root     21460  0.0  0.0  15964  2140 pts/26   S+   12:06   0:00 grep --color=auto test
root@test:/home/juan#

perfect, now process 20593 is in S state!

The kernel doesn't like processes in D state either:

Can't blame it, right? Long time in D state is usually sign of some kind of issue, either high load or locking problems. Because of this, the kernel has 4 features to detect and act on processes in D state. Here you can see the default setup for Ubuntu:

root@test:/home/juan# cat /proc/sys/kernel/hung_task_timeout_secs
120
root@test:/home/juan# cat /proc/sys/kernel/hung_task_panic
0
root@test:/home/juan# cat /proc/sys/kernel/hung_task_check_count
4194304
root@test:/home/juan# cat /proc/sys/kernel/hung_task_warnings
0
root@test:/home/juan#

What do these values mean (for more precision read this):

hung_task_timeout_secs: if a task doesn't get scheduled in more than that number of seconds a warning will be reported.
hung_task_panic: to indicate the kernel whether it should panic or not when detecting a hung task. 0 for no panic, 1 for panic. This for example could be use to reboot the system.
hung_task_check_count: maximum number of tasks to be checked, don't really get this one...
hung_task_warnings: maximum number of warnings due to hung tasks to be reported, every time a task is detected as hung this number gets decreased, when reaching 0 no more reports will be issued.

How does a warning report look like? Oh.. they are lovely:

Dec 20 19:57:25 test kernel: [78751.283729] INFO: task test:20593 blocked for more than 120 seconds.
Dec 20 19:57:25 test kernel: [78751.283737]       Tainted: G           OE  3.19.0-51-generic #58~14.04.1-Ubuntu
Dec 20 19:57:25 test kernel: [78751.283739] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 20 19:57:25 test kernel: [78751.283741] test            D ffff88003610bb68     0 20593  18722 0x00000000
Dec 20 19:57:25 test kernel: [78751.283748]  ffff88003610bb68 ffff88007ce9f5c0 0000000000013e80 ffff88003610bfd8
Dec 20 19:57:25 test kernel: [78751.283753]  0000000000013e80 ffff8800b771f5c0 ffff88007ce9f5c0 ffff88007ce9f5c0
Dec 20 19:57:25 test kernel: [78751.283756]  ffffffffc04f0100 ffffffffc04f0104 ffff88007ce9f5c0 00000000ffffffff
Dec 20 19:57:25 test kernel: [78751.283760] Call Trace:
Dec 20 19:57:25 test kernel: [78751.283773]  [<ffffffff817b3a29>] schedule_preempt_disabled+0x29/0x70
Dec 20 19:57:25 test kernel: [78751.283778]  [<ffffffff817b5745>] __mutex_lock_slowpath+0x95/0x100
Dec 20 19:57:25 test kernel: [78751.283781]  [<ffffffff817b57b5>] ? mutex_lock+0x5/0x37
Dec 20 19:57:25 test kernel: [78751.283785]  [<ffffffff817b57d3>] mutex_lock+0x23/0x37
Dec 20 19:57:25 test kernel: [78751.283790]  [<ffffffffc04ee045>] ? dev_open+0x15/0x40 [ebbcharmutex]
Dec 20 19:57:25 test kernel: [78751.283793]  [<ffffffffc04ee045>] dev_open+0x15/0x40 [ebbcharmutex]
Dec 20 19:57:25 test kernel: [78751.283798]  [<ffffffff811f0a7f>] chrdev_open+0x9f/0x1d0
Dec 20 19:57:25 test kernel: [78751.283802]  [<ffffffff811e9527>] do_dentry_open+0x1f7/0x340
Dec 20 19:57:25 test kernel: [78751.283805]  [<ffffffff811f09e0>] ? cdev_put+0x30/0x30
Dec 20 19:57:25 test kernel: [78751.283808]  [<ffffffff811ead35>] ? vfs_open+0x5/0x60
Dec 20 19:57:25 test kernel: [78751.283812]  [<ffffffff811ead87>] vfs_open+0x57/0x60
Dec 20 19:57:25 test kernel: [78751.283815]  [<ffffffff811faccc>] do_last+0x4ec/0x1190
Dec 20 19:57:25 test kernel: [78751.283820]  [<ffffffff811fb9f0>] path_openat+0x80/0x600
Dec 20 19:57:25 test kernel: [78751.283825]  [<ffffffff810a6ef5>] ? update_stats_wait_end+0x5/0xd0
Dec 20 19:57:25 test kernel: [78751.283828]  [<ffffffff817b38b5>] ? _cond_resched+0x5/0x40
Dec 20 19:57:25 test kernel: [78751.283832]  [<ffffffff811fd10a>] do_filp_open+0x3a/0x90
Dec 20 19:57:25 test kernel: [78751.283836]  [<ffffffff81209cd7>] ? __alloc_fd+0xa7/0x130
Dec 20 19:57:25 test kernel: [78751.283839]  [<ffffffff811fd0d5>] ? do_filp_open+0x5/0x90
Dec 20 19:57:25 test kernel: [78751.283843]  [<ffffffff811eb109>] do_sys_open+0x129/0x280
Dec 20 19:57:25 test kernel: [78751.283846]  [<ffffffff811eafe5>] ? do_sys_open+0x5/0x280
Dec 20 19:57:25 test kernel: [78751.283850]  [<ffffffff811eb27e>] SyS_open+0x1e/0x20
Dec 20 19:57:25 test kernel: [78751.283853]  [<ffffffff817b788d>] system_call_fastpath+0x16/0x1b

The report provides an even more detailed stack trace along with some other details. Now, if you have a look at the top of the stack trace becomes crystal clear the fact that the process is stuck trying to acquire a lock a mutex.

Summing this up:

Processes in D state are rare and could be an indication of system issues like intense IO, high load, or maybe locking problems. Because of being uninterruptible, they don't get to be scheduled hence signals can't be delivered so the process can't be killed. Fortunately there are a few things you can do in order to gather more information to have the chance to troubleshoot the problem.

miércoles, 13 de diciembre de 2017

Some Zombies just can't be killed - Zombie Processes

We've been told by so many movies and TV series that no matter how smelly or ugly Zombies can be, you can easily kill them by either blowing up their heads with a shotgun or gently detaching it from the rest of the body (this doesn't really kill the zombie but renders it almost harmless, fun fact, head can't run while detached from the legs). Unfortunately Zombie processes don't really follow the previous "Zombie expected" behavior and they just can't be killed (kill -9 PID).

What in the world is a Zombie process?

Wikipedia has a pretty nice description here, but essentially we are talking about a process that has finished its execution through either exit or exit_group syscalls but remains present in the system due to some unfinished housekeeping tasks. You can easily identify a zombie process from its Z state in for example top or ps. A nice description can be found here the wait man page:

       A child that terminates, but has not been waited for becomes a
       "zombie".  The kernel maintains a minimal set of information about
       the zombie process (PID, termination status, resource usage
       information) in order to allow the parent to later perform a wait to
       obtain information about the child.  As long as a zombie is not
       removed from the system via a wait, it will consume a slot in the
       kernel process table, and if this table fills, it will not be
       possible to create further processes.  If a parent process
       terminates, then its "zombie" children (if any) are adopted by
       init(1), (or by the nearest "subreaper" process as defined through
       the use of the prctl(2) PR_SET_CHILD_SUBREAPER operation); init(1)
       automatically performs a wait to remove the zombies.

For some reason there's a lot of confusion around this Z state, for example:

Orphan processes aren't zombie processes, in fact for a process to become zombie its parent process has to be still alive. If the parent dies it will be adopted by init and waitED properly.
A Zombie process isn't consuming memory (or CPU), to be more precise, isn't consuming as much memory as it was consuming while it was running. However it still occupies a slot in the process table and this could be a problem.

What happens when exit or exit_group are called?

As I mentioned before these are the syscalls used to terminate a process gracefully. You can find the details of these two for kernel 4.14 here and here. I'm not going to describe the whole code because first I don't understand all of it and second I would probably do it wrong :), but... I'm going to point out some interesting parts:

Things begin in do_group_exit which takes care of killing all the threads in the current process thread group then it calls do_exit to finish the process itself.
do_exit takes care of a few things as well, for example releasing some of the resources associated to the process like files and shared memory (lines 858-866). Later on in line 885, calls to exit_notify, which is in charge of sending to the relative processes the bad news, kind of "Hey!, I'm dying" thing. Now within this function a few things will happen being the most important ones the following:

do_notify_parent will be called to signal the parent process with a SIGCHLD signal. This function will return either true or false depending on whether the dying process should become a Zombie or just go regular dead.

Have a look at the comment in here, long story short, IF the father decides to ignore (by setting the handler to SIG_IGN) the SIGCHLD signal (or SA_NOCLDWAIT flag is set) the dying process can reap itself and doesn't become a zombie and in that case do_notify_parent will return true. Otherwise it will return false.

Back to exit_notify, you can see in here how the process decides whether or not to become a zombie depending on the autoreap variable.

Something worth mentioning is the fact that Zombie is an exit state (tsk ->exit_state) while the scheduling state of the process is actually Dead, you can see this is defined by calling here do_task_dead function. So a Zombie process isn't consuming CPU resources at all, for the scheduling perspective this process is gone, and so it is all the other resources that had been allocated to it.

Why can't it be killed?

Well, you can still send SIGKILL signal to a Zombie process and as you may know, you can't escape from that signal. However technically speaking even though the process is still in the process table, there's nothing it can do, there's no code attached to it anymore, no stack, no files, nothing.

Example

Lets have a look at all this zombie stuff with a simple example. You can get the code here, it's a simple C program that will fork 2 child processes and allocate some memory in each. The parent will sleep for 30 seconds, while the child processes will wait for 15 seconds, hence dying while the father is still running (sleeping) becoming spoooky zombiesssss:

juan@test:~$gcc -o zombie_maker zombie_maker.c
juan@test:~$ ./zombie_maker &
juan@test:~$ Parent PID = 2050, PGRP = 2050, PPID = 1934, PSID = 1934
Child 0 -> PID = 2051, PGRP = 2050, PPID = 2050, PSID = 1934
Child 1 -> PID = 2052, PGRP = 2050, PPID = 2050, PSID = 1934

juan@test:~$ ps aux|grep zombie
juan      2050  0.0  0.0   4216   732 pts/1    S    21:34   0:00 ./zombie_maker
juan      2051  0.0  0.0   6268    92 pts/1    S    21:34   0:00 ./zombie_maker
juan      2052  0.0  0.0   6268    92 pts/1    S    21:34   0:00 ./zombie_maker
juan      2054  0.0  0.0  15960  2268 pts/1    R+   21:34   0:00 grep --color=auto zombie
juan@test:~$ ps aux|grep zombie
juan      2050  0.0  0.0   4216   732 pts/1    S    21:34   0:00 ./zombie_maker
juan      2051  0.0  0.0      0     0 pts/1    Z    21:34   0:00 [zombie_maker] 
juan      2052  0.0  0.0      0     0 pts/1    Z    21:34   0:00 [zombie_maker] 
juan      2056  0.0  0.0  15960  2268 pts/1    S+   21:34   0:00 grep --color=auto zombie
juan@test:~$ ps aux|grep zombie
juan      2058  0.0  0.0  15960  2288 pts/1    S+   21:35   0:00 grep --color=auto zombie
[2]+  Done                    ./zombie_maker
juan@test:~$

After the 15 second, we can see how both child processes not only became Zombies but they also show no memory footprint whatsoever on the system (VSZ is 0 and RSS is 0 as well). After the 30 seconds we can see that none of the 3 processes exist anymore, but who reaped the Zombie children from the system???

Yeahp, init did! Attaching to init with strace we can see the waitid calls returning with the details from the Zombie processes:

root@test:/home/juan# strace -p 1
Process 1 attached
select(54, [3 5 6 7 10 11 15 16 17 19 20 24 26 30 34 53], [], [7 10 11 15 16 17], NULL) = ? ERESTARTNOHAND (To be restarted if no handler)
...
waitid(P_ALL, 0, {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2051, si_status=3, si_utime=0, si_stime=0}, WNOHANG|WEXITED|WSTOPPED|WCONTINUED, NULL) = 0
waitid(P_ALL, 0, {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2052, si_status=3, si_utime=0, si_stime=0}, WNOHANG|WEXITED|WSTOPPED|WCONTINUED, NULL) = 0
select(54, [3 5 6 7 10 11 15 16 17 19 20 24 26 30 34 53], [], [7 10 11 15 16 17], NULL^CProcess 1 detached
 
root@test:/home/juan#

Running zombie_maker with strace will show you how the parent process gets indeed interrupted by the SIGCHLD signals but since the handler isn't defined it is ignored. PLEASE, don't confuse that ignore with the one that happens when the handler is set to SIG_IGN, literally this is what the documentation says:

       POSIX.1-2001 specifies that if the disposition of SIGCHLD is set to
       SIG_IGN or the SA_NOCLDWAIT flag is set for SIGCHLD (see
       sigaction(2)), then children that terminate do not become zombies and
       a call to wait() or waitpid() will block until all children have
       terminated, and then fail with errno set to ECHILD.  (The original
       POSIX standard left the behavior of setting SIGCHLD to SIG_IGN
       unspecified.  Note that even though the default disposition of
       SIGCHLD is "ignore", explicitly setting the disposition to SIG_IGN
       results in different treatment of zombie process children.)

Summary

No matter how brave or root you are, you can't kill a Zombie process :D, killing its parent is usually the way to get rid of them. When the parent dies, init will adopt the little zombie bastards and reap them off the process table, as we saw on the example. Also as mentioned, a Zombie process doesn't consume much system resources (actually almost none), so unless you have hundreds of them you shouldn't worry much.

The kernel code is going crazy, it would be nice if we had more comments in place :P.

lunes, 15 de mayo de 2017

We all understand free command... don't we???

I know I should be writing the next entry of "Understanding SOMAXCONN parameter Part I" (and I am) but instead I decided to do a little detour first. I faced this question about free a few days ago and noticed odd/different answers for it (including my own xD) so decided to go for a deep dive and found some interesting facts worth sharing.

Free

Free Linux command is one of those tools you will be definitely get exposed to as a sysadmin, and understanding its output is extremely important. A bad interpretation of free output could cause unnecessary panic and you don't want to panic for no reason :D (oh wait... maybe you do, I certainly don't ).

Lets have a look at a simple free output:

juan@test:~$ free
             total       used       free     shared    buffers     cached
Mem:       3007572     584872    2422700      14780      55808     291704
-/+ buffers/cache:     237360    2770212
Swap:            0          0          0
juan@test:~$

In the line starting with "Mem:" we get to see the total available physical memory on the system 3007572 KBytes (3GBytes) and then how it is distributed:

used: memory being occupied
free: memory that is available in the system
shared: amount of memory that is shared between processes
buffers: memory used for kernel buffers involving IO (disk, network, etc)
cached: memory used for data caching

Then we have a line starting with "-/+ buffers/cache:" which looks kind of cryptic, doesn't it?... it turns out to be a kind of summary of the previous line:

There are 237360 KBytes being used, this number is the result of the following math "used - (buffers + cached)". This is the amount of memory the system is actually relying on.
There are 2770212 KBytes that could be utilized, this is the result of "free + buffers + cached". What this number actually means is that the kernel should be able to reclaim buffer and cached memory if necessary.

So we have a better understanding of free output now, but lets get more details of where these numbers come from.

The power of the force source

Free binary is shipped as part of procps-ng package in Linux and we can inspect its latest code here. What free actually does for you is summarize memory utilization details coming from /proc/meminfo in a more "human way". So lets dissect this...

The main source code can be read here free.c, if you have a look at the lines between 359-377, you can see that there are 6 variables kb_main_total, kb_main_used, kb_main_free, kb_main_shared, kb_main_buffers and kb_main_cached and they are actually the ones containing the values that will show up in free's output. However if you look for these variables on the code you won't find them, they are extern variables included from "proc/sysinfo.h" which means the variables are being initialized somewhere else, particularly in "proc/sysinfo.c". The variables are actually being populated by meminfo() function being called in line 355.

  meminfo();
  /* Translation Hint: You can use 9 character words in
   * the header, and the words need to be right align to
   * beginning of a number. */
  if (flags & FREE_WIDE) {
   printf(_("              total        used        free      shared     buffers       cache   available"));
  } else {
   printf(_("              total        used        free      shared  buff/cache   available"));
  }
  printf("\n");
  printf("%-7s", _("Mem:"));
  printf(" %11s", scale_size(kb_main_total, flags, args));
  printf(" %11s", scale_size(kb_main_used, flags, args));
  printf(" %11s", scale_size(kb_main_free, flags, args));
  printf(" %11s", scale_size(kb_main_shared, flags, args));
  if (flags & FREE_WIDE) {
   printf(" %11s", scale_size(kb_main_buffers, flags, args));
   printf(" %11s", scale_size(kb_main_cached, flags, args));
  } else {
   printf(" %11s", scale_size(kb_main_buffers+kb_main_cached, flags, args));
  }
  printf(" %11s", scale_size(kb_main_available, flags, args));
  printf("\n");

Now checking the "proc/sysinfo.c" code where meminfo() function exists, we see:

kb_main_buffer is populated with the content of Buffers value from /proc/meminfo (line 691), this is memory used by the kernel to temporarily hold data being sent/received by the system (network IO, disk IO, etc).
kb_main_cached is populated with the result of "kb_page_cache + kb_slab_reclaimable" (line 762). These two values come from Cached and SReclaimable respectively (in /proc/meminfo). So the cached value presented by free contains the memory used by the page cache and the reclaimable slab memory (slab memory caches dentry and inodes structures to speed up some fs operations).

  {"Bounce",       &kb_bounce},
  {"Buffers",      &kb_main_buffers}, // important
  {"Cached",       &kb_page_cache},  // important
  {"CommitLimit",  &kb_commit_limit},
  ...
  kb_main_cached = kb_page_cache + kb_slab_reclaimable;
  kb_swap_used = kb_swap_total - kb_swap_free;

It is clear now that free relies completely on the details exposed by the kernel through /proc/meminfo, and the math involved in getting the values isn't really rocket science after all. But here it comes the interesting part...

To have in mind

It looks like the information provided by free can be slightly different between different Linux flavors (because they carry different free versions :D) so I thought it might be worth pointing that out here as well. This could lead to some inconsistencies when some other tools rely on the output of free.

Free version shipped with Ubuntu 14.04 (3.3.9), doesn't even include Slabs under the cached memory, you can see here how kb_main_cached value is just populated with Cached value from /proc/meminfo.

juan@test:~$ free -V
free from procps-ng 3.3.9
juan@test:~$ free
             total       used       free     shared    buffers     cached
Mem:       3007572     530948    2476624      14776      53080     244740
-/+ buffers/cache:     233128    2774444
Swap:            0          0          0
juan@test:~$ grep "^Cached\|^SReclaimable\|^Slab" /proc/meminfo
Cached:           244740 kB
Slab:              32308 kB
SReclaimable:      17572 kB
juan@test:~$

Note: cached = Cached

Free version shipped with CentOS 7 (3.3.10), defines kb_main_cached as "kb_page_cache + kb_slab" (706) this seems a minor thing but not all the slab memory is reclaimable therefore part of this cached content is not really available in case of memory pressure.

[juan@server ~]$ free -V
free from procps-ng 3.3.10
[juan@server ~]$ free -w
              total        used        free      shared     buffers       cache   available
Mem:        1016860       74212      818380        6688         948      123320      810600
Swap:             0           0           0
[juan@server ~]$ grep "^Cached\|^SRecl\|^Slab" /proc/meminfo
Cached:            88876 kB
Slab:              34444 kB
SReclaimable:      13928 kB
[juan@server ~]$

Note: cache = Cached + Slab

Interesting, isn't it? That's the power of open source after all right? Having the chance to really understand what certain piece of software is doing for you and how is doing it!

That's all about free, hope it helps (it helped me at least :P).

sábado, 6 de mayo de 2017

Understanding SOMAXCONN parameter Part I

This post is kind of a fork of something I worked on the previous week and tries to shed some light on what somaxconn parameter means in the context of Linux Network stack. I've tried to prepare something similar to what I did in TCP Keep Alive - how it works, with examples and some traffic captures so I guess this article will be a bit longer that [I|you]'d like xD, but useful (hopefully).

somaxconn

By definition this value is the maximum number of established connections a socket may have queuing waiting to be accepted by the process holding the socket. This is completely different from tcp_syn_max_backlog since this value is for incomplete connection (connections that haven't been ACKed yet).

To give more context to that statement is necessary to understand a bit more about sockets. When a socket is created, before being able to receive connections two things must happen:

It has to be bind() to a local address.
It has to bet set to listen() for connections.

The listen() syscall is key here, it receives 2 parameters:

the file descriptor that describes the socket (remember "A file descriptor is a data structure used by processes to access: files, Unix sockets, Networking sockets, pipes, etc").
the backlog value. This integer defines the size of the socket queue for established connections waiting to be accepted... yeah :D same as somaxconn. This backlog parameter cannot be bigger than somaxconn and in fact if it is bigger, it will be simply replaced by somaxconn as you can see in the code here.

Long time ago (before kernel 2.4.25) somaxconn value used to be hardcoded in the kernel, nowadays you can read and update using sysctl as with the rest of the network stack parameters. In most systems the default value is 128, like in my CentOS VM:

[root@server juan]# sysctl -a --pattern net.core.somaxconn
net.core.somaxconn = 128
[root@server juan]#

Test scenario

In order to test and confirm the behavior of somaxconn I have the following setup:

Client VM (IP 192.168.0.17) running a C binary (source here) that will start 25 TCP connections to the Server VM on port 8080, in each connection it will send 4 bytes "juan". The binary opens 1 connection per second, and after having opened the 25 connections just waits for 2 minutes to finish.
Server VM (IP 192.168.0.26) running ncat (nc, netcat, whatever you like to call it) listening on port 8080, and the maximum open files has been set to 9 (ulimit -n 9) to achieve a condition where we can see established connections being queued easier.

A few more details you should keep in mind:

The version of nc shipped with CentOS sets a backlog of 10 on the listen() syscall (using strace):

[juan@server ~]$ ulimit -n 9
[juan@server ~]$ strace nc -lk 8080
execve("/usr/bin/nc", ["nc", "-lk", "8080"], [/* 24 vars */]) = 0
brk(0)                                  = 0x1dd7000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fe7075e9000
...
bind(3, {sa_family=AF_INET6, sin6_port=htons(8080), inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 128) = 0
listen(3, 10)                           = 0
fcntl(3, F_GETFL)
...

So we have a backlog of 10 set in the listen syscall and a limit of 9 open files. I will use this value as somaxconn since the meaning is the same.
After nc was just started we can see it has already used 5 FD in use (STDIN, STDOUT, STDERR, and two more for the listening sockets on IPv4 and IPv6)

[root@server juan]# lsof -p 14136|grep "CHR\|IPv"
nc      14136 juan    0u   CHR     136,0      0t0        3 /dev/pts/0
nc      14136 juan    1u   CHR     136,0      0t0        3 /dev/pts/0
nc      14136 juan    2u   CHR     136,0      0t0        3 /dev/pts/0
nc      14136 juan    3u  IPv6 488270957      0t0      TCP *:webcache (LISTEN)
nc      14136 juan    4u  IPv4 488270958      0t0      TCP *:webcache (LISTEN)
[root@server juan]#

With all these details in mind and some basic math (2+2=4 :D) we can expect the following behavior:

Since 5 out of the 9 available file descriptors are already in use, nc should not be able to receive more than 4 new connections on port 8080. Each new connection will create a new socket that has to be accessible in user space through a file descriptor (have a look at accept() syscall), which can't be created if the max open files limit has been reached.
The C binary will issue 25 connections:

The first 4 will be answered just fine by nc and the 4 bytes will be read from the socket and printed to STDOUT.
The next 10 connections will be indeed accepted by the Network layer of the Server VM, and even the 4 bytes will be received, however nc won't be able to accept the socket and read the 4 bytes. These are the established connections being queued!!!
The remaining 11 connections will look as ESTABLISHED from the client as they have completed the TCP handshake however the 4 bytes won't be ACKnowledged by Network Stack because the queue is full (somaxconn limit has been reached). From the server the connections won't even exist.

Binary execution:

juan@test:~/TCP_connections$ gcc -o tcp_connections tcp_connections.c
juan@test:~/TCP_connections$ ./tcp_connections
Socket opened FD=3, connection number 0
Socket opened FD=4, connection number 1
Socket opened FD=5, connection number 2
Socket opened FD=6, connection number 3
Socket opened FD=7, connection number 4
Socket opened FD=8, connection number 5
Socket opened FD=9, connection number 6
Socket opened FD=10, connection number 7
Socket opened FD=11, connection number 8
Socket opened FD=12, connection number 9
Socket opened FD=13, connection number 10
Socket opened FD=14, connection number 11
Socket opened FD=15, connection number 12
Socket opened FD=16, connection number 13
Socket opened FD=17, connection number 14
Socket opened FD=18, connection number 15
Socket opened FD=19, connection number 16
Socket opened FD=20, connection number 17
Socket opened FD=21, connection number 18
Socket opened FD=22, connection number 19
Socket opened FD=23, connection number 20
Socket opened FD=24, connection number 21
Socket opened FD=25, connection number 22
Socket opened FD=26, connection number 23
Socket opened FD=27, connection number 24
juan@test:~/TCP_connections$

Nc execution on the server

Note: I introduced the new line after the 4th connections just to split them from the rest:

[juan@server ~]$ nc -lvk -p 8080
Ncat: Version 6.40 ( http://nmap.org/ncat )
Ncat: Listening on :::8080
Ncat: Listening on 0.0.0.0:8080
Ncat: Connection from 192.168.0.17.
Ncat: Connection from 192.168.0.17:49026.
juanNcat: Connection from 192.168.0.17.
Ncat: Connection from 192.168.0.17:49027.
juanNcat: Connection from 192.168.0.17.
Ncat: Connection from 192.168.0.17:49028.
juanNcat: Connection from 192.168.0.17.
Ncat: Connection from 192.168.0.17:49029.
juan

Ncat: Connection from 192.168.0.17.
Ncat: Connection from 192.168.0.17:49030.
Ncat: Connection from 192.168.0.17.
Ncat: Connection from 192.168.0.17:49031.
juanNcat: Connection from 192.168.0.17.
Ncat: Connection from 192.168.0.17:49032.
juanNcat: Connection from 192.168.0.17.
Ncat: Connection from 192.168.0.17:49033.
juanNcat: Connection from 192.168.0.17.
Ncat: Connection from 192.168.0.17:49034.
juanNcat: Connection from 192.168.0.17.
Ncat: Connection from 192.168.0.17:49035.
juanNcat: Connection from 192.168.0.17.
Ncat: Connection from 192.168.0.17:49036.
juanNcat: Connection from 192.168.0.17.
Ncat: Connection from 192.168.0.17:49037.
juanNcat: Connection from 192.168.0.17.
Ncat: Connection from 192.168.0.17:49038.
juanNcat: Connection from 192.168.0.17.
Ncat: Connection from 192.168.0.17:49039.
juanNcat: Connection from 192.168.0.17.
Ncat: Connection from 192.168.0.17:49040.
juanjuan

Results

To have a look at what was happening during the test I measured 3 different locations:

TCP connections on the client using ss.
TCP connections on the server using ss.
Captured network traffic on the server using tcpdump.

These 3 locations provided enough information to understand what is going on and to even find something interesting :D.

TCP connections on the client:

I've chopped off a couple of lines for the sake of space and time. The samples were taken every 2 seconds and only TCP connections to port 8080 were considered:

root@test:/home/juan# for i in {1..100}; do ss -ntpe|grep "State\|8080"; echo ---; sleep 2; done
State      Recv-Q Send-Q        Local Address:Port          Peer Address:Port
ESTAB      0      0              192.168.0.17:49026         192.168.0.26:8080   users:(("tcp_connections",3342,3)) uid:1000 ino:23326 sk:ffff8800b4914fc0
---
State      Recv-Q Send-Q        Local Address:Port          Peer Address:Port
ESTAB      0      0              192.168.0.17:49027         192.168.0.26:8080   users:(("tcp_connections",3342,4)) uid:1000 ino:23350 sk:ffff8800b4914140
ESTAB      0      0              192.168.0.17:49028         192.168.0.26:8080   users:(("tcp_connections",3342,5)) uid:1000 ino:23368 sk:ffff8800b4910e80
ESTAB      0      0              192.168.0.17:49026         192.168.0.26:8080   users:(("tcp_connections",3342,3)) uid:1000 ino:23326 sk:ffff8800b4914fc0
---
...
---
State      Recv-Q Send-Q        Local Address:Port          Peer Address:Port
ESTAB      0      0              192.168.0.17:49031         192.168.0.26:8080   users:(("tcp_connections",3342,8)) uid:1000 ino:23408 sk:ffff8800b4911d00
ESTAB      0      0              192.168.0.17:49027         192.168.0.26:8080   users:(("tcp_connections",3342,4)) uid:1000 ino:23350 sk:ffff8800b4914140
ESTAB      0      0              192.168.0.17:49029         192.168.0.26:8080   users:(("tcp_connections",3342,6)) uid:1000 ino:23372 sk:ffff8800b4910740
ESTAB      0      0              192.168.0.17:49030         192.168.0.26:8080   users:(("tcp_connections",3342,7)) uid:1000 ino:23390 sk:ffff8800b49115c0
ESTAB      0      0              192.168.0.17:49028         192.168.0.26:8080   users:(("tcp_connections",3342,5)) uid:1000 ino:23368 sk:ffff8800b4910e80
ESTAB      0      0              192.168.0.17:49026         192.168.0.26:8080   users:(("tcp_connections",3342,3)) uid:1000 ino:23326 sk:ffff8800b4914fc0
ESTAB      0      0              192.168.0.17:49033         192.168.0.26:8080   users:(("tcp_connections",3342,10)) uid:1000 ino:23445 sk:ffff8800b4912b80
ESTAB      0      0              192.168.0.17:49034         192.168.0.26:8080   users:(("tcp_connections",3342,11)) uid:1000 ino:23447 sk:ffff8800b49132c0
ESTAB      0      0              192.168.0.17:49032         192.168.0.26:8080   users:(("tcp_connections",3342,9)) uid:1000 ino:23426 sk:ffff8800b4912440
---
...

After a few seconds, we can see a few ESTABLISHED tcp connections using source ports from 49026 to 49034, 9 connections to be more precise, and they all show 0 bytes in Send-Q which suggests the 4 bytes sent by the application were actually acknowledged by the server.

...
---
State      Recv-Q Send-Q        Local Address:Port          Peer Address:Port
ESTAB      0      0              192.168.0.17:49031         192.168.0.26:8080   users:(("tcp_connections",3342,8)) uid:1000 ino:23408 sk:ffff8800b4911d00
ESTAB      0      0              192.168.0.17:49027         192.168.0.26:8080   users:(("tcp_connections",3342,4)) uid:1000 ino:23350 sk:ffff8800b4914140
ESTAB      0      0              192.168.0.17:49029         192.168.0.26:8080   users:(("tcp_connections",3342,6)) uid:1000 ino:23372 sk:ffff8800b4910740
ESTAB      0      0              192.168.0.17:49030         192.168.0.26:8080   users:(("tcp_connections",3342,7)) uid:1000 ino:23390 sk:ffff8800b49115c0
ESTAB      0      0              192.168.0.17:49028         192.168.0.26:8080   users:(("tcp_connections",3342,5)) uid:1000 ino:23368 sk:ffff8800b4910e80
ESTAB      0      0              192.168.0.17:49037         192.168.0.26:8080   users:(("tcp_connections",3342,14)) uid:1000 ino:23502 sk:ffff8800b4917400
ESTAB      0      0              192.168.0.17:49035         192.168.0.26:8080   users:(("tcp_connections",3342,12)) uid:1000 ino:23465 sk:ffff8800b4913a00
ESTAB      0      0              192.168.0.17:49036         192.168.0.26:8080   users:(("tcp_connections",3342,13)) uid:1000 ino:23484 sk:ffff8800b4915e40
ESTAB      0      0              192.168.0.17:49026         192.168.0.26:8080   users:(("tcp_connections",3342,3)) uid:1000 ino:23326 sk:ffff8800b4914fc0
ESTAB      0      0              192.168.0.17:49038         192.168.0.26:8080   users:(("tcp_connections",3342,15)) uid:1000 ino:23520 sk:ffff8800b4916580
ESTAB      0      0              192.168.0.17:49033         192.168.0.26:8080   users:(("tcp_connections",3342,10)) uid:1000 ino:23445 sk:ffff8800b4912b80
ESTAB      0      0              192.168.0.17:49034         192.168.0.26:8080   users:(("tcp_connections",3342,11)) uid:1000 ino:23447 sk:ffff8800b49132c0
ESTAB      0      0              192.168.0.17:49040         192.168.0.26:8080   users:(("tcp_connections",3342,17)) uid:1000 ino:23557 sk:ffff8800b4914880
ESTAB      0      0              192.168.0.17:49039         192.168.0.26:8080   users:(("tcp_connections",3342,16)) uid:1000 ino:23539 sk:ffff8800b4915700
ESTAB      0      0              192.168.0.17:49032         192.168.0.26:8080   users:(("tcp_connections",3342,9)) uid:1000 ino:23426 sk:ffff8800b4912440
---
...

A few more seconds later we see even more TCP connections in ESTABLISHED state, including src ports from 49035 to 49040. So we have a total of 15 ESTABLISHED connections with 0 bytes waiting to be ACKnowledged by the server. Remember we were expecting 14 (4 + 10), but instead we got 15 (4 + 11), odd right? I'll get back to this later...

...
---
State      Recv-Q Send-Q        Local Address:Port          Peer Address:Port
ESTAB      0      0              192.168.0.17:49031         192.168.0.26:8080   users:(("tcp_connections",3342,8)) uid:1000 ino:23408 sk:ffff8800b4911d00
ESTAB      0      0              192.168.0.17:49027         192.168.0.26:8080   users:(("tcp_connections",3342,4)) uid:1000 ino:23350 sk:ffff8800b4914140
ESTAB      0      0              192.168.0.17:49029         192.168.0.26:8080   users:(("tcp_connections",3342,6)) uid:1000 ino:23372 sk:ffff8800b4910740
ESTAB      0      0              192.168.0.17:49030         192.168.0.26:8080   users:(("tcp_connections",3342,7)) uid:1000 ino:23390 sk:ffff8800b49115c0
ESTAB      0      0              192.168.0.17:49028         192.168.0.26:8080   users:(("tcp_connections",3342,5)) uid:1000 ino:23368 sk:ffff8800b4910e80
ESTAB      0      0              192.168.0.17:49037         192.168.0.26:8080   users:(("tcp_connections",3342,14)) uid:1000 ino:23502 sk:ffff8800b4917400
ESTAB      0      0              192.168.0.17:49035         192.168.0.26:8080   users:(("tcp_connections",3342,12)) uid:1000 ino:23465 sk:ffff8800b4913a00
ESTAB      0      0              192.168.0.17:49036         192.168.0.26:8080   users:(("tcp_connections",3342,13)) uid:1000 ino:23484 sk:ffff8800b4915e40
ESTAB      0      0              192.168.0.17:49026         192.168.0.26:8080   users:(("tcp_connections",3342,3)) uid:1000 ino:23326 sk:ffff8800b4914fc0
ESTAB      0      0              192.168.0.17:49038         192.168.0.26:8080   users:(("tcp_connections",3342,15)) uid:1000 ino:23520 sk:ffff8800b4916580
ESTAB      0      0              192.168.0.17:49033         192.168.0.26:8080   users:(("tcp_connections",3342,10)) uid:1000 ino:23445 sk:ffff8800b4912b80
ESTAB      0      0              192.168.0.17:49034         192.168.0.26:8080   users:(("tcp_connections",3342,11)) uid:1000 ino:23447 sk:ffff8800b49132c0
ESTAB      0      0              192.168.0.17:49040         192.168.0.26:8080   users:(("tcp_connections",3342,17)) uid:1000 ino:23557 sk:ffff8800b4914880
ESTAB      0      4              192.168.0.17:49041         192.168.0.26:8080   timer:(on,1.208ms,3) users:(("tcp_connections",3342,18)) uid:1000 ino:23575 sk:ffff8800b4916cc0
ESTAB      0      0              192.168.0.17:49039         192.168.0.26:8080   users:(("tcp_connections",3342,16)) uid:1000 ino:23539 sk:ffff8800b4915700
ESTAB      0      0              192.168.0.17:49032         192.168.0.26:8080   users:(("tcp_connections",3342,9)) uid:1000 ino:23426 sk:ffff8800b4912440
ESTAB      0      4              192.168.0.17:49042         192.168.0.26:8080   timer:(on,608ms,2) users:(("tcp_connections",3342,19)) uid:1000 ino:23593 sk:ffff8800b4910000
---
...

This new sample shows 2 lines that are different than the rest, the ones using src ports 49041 and 49042. These were the first 2 connections that couldn't fit in the somaxconn queue!!! And even more, you can see how the 4 bytes that were sent from the client haven't been acknowledged by the server and therefore they show up in the Send-Q column. Note how the connection looks as ESTABLISHED for the client though, this is because the 3-way handshake was completed from the client point of view.

...
---
State      Recv-Q Send-Q        Local Address:Port          Peer Address:Port
ESTAB      0      0              192.168.0.17:49031         192.168.0.26:8080   users:(("tcp_connections",3342,8)) uid:1000 ino:23408 sk:ffff8800b4911d00
ESTAB      0      0              192.168.0.17:49027         192.168.0.26:8080   users:(("tcp_connections",3342,4)) uid:1000 ino:23350 sk:ffff8800b4914140
ESTAB      0      0              192.168.0.17:49029         192.168.0.26:8080   users:(("tcp_connections",3342,6)) uid:1000 ino:23372 sk:ffff8800b4910740
ESTAB      0      4              192.168.0.17:49047         192.168.0.26:8080   timer:(on,460ms,4) users:(("tcp_connections",3342,24)) uid:1000 ino:23632 sk:ffff8800a6efba00
ESTAB      0      4              192.168.0.17:49050         192.168.0.26:8080   timer:(on,132ms,3) users:(("tcp_connections",3342,27)) uid:1000 ino:23635 sk:ffff8800a6efc140
ESTAB      0      0              192.168.0.17:49030         192.168.0.26:8080   users:(("tcp_connections",3342,7)) uid:1000 ino:23390 sk:ffff8800b49115c0
ESTAB      0      0              192.168.0.17:49028         192.168.0.26:8080   users:(("tcp_connections",3342,5)) uid:1000 ino:23368 sk:ffff8800b4910e80
ESTAB      0      4              192.168.0.17:49045         192.168.0.26:8080   timer:(on,4.740ms,5) users:(("tcp_connections",3342,22)) uid:1000 ino:23630 sk:ffff8800a6efcfc0
ESTAB      0      4              192.168.0.17:49049         192.168.0.26:8080   timer:(on,2.336ms,4) users:(("tcp_connections",3342,26)) uid:1000 ino:23634 sk:ffff8800a6efc880
ESTAB      0      0              192.168.0.17:49037         192.168.0.26:8080   users:(("tcp_connections",3342,14)) uid:1000 ino:23502 sk:ffff8800b4917400
ESTAB      0      0              192.168.0.17:49035         192.168.0.26:8080   users:(("tcp_connections",3342,12)) uid:1000 ino:23465 sk:ffff8800b4913a00
ESTAB      0      4              192.168.0.17:49043         192.168.0.26:8080   timer:(on,2.740ms,5) users:(("tcp_connections",3342,20)) uid:1000 ino:23611 sk:ffff8800a6efde40
ESTAB      0      0              192.168.0.17:49036         192.168.0.26:8080   users:(("tcp_connections",3342,13)) uid:1000 ino:23484 sk:ffff8800b4915e40
ESTAB      0      4              192.168.0.17:49044         192.168.0.26:8080   timer:(on,3.732ms,5) users:(("tcp_connections",3342,21)) uid:1000 ino:23616 sk:ffff8800a6efd700
ESTAB      0      4              192.168.0.17:49046         192.168.0.26:8080   timer:(on,5.740ms,5) users:(("tcp_connections",3342,23)) uid:1000 ino:23631 sk:ffff8800a6eff400
ESTAB      0      0              192.168.0.17:49026         192.168.0.26:8080   users:(("tcp_connections",3342,3)) uid:1000 ino:23326 sk:ffff8800b4914fc0
ESTAB      0      4              192.168.0.17:49048         192.168.0.26:8080   timer:(on,1.336ms,4) users:(("tcp_connections",3342,25)) uid:1000 ino:23633 sk:ffff8800a6efecc0
ESTAB      0      0              192.168.0.17:49038         192.168.0.26:8080   users:(("tcp_connections",3342,15)) uid:1000 ino:23520 sk:ffff8800b4916580
ESTAB      0      0              192.168.0.17:49033         192.168.0.26:8080   users:(("tcp_connections",3342,10)) uid:1000 ino:23445 sk:ffff8800b4912b80
ESTAB      0      0              192.168.0.17:49034         192.168.0.26:8080   users:(("tcp_connections",3342,11)) uid:1000 ino:23447 sk:ffff8800b49132c0
ESTAB      0      0              192.168.0.17:49040         192.168.0.26:8080   users:(("tcp_connections",3342,17)) uid:1000 ino:23557 sk:ffff8800b4914880
ESTAB      0      4              192.168.0.17:49041         192.168.0.26:8080   timer:(on,732ms,5) users:(("tcp_connections",3342,18)) uid:1000 ino:23575 sk:ffff8800b4916cc0
ESTAB      0      0              192.168.0.17:49039         192.168.0.26:8080   users:(("tcp_connections",3342,16)) uid:1000 ino:23539 sk:ffff8800b4915700
ESTAB      0      0              192.168.0.17:49032         192.168.0.26:8080   users:(("tcp_connections",3342,9)) uid:1000 ino:23426 sk:ffff8800b4912440
ESTAB      0      4              192.168.0.17:49042         192.168.0.26:8080   timer:(on,1.732ms,5) users:(("tcp_connections",3342,19)) uid:1000 ino:23593 sk:ffff8800b4910000
---
^C
root@test:/home/juan#

The last sample shows the remaining connections from ports 49043 to 49050 all of them in the same condition, 4 bytes not being acknowledged.

From the client side samples we can confirm we got 15 complete TCP connections (src ports from 49026 to 49040) and 10 that even though showed up as ESTABLISHED weren't able to get acknowledged the 4 bytes sent by the client (src ports from 49041 to 49050).

TCP connections on the server

Lets have a look now to the samples taken on the server and see if they match with what we found on the client.

[root@server juan]# for i in {1..100}; do ss -tpn|grep "State\|8080"; echo "---"; sleep 2;done
State      Recv-Q Send-Q Local Address:Port               Peer Address:Port
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49026               users:(("nc",pid=13705,fd=5))
---
State      Recv-Q Send-Q Local Address:Port               Peer Address:Port
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49026               users:(("nc",pid=13705,fd=5))
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49028               users:(("nc",pid=13705,fd=7))
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49027               users:(("nc",pid=13705,fd=6))
---
...

The first samples look ok, nothing fancy there, we can see the first 3 TCP connections using src ports from 49026 to 49028.

...
---
State      Recv-Q Send-Q Local Address:Port               Peer Address:Port
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49026               users:(("nc",pid=13705,fd=5))
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49031
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49033
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49029               users:(("nc",pid=13705,fd=8))
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49028               users:(("nc",pid=13705,fd=7))
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49032
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49030
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49027               users:(("nc",pid=13705,fd=6))
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49034
---
State      Recv-Q Send-Q Local Address:Port               Peer Address:Port
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49036
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49026               users:(("nc",pid=13705,fd=5))
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49035
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49031
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49033
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49029               users:(("nc",pid=13705,fd=8))
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49028               users:(("nc",pid=13705,fd=7))
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49032
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49030
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49027               users:(("nc",pid=13705,fd=6))
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49034
---
...

And here it got interesting! We have now 4 connections (src ports between 49026 and 49029) which besides being ESTABLISHED have a FD associated to our nc process. However we have another 7 ESTABLISHED connections (src ports from 49030 to 49036) with no FD associated (and no process either :D)!!! These 7 connections are the ones waiting on the queue for nc to issue the accept() call. nc is actually issuing the accept calls, but the calls are failing due to nc having reached the maximum open files (because of the "ulimit -n 9").

...
---
State      Recv-Q Send-Q Local Address:Port               Peer Address:Port
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49036
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49026               users:(("nc",pid=13705,fd=5))
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49038
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49035
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49031
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49033
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49029               users:(("nc",pid=13705,fd=8))
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49028               users:(("nc",pid=13705,fd=7))
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49032
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49039
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49030
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49027               users:(("nc",pid=13705,fd=6))
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49037
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49040
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49034
---
^C
[root@server juan]#

After the binary issued the 25 connections, on the server we can see 4 ESTABLISHED connections (src ports 49026 to 49029) with corresponding FDs and 11 ESTABLISHED connections (src ports 49030 to 49040) with no FD. So... where are the remaining 10 connections that show up as ESTABLISHED on the client??? Well, these 10 connections didn't reach ESTABLISHED state on the server :D as per the documentation of listen():

The backlog argument defines the maximum length to which the queue of pending connections for sockfd may grow. If a connection request arrives when the queue is full, the client may receive an error with an indication of ECONNREFUSED or, if the underlying protocol supports retransmission, the request may be ignored so that a later reattempt at connection succeeds.

Considering we are using TCP , the use case matches exactly with the underlined part.

To be continued...

martes, 28 de marzo de 2017

Death by Real Time scheduling

A few weeks ago I had the chance to troubleshoot a system that was literally crawling performance wise. A simple SSH connection would take a few seconds to be established and then the latency on the session would be really frustrating. Clearly something was putting the system on its knees. A simple look at the logs threw the first clue:

[18253.961495] sched: RT throttling activated

These simple 4 words were enough to put me on the right track and I asked, "Do you happen to have any processes using a Real Time scheduling policy?"...

Linux Scheduler

There's a "bit" of code in the kernel that decides which process will get to be executed next and that code is called Scheduler. In short, the scheduler is in charge of assigning CPU time to processes according to certain parameters. In Linux threads are scheduled considering two factors (ok, 2 + 1 if you want :P):

The scheduling policy: there are basically two different types of scheduling policies, regular policies (SCHED_OTHER aka SCHED_NORMAL, SCHED_BATCH, SCHED_IDLE) and real time policies (SCHED_FIFO, SCHED_RR, SCHED_DEADLINE). The regular policies are implemented using the Completely Fair Scheduler (CFS):

SCHED_NORMAL: regular scheduling policy, nothing fancy here, tasks get to be executed for a period of time and preemption is in place as well.
SCHED_BATCH: allows tasks to run longer periods of time before preempting them, this improves for example the use of cache lines (but of course reducing interactivity).
SCHED_IDLE: tasks with this policy have less priority than nice 19.
SCHED_FIFO: is a simple scheduling algorithm without time slicing, based in a set of queues for the available priorities. Static priority must be higher than 0.
SCHED_RR: unlike in SCHED_FIFO policy, threads under SCHED_RR run for a certain time slice called quantum and if they reach that time slice they are put at the end of its priority queue.
SCHED_DEADLINE: uses three parameters, named "runtime", "period", and "deadline", to schedule tasks. A SCHED_DEADLINE task should receive "runtime" microseconds of execution time every "period" microseconds, and these "runtime" microseconds are available within "deadline" microseconds from the beginning of the period.

Static priority: also known as sched_priority is a value between 1 and 99. This value only affects Real Time scheduling policies, it has no effect on regular policies and in fact should be set to 0.

In a brief:

       Conceptually, the scheduler maintains a list of runnable threads for
       each possible sched_priority value.  In order to determine which
       thread runs next, the scheduler looks for the nonempty list with the
       highest static priority and selects the thread at the head of this
       list.

There's another concept to mention when it comes to scheduling and is the "niceness" (this is the +1). This value, which only affects SCHED_OTHER and SCHED_BATCH policies, is used to influence the scheduler behavior to favor or disfavor certain threads. Linux's CFS (Completely Fair Scheduler) will then use the nice value to schedule the processes accordingly. Nice can take any integer value between [-20,19], being -20 the highest priority and 19 the lowest. This is the value you can change using nice command from your shell.

How do you find the Scheduling Policy of a process?

The easiest way to get the scheduling information from a process is by using our well known ps command. We can have a look at the process of the system like in the following example:

juan@test:~$ ps -eo pid,user,class,pri,nice,rtprio|awk '{print $3,$5,$6}'|sort|uniq -c
      1 CLS NI RTPRIO
      4 FF - 99
    106 TS 0 -
      1 TS 1 -
      1 TS -10 -
      1 TS 19 -
     31 TS -20 -
      1 TS 5 -
juan@test:~$

there we can see:

4 processes running with real time policy FF (SCHED_FIFO) with real time priority 99. Also note how the NI (nice) column is null for these processes as we described before.
141 processes with regular policy TS (SCHED_OTHER, you can see this in "man ps"). Also note how the RTPRIO (real time priority) column is null for these processes as we described before.

If we have a closer look at the real time threads (FF) we can see that they are kernel threads (one per available core):

juan@test:~$ ps -eo pid,cmd,user,pcpu,class,pri,nice,rtprio|grep "FF\|PID"|grep -v grep
  PID CMD                         USER     %CPU CLS PRI  NI RTPRIO
   11 [migration/0]               root      0.0 FF  139   -     99
   12 [watchdog/0]                root      0.0 FF  139   -     99
   13 [watchdog/1]                root      0.0 FF  139   -     99
   14 [migration/1]               root      0.0 FF  139   -     99
juan@test:~$

watchdog threads are there in order to identify software lockups conditions
migration threads are there to handle load balance tasks among the available cores

These threads use a real time policy because their execution on time is crucial for the health of the system, however they only run for really short periods of time and that's the reason why they don't cause any pain.

Real time Policy

The core of the this post was actually real time policy right? Well lets have a look at the particularities of this policy and why it could become a headache easily.

SCHED_FIFO policy doesn't use time slicing (SCHED_RR does though)
So a SCHED_FIFO task will run until:

A higher priority task goes in runable state and gets scheduled right away.
It blocks due to IO
Executes sched_yield call.

With these details said, you should already get an idea of how an out of control RT task could become a big problem.

CPU Starvation

The problem a RT task can cause is called CPU starvation. The idea of the starvation situation is that there's a small number of tasks (even just one) consuming most or all of certain resource in this case the CPU time. Here you can see a simple example of a real time process that goes into an infinite loop causing CPU starvation to the rest (scheduling policy is set to SCHED_FIFO in line 23).

Code (remember you can pull this example or the others from https://github.com/jjpavlik):

#include <stdio.h>
#include <sched.h>
#include <errno.h>
#include <string.h>
#include <sys/time.h>
#include <sys/resource.h>

void main()
{
        int sched,pri,pid,min,max,i,aux,error;
        struct sched_param params;

        pid = getpid();

        sched = sched_getscheduler(pid);
        pri = getpriority(PRIO_PROCESS,0);
        printf("PID=%d\n",pid);
        printf("Current scheduler is %d, current priority is %d\n",sched,pri);
        printf("Priorities MIN=%d, MAX=%d, nice=%d\n",sched_get_priority_min(sched), sched_get_priority_max(sched),getpriority(PRIO_PROCESS,0));

        printf("Changing scheduling class to SCHED_FIFO\n");
        params.sched_priority=99;
        aux = sched_setscheduler(pid,SCHED_FIFO,&params);
        error=errno;
        if( aux == -1)
        {
                //You need to run this as root :D, otherwise you will get a permission denied
                printf("Setscheduler failed: %s\n",strerror(error));
                return;
        }
        sched = sched_getscheduler(pid);
        pri = getpriority(PRIO_PROCESS,0);
        printf("Scheduler is %d, current priority is %d\n",sched,pri);
        printf("Priorities MIN=%d, MAX=%d, nice=%d\n",sched_get_priority_min(sched), sched_get_priority_max(sched),getpriority(PRIO_PROCESS,0));

        while(1)
        {//Inifinite loop
                i++;
        }
}

Execution:

root@test:/home/juan/scheduling_tests# ./sched_details &
[1] 3265
PID=3265
Current scheduler is 0, current priority is 0
Priorities MIN=0, MAX=0, nice=0
Changing scheduling class to SCHED_FIFO
Scheduler is 1, current priority is 0
Priorities MIN=1, MAX=99, nice=0
root@test:/home/juan/scheduling_tests# 
root@test:/home/juan/scheduling_tests# ps -eo pid,cmd,user,pcpu,class,pri,nice,rtprio|grep "FF\|PID"|grep -v grep
  PID CMD                         USER     %CPU CLS PRI  NI RTPRIO 
   11 [migration/0]               root      0.0 FF  139   -     99
   12 [watchdog/0]                root      0.0 FF  139   -     99
   13 [watchdog/1]                root      0.0 FF  139   -     99
   14 [migration/1]               root      0.0 FF  139   -     99
 3265./sched_details             root     99.3 FF  139   -     99
root@test:/home/juan/scheduling_tests#

Clearly the process is consuming almost completely one of the available cores in my system (99.3%), however we still don't see the throttling message on the logs because there's a second core available. So to push this even further I started a second run of the same binary to make sure the other core gets hammered as well and now:

root@test:/home/juan/scheduling_tests# ps -eo pid,cmd,user,pcpu,class,pri,nice,rtprio|grep "FF\|PID"|grep -v grep
  PID CMD                         USER     %CPU CLS PRI  NI RTPRIO 
   11 [migration/0]               root      0.0 FF  139   -     99
   12 [watchdog/0]                root      0.0 FF  139   -     99
   13 [watchdog/1]                root      0.0 FF  139   -     99
   14 [migration/1]               root      0.0 FF  139   -     99
 3299 ./sched_details             root     97.9 FF  139   -     99
 3338 ./sched_details             root     95.8 FF  139   -     99
root@test:/home/juan/scheduling_tests#

After a few seconds you should see the throttling activated message on your logs:

root@test:/home/juan/scheduling_tests# dmesg |tail -1
[ 9640.055754] sched: RT throttling activated
root@test:/home/juan/scheduling_tests#

at this stage the system is indeed crawling and barely responding to every command. But what does "sched: RT throttling activated" means after all?

There are two kernel parameters that help us reduce the impact of this situation with real time processes. Here you can see them:

root@test:/home/juan/scheduling_tests# cat /proc/sys/kernel/sched_rt_period_us
1000000
root@test:/home/juan/scheduling_tests# cat /proc/sys/kernel/sched_rt_runtime_us
950000
root@test:/home/juan/scheduling_tests#

sched_rt_period_us specifies a scheduling period that is equivalent to 100% CPU bandwidth.
sched_rt_runtime_us specifies how much of the rt_period can be used by real time and deadline tasks.

In my setup (default Ubuntu setup) the RT period is 1000000 us (1 second) and the RT runtime is 950000 us, meaning that 95% of the time can be used by realtime tasks, leaving 5% to other scheduling policies. This 5% of CPU time should be enough (being patience) to login to the system and get rid of the rogue processes, unless you have many other processes running as well.

For example, changing 950000 to 500000 would allow RT tasks to use up to around 50% of the CPU time:

root@test:/home/juan/scheduling_tests# echo 500000 > /proc/sys/kernel/sched_rt_runtime_us
root@test:/home/juan/scheduling_tests# cat /proc/sys/kernel/sched_rt_runtime_us
500000
root@test:/home/juan/scheduling_tests#
root@test:/home/juan/scheduling_tests# ps -eo pid,cmd,user,pcpu,class,pri,nice,rtprio|grep "FF\|PID"|grep -v grep
  PID CMD                         USER     %CPU CLS PRI  NI RTPRIO
   11 [migration/0]               root      0.0 FF  139   -     99
   12 [watchdog/0]                root      0.0 FF  139   -     99
   13 [watchdog/1]                root      0.0 FF  139   -     99
   14 [migration/1]               root      0.0 FF  139   -     99
 3394 ./sched_details             root     58.7 FF  139   -     99
 3395 ./sched_details             root     50.0 FF  139   -     99
root@test:/home/juan/scheduling_tests#

so this way you can keep the real time processes under control.

However, considering real time tasks are supposed to be time sensitive you should avoid capping them like this unless it was really necessary. Maybe it's worth having a look at why they are consuming so much CPU time instead, usually real time tasks are operations that take really short periods of time but they need to be executed as soon as possible. So tools like strace or even perf should be really good starting points to identify the reason behind the CPU time, even better if you have access to the source code!

martes, 7 de marzo de 2017

Where did my IO Scheduler go? Where are you my beloved [noop|deadline|cfq]?

That was exactly the question I asked myself when I came across the following output:

root@ip-172-31-22-167:/home/ubuntu# cat /sys/block/xvda/queue/scheduler
none
root@ip-172-31-22-167:/home/ubuntu#
root@ip-172-31-22-167:/home/ubuntu# uname -a
Linux ip-172-31-22-167 4.4.0-64-generic #85-Ubuntu SMP Mon Feb 20 11:50:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
root@ip-172-31-22-167:/home/ubuntu#

Of course my first approach was to curse, blame politicians, global warming and even capitalism..., but somehow that didn't provide any answers.

Well turns out that a few things have changed in the Linux Kernel without my consent (xD), how dared them, right? Developers and their creativity :P.

In order to understand a bit what this none "scheduler" means, I'll try to sum up a bit of history, Linux and politics in this post.

Time between Kernel 2.6.10 to somewhere before 3.13 / 3.16

So between these two kernel versions there were basically two different methods for the block drivers to interact with the system. First method was called Request Mode, and the second method you could call it "Make your own Request Interface" Mode.

Request Mode

In this mode the Linux Block Layer maintains a simple request queue (using a linked list structure) where all the IO requests are placed one after the other (yeah, like a queue xD). The device driver then receives the requests from the head of the queue, and that's pretty much about it in terms of its logic. Having the IOs waiting there in the queue actually made some interesting stuff possible, for example:

Why don't we reorder them in order to make a better use of the HW? Makes perfect sense right? Well for the young ones... there used to be something called magnetic drives that due to their HW design (plates, spindles, head, etc) in order to get the better out of them you should try to do sequential operations as opposed to random ones, to minimize the seek times for example.
Also, now that we have them reorder, why don't we merge them to get even better throughput? And so they did.
Another interesting feature that was implemented later on is the chance to apply certain policies to achieve a kind of fairness in terms of processes getting access to IO.

These are the pillars where the IO schedulers like noop, deadline and CFQ were build on. You can still see them in some of the current Linux like in Ubuntu 14.04.5:

juan@test:~$ uname -a
Linux test 3.19.0-51-generic #58~14.04.1-Ubuntu SMP Fri Feb 26 22:02:58 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
juan@test:~$ cat /sys/block/sda/queue/scheduler
noop [deadline] cfq
juan@test:~$

Note: even though the kernel is 3.19, blk-mq had already been added by then, but it was not the default for some drivers like the scsi driver in this case.

So why was it necessary for the developers to replace this lovely piece of Software after all??? Well there were a few reasons:

Today's drives (mainly SSDs) are way faster that the drives we had a couple of years ago, Magnetic drives are kind of on their way to extinction as well...
Nowadays due to Virtualization the guest OS doesn't really get to know the real physical layout of the drives so using CPU cycles trying to sort the requests probed to be a waste of time.
With really fast SSD drives/arrays the single queue mechanism can easily become a bottle neck mainly in multiprocessor environments (yeah, everywhere now).

The queue can only be accessed by a single processor at the time, so some contention can be generated because of this as well when trying to push high number of IO operations.

"Make your own Request Interface" Mode

This second Mode is just a way to skip the Request Mode and it's IO scheduling algorithms, allowing the driver to be directly accessed by higher parts of the IO stack. This is used for example for MD to build software RAID, and other tools that need to process the requests themselves before actually sending them to the devices.

The Block Multiqueue IO mechanism (that's quite a name, isn't it?)

Kernel developers realized that the previous Single queuing mechanism had become a huge bottleneck on the Linux IO stack so they came up with this new mechanism. On Kernel 3.13 the first Multiqueue compatible driver was introduced in order to start testing the future IO queuing mechanism (blk-mq), finally on 3.16 and beyond complete blk-mq implementations began to show up for different drivers.

This new mechanism includes two levels of queuing. On the first level, what used to be the single queue where the IO operations where placed has been split into multiple submission queues, one per CPU (or per Node), improving considerably the number of IO operations the system can handle (now each CPU has its own submission queue, getting the contention issue totally off the table). The second level of queuing is the hardware queues the device itself may provide.

The mapping between the submission queues and the hardware queues might be 1:1 or N:M depending on the configuration.

Therefore with this new Multiqueue mechanism, IO schedulers became pretty much useless and they are no longer available for many drivers already. Maybe in the future new schedulers/logic might show up to work on the individual submission queues.

I guess this was the longest possible way to explain where the IO schedulers went on my Ubuntu 16.04 instance after all :D, hopefully will help someone to understand what is going on:

root@ip-172-31-22-167:/home/ubuntu# uname -a
Linux ip-172-31-22-167 4.4.0-64-generic #85-Ubuntu SMP Mon Feb 20 11:50:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
root@ip-172-31-22-167:/home/ubuntu# cat /sys/block/xvda/queue/scheduler
none
root@ip-172-31-22-167:/home/ubuntu#

Note: in kernel 4.3 blkfront driver (Xen block driver) was converted to blk-mq

If I have the time will shoot some tests in order to see the different behaviors I mentioned, just for fun xD.

Some interesting bibliography:

https://lwn.net/Articles/552904/
https://www.thomas-krenn.com/en/wiki/Linux_Multi-Queue_Block_IO_Queueing_Mechanism_(blk-mq)
https://www.thomas-krenn.com/en/wiki/Linux_I/O_Stack_Diagram#Diagram_for_Linux_Kernel_3.17
http://kernel.dk/systor13-final18.pdf

martes, 21 de febrero de 2017

Process mating season 102 - Fork and Clone

It's been a while since I posted Process mating season 101 where I reviewed how processes are created in Linux systems showing examples of the behavior of fork() syscall. At the very end of the post I mentioned glibc is not using fork() syscall anymore when you call the fork() wrapper, but now is calling clone() syscall instead. Even though you can still use the old fork syscall, if you call the syscall directly, clone provides the same features and many more!

So what's the difference between clone and fork?

Since you already know what to expect from fork() (because you have read Process mating season 101) I will only talk about clone now. Similarly to fork, clone creates processes... wait..., what? Yes, the peculiarity of clone() is that the new child process has the possibility of sharing part of the execution context like the memory space (excluding stuff like, stack and CPU registers) with its parent process. Therefore this special feature of clone() is the one used to implement the concept of threads in Linux, basically several processes running in the same memory space.

To give a bit more of context to how clone() works, I think is worth mentioning the following concepts:

tgid: thread group ID, this ID represents the thread group a task belongs to. This ID is also known as PID, since kernel 2.4 getpid() function actually returns TGID
tid: thread ID, this is a unique identifier for a given task, clone() returns the TID of the newly created process/thread. You can obtain the current task TID by using gettid() function.
When clone() is called specifying CLONE_THREAD flag, the new task is created under the same TGID, a new unique TID is assigned to the thread (task). On the contrary, when clone() is called without CLONE_THREAD flag, the new task is placed in a new thread group whose TGID is the TID of the new task (remember TID is unique system-wide).

lets see this in a classic example you may find in your system right now:

juan@test:~$ ps -eLF |grep "PID\|rsyslog"
UID        PID  PPID   LWP  C NLWP    SZ   RSS PSR STIME TTY          TIME CMD
syslog     568     1   568  0    4 65534  3040   1 18:41 ?        00:00:00 rsyslogd
syslog     568     1   570  0    4 65534  3040   0 18:41 ?        00:00:00 rsyslogd
syslog     568     1   571  0    4 65534  3040   1 18:41 ?        00:00:00 rsyslogd
syslog     568     1   572  0    4 65534  3040   0 18:41 ?        00:00:00 rsyslogd
juan      2358  1938  2358  0    1  3987  2168   0 19:53 pts/1    00:00:00 grep --color=auto PID\|rsyslog
juan@test:~$

Rsyslogd is our lovely syslog service and seems to be spawning 4 LWP (light weight process, aka threads). Please note that LWP is an alias for TID, and NLWP is the number of threads under the particular TGID which as we said before is the PID value as well :D, a bit tricky right?

Let me re phrase that putting some time context on it:

rsyslog's parent is process 1 as we can see from PPID column. This means at some point in time init process fork()ed itself (actually fork+execve :D), giving birth to a new task (rsyslogd) with TID 568 (LWP) which was placed under a new thread group identified by PID 568.
later on rsyslogd TID 568 decided more threads were necessary to take care of our logs and spawn a few more tasks (this time using clone(CLONE_THREAD)). This way three new threads were spawn TID 570, 571 and 572, all of them of course under the same thread group 568.

I hope it is more clear now, you can't said I haven't tried! If by any chance you don't believe any of this, about PID being TGID, and LWP being TID I'd strongly suggest to have a look at map ps.

Show me an example in C god damn it!!

Well... I have to be honest, I thought it would be as simple as it was for the previous post, but once again I was wrong xD. That said I came up with an extremely simple example anyways, just for the sake of writing some C and probing myself I'm not thaaaat lazy.

The example doesn't do anything meaningful, it only proves that threads can access to memory on the process that spawned them.

Here it goes the code, again is heavily commented:

#include <stdio.h>
#include <errno.h>
#include <linux/sched.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#include <string.h>
#include <stdlib.h>
#include <sys/syscall.h>

#define STACK_SIZE (1024*1024)
#define THREADS 5

int myFunction(void *arg);

int main(int argc, char *argv[])
{
        int aux, i, childtid, array[THREADS];
        char *stack_low, *stack_high;

        for(i=0;i<THREADS;i++)
        {
                *(array+i)=0;//Initializing array with zero, position by position, just for the fun of it

                //Allocate some bytes for the cloned child, since STACK
                // grows downward we want to know the last memory position
                // to use it on clone().
                stack_low=malloc(STACK_SIZE);
                stack_high=stack_low+STACK_SIZE;

                //When CLONE_THREAD is set, CLONE_SIGHAND and CLONE_VM have to be set as well, have a look at https://linux.die.net/man/2/clone.
                childtid=clone(myFunction,(void *)stack_high,CLONE_THREAD|CLONE_VM|CLONE_SIGHAND,(void *)(array+i));
                aux=errno;
                if(childtid == -1)
                {
                        printf("Clone failed: \"%s\"\n",strerror(aux));
                        return 2;
                }
        }
        sleep(2);//We just wait... synching the threads is a pain...

        //We print all the TID values stored in the array by every thread
        for(i=0;i<THREADS;i++)
        {
                printf("Spawned thread %d\n",*(array+i));
        }
        return 0;
}

// This function is used to run every new thread, it receives a pointer to the place where the thread should store its TID
int myFunction(void *arg)
{
        int *aux;
        aux=(int *)arg;
        *aux=syscall(SYS_gettid);//Stores the thread TID in the memory possition passed as argument.
        pause();
}

As you can see clone() requires a few more parameters than fork() in order to create life in your system. Essentially these:

A pointer to the function that will be the code the thread will execute. In the example the function is called myFunction (I'm great at naming stuff, right?) and all it does is to write the TID value of the thread to a position in memory that comes in the arg argument.
The next parameter is the stack, as I mentioned before the stack is one of those things that can't be shared among threads, simply because it keeps the current execution context, local variables, etc. So, you need to allocate a certain amount of memory to act as stack on the new thread, in this case I'm reserving 1024*1024 bytes, nothing fancy on the number, smaller numbers caused Segmentation faults xD. Another interesting thing here is that due to the way stack works, you have to pass the highest memory position as argument not the lowermost as usually happens.
As third parameter we have the flags that will define clone()'s behavior, since we want it to spawn a process with shared context I set CLONE_THREAD, and the other 2 (CLONE_VM and CLONE_SIGHAND) have to be there as well because of CLONE_THREAD (more details about that here).
Parameter number 4 is the argument that will be passed to the function in parameter number 1, in the case of this example is just a memory position where the thread should store the TID value.

So reading the code you can kind of understand what's going to happen:

5 threads should be spawned, unless something goes terrible wrong during the clone call.
Each thread should write its TID in a particular position on array[], for example the first spawned thread should write its TID in array[0], the second thread in array[1] and so on. NOTE: array[] is placed in the stack of the leader thread, yet threads can access it!
After writing the TID in that memory position the threads will just pause themselves to prevent them from exiting and terminating the rest of the threads, including the leader thread.
The leader thread after spawning the 5 threads just sleeps for 2 seconds waiting for the threads to hopefully execute, and then prints the results. Sending signals between threads turned out to be way harder than I thought so I gave up and went down the easy road, just wait for it.

No big deal, but now lets see if that works:

juan@test:~/clone_fork$ gcc -o clone_test_simplified clone_test_simplified.c
juan@test:~/clone_fork$ ./clone_test_simplified
Spawned thread 2934
Spawned thread 2935
Spawned thread 2936
Spawned thread 2937
Spawned thread 2938
juan@test:~/clone_fork$

indeed it did!!! (the fun part here is that you have no idea how many times I had to compile this stuff to make it work xD).

For the fun of it, I increased the sleep time and captured the output of ps as I did with rsyslogd before so we can have a look at the threads created and the other values. Here we have the output:

juan@test:~/clone_fork$ gcc -o clone_test_simplified clone_test_simplified.c
juan@test:~/clone_fork$ ./clone_test_simplified &
[1] 2945
juan@test:~/clone_fork$ ps -eLF |grep "PID\|clone"
UID        PID  PPID   LWP  C NLWP    SZ   RSS PSR STIME TTY          TIME CMD
juan      2945  1938  2945  0    6  2334   628   1 22:04 pts/1    00:00:00 ./clone_test_simplified
juan      2945  1938  2946  0    6  2334   628   0 22:04 pts/1    00:00:00 ./clone_test_simplified
juan      2945  1938  2947  0    6  2334   628   1 22:04 pts/1    00:00:00 ./clone_test_simplified
juan      2945  1938  2948  0    6  2334   628   0 22:04 pts/1    00:00:00 ./clone_test_simplified
juan      2945  1938  2949  0    6  2334   628   0 22:04 pts/1    00:00:00 ./clone_test_simplified
juan      2945  1938  2950  0    6  2334   628   0 22:04 pts/1    00:00:00 ./clone_test_simplified
juan      2952  1938  2952  0    1  3987  2268   1 22:04 pts/1    00:00:00 grep --color=auto PID\|clone
juan@test:~/clone_fork$ Spawned thread 2946
Spawned thread 2947
Spawned thread 2948
Spawned thread 2949
Spawned thread 2950

[1]+  Done                    ./clone_test_simplified
juan@test:~/clone_fork$

So we see now NLWP states there are 6 threads under TGID 2945, you can see how 5 LWP match with the ones printed by the binary right afterwards, these were the ones spawned by the leader thread TID 2945.

Wrap up

Processes in Linux are created by either fork() or clone() syscall, and not only they are called different but they provide different features as well. Fork will result in two identical copies of the original process, while clone (when used with CLONE_THREAD) will instead create what is usually called light weight process, that shares part of the execution context with its creator process.

The main difference between processes created with fork and "threads" created with clone is that threads share the same memory space and therefore communication between them is way easier, improving cache and TLB usage since page tables are shared. However you have to be careful when dealing with shared memory spaces and using proper locking mechanisms unless you want to have some headaches. On the contrary processes spawned from fork run in completely different an isolated memory spaces, therefore communication between them is more expensive (pipes, shared memory, other IPC mechanisms).

Bibliography

https://linux.die.net/man/2/clone
http://man7.org/linux/man-pages/man2/syscall.2.html
http://blog.man7.org/
http://www.google.com