jueves, 21 de diciembre de 2017

Processes in uninterruptable state (D) - What to do with them?

On the previous entry a I talked about the infamous zombie state and how they just can't be killed. Now is time to some even more scary than the Z state, [suspense sounds].... the D state [/suspense sounds].

As its name precisely suggests, processes in uninterruptible state (D in ps/top/etc) cannot be interrupted hence they can't be killed either!!! An unlike a process Z state, the former might be indeed occupying memory you can't get back!

How do you get into D state?


Most of the time people associates processes in D state with IO activity, and that's not wrong actually that's exactly what man ps says:

PROCESS STATE CODES
       Here are the different values that the s, stat and state output specifiers (header "STAT" or "S") will display to describe the state of a
       process:

               D    uninterruptible sleep (usually IO)
               R    running or runnable (on run queue)
               S    interruptible sleep (waiting for an event to complete)
               T    stopped, either by a job control signal or because it is being traced
               W    paging (not valid since the 2.6.xx kernel)
               X    dead (should never be seen)
               Z    defunct ("zombie") process, terminated but not reaped by its parent

However processes that go into D state due to IO usually come back to R once the IO operation that was holding them is completed (unless something goes nuts). Now, there are some scenarios that don't involve IO and can also lead to D state, for example when using serialization mechanisms like mutexes. When you try to acquire a mutex that has been locked already you have two options, either you go into interruptable sleep (S) or you go into uninterruptible sleep (D) this is something that will depend on what the code is supposed to do and whether you want to be able to receive signals and stuff while waiting for the mutex to be available. If you are curious about mutexes in the kernel have a look at this or a shorter and perhaps more human version here.

So how do you detect a process in D state?


That's rather simple! You can use the regular tools like ps and top and look for the State field/column. For example:

juan@test:~$ ps aux|grep test
root     20592  0.0  0.0   4220   656 pts/2    S+   21:36   0:00 ./test
root     20593  0.0  0.0   4216   640 pts/24   D+   21:36   0:00 ./test
juan     20982  0.0  0.0  15964  2268 pts/27   R+   23:26   0:00 grep --color=auto test
juan@test:~$

in this case, process 20593 is in D+ state, so Uninterruptable sleep and is a foreground process.

What to do with a process in D state? How to know what is actually locking the process:


This is where things get interesting. As I mentioned before D state is usually temporary and most of the times if you see a process in that state as soon as you run ps again the process will have changed to either R or S. BUT... if you believe the process is just refusing to leave D state, then what to do?

Lots of people answer this question with "I kill the process"... well that's exactly why is called UNinterruptible, it will absolutely ignore your most violent attempts to kill it. Any signals you send to the process will be queued and delivered eventually once the process leaves the D state, if it ever does of course.

Fortunately there are a few things that can help you debug this lovely situation and prevent it from happening in the future.

Have a look at the process stack:

This would be an initial approach, by having a look at the process stack you can get an idea of what code path took the process to that D condition. Luckily having a look at the stack isn't that hard now a days and you can do that by using /proc/:

juan@test:~$ sudo cat /proc/20593/stack
[<ffffffffc04ee045>] dev_open+0x15/0x40 [ebbcharmutex]
[<ffffffff811f0a7f>] chrdev_open+0x9f/0x1d0
[<ffffffff811e9527>] do_dentry_open+0x1f7/0x340
[<ffffffff811ead87>] vfs_open+0x57/0x60
[<ffffffff811faccc>] do_last+0x4ec/0x1190
[<ffffffff811fb9f0>] path_openat+0x80/0x600
[<ffffffff811fd10a>] do_filp_open+0x3a/0x90
[<ffffffff811eb109>] do_sys_open+0x129/0x280
[<ffffffff811eb27e>] SyS_open+0x1e/0x20
[<ffffffff817b788d>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
juan@test:~$

nice! So the process is locked in Kernel space and started by running an open syscall (SyS_Open+0x1e/0x20) on some sort of character device (chrdev_open+0x9f/0x1d0) and there's a module intervening here called [ebbcharmutex]. Since this is a stack trace we know with high certainty that the process is blocked particularly inside dev_open call, and that the code for the function belongs to module ebbcharmutex.
For more details about ebbcharmutex, I'd suggest to go to the source (really nice website around Linux and embedded systems) anyways here a piece of the code:
/** @brief The device open function that is called each time the device is opened
 *  This will only increment the numberOpens counter in this case.
 *  @param inodep A pointer to an inode object (defined in linux/fs.h)
 *  @param filep A pointer to a file object (defined in linux/fs.h)
 */
static int dev_open(struct inode *inodep, struct file *filep){

//   if(!mutex_trylock(&ebbchar_mutex)){                  // Try to acquire the mutex (returns 0 on fail)
//      printk(KERN_ALERT "EBBChar: Device in use by another process");
//      return -EBUSY;
//  }
   mutex_lock(&ebbchar_mutex);
   numberOpens++;
   printk(KERN_INFO "EBBChar: Device has been opened %d time(s)\n", numberOpens);
   return 0;
}

as you can see the very first line of the function calls mutex_lock() in its uninterrtible fashion hence this is where our process is stuck.

Now what? well... technically there's not much you can do to unlock the process, all you can try to do is identify who could be the one holding the lock and not releasing it (of course this would require some serious understanding of the problem and the code itself). In my simple case scenario you can see that I had two test processes one of them in S+ state (scroll up, I'm not going to paste it twice :P), if we check the opened files for that process we can see it opened /dev/ebbchar so he is the one holding the lock:
root@test:/home/juan# lsof -p 20592
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF   NODE NAME
test    20592 root  cwd    DIR    8,1     4096 466847 /home/juan/ModTest/exploringBB/extras/kernel/ebbcharmutex
test    20592 root  rtd    DIR    8,1     4096      2 /
test    20592 root  txt    REG    8,1     9102 466927 /home/juan/ModTest/exploringBB/extras/kernel/ebbcharmutex/test
test    20592 root  mem    REG    8,1  1857312 583879 /lib/x86_64-linux-gnu/libc-2.19.so
test    20592 root  mem    REG    8,1   149120 583895 /lib/x86_64-linux-gnu/ld-2.19.so
test    20592 root    0u   CHR  136,2      0t0      5 /dev/pts/2
test    20592 root    1u   CHR  136,2      0t0      5 /dev/pts/2
test    20592 root    2u   CHR  136,2      0t0      5 /dev/pts/2
test    20592 root    3u   CHR  250,0      0t0  69224 /dev/ebbchar
root@test:/home/juan#

If you can't really understand why the process isn't releasing the lock, since it is in S+ state, you can kill it and this should free the lock hence releasing the other stuck process:

root@test:/home/juan# kill -9 20592
root@test:/home/juan# ps aux|grep test
root     20593  0.0  0.0   4220   640 pts/24   S+   05:39   0:00 ./test
root     21460  0.0  0.0  15964  2140 pts/26   S+   12:06   0:00 grep --color=auto test
root@test:/home/juan#

perfect, now process 20593 is in S state!

The kernel doesn't like processes in D state either:


Can't blame it, right? Long time in D state is usually sign of some kind of issue, either high load or locking problems. Because of this, the kernel has 4 features to detect and act on processes in D state. Here you can see the default setup for Ubuntu:

root@test:/home/juan# cat /proc/sys/kernel/hung_task_timeout_secs
120
root@test:/home/juan# cat /proc/sys/kernel/hung_task_panic
0
root@test:/home/juan# cat /proc/sys/kernel/hung_task_check_count
4194304
root@test:/home/juan# cat /proc/sys/kernel/hung_task_warnings
0
root@test:/home/juan#

What do these values mean (for more precision read this):
  • hung_task_timeout_secs: if a task doesn't get scheduled in more than that number of seconds a warning will be reported.
  • hung_task_panic: to indicate the kernel whether it should panic or not when detecting a hung task. 0 for no panic, 1 for panic. This for example could be use to reboot the system.
  • hung_task_check_count: maximum number of tasks to be checked, don't really get this one...
  • hung_task_warnings: maximum number of warnings due to hung tasks to be reported, every time a task is detected as hung this number gets decreased, when reaching 0 no more reports will be issued.
 How does a warning report look like? Oh.. they are lovely:

Dec 20 19:57:25 test kernel: [78751.283729] INFO: task test:20593 blocked for more than 120 seconds.
Dec 20 19:57:25 test kernel: [78751.283737]       Tainted: G           OE  3.19.0-51-generic #58~14.04.1-Ubuntu
Dec 20 19:57:25 test kernel: [78751.283739] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 20 19:57:25 test kernel: [78751.283741] test            D ffff88003610bb68     0 20593  18722 0x00000000
Dec 20 19:57:25 test kernel: [78751.283748]  ffff88003610bb68 ffff88007ce9f5c0 0000000000013e80 ffff88003610bfd8
Dec 20 19:57:25 test kernel: [78751.283753]  0000000000013e80 ffff8800b771f5c0 ffff88007ce9f5c0 ffff88007ce9f5c0
Dec 20 19:57:25 test kernel: [78751.283756]  ffffffffc04f0100 ffffffffc04f0104 ffff88007ce9f5c0 00000000ffffffff
Dec 20 19:57:25 test kernel: [78751.283760] Call Trace:
Dec 20 19:57:25 test kernel: [78751.283773]  [<ffffffff817b3a29>] schedule_preempt_disabled+0x29/0x70
Dec 20 19:57:25 test kernel: [78751.283778]  [<ffffffff817b5745>] __mutex_lock_slowpath+0x95/0x100
Dec 20 19:57:25 test kernel: [78751.283781]  [<ffffffff817b57b5>] ? mutex_lock+0x5/0x37
Dec 20 19:57:25 test kernel: [78751.283785]  [<ffffffff817b57d3>] mutex_lock+0x23/0x37
Dec 20 19:57:25 test kernel: [78751.283790]  [<ffffffffc04ee045>] ? dev_open+0x15/0x40 [ebbcharmutex]
Dec 20 19:57:25 test kernel: [78751.283793]  [<ffffffffc04ee045>] dev_open+0x15/0x40 [ebbcharmutex]
Dec 20 19:57:25 test kernel: [78751.283798]  [<ffffffff811f0a7f>] chrdev_open+0x9f/0x1d0
Dec 20 19:57:25 test kernel: [78751.283802]  [<ffffffff811e9527>] do_dentry_open+0x1f7/0x340
Dec 20 19:57:25 test kernel: [78751.283805]  [<ffffffff811f09e0>] ? cdev_put+0x30/0x30
Dec 20 19:57:25 test kernel: [78751.283808]  [<ffffffff811ead35>] ? vfs_open+0x5/0x60
Dec 20 19:57:25 test kernel: [78751.283812]  [<ffffffff811ead87>] vfs_open+0x57/0x60
Dec 20 19:57:25 test kernel: [78751.283815]  [<ffffffff811faccc>] do_last+0x4ec/0x1190
Dec 20 19:57:25 test kernel: [78751.283820]  [<ffffffff811fb9f0>] path_openat+0x80/0x600
Dec 20 19:57:25 test kernel: [78751.283825]  [<ffffffff810a6ef5>] ? update_stats_wait_end+0x5/0xd0
Dec 20 19:57:25 test kernel: [78751.283828]  [<ffffffff817b38b5>] ? _cond_resched+0x5/0x40
Dec 20 19:57:25 test kernel: [78751.283832]  [<ffffffff811fd10a>] do_filp_open+0x3a/0x90
Dec 20 19:57:25 test kernel: [78751.283836]  [<ffffffff81209cd7>] ? __alloc_fd+0xa7/0x130
Dec 20 19:57:25 test kernel: [78751.283839]  [<ffffffff811fd0d5>] ? do_filp_open+0x5/0x90
Dec 20 19:57:25 test kernel: [78751.283843]  [<ffffffff811eb109>] do_sys_open+0x129/0x280
Dec 20 19:57:25 test kernel: [78751.283846]  [<ffffffff811eafe5>] ? do_sys_open+0x5/0x280
Dec 20 19:57:25 test kernel: [78751.283850]  [<ffffffff811eb27e>] SyS_open+0x1e/0x20
Dec 20 19:57:25 test kernel: [78751.283853]  [<ffffffff817b788d>] system_call_fastpath+0x16/0x1b

The report provides an even more detailed stack trace along with some other details. Now, if you have a look at the top of the stack trace becomes crystal clear the fact that the process is stuck trying to acquire a lock a mutex.

Summing this up:


Processes in D state are rare and could be an indication of system issues like intense IO, high load, or maybe locking problems. Because of being uninterruptible, they don't get to be scheduled hence signals can't be delivered so the process can't be killed. Fortunately there are a few things you can do in order to gather more information to have the chance to troubleshoot the problem.

miércoles, 13 de diciembre de 2017

Some Zombies just can't be killed - Zombie Processes

We've been told by so many movies and TV series that no matter how smelly or ugly Zombies can be, you can easily kill them by either blowing up their heads with a shotgun or gently detaching it from the rest of the body (this doesn't really kill the zombie but renders it almost harmless, fun fact, head can't run while detached from the legs). Unfortunately Zombie processes don't really follow the previous "Zombie expected" behavior and they just can't be killed (kill -9 PID).

What in the world is a Zombie process?


Wikipedia has a pretty nice description here, but essentially we are talking about a process that has finished its execution through either exit or exit_group syscalls but remains present in the system due to some unfinished housekeeping tasks. You can easily identify a zombie process from its Z state in for example top or ps. A nice description can be found here the wait man page:

       A child that terminates, but has not been waited for becomes a
       "zombie".  The kernel maintains a minimal set of information about
       the zombie process (PID, termination status, resource usage
       information) in order to allow the parent to later perform a wait to
       obtain information about the child.  As long as a zombie is not
       removed from the system via a wait, it will consume a slot in the
       kernel process table, and if this table fills, it will not be
       possible to create further processes.  If a parent process
       terminates, then its "zombie" children (if any) are adopted by
       init(1), (or by the nearest "subreaper" process as defined through
       the use of the prctl(2) PR_SET_CHILD_SUBREAPER operation); init(1)
       automatically performs a wait to remove the zombies.

For some reason there's a lot of confusion around this Z state, for example:
  • Orphan processes aren't zombie processes, in fact for a process to become zombie its parent process has to be still alive. If the parent dies it will be adopted by init and waitED properly.
  •  A Zombie process isn't consuming memory (or CPU), to be more precise, isn't consuming as much memory as it was consuming while it was running. However it still occupies a slot in the process table and this could be a problem.

 What happens when exit or exit_group are called?


As I mentioned before these are the syscalls used to terminate a process gracefully. You can find the details of these two for kernel 4.14 here and here. I'm not going to describe the whole code because first I don't understand all of it and second I would probably do it wrong :), but... I'm going to point out some interesting parts:
  • Things begin in do_group_exit which takes care of killing all the threads in the current process thread group then it calls do_exit to finish the process itself.
  • do_exit takes care of a few things as well, for example releasing some of the resources associated to the process like files and shared memory (lines 858-866). Later on in line 885, calls to exit_notify, which is in charge of sending to the relative processes the bad news, kind of "Hey!, I'm dying" thing. Now within this function a few things will happen being the most important ones the following:
    • do_notify_parent will be called to signal the parent process with a SIGCHLD signal. This function will return either true or false depending on whether the dying process should become a Zombie or just go regular dead.
      • Have a look at the comment in here, long story short, IF the father decides to ignore (by setting the handler to SIG_IGN) the SIGCHLD signal (or SA_NOCLDWAIT flag is set) the dying process can reap itself and doesn't become a zombie and in that case do_notify_parent will return true. Otherwise it will return false.
    • Back to exit_notify, you can see in here how the process decides whether or not to become a zombie depending on the autoreap variable.
Something worth mentioning is the fact that Zombie is an exit state (tsk ->exit_state) while the scheduling state of the process is actually Dead, you can see this is defined by calling here do_task_dead function. So a Zombie process isn't consuming CPU resources at all, for the scheduling perspective this process is gone, and so it is all the other resources that had been allocated to it.

Why can't it be killed?


Well, you can still send SIGKILL signal to a Zombie process and as you may know, you can't escape from that signal. However technically speaking even though the process is still in the process table, there's nothing it can do, there's no code attached to it anymore, no stack, no files, nothing.

Example


Lets have a look at all this zombie stuff with a simple example. You can get the code here, it's a simple C program that will fork 2 child processes and allocate some memory in each. The parent will sleep for 30 seconds, while the child processes will wait for 15 seconds, hence dying while the father is still running (sleeping) becoming spoooky zombiesssss:

juan@test:~$gcc -o zombie_maker zombie_maker.c
juan@test:~$ ./zombie_maker &
juan@test:~$ Parent PID = 2050, PGRP = 2050, PPID = 1934, PSID = 1934
Child 0 -> PID = 2051, PGRP = 2050, PPID = 2050, PSID = 1934
Child 1 -> PID = 2052, PGRP = 2050, PPID = 2050, PSID = 1934

juan@test:~$ ps aux|grep zombie
juan      2050  0.0  0.0   4216   732 pts/1    S    21:34   0:00 ./zombie_maker
juan      2051  0.0  0.0   6268    92 pts/1    S    21:34   0:00 ./zombie_maker
juan      2052  0.0  0.0   6268    92 pts/1    S    21:34   0:00 ./zombie_maker
juan      2054  0.0  0.0  15960  2268 pts/1    R+   21:34   0:00 grep --color=auto zombie
juan@test:~$ ps aux|grep zombie
juan      2050  0.0  0.0   4216   732 pts/1    S    21:34   0:00 ./zombie_maker
juan      2051  0.0  0.0      0     0 pts/1    Z    21:34   0:00 [zombie_maker] 
juan      2052  0.0  0.0      0     0 pts/1    Z    21:34   0:00 [zombie_maker] 
juan      2056  0.0  0.0  15960  2268 pts/1    S+   21:34   0:00 grep --color=auto zombie
juan@test:~$ ps aux|grep zombie
juan      2058  0.0  0.0  15960  2288 pts/1    S+   21:35   0:00 grep --color=auto zombie
[2]+  Done                    ./zombie_maker
juan@test:~$

After the 15 second, we can see how both child processes not only became Zombies but they also show no memory footprint whatsoever on the system (VSZ is 0 and RSS is 0 as well). After the 30 seconds we can see that none of the 3 processes exist anymore, but who reaped the Zombie children from the system???

Yeahp, init did! Attaching to init with strace we can see the waitid calls returning with the details from the Zombie processes:

root@test:/home/juan# strace -p 1
Process 1 attached
select(54, [3 5 6 7 10 11 15 16 17 19 20 24 26 30 34 53], [], [7 10 11 15 16 17], NULL) = ? ERESTARTNOHAND (To be restarted if no handler)
...
waitid(P_ALL, 0, {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2051, si_status=3, si_utime=0, si_stime=0}, WNOHANG|WEXITED|WSTOPPED|WCONTINUED, NULL) = 0
waitid(P_ALL, 0, {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2052, si_status=3, si_utime=0, si_stime=0}, WNOHANG|WEXITED|WSTOPPED|WCONTINUED, NULL) = 0
select(54, [3 5 6 7 10 11 15 16 17 19 20 24 26 30 34 53], [], [7 10 11 15 16 17], NULL^CProcess 1 detached
 
root@test:/home/juan#

Running zombie_maker with strace will show you how the parent process gets indeed interrupted by the SIGCHLD signals but since the handler isn't defined it is ignored. PLEASE, don't confuse that ignore with the one that happens when the handler is set to SIG_IGN, literally this is what the documentation says:

       POSIX.1-2001 specifies that if the disposition of SIGCHLD is set to
       SIG_IGN or the SA_NOCLDWAIT flag is set for SIGCHLD (see
       sigaction(2)), then children that terminate do not become zombies and
       a call to wait() or waitpid() will block until all children have
       terminated, and then fail with errno set to ECHILD.  (The original
       POSIX standard left the behavior of setting SIGCHLD to SIG_IGN
       unspecified.  Note that even though the default disposition of
       SIGCHLD is "ignore", explicitly setting the disposition to SIG_IGN
       results in different treatment of zombie process children.)

Summary


No matter how brave or root you are, you can't kill a Zombie process :D, killing its parent is usually the way to get rid of them. When the parent dies, init will adopt the little zombie bastards and reap them off the process table, as we saw on the example. Also as mentioned, a Zombie process doesn't consume much system resources (actually almost none), so unless you have hundreds of them you shouldn't worry much.

The kernel code is going crazy, it would be nice if we had more comments in place :P.