miércoles, 23 de enero de 2019

When ps let me down

Some tools have been around for ages helping systems administrators during rough times, making our lives and jobs somewhat easier (ok not all the times xD). One of these marvels is, without a doubt, the almighty ps. What if I told you that ps stole a few hours of my life with no remorse, would you believe me? Well... you should xD.


Lets assume you had a Linux system and noticed high system load average, along with unusually high CPU utilization. Having a look at the metrics you realize these had been increasing for a few weeks by now.

A quick glance at top, shows something like this:
top - 20:58:27 up  1:18,  3 users,  load average: 50.27, 29.91, 14.09
Tasks: 20215 total,  13 running, 202 sleeping,   0 stopped, 20000 zombie
%Cpu(s): 29.7 us, 70.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
KiB Mem:   3007548 total,  2584796 used,   422752 free,   214556 buffers
KiB Swap:  1048568 total,        0 used,  1048568 free.   983056 cached Mem

25244 juan      20   0   14228   2180   2060 R  14.1  0.1   0:00.77 ps
25246 juan      20   0   14228   2188   2064 R  13.5  0.1   0:00.70 ps
25248 juan      20   0   14228   2036   1916 R  13.5  0.1   0:00.61 ps
25250 juan      20   0   14228   2100   1976 R  13.5  0.1   0:00.57 ps
25252 juan      20   0   14228   2092   1968 R  13.5  0.1   0:00.49 ps
25254 juan      20   0   14228   2180   2056 R  13.2  0.1   0:00.45 ps
25256 juan      20   0   14228   2092   1968 R  11.5  0.1   0:00.39 ps
23469 root      20   0   47308  20812   2572 R  10.9  0.7   1:18.58 top
25258 juan      20   0   14228   2228   2104 R   9.4  0.1   0:00.32 ps
25260 juan      20   0   14228   2212   2088 R   7.4  0.1   0:00.25 ps
25262 juan      20   0   14228   2040   1916 R   5.0  0.1   0:00.17 ps
25264 juan      20   0   14228   2188   2064 R   2.9  0.1   0:00.10 ps
 2290 juan      20   0 2335644 128804  78096 S   1.2  4.3   0:06.22 skypeforli+
    9 root      20   0       0      0      0 S   0.9  0.0   0:05.64 rcuos/0
   18 root      20   0       0      0      0 S   0.9  0.0   0:04.86 rcuos/1
    7 root      20   0       0      0      0 S   0.6  0.0   0:03.59 rcu_sched
 1071 root      20   0  470184 113424  30860 S   0.6  3.8   0:35.99 Xorg

So, what's wrong with this system?

Well, a bunch of things as you can see:

  • To begin with, it kind of looks like an episode of "The Walking Dead", there's 20000 zombie processes xD. Not a big deal, right? zombie processes don't take much memory space and don't cause load increase. However they could easily consume all your process slots within the kernel and that would be a big deal.
  • Then system load is literally insane for 2 vCPUs (13 processes in the running queue). 
  • Adding up to all that, we have several freaking ps processes taking up to 14% of the CPU time each, how come ps can be taking that much CPU time?

Where do the zombies come from?

Unlike human babies, that we all know are carried by the storks, zombie processes are the result of parent processes not cleaning after their terminated childs (more details here). And in this case, we clearly have a good bunch of them, however in this particular scenario I literally faked the zombie processes so we won't go into details as per why they were created.

Why is the load so freaking high?

The reason for the high load is pretty clear, we have tons of processes in the running queue plus potentially some others in D state blocked by some resources. All these would add up to the system load, and what's concerning is the fact that the load was increasing, so at some point this system would have gone unresponsive.

Why in the world is ps taking so much CPU time?

Having a look at the ps processes, looks as if they were trying to pull the state of a given process:
root@test:/home/juan# ps ha -o state,cmd|grep ps
R ps h --pid 3234 -o state
R ps h --pid 3234 -o state
R ps h --pid 3234 -o state
R ps h --pid 3234 -o state
R ps h --pid 3234 -o state
R ps h --pid 3234 -o state
R ps h --pid 3234 -o state
R ps h --pid 3234 -o state
R ps h --pid 3234 -o state
R ps h --pid 3234 -o state
R ps ha -o state,cmd
S grep --color=auto ps
R ps h --pid 3234 -o state
R ps h --pid 3234 -o state

A pretty harmless move, you would say, so lets time that operation:
root@test:/home/juan# time ps h --pid 3234 -o state

real    0m1.003s
user    0m0.188s
sys     0m0.552s

wait..., what? took 1 second to run ps? Which age are we living in?

So ps spent a bit more than half of the time running in kernel space, it also spent about ~200 ms in user space and then somewhere around ~300 ms waiting, maybe in the scheduler queue, maybe IO, maybe all of them, who knows.

Half a second spent in the kernel seems to be a lot for such a simple task "get the process state of PID 3234", right? So what's going on?

Strace to the rescue:
root@test:/home/juan# strace -c ps h --pid 3234 -o state
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 71.16    0.014693         735        20           getdents
 26.54    0.005481           0     40418           read
  1.54    0.000317           0     40423         4 open
  0.47    0.000097           0     40420           close
  0.30    0.000061           0     20210           stat
  0.00    0.000000           0         1           write
  0.00    0.000000           0         6           fstat
  0.00    0.000000           0         2           lseek
  0.00    0.000000           0        16           mmap
  0.00    0.000000           0         7           mprotect
  0.00    0.000000           0         3           munmap
  0.00    0.000000           0         3           brk
  0.00    0.000000           0        23           rt_sigaction
  0.00    0.000000           0         2           ioctl
  0.00    0.000000           0         4         4 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           uname
  0.00    0.000000           0         1           geteuid
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           openat
------ ----------- ----------- --------- --------- ----------------
100.00    0.020649                142959         8 total
Turns out it took ps "only 142959 system calls" to get the process state of PID 3234 and that is indeed a pretty insane number of syscalls.

  • ~70% of the time in kernel space was consumed by 20 getdents syscalls, these calls were the ones pulling the directory information from /proc. Keep in mind that /proc is a pretty crowded place when there's ~20k process in the system. So despite these calls not being too many, they actually pulled lots of bytes of data (about 600Kbytes, in 32K sized fragments).
root@test:/home/juan# strace -T -e getdents ps h --pid 3234 -o state
getdents(5, /* 1350 entries */, 32768)  = 32768
getdents(5, /* 1365 entries */, 32768)  = 32760
getdents(5, /* 1365 entries */, 32768)  = 32760
getdents(5, /* 1365 entries */, 32768)  = 32760
getdents(5, /* 1365 entries */, 32768)  = 32760
getdents(5, /* 1024 entries */, 32768)  = 32752
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 309 entries */, 32768)   = 9888
getdents(5, /* 0 entries */, 32768)     = 0
+++ exited with 1 +++
  • ~26% of the time was spent on read syscalls (40818 calls), these would effectively be the ones reading data from the files within /proc. 
  • The interesting part here is the fact that the number of read, open and close syscalls happen to be almost exactly twice the number of stat calls (20406 calls). And ofcourse the fact that the number of stat calls is alsmot exactly the number of processes in the system.

Wait... are you trying to tell me that ps, despite the fact that you asked for a single process state, is going through all the processes in the system anyways? 

Yeahp, that's exactly what is doing. You don't trust me? (can't blame you, I've been wrong before xD).

That single write syscall in the list above is the one used by ps to show the command output (the process state), right? Ok, so if we dump the strace output to a file we can get a better picture:
root@test:/home/juan# grep ^write -A5 -B5  strace.txt
read(6, "3234 (apache) S 3222 3234 3204 3"..., 1024) = 306
close(6)                                = 0
open("/proc/3234/status", O_RDONLY)     = 6
read(6, "Name:\tapache\nState:\tS (sleeping)"..., 1024) = 780
close(6)                                = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 4), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff10e10f000
write(1, "S\n", 2)                      = 2
stat("/proc/3235", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
open("/proc/3235/stat", O_RDONLY)       = 6
read(6, "3235 (apache) Z 3234 3234 3234 3"..., 1024) = 162
close(6)                                = 0
open("/proc/3236/status", O_RDONLY)     = 6
Clearly ps is going one by one through all the folders available in /proc and once it comes across the one requested by "--pid" it writes the output, however it keeps moving forward untill all folders are checked (this is disturbingly fun when you try it with PID 1).

Now... why would ps do this?

This seems to be a design decition after all. Ps man states:

"ps - report a snapshot of the current processes."

Note: which suggests ps may actually stand for Processes Snapshot (who knows).

"Except as described below, process selection options are additive. The default selection is discarded, and then the selected processes are added to the set of processes to be displayed. A process will thus be shown if it meets any of the given selection criteria."

So what you actually do with the options is descrcibe to ps which processes you want to print out and which information. Then ps will go through all of them and print out the ones that match your description.


With all these in mind, looks like using ps to get details of a given proces which PID you already know doesn't seem to be efficient at all, and can be a terrible idea in systems hosting a big number of processes.

After all looks like you would be better of reading the state directly from /proc yourself:
root@test:/home/juan# strace -c grep ^State /proc/3234/status
State:  S (sleeping)
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0.00    0.000000           0         5           read
  0.00    0.000000           0         1           write
  0.00    0.000000           0         6           open
  0.00    0.000000           0         9           close
  0.00    0.000000           0         9           fstat
  0.00    0.000000           0         1         1 lseek
  0.00    0.000000           0        14           mmap
  0.00    0.000000           0         8           mprotect
  0.00    0.000000           0         2           munmap
  0.00    0.000000           0         3           brk
  0.00    0.000000           0         2         1 ioctl
  0.00    0.000000           0         5         5 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           openat
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000                    68         7 total

Only took grep 68 syscalls to retrieve the state of the process :P, yeah I know, now I'm just being a jerk hahaha.

lunes, 30 de abril de 2018

Where do your bytes go when you send them to /dev/null?

I've been playing around with some interesting tools lately so decided to use one of them to answer a rather ontological question. No, I'm not talking about the meaning of life and stuff, but the one on the title!

I'm pretty sure you have at least once come across the idea of running some IO benchmark (I didn't say this is a good idea, but lets face it... you've done it) of your drive and decided to run something like the following:

dd if=/home/juan/myfile of=/dev/null

You know, trying to measure the read throughput of the device where myfile file is stored. Or maybe you've redirected some annoying stderr output to prevent it from showing up on your stdout. The point is that I've always blindly assumed /dev/null is a kind of black hole or one way ticket for your bytes, but never actually understood how it works.

Spoiler alert, I've written this article while working on my Ubuntu Trusty VM, so all the kernel source links will be pointing to kernel 3.19 however, most (if not all) of the code/content should be valid in newer kernels. If it's not valid then, sorry xD I'm not going write this again, so get your hands dirty and read some source code.

What's /dev/null after all?

According to Wikipedia:

The null device is typically used for disposing of unwanted output streams of a process, or as a convenient empty file for input streams. This is usually done by redirection.

Something interesting is the concept of "device", /dev/null isn't a regular file, but a special device file, particularly a character device file. You may have heard the concept of "in Unix everything is a file" (ok, except Network cards :P), that's quite accurate and there are several different file types, being the most common the following:

  • Regular files
  • Directory
  • Named Pipe
  • Socket
  • Symbolic Link
  • Device file
    • Character
    • Block
Lets have a quick look at this special file and see what's so special about it! Using stat we already have some nice details about it:

juan@test:~$ stat /dev/null
  File: ‘/dev/null’
  Size: 0             Blocks: 0          IO Block: 4096   character special file
Device: 6h/6d    Inode: 1029        Links: 1     Device type: 1,3
Access: (0666/crw-rw-rw-)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2018-04-17 18:57:48.359574000 +0100
Modify: 2018-04-17 18:57:48.359574000 +0100
Change: 2018-04-17 18:57:48.359574000 +0100
 Birth: -

What does "Device type: 1,3"  stand for? Well that's the key here, these two numbers are the Major (identifies the driver used) and the Minor number (identifies the device).

According to Linux Devices documentation, the character device with Major number 1 and Minor Number 3 is the Memory character device /dev/null (aka Null device). If you feel like getting your hands really dirty, you can see the Memory character implementation code here :P.

So what happens when you send stuff to /dev/null?

As with every file intervention, at least the following things have to happen:

- Open the file and get a file descriptor
- Write to the file using the file descriptor obtained previously
- Close the file using the file descriptor obtained previously

these 3 actions require kernel intervention and they map directly to syscalls open, write and close. Lets have a look at this with strace:

root@test:/home/juan# strace -e "open,read,write,close,dup2" dd if=/dev/zero of=/dev/null count=1
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
close(3)                                = 0
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P \2\0\0\0\0\0"..., 832) = 832
close(3)                                = 0
open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
close(3)                                = 0
open("/dev/zero", O_RDONLY)             = 3
dup2(3, 0)                              = 0
close(3)                                = 0
open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
dup2(3, 1)                              = 1
close(3)                                = 0
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512
close(0)                                = 0
close(1)                                = 0
open("/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 0
read(0, "# Locale name alias data base.\n#"..., 4096) = 2570
read(0, "", 4096)                       = 0
close(0)                                = 0
open("/usr/share/locale/en_IE/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-langpack/en_IE/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-langpack/en/LC_MESSAGES/coreutils.mo", O_RDONLY) = 0
close(0)                                = 0
write(2, "1+0 records in\n1+0 records out\n", 311+0 records in
1+0 records out
) = 31
write(2, "512 bytes (512 B) copied", 24512 bytes (512 B) copied) = 24
write(2, ", 0.000607882 s, 842 kB/s\n", 26, 0.000607882 s, 842 kB/s
) = 26
close(2)                                = 0
+++ exited with 0 +++

So /dev/null file was opened and in return we got FD 3, which was then copied to FD 1 (dup2(3,1)). Right after this, write call was used to write on it, we can see the exact lines here:

open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
dup2(3, 1)                              = 1
close(3)                                = 0
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512

Once the process hands over control to the kernel to execute the syscalls code we totally loose track of what's going on. To be honest that's the whole point of an API like the syscall layer, however sometimes is good to have the chance to see how things work in the background, and in order to answer the question we need to understand it.

Ftrace, like strace but way more fun!

Ftrace was introduced on the Linux kernel long ago, somewhere around late 2008 on Kernel 2.6.27. It allows users to trace certain Kernel functions given an incredible visibility of what's going and how things really work.

In order for ftrace to be available, the kernel needs to be compiled with this tracing feature. You can confirm by looking for CONFIG_FUNCTION_TRACER, like:

juan@test:~$ grep FUNCTION_TRACER /boot/config-3.19.0-51-generic

Many distributions ship their kernels with ftrace enabled and even the debugfs (the way to access ftrace data) file system mounted (Ubuntu for example):

juan@test:~$ mount | grep debug
none on /sys/kernel/debug type debugfs (rw)

What essentially happens is pretty much summed up in these lines:

It uses a compiler feature to insert a small, 5-byte No-Operation instruction to the beginning of every kernel function, which NOP sequence is then dynamically patched into a tracer call when tracing is enabled by the administrator. 

so these few extra bytes are used later on to hook a tracing function that will gather the details of the invoked functions, pretty neat!

There's a LOT of ftrace documentation around so I won't waste time repeating any of it. I do suggest you to read it though. In this particular case I'll be using trace-cmd frontend tool instead of directly use the files under the debugfs directory.

BTW: ftrace is a great way to inspect and learn about the Linux Kernel, so do give it a go.


Ok, ok,... In order to track the kernel functions I'm going to use function_graph plugin with a little of filtering to only get the events related to the write syscall:

root@test:/home/juan# trace-cmd record -p function_graph -g "SyS_write" dd if=/dev/zero of=/dev/null count=1
  plugin 'function_graph'
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.00255381 s, 200 kB/s
Kernel buffer statistics:
  Note: "entries" are the entries left in the kernel ring buffer and are not
        recorded in the trace data. They should all be zero.

CPU: 0
entries: 0
overrun: 0
commit overrun: 0
bytes: 2476
oldest event ts: 180098.264687
now ts: 180098.267521
dropped events: 0
read events: 468

CPU: 1
entries: 0
overrun: 0
commit overrun: 0
bytes: 2184
oldest event ts: 180098.266223
now ts: 180098.267677
dropped events: 0
read events: 1882

CPU0 data recorded at offset=0x42b000
    20480 bytes in size
CPU1 data recorded at offset=0x430000
    77824 bytes in size

The execution of the command creates a "trace.dat" file were the captured events are written. Then you can access to these events using "trace-cmd report". Since we know we are looking for the first Write call done by dd command I will limit the output of the report to the following:

root@test:/home/juan# trace-cmd report | grep "dd-" | head -25
              dd-3770  [001] 180098.263240: funcgraph_entry:                   |  SyS_write() {
              dd-3770  [001] 180098.263269: funcgraph_entry:                   |    __fdget_pos() {
              dd-3770  [001] 180098.263269: funcgraph_entry:        0.077 us   |      __fget_light();
              dd-3770  [001] 180098.263270: funcgraph_exit:         0.669 us   |    }
              dd-3770  [001] 180098.263270: funcgraph_entry:                   |    vfs_write() {
              dd-3770  [001] 180098.263270: funcgraph_entry:                   |      rw_verify_area() {
              dd-3770  [001] 180098.263271: funcgraph_entry:                   |        security_file_permission() {
              dd-3770  [001] 180098.263271: funcgraph_entry:                   |          apparmor_file_permission() {
              dd-3770  [001] 180098.263271: funcgraph_entry:                   |            common_file_perm() {
              dd-3770  [001] 180098.263272: funcgraph_entry:        0.087 us   |              aa_file_perm();
              dd-3770  [001] 180098.263272: funcgraph_exit:         0.679 us   |            }
              dd-3770  [001] 180098.263272: funcgraph_exit:         1.230 us   |          }
              dd-3770  [001] 180098.263273: funcgraph_exit:         1.783 us   |        }
              dd-3770  [001] 180098.263273: funcgraph_exit:         2.330 us   |      }
              dd-3770  [001] 180098.263273: funcgraph_entry:        0.068 us   |      write_null();
              dd-3770  [001] 180098.263274: funcgraph_entry:                   |      __fsnotify_parent() {
              dd-3770  [001] 180098.263274: funcgraph_entry:        0.103 us   |        dget_parent();
              dd-3770  [001] 180098.263275: funcgraph_entry:        0.078 us   |        dput();
              dd-3770  [001] 180098.263275: funcgraph_exit:         1.255 us   |      }
              dd-3770  [001] 180098.263276: funcgraph_entry:                   |      fsnotify() {
              dd-3770  [001] 180098.263276: funcgraph_entry:        0.308 us   |        __srcu_read_lock();
              dd-3770  [001] 180098.263277: funcgraph_entry:        0.087 us   |        __srcu_read_unlock();
              dd-3770  [001] 180098.263277: funcgraph_exit:         1.548 us   |      }
              dd-3770  [001] 180098.263278: funcgraph_exit:         7.318 us   |    }
              dd-3770  [001] 180098.263278: funcgraph_exit:         9.498 us   |  }

Pretty neat, right? we can clearly see the entry and exit of "every" kernel function within the SyS_write syscall. Now, with the function names we can go to our one and only https://elixir.bootlin.com and read the code of each function to have a better picture:
  • SyS_write Write syscall enter
  • vfs_write VFS generic write path, security checks and permissions validation (among other things :))
  •  write_null this function is the write function for the specific device /dev/null (indicated by the file_operations structure of the device), if this was a regular file we would see here a File System specific write function.
Now if we explore the code of write_null, we can see the following:

static ssize_t write_null(struct file *file, const char __user *buf,
     size_t count, loff_t *ppos)
 return count;

the parameters:
  • *file the file structure (created during the open syscall and linked to the fd)
  • *buf the buffer that keeps the data that is supposed to be written
  • count the number of bytes to write 
  • *ppos the position within the file where to start writing. 

As you can see the function itself doesn't do a thing (which was kind of expected after all xD), it just returns the number of bytes that were supposed to be written as some sort of acknowledgement and confirming the write operation did actually take place.

So not only now you know about /dev/null :P but you can go and have fun with ftrace exploring the Linux Kernel!!!

To answer the question, your bytes literally go nowhere!

jueves, 21 de diciembre de 2017

Processes in uninterruptable state (D) - What to do with them?

On the previous entry a I talked about the infamous zombie state and how they just can't be killed. Now is time to some even more scary than the Z state, [suspense sounds].... the D state [/suspense sounds].

As its name precisely suggests, processes in uninterruptible state (D in ps/top/etc) cannot be interrupted hence they can't be killed either!!! An unlike a process Z state, the former might be indeed occupying memory you can't get back!

How do you get into D state?

Most of the time people associates processes in D state with IO activity, and that's not wrong actually that's exactly what man ps says:

       Here are the different values that the s, stat and state output specifiers (header "STAT" or "S") will display to describe the state of a

               D    uninterruptible sleep (usually IO)
               R    running or runnable (on run queue)
               S    interruptible sleep (waiting for an event to complete)
               T    stopped, either by a job control signal or because it is being traced
               W    paging (not valid since the 2.6.xx kernel)
               X    dead (should never be seen)
               Z    defunct ("zombie") process, terminated but not reaped by its parent

However processes that go into D state due to IO usually come back to R once the IO operation that was holding them is completed (unless something goes nuts). Now, there are some scenarios that don't involve IO and can also lead to D state, for example when using serialization mechanisms like mutexes. When you try to acquire a mutex that has been locked already you have two options, either you go into interruptable sleep (S) or you go into uninterruptible sleep (D) this is something that will depend on what the code is supposed to do and whether you want to be able to receive signals and stuff while waiting for the mutex to be available. If you are curious about mutexes in the kernel have a look at this or a shorter and perhaps more human version here.

So how do you detect a process in D state?

That's rather simple! You can use the regular tools like ps and top and look for the State field/column. For example:

juan@test:~$ ps aux|grep test
root     20592  0.0  0.0   4220   656 pts/2    S+   21:36   0:00 ./test
root     20593  0.0  0.0   4216   640 pts/24   D+   21:36   0:00 ./test
juan     20982  0.0  0.0  15964  2268 pts/27   R+   23:26   0:00 grep --color=auto test

in this case, process 20593 is in D+ state, so Uninterruptable sleep and is a foreground process.

What to do with a process in D state? How to know what is actually locking the process:

This is where things get interesting. As I mentioned before D state is usually temporary and most of the times if you see a process in that state as soon as you run ps again the process will have changed to either R or S. BUT... if you believe the process is just refusing to leave D state, then what to do?

Lots of people answer this question with "I kill the process"... well that's exactly why is called UNinterruptible, it will absolutely ignore your most violent attempts to kill it. Any signals you send to the process will be queued and delivered eventually once the process leaves the D state, if it ever does of course.

Fortunately there are a few things that can help you debug this lovely situation and prevent it from happening in the future.

Have a look at the process stack:

This would be an initial approach, by having a look at the process stack you can get an idea of what code path took the process to that D condition. Luckily having a look at the stack isn't that hard now a days and you can do that by using /proc/:

juan@test:~$ sudo cat /proc/20593/stack
[<ffffffffc04ee045>] dev_open+0x15/0x40 [ebbcharmutex]
[<ffffffff811f0a7f>] chrdev_open+0x9f/0x1d0
[<ffffffff811e9527>] do_dentry_open+0x1f7/0x340
[<ffffffff811ead87>] vfs_open+0x57/0x60
[<ffffffff811faccc>] do_last+0x4ec/0x1190
[<ffffffff811fb9f0>] path_openat+0x80/0x600
[<ffffffff811fd10a>] do_filp_open+0x3a/0x90
[<ffffffff811eb109>] do_sys_open+0x129/0x280
[<ffffffff811eb27e>] SyS_open+0x1e/0x20
[<ffffffff817b788d>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

nice! So the process is locked in Kernel space and started by running an open syscall (SyS_Open+0x1e/0x20) on some sort of character device (chrdev_open+0x9f/0x1d0) and there's a module intervening here called [ebbcharmutex]. Since this is a stack trace we know with high certainty that the process is blocked particularly inside dev_open call, and that the code for the function belongs to module ebbcharmutex.
For more details about ebbcharmutex, I'd suggest to go to the source (really nice website around Linux and embedded systems) anyways here a piece of the code:
/** @brief The device open function that is called each time the device is opened
 *  This will only increment the numberOpens counter in this case.
 *  @param inodep A pointer to an inode object (defined in linux/fs.h)
 *  @param filep A pointer to a file object (defined in linux/fs.h)
static int dev_open(struct inode *inodep, struct file *filep){

//   if(!mutex_trylock(&ebbchar_mutex)){                  // Try to acquire the mutex (returns 0 on fail)
//      printk(KERN_ALERT "EBBChar: Device in use by another process");
//      return -EBUSY;
//  }
   printk(KERN_INFO "EBBChar: Device has been opened %d time(s)\n", numberOpens);
   return 0;

as you can see the very first line of the function calls mutex_lock() in its uninterrtible fashion hence this is where our process is stuck.

Now what? well... technically there's not much you can do to unlock the process, all you can try to do is identify who could be the one holding the lock and not releasing it (of course this would require some serious understanding of the problem and the code itself). In my simple case scenario you can see that I had two test processes one of them in S+ state (scroll up, I'm not going to paste it twice :P), if we check the opened files for that process we can see it opened /dev/ebbchar so he is the one holding the lock:
root@test:/home/juan# lsof -p 20592
test    20592 root  cwd    DIR    8,1     4096 466847 /home/juan/ModTest/exploringBB/extras/kernel/ebbcharmutex
test    20592 root  rtd    DIR    8,1     4096      2 /
test    20592 root  txt    REG    8,1     9102 466927 /home/juan/ModTest/exploringBB/extras/kernel/ebbcharmutex/test
test    20592 root  mem    REG    8,1  1857312 583879 /lib/x86_64-linux-gnu/libc-2.19.so
test    20592 root  mem    REG    8,1   149120 583895 /lib/x86_64-linux-gnu/ld-2.19.so
test    20592 root    0u   CHR  136,2      0t0      5 /dev/pts/2
test    20592 root    1u   CHR  136,2      0t0      5 /dev/pts/2
test    20592 root    2u   CHR  136,2      0t0      5 /dev/pts/2
test    20592 root    3u   CHR  250,0      0t0  69224 /dev/ebbchar

If you can't really understand why the process isn't releasing the lock, since it is in S+ state, you can kill it and this should free the lock hence releasing the other stuck process:

root@test:/home/juan# kill -9 20592
root@test:/home/juan# ps aux|grep test
root     20593  0.0  0.0   4220   640 pts/24   S+   05:39   0:00 ./test
root     21460  0.0  0.0  15964  2140 pts/26   S+   12:06   0:00 grep --color=auto test

perfect, now process 20593 is in S state!

The kernel doesn't like processes in D state either:

Can't blame it, right? Long time in D state is usually sign of some kind of issue, either high load or locking problems. Because of this, the kernel has 4 features to detect and act on processes in D state. Here you can see the default setup for Ubuntu:

root@test:/home/juan# cat /proc/sys/kernel/hung_task_timeout_secs
root@test:/home/juan# cat /proc/sys/kernel/hung_task_panic
root@test:/home/juan# cat /proc/sys/kernel/hung_task_check_count
root@test:/home/juan# cat /proc/sys/kernel/hung_task_warnings

What do these values mean (for more precision read this):
  • hung_task_timeout_secs: if a task doesn't get scheduled in more than that number of seconds a warning will be reported.
  • hung_task_panic: to indicate the kernel whether it should panic or not when detecting a hung task. 0 for no panic, 1 for panic. This for example could be use to reboot the system.
  • hung_task_check_count: maximum number of tasks to be checked, don't really get this one...
  • hung_task_warnings: maximum number of warnings due to hung tasks to be reported, every time a task is detected as hung this number gets decreased, when reaching 0 no more reports will be issued.
 How does a warning report look like? Oh.. they are lovely:

Dec 20 19:57:25 test kernel: [78751.283729] INFO: task test:20593 blocked for more than 120 seconds.
Dec 20 19:57:25 test kernel: [78751.283737]       Tainted: G           OE  3.19.0-51-generic #58~14.04.1-Ubuntu
Dec 20 19:57:25 test kernel: [78751.283739] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 20 19:57:25 test kernel: [78751.283741] test            D ffff88003610bb68     0 20593  18722 0x00000000
Dec 20 19:57:25 test kernel: [78751.283748]  ffff88003610bb68 ffff88007ce9f5c0 0000000000013e80 ffff88003610bfd8
Dec 20 19:57:25 test kernel: [78751.283753]  0000000000013e80 ffff8800b771f5c0 ffff88007ce9f5c0 ffff88007ce9f5c0
Dec 20 19:57:25 test kernel: [78751.283756]  ffffffffc04f0100 ffffffffc04f0104 ffff88007ce9f5c0 00000000ffffffff
Dec 20 19:57:25 test kernel: [78751.283760] Call Trace:
Dec 20 19:57:25 test kernel: [78751.283773]  [<ffffffff817b3a29>] schedule_preempt_disabled+0x29/0x70
Dec 20 19:57:25 test kernel: [78751.283778]  [<ffffffff817b5745>] __mutex_lock_slowpath+0x95/0x100
Dec 20 19:57:25 test kernel: [78751.283781]  [<ffffffff817b57b5>] ? mutex_lock+0x5/0x37
Dec 20 19:57:25 test kernel: [78751.283785]  [<ffffffff817b57d3>] mutex_lock+0x23/0x37
Dec 20 19:57:25 test kernel: [78751.283790]  [<ffffffffc04ee045>] ? dev_open+0x15/0x40 [ebbcharmutex]
Dec 20 19:57:25 test kernel: [78751.283793]  [<ffffffffc04ee045>] dev_open+0x15/0x40 [ebbcharmutex]
Dec 20 19:57:25 test kernel: [78751.283798]  [<ffffffff811f0a7f>] chrdev_open+0x9f/0x1d0
Dec 20 19:57:25 test kernel: [78751.283802]  [<ffffffff811e9527>] do_dentry_open+0x1f7/0x340
Dec 20 19:57:25 test kernel: [78751.283805]  [<ffffffff811f09e0>] ? cdev_put+0x30/0x30
Dec 20 19:57:25 test kernel: [78751.283808]  [<ffffffff811ead35>] ? vfs_open+0x5/0x60
Dec 20 19:57:25 test kernel: [78751.283812]  [<ffffffff811ead87>] vfs_open+0x57/0x60
Dec 20 19:57:25 test kernel: [78751.283815]  [<ffffffff811faccc>] do_last+0x4ec/0x1190
Dec 20 19:57:25 test kernel: [78751.283820]  [<ffffffff811fb9f0>] path_openat+0x80/0x600
Dec 20 19:57:25 test kernel: [78751.283825]  [<ffffffff810a6ef5>] ? update_stats_wait_end+0x5/0xd0
Dec 20 19:57:25 test kernel: [78751.283828]  [<ffffffff817b38b5>] ? _cond_resched+0x5/0x40
Dec 20 19:57:25 test kernel: [78751.283832]  [<ffffffff811fd10a>] do_filp_open+0x3a/0x90
Dec 20 19:57:25 test kernel: [78751.283836]  [<ffffffff81209cd7>] ? __alloc_fd+0xa7/0x130
Dec 20 19:57:25 test kernel: [78751.283839]  [<ffffffff811fd0d5>] ? do_filp_open+0x5/0x90
Dec 20 19:57:25 test kernel: [78751.283843]  [<ffffffff811eb109>] do_sys_open+0x129/0x280
Dec 20 19:57:25 test kernel: [78751.283846]  [<ffffffff811eafe5>] ? do_sys_open+0x5/0x280
Dec 20 19:57:25 test kernel: [78751.283850]  [<ffffffff811eb27e>] SyS_open+0x1e/0x20
Dec 20 19:57:25 test kernel: [78751.283853]  [<ffffffff817b788d>] system_call_fastpath+0x16/0x1b

The report provides an even more detailed stack trace along with some other details. Now, if you have a look at the top of the stack trace becomes crystal clear the fact that the process is stuck trying to acquire a lock a mutex.

Summing this up:

Processes in D state are rare and could be an indication of system issues like intense IO, high load, or maybe locking problems. Because of being uninterruptible, they don't get to be scheduled hence signals can't be delivered so the process can't be killed. Fortunately there are a few things you can do in order to gather more information to have the chance to troubleshoot the problem.