miércoles, 23 de enero de 2019

When ps let me down

Some tools have been around for ages helping systems administrators during rough times, making our lives and jobs somewhat easier (ok not all the times xD). One of these marvels is, without a doubt, the almighty ps. What if I told you that ps stole a few hours of my life with no remorse, would you believe me? Well... you should xD.

Situation:


Lets assume you had a Linux system and noticed high system load average, along with unusually high CPU utilization. Having a look at the metrics you realize these had been increasing for a few weeks by now.

A quick glance at top, shows something like this:
top - 20:58:27 up  1:18,  3 users,  load average: 50.27, 29.91, 14.09
Tasks: 20215 total,  13 running, 202 sleeping,   0 stopped, 20000 zombie
%Cpu(s): 29.7 us, 70.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
KiB Mem:   3007548 total,  2584796 used,   422752 free,   214556 buffers
KiB Swap:  1048568 total,        0 used,  1048568 free.   983056 cached Mem

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
25244 juan      20   0   14228   2180   2060 R  14.1  0.1   0:00.77 ps
25246 juan      20   0   14228   2188   2064 R  13.5  0.1   0:00.70 ps
25248 juan      20   0   14228   2036   1916 R  13.5  0.1   0:00.61 ps
25250 juan      20   0   14228   2100   1976 R  13.5  0.1   0:00.57 ps
25252 juan      20   0   14228   2092   1968 R  13.5  0.1   0:00.49 ps
25254 juan      20   0   14228   2180   2056 R  13.2  0.1   0:00.45 ps
25256 juan      20   0   14228   2092   1968 R  11.5  0.1   0:00.39 ps
23469 root      20   0   47308  20812   2572 R  10.9  0.7   1:18.58 top
25258 juan      20   0   14228   2228   2104 R   9.4  0.1   0:00.32 ps
25260 juan      20   0   14228   2212   2088 R   7.4  0.1   0:00.25 ps
25262 juan      20   0   14228   2040   1916 R   5.0  0.1   0:00.17 ps
25264 juan      20   0   14228   2188   2064 R   2.9  0.1   0:00.10 ps
 2290 juan      20   0 2335644 128804  78096 S   1.2  4.3   0:06.22 skypeforli+
    9 root      20   0       0      0      0 S   0.9  0.0   0:05.64 rcuos/0
   18 root      20   0       0      0      0 S   0.9  0.0   0:04.86 rcuos/1
    7 root      20   0       0      0      0 S   0.6  0.0   0:03.59 rcu_sched
 1071 root      20   0  470184 113424  30860 S   0.6  3.8   0:35.99 Xorg

So, what's wrong with this system?


Well, a bunch of things as you can see:

  • To begin with, it kind of looks like an episode of "The Walking Dead", there's 20000 zombie processes xD. Not a big deal, right? zombie processes don't take much memory space and don't cause load increase. However they could easily consume all your process slots within the kernel and that would be a big deal.
  • Then system load is literally insane for 2 vCPUs (13 processes in the running queue). 
  • Adding up to all that, we have several freaking ps processes taking up to 14% of the CPU time each, how come ps can be taking that much CPU time?

Where do the zombies come from?


Unlike human babies, that we all know are carried by the storks, zombie processes are the result of parent processes not cleaning after their terminated childs (more details here). And in this case, we clearly have a good bunch of them, however in this particular scenario I literally faked the zombie processes so we won't go into details as per why they were created.

Why is the load so freaking high?


The reason for the high load is pretty clear, we have tons of processes in the running queue plus potentially some others in D state blocked by some resources. All these would add up to the system load, and what's concerning is the fact that the load was increasing, so at some point this system would have gone unresponsive.

Why in the world is ps taking so much CPU time?


Having a look at the ps processes, looks as if they were trying to pull the state of a given process:
root@test:/home/juan# ps ha -o state,cmd|grep ps
R ps h --pid 3234 -o state
R ps h --pid 3234 -o state
R ps h --pid 3234 -o state
R ps h --pid 3234 -o state
R ps h --pid 3234 -o state
R ps h --pid 3234 -o state
R ps h --pid 3234 -o state
R ps h --pid 3234 -o state
R ps h --pid 3234 -o state
R ps h --pid 3234 -o state
R ps ha -o state,cmd
S grep --color=auto ps
R ps h --pid 3234 -o state
R ps h --pid 3234 -o state
root@test:/home/juan#

A pretty harmless move, you would say, so lets time that operation:
root@test:/home/juan# time ps h --pid 3234 -o state
S

real    0m1.003s
user    0m0.188s
sys     0m0.552s
root@test:/home/juan#

wait..., what? took 1 second to run ps? Which age are we living in?

So ps spent a bit more than half of the time running in kernel space, it also spent about ~200 ms in user space and then somewhere around ~300 ms waiting, maybe in the scheduler queue, maybe IO, maybe all of them, who knows.

Half a second spent in the kernel seems to be a lot for such a simple task "get the process state of PID 3234", right? So what's going on?

Strace to the rescue:
root@test:/home/juan# strace -c ps h --pid 3234 -o state
S
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 71.16    0.014693         735        20           getdents
 26.54    0.005481           0     40418           read
  1.54    0.000317           0     40423         4 open
  0.47    0.000097           0     40420           close
  0.30    0.000061           0     20210           stat
  0.00    0.000000           0         1           write
  0.00    0.000000           0         6           fstat
  0.00    0.000000           0         2           lseek
  0.00    0.000000           0        16           mmap
  0.00    0.000000           0         7           mprotect
  0.00    0.000000           0         3           munmap
  0.00    0.000000           0         3           brk
  0.00    0.000000           0        23           rt_sigaction
  0.00    0.000000           0         2           ioctl
  0.00    0.000000           0         4         4 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           uname
  0.00    0.000000           0         1           geteuid
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           openat
------ ----------- ----------- --------- --------- ----------------
100.00    0.020649                142959         8 total
root@test:/home/juan#
Turns out it took ps "only 142959 system calls" to get the process state of PID 3234 and that is indeed a pretty insane number of syscalls.

  • ~70% of the time in kernel space was consumed by 20 getdents syscalls, these calls were the ones pulling the directory information from /proc. Keep in mind that /proc is a pretty crowded place when there's ~20k process in the system. So despite these calls not being too many, they actually pulled lots of bytes of data (about 600Kbytes, in 32K sized fragments).
root@test:/home/juan# strace -T -e getdents ps h --pid 3234 -o state
getdents(5, /* 1350 entries */, 32768)  = 32768
getdents(5, /* 1365 entries */, 32768)  = 32760
getdents(5, /* 1365 entries */, 32768)  = 32760
getdents(5, /* 1365 entries */, 32768)  = 32760
getdents(5, /* 1365 entries */, 32768)  = 32760
getdents(5, /* 1024 entries */, 32768)  = 32752
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 309 entries */, 32768)   = 9888
getdents(5, /* 0 entries */, 32768)     = 0
+++ exited with 1 +++
root@test:/home/juan#
  • ~26% of the time was spent on read syscalls (40818 calls), these would effectively be the ones reading data from the files within /proc. 
  • The interesting part here is the fact that the number of read, open and close syscalls happen to be almost exactly twice the number of stat calls (20406 calls). And ofcourse the fact that the number of stat calls is alsmot exactly the number of processes in the system.

Wait... are you trying to tell me that ps, despite the fact that you asked for a single process state, is going through all the processes in the system anyways? 

Yeahp, that's exactly what is doing. You don't trust me? (can't blame you, I've been wrong before xD).

That single write syscall in the list above is the one used by ps to show the command output (the process state), right? Ok, so if we dump the strace output to a file we can get a better picture:
root@test:/home/juan# grep ^write -A5 -B5  strace.txt
read(6, "3234 (apache) S 3222 3234 3204 3"..., 1024) = 306
close(6)                                = 0
open("/proc/3234/status", O_RDONLY)     = 6
read(6, "Name:\tapache\nState:\tS (sleeping)"..., 1024) = 780
close(6)                                = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 4), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff10e10f000
write(1, "S\n", 2)                      = 2
stat("/proc/3235", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
open("/proc/3235/stat", O_RDONLY)       = 6
read(6, "3235 (apache) Z 3234 3234 3234 3"..., 1024) = 162
close(6)                                = 0
open("/proc/3236/status", O_RDONLY)     = 6
root@test:/home/juan#
Clearly ps is going one by one through all the folders available in /proc and once it comes across the one requested by "--pid" it writes the output, however it keeps moving forward untill all folders are checked (this is disturbingly fun when you try it with PID 1).

Now... why would ps do this?

This seems to be a design decition after all. Ps man states:

"ps - report a snapshot of the current processes."

Note: which suggests ps may actually stand for Processes Snapshot (who knows).

"Except as described below, process selection options are additive. The default selection is discarded, and then the selected processes are added to the set of processes to be displayed. A process will thus be shown if it meets any of the given selection criteria."

So what you actually do with the options is descrcibe to ps which processes you want to print out and which information. Then ps will go through all of them and print out the ones that match your description.

Summary


With all these in mind, looks like using ps to get details of a given proces which PID you already know doesn't seem to be efficient at all, and can be a terrible idea in systems hosting a big number of processes.

After all looks like you would be better of reading the state directly from /proc yourself:
root@test:/home/juan# strace -c grep ^State /proc/3234/status
State:  S (sleeping)
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0.00    0.000000           0         5           read
  0.00    0.000000           0         1           write
  0.00    0.000000           0         6           open
  0.00    0.000000           0         9           close
  0.00    0.000000           0         9           fstat
  0.00    0.000000           0         1         1 lseek
  0.00    0.000000           0        14           mmap
  0.00    0.000000           0         8           mprotect
  0.00    0.000000           0         2           munmap
  0.00    0.000000           0         3           brk
  0.00    0.000000           0         2         1 ioctl
  0.00    0.000000           0         5         5 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           openat
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000                    68         7 total
root@test:/home/juan#

Only took grep 68 syscalls to retrieve the state of the process :P, yeah I know, now I'm just being a jerk hahaha.