Situation:
Lets assume you had a Linux system and noticed high system load average, along with unusually high CPU utilization. Having a look at the metrics you realize these had been increasing for a few weeks by now.
A quick glance at top, shows something like this:
top - 20:58:27 up 1:18, 3 users, load average: 50.27, 29.91, 14.09 Tasks: 20215 total, 13 running, 202 sleeping, 0 stopped, 20000 zombie %Cpu(s): 29.7 us, 70.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st KiB Mem: 3007548 total, 2584796 used, 422752 free, 214556 buffers KiB Swap: 1048568 total, 0 used, 1048568 free. 983056 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 25244 juan 20 0 14228 2180 2060 R 14.1 0.1 0:00.77 ps 25246 juan 20 0 14228 2188 2064 R 13.5 0.1 0:00.70 ps 25248 juan 20 0 14228 2036 1916 R 13.5 0.1 0:00.61 ps 25250 juan 20 0 14228 2100 1976 R 13.5 0.1 0:00.57 ps 25252 juan 20 0 14228 2092 1968 R 13.5 0.1 0:00.49 ps 25254 juan 20 0 14228 2180 2056 R 13.2 0.1 0:00.45 ps 25256 juan 20 0 14228 2092 1968 R 11.5 0.1 0:00.39 ps 23469 root 20 0 47308 20812 2572 R 10.9 0.7 1:18.58 top 25258 juan 20 0 14228 2228 2104 R 9.4 0.1 0:00.32 ps 25260 juan 20 0 14228 2212 2088 R 7.4 0.1 0:00.25 ps 25262 juan 20 0 14228 2040 1916 R 5.0 0.1 0:00.17 ps 25264 juan 20 0 14228 2188 2064 R 2.9 0.1 0:00.10 ps 2290 juan 20 0 2335644 128804 78096 S 1.2 4.3 0:06.22 skypeforli+ 9 root 20 0 0 0 0 S 0.9 0.0 0:05.64 rcuos/0 18 root 20 0 0 0 0 S 0.9 0.0 0:04.86 rcuos/1 7 root 20 0 0 0 0 S 0.6 0.0 0:03.59 rcu_sched 1071 root 20 0 470184 113424 30860 S 0.6 3.8 0:35.99 Xorg
So, what's wrong with this system?
Well, a bunch of things as you can see:
- To begin with, it kind of looks like an episode of "The Walking Dead", there's 20000 zombie processes xD. Not a big deal, right? zombie processes don't take much memory space and don't cause load increase. However they could easily consume all your process slots within the kernel and that would be a big deal.
- Then system load is literally insane for 2 vCPUs (13 processes in the running queue).
- Adding up to all that, we have several freaking ps processes taking up to 14% of the CPU time each, how come ps can be taking that much CPU time?
Where do the zombies come from?
Unlike human babies, that we all know are carried by the storks, zombie processes are the result of parent processes not cleaning after their terminated childs (more details here). And in this case, we clearly have a good bunch of them, however in this particular scenario I literally faked the zombie processes so we won't go into details as per why they were created.
Why is the load so freaking high?
The reason for the high load is pretty clear, we have tons of processes in the running queue plus potentially some others in D state blocked by some resources. All these would add up to the system load, and what's concerning is the fact that the load was increasing, so at some point this system would have gone unresponsive.
Why in the world is ps taking so much CPU time?
Having a look at the ps processes, looks as if they were trying to pull the state of a given process:
root@test:/home/juan# ps ha -o state,cmd|grep ps R ps h --pid 3234 -o state R ps h --pid 3234 -o state R ps h --pid 3234 -o state R ps h --pid 3234 -o state R ps h --pid 3234 -o state R ps h --pid 3234 -o state R ps h --pid 3234 -o state R ps h --pid 3234 -o state R ps h --pid 3234 -o state R ps h --pid 3234 -o state R ps ha -o state,cmd S grep --color=auto ps R ps h --pid 3234 -o state R ps h --pid 3234 -o state root@test:/home/juan#
A pretty harmless move, you would say, so lets time that operation:
root@test:/home/juan# time ps h --pid 3234 -o state S real 0m1.003s user 0m0.188s sys 0m0.552s root@test:/home/juan#
So ps spent a bit more than half of the time running in kernel space, it also spent about ~200 ms in user space and then somewhere around ~300 ms waiting, maybe in the scheduler queue, maybe IO, maybe all of them, who knows.
Half a second spent in the kernel seems to be a lot for such a simple task "get the process state of PID 3234", right? So what's going on?
Strace to the rescue:
root@test:/home/juan# strace -c ps h --pid 3234 -o state S % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 71.16 0.014693 735 20 getdents 26.54 0.005481 0 40418 read 1.54 0.000317 0 40423 4 open 0.47 0.000097 0 40420 close 0.30 0.000061 0 20210 stat 0.00 0.000000 0 1 write 0.00 0.000000 0 6 fstat 0.00 0.000000 0 2 lseek 0.00 0.000000 0 16 mmap 0.00 0.000000 0 7 mprotect 0.00 0.000000 0 3 munmap 0.00 0.000000 0 3 brk 0.00 0.000000 0 23 rt_sigaction 0.00 0.000000 0 2 ioctl 0.00 0.000000 0 4 4 access 0.00 0.000000 0 1 execve 0.00 0.000000 0 1 uname 0.00 0.000000 0 1 geteuid 0.00 0.000000 0 1 arch_prctl 0.00 0.000000 0 1 openat ------ ----------- ----------- --------- --------- ---------------- 100.00 0.020649 142959 8 total root@test:/home/juan#Turns out it took ps "only 142959 system calls" to get the process state of PID 3234 and that is indeed a pretty insane number of syscalls.
- ~70% of the time in kernel space was consumed by 20 getdents syscalls, these calls were the ones pulling the directory information from /proc. Keep in mind that /proc is a pretty crowded place when there's ~20k process in the system. So despite these calls not being too many, they actually pulled lots of bytes of data (about 600Kbytes, in 32K sized fragments).
root@test:/home/juan# strace -T -e getdents ps h --pid 3234 -o state getdents(5, /* 1350 entries */, 32768) = 32768 getdents(5, /* 1365 entries */, 32768) = 32760 getdents(5, /* 1365 entries */, 32768) = 32760 getdents(5, /* 1365 entries */, 32768) = 32760 getdents(5, /* 1365 entries */, 32768) = 32760 getdents(5, /* 1024 entries */, 32768) = 32752 getdents(5, /* 1024 entries */, 32768) = 32768 getdents(5, /* 1024 entries */, 32768) = 32768 getdents(5, /* 1024 entries */, 32768) = 32768 getdents(5, /* 1024 entries */, 32768) = 32768 getdents(5, /* 1024 entries */, 32768) = 32768 getdents(5, /* 1024 entries */, 32768) = 32768 getdents(5, /* 1024 entries */, 32768) = 32768 getdents(5, /* 1024 entries */, 32768) = 32768 getdents(5, /* 1024 entries */, 32768) = 32768 getdents(5, /* 1024 entries */, 32768) = 32768 getdents(5, /* 1024 entries */, 32768) = 32768 getdents(5, /* 1024 entries */, 32768) = 32768 getdents(5, /* 309 entries */, 32768) = 9888 getdents(5, /* 0 entries */, 32768) = 0 +++ exited with 1 +++ root@test:/home/juan#
- ~26% of the time was spent on read syscalls (40818 calls), these would effectively be the ones reading data from the files within /proc.
- The interesting part here is the fact that the number of read, open and close syscalls happen to be almost exactly twice the number of stat calls (20406 calls). And ofcourse the fact that the number of stat calls is alsmot exactly the number of processes in the system.
Wait... are you trying to tell me that ps, despite the fact that you asked for a single process state, is going through all the processes in the system anyways?
Yeahp, that's exactly what is doing. You don't trust me? (can't blame you, I've been wrong before xD).
That single write syscall in the list above is the one used by ps to show the command output (the process state), right? Ok, so if we dump the strace output to a file we can get a better picture:
root@test:/home/juan# grep ^write -A5 -B5 strace.txt read(6, "3234 (apache) S 3222 3234 3204 3"..., 1024) = 306 close(6) = 0 open("/proc/3234/status", O_RDONLY) = 6 read(6, "Name:\tapache\nState:\tS (sleeping)"..., 1024) = 780 close(6) = 0 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 4), ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff10e10f000 write(1, "S\n", 2) = 2 stat("/proc/3235", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0 open("/proc/3235/stat", O_RDONLY) = 6 read(6, "3235 (apache) Z 3234 3234 3234 3"..., 1024) = 162 close(6) = 0 open("/proc/3236/status", O_RDONLY) = 6 root@test:/home/juan#
Clearly ps is going one by one through all the folders available in /proc and once it comes across the one requested by "--pid" it writes the output, however it keeps moving forward untill all folders are checked (this is disturbingly fun when you try it with PID 1).
Now... why would ps do this?
This seems to be a design decition after all. Ps man states:
"Except as described below, process selection options are additive. The default selection is discarded, and then the selected processes are added to the set of processes to be displayed. A process will thus be shown if it meets any of the given selection criteria."
So what you actually do with the options is descrcibe to ps which processes you want to print out and which information. Then ps will go through all of them and print out the ones that match your description.
With all these in mind, looks like using ps to get details of a given proces which PID you already know doesn't seem to be efficient at all, and can be a terrible idea in systems hosting a big number of processes.
After all looks like you would be better of reading the state directly from /proc yourself:
"ps - report a snapshot of the current processes."
Note: which suggests ps may actually stand for Processes Snapshot (who knows).
"Except as described below, process selection options are additive. The default selection is discarded, and then the selected processes are added to the set of processes to be displayed. A process will thus be shown if it meets any of the given selection criteria."
So what you actually do with the options is descrcibe to ps which processes you want to print out and which information. Then ps will go through all of them and print out the ones that match your description.
Summary
With all these in mind, looks like using ps to get details of a given proces which PID you already know doesn't seem to be efficient at all, and can be a terrible idea in systems hosting a big number of processes.
After all looks like you would be better of reading the state directly from /proc yourself:
root@test:/home/juan# strace -c grep ^State /proc/3234/status State: S (sleeping) % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 0.00 0.000000 0 5 read 0.00 0.000000 0 1 write 0.00 0.000000 0 6 open 0.00 0.000000 0 9 close 0.00 0.000000 0 9 fstat 0.00 0.000000 0 1 1 lseek 0.00 0.000000 0 14 mmap 0.00 0.000000 0 8 mprotect 0.00 0.000000 0 2 munmap 0.00 0.000000 0 3 brk 0.00 0.000000 0 2 1 ioctl 0.00 0.000000 0 5 5 access 0.00 0.000000 0 1 execve 0.00 0.000000 0 1 arch_prctl 0.00 0.000000 0 1 openat ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000000 68 7 total root@test:/home/juan#