Ningún detalle es anecdótico...: c

Mostrando entradas con la etiqueta c. Mostrar todas las entradas

miércoles, 13 de diciembre de 2017

Some Zombies just can't be killed - Zombie Processes

We've been told by so many movies and TV series that no matter how smelly or ugly Zombies can be, you can easily kill them by either blowing up their heads with a shotgun or gently detaching it from the rest of the body (this doesn't really kill the zombie but renders it almost harmless, fun fact, head can't run while detached from the legs). Unfortunately Zombie processes don't really follow the previous "Zombie expected" behavior and they just can't be killed (kill -9 PID).

What in the world is a Zombie process?

Wikipedia has a pretty nice description here, but essentially we are talking about a process that has finished its execution through either exit or exit_group syscalls but remains present in the system due to some unfinished housekeeping tasks. You can easily identify a zombie process from its Z state in for example top or ps. A nice description can be found here the wait man page:

       A child that terminates, but has not been waited for becomes a
       "zombie".  The kernel maintains a minimal set of information about
       the zombie process (PID, termination status, resource usage
       information) in order to allow the parent to later perform a wait to
       obtain information about the child.  As long as a zombie is not
       removed from the system via a wait, it will consume a slot in the
       kernel process table, and if this table fills, it will not be
       possible to create further processes.  If a parent process
       terminates, then its "zombie" children (if any) are adopted by
       init(1), (or by the nearest "subreaper" process as defined through
       the use of the prctl(2) PR_SET_CHILD_SUBREAPER operation); init(1)
       automatically performs a wait to remove the zombies.

For some reason there's a lot of confusion around this Z state, for example:

Orphan processes aren't zombie processes, in fact for a process to become zombie its parent process has to be still alive. If the parent dies it will be adopted by init and waitED properly.
A Zombie process isn't consuming memory (or CPU), to be more precise, isn't consuming as much memory as it was consuming while it was running. However it still occupies a slot in the process table and this could be a problem.

What happens when exit or exit_group are called?

As I mentioned before these are the syscalls used to terminate a process gracefully. You can find the details of these two for kernel 4.14 here and here. I'm not going to describe the whole code because first I don't understand all of it and second I would probably do it wrong :), but... I'm going to point out some interesting parts:

Things begin in do_group_exit which takes care of killing all the threads in the current process thread group then it calls do_exit to finish the process itself.
do_exit takes care of a few things as well, for example releasing some of the resources associated to the process like files and shared memory (lines 858-866). Later on in line 885, calls to exit_notify, which is in charge of sending to the relative processes the bad news, kind of "Hey!, I'm dying" thing. Now within this function a few things will happen being the most important ones the following:

do_notify_parent will be called to signal the parent process with a SIGCHLD signal. This function will return either true or false depending on whether the dying process should become a Zombie or just go regular dead.

Have a look at the comment in here, long story short, IF the father decides to ignore (by setting the handler to SIG_IGN) the SIGCHLD signal (or SA_NOCLDWAIT flag is set) the dying process can reap itself and doesn't become a zombie and in that case do_notify_parent will return true. Otherwise it will return false.

Back to exit_notify, you can see in here how the process decides whether or not to become a zombie depending on the autoreap variable.

Something worth mentioning is the fact that Zombie is an exit state (tsk ->exit_state) while the scheduling state of the process is actually Dead, you can see this is defined by calling here do_task_dead function. So a Zombie process isn't consuming CPU resources at all, for the scheduling perspective this process is gone, and so it is all the other resources that had been allocated to it.

Why can't it be killed?

Well, you can still send SIGKILL signal to a Zombie process and as you may know, you can't escape from that signal. However technically speaking even though the process is still in the process table, there's nothing it can do, there's no code attached to it anymore, no stack, no files, nothing.

Example

Lets have a look at all this zombie stuff with a simple example. You can get the code here, it's a simple C program that will fork 2 child processes and allocate some memory in each. The parent will sleep for 30 seconds, while the child processes will wait for 15 seconds, hence dying while the father is still running (sleeping) becoming spoooky zombiesssss:

juan@test:~$gcc -o zombie_maker zombie_maker.c
juan@test:~$ ./zombie_maker &
juan@test:~$ Parent PID = 2050, PGRP = 2050, PPID = 1934, PSID = 1934
Child 0 -> PID = 2051, PGRP = 2050, PPID = 2050, PSID = 1934
Child 1 -> PID = 2052, PGRP = 2050, PPID = 2050, PSID = 1934

juan@test:~$ ps aux|grep zombie
juan      2050  0.0  0.0   4216   732 pts/1    S    21:34   0:00 ./zombie_maker
juan      2051  0.0  0.0   6268    92 pts/1    S    21:34   0:00 ./zombie_maker
juan      2052  0.0  0.0   6268    92 pts/1    S    21:34   0:00 ./zombie_maker
juan      2054  0.0  0.0  15960  2268 pts/1    R+   21:34   0:00 grep --color=auto zombie
juan@test:~$ ps aux|grep zombie
juan      2050  0.0  0.0   4216   732 pts/1    S    21:34   0:00 ./zombie_maker
juan      2051  0.0  0.0      0     0 pts/1    Z    21:34   0:00 [zombie_maker] 
juan      2052  0.0  0.0      0     0 pts/1    Z    21:34   0:00 [zombie_maker] 
juan      2056  0.0  0.0  15960  2268 pts/1    S+   21:34   0:00 grep --color=auto zombie
juan@test:~$ ps aux|grep zombie
juan      2058  0.0  0.0  15960  2288 pts/1    S+   21:35   0:00 grep --color=auto zombie
[2]+  Done                    ./zombie_maker
juan@test:~$

After the 15 second, we can see how both child processes not only became Zombies but they also show no memory footprint whatsoever on the system (VSZ is 0 and RSS is 0 as well). After the 30 seconds we can see that none of the 3 processes exist anymore, but who reaped the Zombie children from the system???

Yeahp, init did! Attaching to init with strace we can see the waitid calls returning with the details from the Zombie processes:

root@test:/home/juan# strace -p 1
Process 1 attached
select(54, [3 5 6 7 10 11 15 16 17 19 20 24 26 30 34 53], [], [7 10 11 15 16 17], NULL) = ? ERESTARTNOHAND (To be restarted if no handler)
...
waitid(P_ALL, 0, {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2051, si_status=3, si_utime=0, si_stime=0}, WNOHANG|WEXITED|WSTOPPED|WCONTINUED, NULL) = 0
waitid(P_ALL, 0, {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2052, si_status=3, si_utime=0, si_stime=0}, WNOHANG|WEXITED|WSTOPPED|WCONTINUED, NULL) = 0
select(54, [3 5 6 7 10 11 15 16 17 19 20 24 26 30 34 53], [], [7 10 11 15 16 17], NULL^CProcess 1 detached
 
root@test:/home/juan#

Running zombie_maker with strace will show you how the parent process gets indeed interrupted by the SIGCHLD signals but since the handler isn't defined it is ignored. PLEASE, don't confuse that ignore with the one that happens when the handler is set to SIG_IGN, literally this is what the documentation says:

       POSIX.1-2001 specifies that if the disposition of SIGCHLD is set to
       SIG_IGN or the SA_NOCLDWAIT flag is set for SIGCHLD (see
       sigaction(2)), then children that terminate do not become zombies and
       a call to wait() or waitpid() will block until all children have
       terminated, and then fail with errno set to ECHILD.  (The original
       POSIX standard left the behavior of setting SIGCHLD to SIG_IGN
       unspecified.  Note that even though the default disposition of
       SIGCHLD is "ignore", explicitly setting the disposition to SIG_IGN
       results in different treatment of zombie process children.)

Summary

No matter how brave or root you are, you can't kill a Zombie process :D, killing its parent is usually the way to get rid of them. When the parent dies, init will adopt the little zombie bastards and reap them off the process table, as we saw on the example. Also as mentioned, a Zombie process doesn't consume much system resources (actually almost none), so unless you have hundreds of them you shouldn't worry much.

The kernel code is going crazy, it would be nice if we had more comments in place :P.

viernes, 19 de agosto de 2016

Strace 101 - stracing your stuff!

It's been a while since the last blog post now, like 4 months went by! I remeber once I thought I could write one post per week xD, that did not work at all hahaha. Anyway..., the last few days had the chance to spend some time working with strace and perf, so I decided to write something about it here as well.

Strace? What the heck is it?

Strace is a nice debugging/troubleshooting tool for Linux, that helps you identify the syscalls a particular process is using and the signals a proess receives (more details here). Syscalls are basically the interface to access to kernel services. Therefore knowing what particular syscalls a program is issuing and what the results of these calls are is really interesting when debugging some software problems.

But how is it possible that a User space process is able to see the syscalls another User space process issues? Well, there's a kernel feature called ptrace that makes that possible (yeah... ptrace is a system call xD). By definition ptrace is:

The ptrace() system call provides a means by which one process (the "tracer") may observe and control the execution of another process (the "tracee"), and examine and change the tracee's memory and registers. It is primarily used to implement breakpoint debugging and system call tracing.

There are basically 2 ways you can strace a process to see its syscalls, you can either launch the process using strace or you can attach strace to a running process (under certain conditions :D).

How does it work? Show me the money!

I strongly believe the best way to explain something is by showing working examples, so there we go... Lets see how many syscalls (and which ones) the classic "Hello World" issues. The C code is:

#include <stdio.h>
int main()
{
     printf("Hello World\n");
     return 0;
}

Compile and run:

juan@juan-VirtualBox:~$ gcc -o classic classic.c 
juan@juan-VirtualBox:~$ ./classic 
Hello World
juan@juan-VirtualBox:~$

now we run it using strace instead:

juan@juan-VirtualBox:~$ strace ./classic
execve("./classic", ["./classic"], [/* 60 vars */]) = 0
brk(0)  = 0x7f0000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0362d17000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=85679, ...}) = 0
mmap(NULL, 85679, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f0362d02000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P \2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1840928, ...}) = 0
mmap(NULL, 3949248, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f0362732000
mprotect(0x7f03628ec000, 2097152, PROT_NONE) = 0
mmap(0x7f0362aec000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1ba000) = 0x7f0362aec000
mmap(0x7f0362af2000, 17088, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f0362af2000
close(3) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0362d01000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0362cff000
arch_prctl(ARCH_SET_FS, 0x7f0362cff740) = 0
mprotect(0x7f0362aec000, 16384, PROT_READ) = 0
mprotect(0x600000, 4096, PROT_READ) = 0
mprotect(0x7f0362d19000, 4096, PROT_READ) = 0
munmap(0x7f0362d02000, 85679) = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0 
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0362d16000
write(1, "Hello World\n", 12Hello World
) = 12
exit_group(0) = ?
+++ exited with 12 +++
juan@juan-VirtualBox:~$

The ouput provides one system call per line, and each line includes the syscall name, the parameters used to invoke it and the result returned (after the =). As we can see the simplest piece of code you can imagine will actually make use of many syscalls. I will only comment on a few of them (probably in a next post we can dive deep there :D):

execve (the first one), just before this call strace forked itself and the child process called ptrace syscall allowing strace parent to trace him. So after all that, the process runs execve changing its running code to our classic C program.
brk changes the location of the program break, which defines the end of the process's data segment. Increasing it has the effect of allocating memory to the process; or decreasing the break deallocates memory. In this case is called with 0 as increment, which makes the call return the current program break (0x7f0000).
we see a couple of mmap syscalls mapping memory regions as anonymous and also some mapping library libc.so.6. This is the Dynamic Linker doing its job and adding all the necessary libraries to the process memory space.
there are also a few open syscalls, opening files like /etc/ld.so.cache where it can find a list of the available system libraries.
just before finishing we see a write syscall, sending our classic "Hello World" to file descriptor 1, also known as STDOUT (standard output). Since both strace and classic are sending the standard output to the console we can see how the colided in lines 29 and 30.
the last call in place was exit_group, it's the equivalent to exit syscall but it terminates not only the calling thread but all the threads in the thread group (this particular example was single threaded).

This should provide a fair idea of what a particular piece of software does and which kernel services it's accessing to. However, sometimes we don't want to go into so much details, but instead we would like to see a summary. We can easily get that with -c flag:

juan@juan-VirtualBox:~$ strace -c ./classic 
Hello World
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0.00    0.000000           0         1           read
  0.00    0.000000           0         1           write
  0.00    0.000000           0         2           open
  0.00    0.000000           0         2           close
  0.00    0.000000           0         3           fstat
  0.00    0.000000           0         8           mmap
  0.00    0.000000           0         4           mprotect
  0.00    0.000000           0         1           munmap
  0.00    0.000000           0         1           brk
  0.00    0.000000           0         3         3 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000                    28         3 total
juan@juan-VirtualBox:~$

the output shows the list of syscalls issued by the process (and its threads if there were more than 1), the number of times each one was called, the number of times they failed, and the CPU time they consumed (kernel space time aka system time). With this output we can easily identify that 3 syscalls returned failed states, if we go back to the full strace output we'll see that lines 4, 6 and 11 returned ENOENT "No such file or directory".

I mentioned before that is also possible to strace an already running processes, so lets take a look at that. First we have to identify a process as target, in this case nc with PID 8739:

juan@juan-VirtualBox:~$ ps aux|grep -i nc
root       954  0.0  0.0  19196  2144 ?        Ss   01:33   0:04 /usr/sbin/irqbalance
juan      2055  0.0  0.2 355028  8396 ?        Ssl  01:33   0:00 /usr/lib/at-spi2-core/at-spi-bus-launcher --launch-immediately
juan      8739  0.0  0.0   9132   800 pts/0    S    11:52   0:00 nc -l 9999
juan      2585  0.0  0.0  24440  1964 ?        S    01:33   0:00 dbus-launch --autolaunch 0c0058daf07f369dd9b0d1605654eff1 --binary-syntax --close-stderr
juan      9477  0.0  0.0  15948  2304 pts/2    R+   14:03   0:00 grep --color=auto -i nc
juan@juan-VirtualBox:~$

now lets try to attach strace to it in order to inspect the syscalls:

juan@juan-VirtualBox:~$ strace -p 8739
strace: attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
juan@juan-VirtualBox:~$

interesting :D. Turns out that by default Ubuntu doesn't allow ptrace_attach feature, why? we'll see an example later :P. Ptrace has different scopes (4 actually, 0 to 3, being 3 the most restrictive one):

A PTRACE scope of "0" is the more permissive mode. A scope of "1" limits PTRACE only to direct child processes (e.g. "gdb name-of-program" and "strace -f name-of-program" work, but gdb's "attach" and "strace -fp $PID" do not). The PTRACE scope is ignored when a user has CAP_SYS_PTRACE, so "sudo strace -fp $PID" will work as before.

if we take a look at the current ptrace_scope value will see we have scope 1:

juan@juan-VirtualBox:~$ cat /proc/sys/kernel/yama/ptrace_scope
1
juan@juan-VirtualBox:~$

At this point we have two options, we either try with sudo or we enable scope 1 system wide by changing /proc/sys/kernel/yama/ptrace_scope (this might be dangerous). Lets go with sudo now:

juan@juan-VirtualBox:~$ sudo strace -p 8739
Process 8739 attached
accept(3, {sa_family=AF_INET, sin_port=htons(34404), sin_addr=inet_addr("127.0.0.1")}, [16]) = 4
close(3) = 0
poll([{fd=4, events=POLLIN}, {fd=0, events=POLLIN}], 2, 4294967295) = 1 ([{fd=4, revents=POLLIN}])
read(4, "Hello World through NC\n", 2048) = 23
write(1, "Hello World through NC\n", 23) = 23
poll([{fd=4, events=POLLIN}, {fd=0, events=POLLIN}], 2, 4294967295) = 1 ([{fd=4, revents=POLLIN}])
read(4, "", 2048) = 0
shutdown(4, SHUT_RD) = 0
close(4) = 0
close(3) = -1 EBADF (Bad file descriptor)
close(3) = -1 EBADF (Bad file descriptor)
exit_group(0) = ?
+++ exited with 0 +++
juan@juan-VirtualBox:~$

It worked indeed! Lets do a brief review of the syscalls:

First an accept call. It extracts the first connection request on the queue of pending connections for the listening socket, 3, creates a new connected socket, and returns a new file descriptor referring to that socket , 4. The newly created socket is not in the listening state. The original socket 3 is unaffected by this call.
Then close call closes the listening socket 3.
The next call was poll, it waits for one of a set of file descriptors to become ready to perform I/O. In this case we can see it waits for fd 4 (the recent socket created due to the incoming connection) and fd 0 (the standard input).
Right after the poll we sea the read call, reading 23 bytes out of fd 4 using a 2048 bytes buffer.
After finishing reading, nc uses write to send the received bytes to fd 1, the standard output.
Then nc polls again for any extra data coming in, and this time the call returns empty as can be seen on the next read call returning 0. This poll is probably triggered by the connection being finished on the other side.
Shutdown call shuts down all or part of a full dupplex connection on a given socket, in this case the one pointed by fd 4.
Then we have 3 close calls, the first one closes the file descriptor used by the socket created by the accept call, while the next two calls try to close a fd that has already been closed on line 4 (fd 3), which is kind of weird and could be a bug.

Why ptrace could be dangerous?

Usually debugging tools are like two edged swords, right? Well, ptrace is no exception to that rule. Having access to the interface between user space and kernel space of a process can leak some important information, like credentials.

Lets see an extremely simple example. My virtual machine has vsftpd 3.0.2 running, so lets capture the credentials of a system user that logs into the FTP service. In this case will set a few extra flags on strace in order to make things easier:

-f will Trace child processes as they are created by currently traced processes as a result of the fork system call.
-eread -ewrite are two filters to tell strace to only record read and write syscalls.
-o sets an output file where the syscalls will be recorded.

So lets strace:

juan@juan-VirtualBox:~$ sudo strace -f -e trace=read,write -o output -p $(pidof vsftpd)
Process 10040 attached
Process 10280 attached
Process 10281 attached
Process 10282 attached
^CProcess 10040 detached
juan@juan-VirtualBox:~$

we can see vsftpd forked a couple of times while we connected to it using the ftp client. Now lets take a look at the content of output file:

juan@juan-VirtualBox:~$ cat output 
10280 read(3, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\n\0\0\0\n\0\0\0\0"..., 4096) = 3533
10280 read(3, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\v\0\0\0\v\0\0\0\0"..., 4096) = 2248
10280 read(4,  
10281 read(4, "# /etc/nsswitch.conf\n#\n# Example"..., 4096) = 507
10281 read(4, "", 4096)                 = 0
10281 read(4, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260\23\0\0\0\0\0\0"..., 832) = 832
10281 read(4, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240!\0\0\0\0\0\0"..., 832) = 832
10281 read(4, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240\"\0\0\0\0\0\0"..., 832) = 832
10281 write(3, "Sat Aug 20 17:06:02 2016 [pid 10"..., 65) = 65
10281 write(0, "220 (vsFTPd 3.0.2)\r\n", 20) = 20
10281 read(0, "USER juan\r\n", 11)      = 11
10281 write(0, "331 Please specify the password."..., 34) = 34
10281 read(0, "PASS MyPassw0rd\r\n", 15)  = 15
10281 write(5, "\1", 1)                 = 1
10280 <... read resumed> "\1", 1)       = 1
10281 write(5, "\4\0\0\0", 4 
10280 read(4,  
10281 <... write resumed> )             = 4
10280 <... read resumed> "\4\0\0\0", 4) = 4
10281 write(5, "juan", 4 
10280 read(4,  
...
10282 write(0, "230 Login successful.\r\n", 23) = 23
10282 read(0, "SYST\r\n", 6)            = 6
10282 write(0, "215 UNIX Type: L8\r\n", 19) = 19
10282 read(0, "QUIT\r\n", 6)            = 6
10282 write(0, "221 Goodbye.\r\n", 14)  = 14
10282 +++ exited with 0 +++
10280 <... read resumed> 0x7ffe4994960f, 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
10280 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=10282, si_status=0, si_utime=0, si_stime=0} ---
10280 +++ killed by SIGSYS +++
10040 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=10280, si_status=SIGSYS, si_utime=0, si_stime=1} ---
juan@juan-VirtualBox:~$

As we anticipated, we can see both username and password on lines 12 and 14. Ok, yes, this is FTP and we could have captured the keys also with a simple network capture, but this is just an example :D of how strace (ptrace actually) can be used to leak sensitive information.

I hope this was interesting, at least it was for me xD. Here I list some interesting links I found on the way:

http://linux.die.net/man/1/strace
http://linux.die.net/man/2/ptrace
http://man7.org/linux/man-pages/man2/syscalls.2.html
https://www.kernel.org/doc/Documentation/security/Yama.txt

viernes, 20 de noviembre de 2015

Zombie Processes - Procesos Zombies

No, no se trata de procesos que se mueven de manera errática y que convierten a otros procesos si los muerden (aunque sería muy divertido).

Cuando hablamos de procesos zombies nos referimos a procesos que han completado su ejecución (estado terminated) pero aun se encuentran registrados en la tabla de procesos del kernel. ¿Por qué sucede esto? Básicamente sucede para poder garantizar que el proceso padre pueda obtener el estado final de sus procesos hijos, para saber cuál fue el resultado. Una vez que el proceso padre lee el estado de salida del hijo (a través de la syscall wait) este último será removido de la tabla de procesos y podrá descansar finalmente en paz, e ir al cielo de los procesos.

Algunos puntos para resaltar:

Un proceso zombie es un proceso cuya ejecución ha finalizado y el estado del proceso es TERMINATED. Puede haber terminado por las buenas o por las malas (con kill por ejemplo).
La memoria ocupada por el proceso ha sido liberada.
Podemos ver procesos zombies utilizando ps aux, los reconoceremos por la Z en la columna de STAT.
Todo proceso que termina su ejecución se vuelve zombie, aunque raras veces lo notaremos dado que por lo general el proceso padre estará esperándolo con la llamada wait.

Creando zombies

A modo de prueba de concepto veamos el siguiente código, que será nuestro generador de zombies:

#include <sys/wait.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

int main(void)
{
        pid_t pid;
        int pid_status;
        pid = fork();
        if (pid == 0) {
                sleep(10);
                exit(9);
        }
        printf("PID del hijo %d\n",pid);
        sleep(40);
        waitpid(pid, &pid_status, 0);
        printf("Estado de salida %d\n",WEXITSTATUS(pid_status));
        return 0;
}

El código es sencillo, luego de la llamada a fork tendremos en el sistema corriendo dos procesos idénticos. Uno de ellos, el hijo, ejecutará el código dentro del if, mientras que el padre irá directamente al primer printf, luego dormirá por 40 segundos y ejecutará waitpid para obtener el estado de terminación del hijo e imprimirlo.

Veamos un poco qué sucede durante la ejecución:

[ec2-user@ip-172-31-16-177 ~]$ gcc -o zombies zombies.c
[ec2-user@ip-172-31-16-177 ~]$ ./zombies
PID del hijo 23037
Estado de salida 9
[ec2-user@ip-172-31-16-177 ~]$

El proceso padre imprimió el PID del hijo (23037), luego durmió por 40 segundos, ejecutó waitpid para recojer el valor de salida del proceso hijo (9) y lo imprimió en pantalla. Usando ps antes de que termine la ejecución del proceso hijo podemos ver un poco mas en detalle lo que sucede:

La relación entre los procesos (23036 es el proceso padre y 23037 es el hijo):

[ec2-user@ip-172-31-16-177 ~]$ ps -eo pid,ppid,cmd|grep zombie
23036 2407 ./zombies
23037 23036 ./zombies
23293 22781 grep --color=auto zombie
[ec2-user@ip-172-31-16-177 ~]$

Luego aun dentro de los primeros 10 segundos vemos lo siguiente:

[ec2-user@ip-172-31-16-177 ~]$ ps aux|grep zombies
ec2-user 23036 0.0 0.0   4176   620 pts/0    S+   22:22   0:00 ./zombies
ec2-user 23037 0.0 0.0   4172    80 pts/0    S+   22:22   0:00 ./zombies
ec2-user 23039 0.0 0.2 110460 2196 pts/1    S+   22:22   0:00 grep --color=auto zombies
[ec2-user@ip-172-31-16-177 ~]$

ambos procesos se encuentran en estado Sleeping (S+), y podemos ver también que ambos tienen el mismo valor de VSZ (virtual memory size), lo cual tiene sentido dado que son idénticos luego de la llamada fork. Pasados los 10 segundos nos encontramos con la siguiente situación:

[ec2-user@ip-172-31-16-177 ~]$ ps aux|grep zombies
ec2-user 23036 0.0 0.0   4176   620 pts/0    S+   22:22   0:00 ./zombies
ec2-user 23037 0.0 0.0      0     0 pts/0    Z+   22:22   0:00 [zombies] <defunct>
ec2-user 23041 0.0 0.2 110460 2124 pts/1    S+   22:22   0:00 grep --color=auto zombies
[ec2-user@ip-172-31-16-177 ~]$

el proceso hijo se encuentra ahora en Zombie state (Z+) y su VSZ es 0 (dado que toda la memoria ocupada fue liberada).

Una vez que se cumplen los 40 segundos, el proceso padre y el hijo desaparecen del sistema.

[ec2-user@ip-172-31-16-177 ~]$ ps aux|grep zombies
[ec2-user@ip-172-31-16-177 ~]$

Cómo matar un proceso zombie?

Como se imaginaran, cortarle la cabeza al proceso no parece ser una opción viable en este contexto. El equivalente en el mundo de los procesos es la señal SIGKILL (9), pero veamos qué sucede cuando la usamos para matar un proceso zombie:

Ejecución del binario zombie:

[ec2-user@ip-172-31-22-1 ~]$ ./zombie
PID del hijo 2765
Estado de salida 9
[ec2-user@ip-172-31-22-1 ~]$

Intentos fallidos de acabar con la existencia del proceso 2765:

[ec2-user@ip-172-31-22-1 ~]$ ps aux|grep zombie
ec2-user 2764 0.0 0.0   4176   692 pts/1    S+   21:06   0:00 ./zombie
ec2-user 2765 0.0 0.0   4172    80 pts/1    S+   21:06   0:00 ./zombie
ec2-user 2769 0.0 0.0 110460 2196 pts/0    S+   21:06   0:00 grep --color=auto zombie
[ec2-user@ip-172-31-22-1 ~]$ ps aux|grep zombie
ec2-user 2764 0.0 0.0   4176   692 pts/1    S+   21:06   0:00 ./zombie
ec2-user 2765 0.0 0.0      0     0 pts/1    Z+   21:06   0:00 [zombie]
ec2-user 2771 0.0 0.0 110460 2152 pts/0    S+   21:06   0:00 grep --color=auto zombie
[ec2-user@ip-172-31-22-1 ~]$
[ec2-user@ip-172-31-22-1 ~]$ kill -9 2765
[ec2-user@ip-172-31-22-1 ~]$ ps aux|grep zombie
ec2-user 2764 0.0 0.0   4176   692 pts/1    S+   21:06   0:00 ./zombie
ec2-user 2765 0.0 0.0      0     0 pts/1    Z+   21:06   0:00 [zombie]
ec2-user 2773 0.0 0.0 110460 2140 pts/0    S+   21:06   0:00 grep --color=auto zombie
[ec2-user@ip-172-31-22-1 ~]$ kill -9 2765
[ec2-user@ip-172-31-22-1 ~]$ ps aux|grep zombie
ec2-user 2764 0.0 0.0   4176   692 pts/1    S+   21:06   0:00 ./zombie
ec2-user 2765 0.0 0.0      0     0 pts/1    Z+   21:06   0:00 [zombie]
ec2-user 2775 0.0 0.0 110460 2192 pts/0    S+   21:07   0:00 grep --color=auto zombie
[ec2-user@ip-172-31-22-1 ~]$

Claramente kill -9 no esta siendo capaz de terminar el proceso, la señal está siendo enviada al proceso y no recibimos ningun mensaje de error o algo similar.

Cómo podríamos deshacernos de ellos?

Lamentablemente, la única manera de deshacernos de ellos es matando el proceso padre. Dando muerte al proceso padre, los procesos zombies se convierten en procesos huérfanos y serán adoptados por init, luego init ejecutará waitpid y los procesos descansarán finalmente en paz.

A continuación un pequeño ejemplo:

[ec2-user@ip-172-31-16-177 ~]$ ps -eo pid,ppid,cmd|grep zombie
23346 2407 ./zombies1
23347 23346 ./zombies1
23348 23346 ./zombies1
23349 23346 ./zombies1
23350 23346 ./zombies1
23351 23346 ./zombies1
23353 22781 grep --color=auto zombie
[ec2-user@ip-172-31-16-177 ~]$ ps aux|grep zombies
ec2-user 23346 0.0 0.0   4172   600 pts/0    S+   23:40   0:00 ./zombies1
ec2-user 23347 0.0 0.0      0     0 pts/0    Z+   23:40   0:00 [zombies1]
ec2-user 23348 0.0 0.0      0     0 pts/0    Z+   23:40   0:00 [zombies1]
ec2-user 23349 0.0 0.0      0     0 pts/0    Z+   23:40   0:00 [zombies1]
ec2-user 23350 0.0 0.0      0     0 pts/0    Z+   23:40   0:00 [zombies1]
ec2-user 23351 0.0 0.0      0     0 pts/0    Z+   23:40   0:00 [zombies1]
ec2-user 23355 0.0 0.2 110460 2124 pts/1    S+   23:41   0:00 grep --color=auto zombies
[ec2-user@ip-172-31-16-177 ~]$ kill -9 23346
[ec2-user@ip-172-31-16-177 ~]$ ps aux|grep zombies
ec2-user 23357 0.0 0.2 110460 2156 pts/1    S+   23:41   0:00 grep --color=auto zombies
[ec2-user@ip-172-31-16-177 ~]$

Podemos ver que tenemos 5 procesos en estado zombie y cuyo padre es el proceso 23346. Una vez que enviamos la señal (con kill) para matar el proceso padre, todos los hijos desaparecen.

El código utilizado fue el siguiente:

#include <sys/wait.h>
#include <stdlib.h>
#include <unistd.h>

int main(void)
{
        pid_t pid;
        int pid_status=0;
        int i;
        for(i=0;i<5;i++)
        {
                pid = fork();
                if (pid == 0) {
                        sleep(3);
                        exit(9);
                }
        }
        sleep(600);
        return 0;
}

Por suerte, dado que los procesos Zombies son proceso que técnicamente ya no se encuentran consumiendo recursos (salvo por las estructuras de kernel donde se encuentran representados) no deberíamos preocuparnos demasiado. Sin embargo podríamos caer en una situación donde tenemos muchos de ellos acumulados...

Zombie fork bomb!

Ok, los procesos en estado zombie en teoría no consumen memoria, pero siguen estando representados dentro del kernel de alguna manera (de lo contrario no serian visibles para ps por ejemplo) por lo que algo de memoria deben consumir. Qué sucede si creamos tantos zombies como procesos podemos crear según ulimit?

[ec2-user@ip-172-31-22-1 ~]$ ulimit -u
31877
[ec2-user@ip-172-31-22-1 ~]$

Según ulimit este usuario puede crear hasta 31877 procesos, de los cuales ya hay:

[ec2-user@ip-172-31-22-1 ~]$ ps aux | grep ^ec2-user | wc -l
6[ec2-user@ip-172-31-22-1 ~]$

Entonces con el proceso padre hacemos 7 y por lo tanto veamos qué pasa si ocupamos todos los procesos, intentemos crear 31870 zombies (seguro me van a llamar de The walking dead después de esto), usando el siguiente código:

#include <sys/wait.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

int main(void)
{
        pid_t pid;
        int bomb;
        for(bomb=0;bomb<31870;bomb++)
        {
                pid = fork();
                if (pid == 0) {
                        sleep(30);
                        exit(9);
                }
        }
        getchar();
        return 0;
}

Ejecutamos (abrí una consola como root, just in case):

[ec2-user@ip-172-31-22-1 ~]$ gcc -o zombie_fork_bomb zombie_fork_bomb.c
[ec2-user@ip-172-31-22-1 ~]$ ./zombie_fork_bomb

Y desde la consola de root vemos lo siguiente:

[root@ip-172-31-22-1 ec2-user]# free -m
             total       used       free     shared    buffers     cached
Mem:          7987        417       7569          0         13        320
-/+ buffers/cache:         83       7903
Swap:            0          0          0
[root@ip-172-31-22-1 ec2-user]# ps aux|grep ^ec2-user|wc -l
31877
[root@ip-172-31-22-1 ec2-user]# ps aux|grep ^ec2-user|grep " Z+ "|wc -l
31870
[root@ip-172-31-22-1 ec2-user]# free -m
             total       used       free     shared    buffers     cached
Mem:          7987       1445       6541          0         13        320
-/+ buffers/cache:       1111       6875
Swap:            0          0          0
[root@ip-172-31-22-1 ec2-user]#

Se pudieron los 31870 zombies y con eso alcanzamos el limite de los 31877 procesos disponibles para el usuario ec2-user. Eso lo podemos comprobar con la segunda consola del usuario:

[ec2-user@ip-172-31-22-1 ~]$ ls
-bash: fork: retry: No child processes
-bash: fork: retry: No child processes
^C-bash: fork: retry: No child processes
-bash: fork: retry: No child processes
-bash: fork: Resource temporarily unavailable

[ec2-user@ip-172-31-22-1 ~]$

Claramente bash no esta pudiendo hacer el fork para ejecutar ls. Por otro lado también podemos ver que el consumo de memoria aumento considerablemente de 417MBytes a 1445MBytes, prácticamente triplicado, pero claro, estamos hablando de 31871 procesos en estado zombie.

Con esto podemos inferir que cada proceso en estado zombie nos cuesta alrededor de 33Kbytes de memoria ((1445-417)/31871*1024 Kbytes) y lo que es tal vez mas importante, cuenta como un proceso más en ejecución a la hora de controlar los límites del usuario.

Conclusión:

Los procesos zombies solo pueden ser eliminados si su padre o init ejecutan la función wait para ese process ID. A pesar de que los procesos zombies son proceso cuya ejecución ha finalizado, siguen ocupando lugar en el kernel y podrían ocasionar problemas mayores si se tratara de un gran numero de ellos.

Como breve comentario final, no hay que confundir los proceso en estado Z (zombie) con los procesos en estado D (uninterruptible). Estos últimos por lo general se encuentran esperando alguna operación de E/S y por lo tanto no hay manera de interrumpirlos. Deshacerse de uno de estos procesos es BASTANTE mas complicado y podría llegar a ser necesario reiniciar el sistema. En otra entrada vamos a ver de que se trata eso.