Ningún detalle es anecdótico...: strace

Mostrando entradas con la etiqueta strace. Mostrar todas las entradas

miércoles, 23 de enero de 2019

When ps let me down

Some tools have been around for ages helping systems administrators during rough times, making our lives and jobs somewhat easier (ok not all the times xD). One of these marvels is, without a doubt, the almighty ps. What if I told you that ps stole a few hours of my life with no remorse, would you believe me? Well... you should xD.

Situation:

Lets assume you had a Linux system and noticed high system load average, along with unusually high CPU utilization. Having a look at the metrics you realize these had been increasing for a few weeks by now.

A quick glance at top, shows something like this:

top - 20:58:27 up  1:18,  3 users,  load average: 50.27, 29.91, 14.09
Tasks: 20215 total,  13 running, 202 sleeping,   0 stopped, 20000 zombie
%Cpu(s): 29.7 us, 70.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
KiB Mem:   3007548 total,  2584796 used,   422752 free,   214556 buffers
KiB Swap:  1048568 total,        0 used,  1048568 free.   983056 cached Mem

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
25244 juan      20   0   14228   2180   2060 R  14.1  0.1   0:00.77 ps
25246 juan      20   0   14228   2188   2064 R  13.5  0.1   0:00.70 ps
25248 juan      20   0   14228   2036   1916 R  13.5  0.1   0:00.61 ps
25250 juan      20   0   14228   2100   1976 R  13.5  0.1   0:00.57 ps
25252 juan      20   0   14228   2092   1968 R  13.5  0.1   0:00.49 ps
25254 juan      20   0   14228   2180   2056 R  13.2  0.1   0:00.45 ps
25256 juan      20   0   14228   2092   1968 R  11.5  0.1   0:00.39 ps
23469 root      20   0   47308  20812   2572 R  10.9  0.7   1:18.58 top
25258 juan      20   0   14228   2228   2104 R   9.4  0.1   0:00.32 ps
25260 juan      20   0   14228   2212   2088 R   7.4  0.1   0:00.25 ps
25262 juan      20   0   14228   2040   1916 R   5.0  0.1   0:00.17 ps
25264 juan      20   0   14228   2188   2064 R   2.9  0.1   0:00.10 ps
 2290 juan      20   0 2335644 128804  78096 S   1.2  4.3   0:06.22 skypeforli+
    9 root      20   0       0      0      0 S   0.9  0.0   0:05.64 rcuos/0
   18 root      20   0       0      0      0 S   0.9  0.0   0:04.86 rcuos/1
    7 root      20   0       0      0      0 S   0.6  0.0   0:03.59 rcu_sched
 1071 root      20   0  470184 113424  30860 S   0.6  3.8   0:35.99 Xorg

So, what's wrong with this system?

Well, a bunch of things as you can see:

To begin with, it kind of looks like an episode of "The Walking Dead", there's 20000 zombie processes xD. Not a big deal, right? zombie processes don't take much memory space and don't cause load increase. However they could easily consume all your process slots within the kernel and that would be a big deal.
Then system load is literally insane for 2 vCPUs (13 processes in the running queue).
Adding up to all that, we have several freaking ps processes taking up to 14% of the CPU time each, how come ps can be taking that much CPU time?

Where do the zombies come from?

Unlike human babies, that we all know are carried by the storks, zombie processes are the result of parent processes not cleaning after their terminated childs (more details here). And in this case, we clearly have a good bunch of them, however in this particular scenario I literally faked the zombie processes so we won't go into details as per why they were created.

Why is the load so freaking high?

The reason for the high load is pretty clear, we have tons of processes in the running queue plus potentially some others in D state blocked by some resources. All these would add up to the system load, and what's concerning is the fact that the load was increasing, so at some point this system would have gone unresponsive.

Why in the world is ps taking so much CPU time?

Having a look at the ps processes, looks as if they were trying to pull the state of a given process:

root@test:/home/juan# ps ha -o state,cmd|grep ps
R ps h --pid 3234 -o state
R ps h --pid 3234 -o state
R ps h --pid 3234 -o state
R ps h --pid 3234 -o state
R ps h --pid 3234 -o state
R ps h --pid 3234 -o state
R ps h --pid 3234 -o state
R ps h --pid 3234 -o state
R ps h --pid 3234 -o state
R ps h --pid 3234 -o state
R ps ha -o state,cmd
S grep --color=auto ps
R ps h --pid 3234 -o state
R ps h --pid 3234 -o state
root@test:/home/juan#

A pretty harmless move, you would say, so lets time that operation:

root@test:/home/juan# time ps h --pid 3234 -o state
S

real    0m1.003s
user    0m0.188s
sys     0m0.552s
root@test:/home/juan#

wait..., what? took 1 second to run ps? Which age are we living in?

So ps spent a bit more than half of the time running in kernel space, it also spent about ~200 ms in user space and then somewhere around ~300 ms waiting, maybe in the scheduler queue, maybe IO, maybe all of them, who knows.

Half a second spent in the kernel seems to be a lot for such a simple task "get the process state of PID 3234", right? So what's going on?

Strace to the rescue:

root@test:/home/juan# strace -c ps h --pid 3234 -o state
S
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 71.16    0.014693         735        20           getdents
 26.54    0.005481           0     40418           read
  1.54    0.000317           0     40423         4 open
  0.47    0.000097           0     40420           close
  0.30    0.000061           0     20210           stat
  0.00    0.000000           0         1           write
  0.00    0.000000           0         6           fstat
  0.00    0.000000           0         2           lseek
  0.00    0.000000           0        16           mmap
  0.00    0.000000           0         7           mprotect
  0.00    0.000000           0         3           munmap
  0.00    0.000000           0         3           brk
  0.00    0.000000           0        23           rt_sigaction
  0.00    0.000000           0         2           ioctl
  0.00    0.000000           0         4         4 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           uname
  0.00    0.000000           0         1           geteuid
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           openat
------ ----------- ----------- --------- --------- ----------------
100.00    0.020649                142959         8 total
root@test:/home/juan#

Turns out it took ps "only 142959 system calls" to get the process state of PID 3234 and that is indeed a pretty insane number of syscalls.

~70% of the time in kernel space was consumed by 20 getdents syscalls, these calls were the ones pulling the directory information from /proc. Keep in mind that /proc is a pretty crowded place when there's ~20k process in the system. So despite these calls not being too many, they actually pulled lots of bytes of data (about 600Kbytes, in 32K sized fragments).

root@test:/home/juan# strace -T -e getdents ps h --pid 3234 -o state
getdents(5, /* 1350 entries */, 32768)  = 32768
getdents(5, /* 1365 entries */, 32768)  = 32760
getdents(5, /* 1365 entries */, 32768)  = 32760
getdents(5, /* 1365 entries */, 32768)  = 32760
getdents(5, /* 1365 entries */, 32768)  = 32760
getdents(5, /* 1024 entries */, 32768)  = 32752
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 1024 entries */, 32768)  = 32768
getdents(5, /* 309 entries */, 32768)   = 9888
getdents(5, /* 0 entries */, 32768)     = 0
+++ exited with 1 +++
root@test:/home/juan#

~26% of the time was spent on read syscalls (40818 calls), these would effectively be the ones reading data from the files within /proc.
The interesting part here is the fact that the number of read, open and close syscalls happen to be almost exactly twice the number of stat calls (20406 calls). And ofcourse the fact that the number of stat calls is alsmot exactly the number of processes in the system.

Wait... are you trying to tell me that ps, despite the fact that you asked for a single process state, is going through all the processes in the system anyways?

Yeahp, that's exactly what is doing. You don't trust me? (can't blame you, I've been wrong before xD).

That single write syscall in the list above is the one used by ps to show the command output (the process state), right? Ok, so if we dump the strace output to a file we can get a better picture:

root@test:/home/juan# grep ^write -A5 -B5  strace.txt
read(6, "3234 (apache) S 3222 3234 3204 3"..., 1024) = 306
close(6)                                = 0
open("/proc/3234/status", O_RDONLY)     = 6
read(6, "Name:\tapache\nState:\tS (sleeping)"..., 1024) = 780
close(6)                                = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 4), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff10e10f000
write(1, "S\n", 2)                      = 2
stat("/proc/3235", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
open("/proc/3235/stat", O_RDONLY)       = 6
read(6, "3235 (apache) Z 3234 3234 3234 3"..., 1024) = 162
close(6)                                = 0
open("/proc/3236/status", O_RDONLY)     = 6
root@test:/home/juan#

Clearly ps is going one by one through all the folders available in /proc and once it comes across the one requested by "--pid" it writes the output, however it keeps moving forward untill all folders are checked (this is disturbingly fun when you try it with PID 1).

Now... why would ps do this?

This seems to be a design decition after all. Ps man states:

"ps - report a snapshot of the current processes."

Note: which suggests ps may actually stand for Processes Snapshot (who knows).

"Except as described below, process selection options are additive. The default selection is discarded, and then the selected processes are added to the set of processes to be displayed. A process will thus be shown if it meets any of the given selection criteria."

So what you actually do with the options is descrcibe to ps which processes you want to print out and which information. Then ps will go through all of them and print out the ones that match your description.

Summary

With all these in mind, looks like using ps to get details of a given proces which PID you already know doesn't seem to be efficient at all, and can be a terrible idea in systems hosting a big number of processes.

After all looks like you would be better of reading the state directly from /proc yourself:

root@test:/home/juan# strace -c grep ^State /proc/3234/status
State:  S (sleeping)
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0.00    0.000000           0         5           read
  0.00    0.000000           0         1           write
  0.00    0.000000           0         6           open
  0.00    0.000000           0         9           close
  0.00    0.000000           0         9           fstat
  0.00    0.000000           0         1         1 lseek
  0.00    0.000000           0        14           mmap
  0.00    0.000000           0         8           mprotect
  0.00    0.000000           0         2           munmap
  0.00    0.000000           0         3           brk
  0.00    0.000000           0         2         1 ioctl
  0.00    0.000000           0         5         5 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           openat
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000                    68         7 total
root@test:/home/juan#

Only took grep 68 syscalls to retrieve the state of the process :P, yeah I know, now I'm just being a jerk hahaha.

sábado, 6 de mayo de 2017

Understanding SOMAXCONN parameter Part I

This post is kind of a fork of something I worked on the previous week and tries to shed some light on what somaxconn parameter means in the context of Linux Network stack. I've tried to prepare something similar to what I did in TCP Keep Alive - how it works, with examples and some traffic captures so I guess this article will be a bit longer that [I|you]'d like xD, but useful (hopefully).

somaxconn

By definition this value is the maximum number of established connections a socket may have queuing waiting to be accepted by the process holding the socket. This is completely different from tcp_syn_max_backlog since this value is for incomplete connection (connections that haven't been ACKed yet).

To give more context to that statement is necessary to understand a bit more about sockets. When a socket is created, before being able to receive connections two things must happen:

It has to be bind() to a local address.
It has to bet set to listen() for connections.

The listen() syscall is key here, it receives 2 parameters:

the file descriptor that describes the socket (remember "A file descriptor is a data structure used by processes to access: files, Unix sockets, Networking sockets, pipes, etc").
the backlog value. This integer defines the size of the socket queue for established connections waiting to be accepted... yeah :D same as somaxconn. This backlog parameter cannot be bigger than somaxconn and in fact if it is bigger, it will be simply replaced by somaxconn as you can see in the code here.

Long time ago (before kernel 2.4.25) somaxconn value used to be hardcoded in the kernel, nowadays you can read and update using sysctl as with the rest of the network stack parameters. In most systems the default value is 128, like in my CentOS VM:

[root@server juan]# sysctl -a --pattern net.core.somaxconn
net.core.somaxconn = 128
[root@server juan]#

Test scenario

In order to test and confirm the behavior of somaxconn I have the following setup:

Client VM (IP 192.168.0.17) running a C binary (source here) that will start 25 TCP connections to the Server VM on port 8080, in each connection it will send 4 bytes "juan". The binary opens 1 connection per second, and after having opened the 25 connections just waits for 2 minutes to finish.
Server VM (IP 192.168.0.26) running ncat (nc, netcat, whatever you like to call it) listening on port 8080, and the maximum open files has been set to 9 (ulimit -n 9) to achieve a condition where we can see established connections being queued easier.

A few more details you should keep in mind:

The version of nc shipped with CentOS sets a backlog of 10 on the listen() syscall (using strace):

[juan@server ~]$ ulimit -n 9
[juan@server ~]$ strace nc -lk 8080
execve("/usr/bin/nc", ["nc", "-lk", "8080"], [/* 24 vars */]) = 0
brk(0)                                  = 0x1dd7000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fe7075e9000
...
bind(3, {sa_family=AF_INET6, sin6_port=htons(8080), inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 128) = 0
listen(3, 10)                           = 0
fcntl(3, F_GETFL)
...

So we have a backlog of 10 set in the listen syscall and a limit of 9 open files. I will use this value as somaxconn since the meaning is the same.
After nc was just started we can see it has already used 5 FD in use (STDIN, STDOUT, STDERR, and two more for the listening sockets on IPv4 and IPv6)

[root@server juan]# lsof -p 14136|grep "CHR\|IPv"
nc      14136 juan    0u   CHR     136,0      0t0        3 /dev/pts/0
nc      14136 juan    1u   CHR     136,0      0t0        3 /dev/pts/0
nc      14136 juan    2u   CHR     136,0      0t0        3 /dev/pts/0
nc      14136 juan    3u  IPv6 488270957      0t0      TCP *:webcache (LISTEN)
nc      14136 juan    4u  IPv4 488270958      0t0      TCP *:webcache (LISTEN)
[root@server juan]#

With all these details in mind and some basic math (2+2=4 :D) we can expect the following behavior:

Since 5 out of the 9 available file descriptors are already in use, nc should not be able to receive more than 4 new connections on port 8080. Each new connection will create a new socket that has to be accessible in user space through a file descriptor (have a look at accept() syscall), which can't be created if the max open files limit has been reached.
The C binary will issue 25 connections:

The first 4 will be answered just fine by nc and the 4 bytes will be read from the socket and printed to STDOUT.
The next 10 connections will be indeed accepted by the Network layer of the Server VM, and even the 4 bytes will be received, however nc won't be able to accept the socket and read the 4 bytes. These are the established connections being queued!!!
The remaining 11 connections will look as ESTABLISHED from the client as they have completed the TCP handshake however the 4 bytes won't be ACKnowledged by Network Stack because the queue is full (somaxconn limit has been reached). From the server the connections won't even exist.

Binary execution:

juan@test:~/TCP_connections$ gcc -o tcp_connections tcp_connections.c
juan@test:~/TCP_connections$ ./tcp_connections
Socket opened FD=3, connection number 0
Socket opened FD=4, connection number 1
Socket opened FD=5, connection number 2
Socket opened FD=6, connection number 3
Socket opened FD=7, connection number 4
Socket opened FD=8, connection number 5
Socket opened FD=9, connection number 6
Socket opened FD=10, connection number 7
Socket opened FD=11, connection number 8
Socket opened FD=12, connection number 9
Socket opened FD=13, connection number 10
Socket opened FD=14, connection number 11
Socket opened FD=15, connection number 12
Socket opened FD=16, connection number 13
Socket opened FD=17, connection number 14
Socket opened FD=18, connection number 15
Socket opened FD=19, connection number 16
Socket opened FD=20, connection number 17
Socket opened FD=21, connection number 18
Socket opened FD=22, connection number 19
Socket opened FD=23, connection number 20
Socket opened FD=24, connection number 21
Socket opened FD=25, connection number 22
Socket opened FD=26, connection number 23
Socket opened FD=27, connection number 24
juan@test:~/TCP_connections$

Nc execution on the server

Note: I introduced the new line after the 4th connections just to split them from the rest:

[juan@server ~]$ nc -lvk -p 8080
Ncat: Version 6.40 ( http://nmap.org/ncat )
Ncat: Listening on :::8080
Ncat: Listening on 0.0.0.0:8080
Ncat: Connection from 192.168.0.17.
Ncat: Connection from 192.168.0.17:49026.
juanNcat: Connection from 192.168.0.17.
Ncat: Connection from 192.168.0.17:49027.
juanNcat: Connection from 192.168.0.17.
Ncat: Connection from 192.168.0.17:49028.
juanNcat: Connection from 192.168.0.17.
Ncat: Connection from 192.168.0.17:49029.
juan

Ncat: Connection from 192.168.0.17.
Ncat: Connection from 192.168.0.17:49030.
Ncat: Connection from 192.168.0.17.
Ncat: Connection from 192.168.0.17:49031.
juanNcat: Connection from 192.168.0.17.
Ncat: Connection from 192.168.0.17:49032.
juanNcat: Connection from 192.168.0.17.
Ncat: Connection from 192.168.0.17:49033.
juanNcat: Connection from 192.168.0.17.
Ncat: Connection from 192.168.0.17:49034.
juanNcat: Connection from 192.168.0.17.
Ncat: Connection from 192.168.0.17:49035.
juanNcat: Connection from 192.168.0.17.
Ncat: Connection from 192.168.0.17:49036.
juanNcat: Connection from 192.168.0.17.
Ncat: Connection from 192.168.0.17:49037.
juanNcat: Connection from 192.168.0.17.
Ncat: Connection from 192.168.0.17:49038.
juanNcat: Connection from 192.168.0.17.
Ncat: Connection from 192.168.0.17:49039.
juanNcat: Connection from 192.168.0.17.
Ncat: Connection from 192.168.0.17:49040.
juanjuan

Results

To have a look at what was happening during the test I measured 3 different locations:

TCP connections on the client using ss.
TCP connections on the server using ss.
Captured network traffic on the server using tcpdump.

These 3 locations provided enough information to understand what is going on and to even find something interesting :D.

TCP connections on the client:

I've chopped off a couple of lines for the sake of space and time. The samples were taken every 2 seconds and only TCP connections to port 8080 were considered:

root@test:/home/juan# for i in {1..100}; do ss -ntpe|grep "State\|8080"; echo ---; sleep 2; done
State      Recv-Q Send-Q        Local Address:Port          Peer Address:Port
ESTAB      0      0              192.168.0.17:49026         192.168.0.26:8080   users:(("tcp_connections",3342,3)) uid:1000 ino:23326 sk:ffff8800b4914fc0
---
State      Recv-Q Send-Q        Local Address:Port          Peer Address:Port
ESTAB      0      0              192.168.0.17:49027         192.168.0.26:8080   users:(("tcp_connections",3342,4)) uid:1000 ino:23350 sk:ffff8800b4914140
ESTAB      0      0              192.168.0.17:49028         192.168.0.26:8080   users:(("tcp_connections",3342,5)) uid:1000 ino:23368 sk:ffff8800b4910e80
ESTAB      0      0              192.168.0.17:49026         192.168.0.26:8080   users:(("tcp_connections",3342,3)) uid:1000 ino:23326 sk:ffff8800b4914fc0
---
...
---
State      Recv-Q Send-Q        Local Address:Port          Peer Address:Port
ESTAB      0      0              192.168.0.17:49031         192.168.0.26:8080   users:(("tcp_connections",3342,8)) uid:1000 ino:23408 sk:ffff8800b4911d00
ESTAB      0      0              192.168.0.17:49027         192.168.0.26:8080   users:(("tcp_connections",3342,4)) uid:1000 ino:23350 sk:ffff8800b4914140
ESTAB      0      0              192.168.0.17:49029         192.168.0.26:8080   users:(("tcp_connections",3342,6)) uid:1000 ino:23372 sk:ffff8800b4910740
ESTAB      0      0              192.168.0.17:49030         192.168.0.26:8080   users:(("tcp_connections",3342,7)) uid:1000 ino:23390 sk:ffff8800b49115c0
ESTAB      0      0              192.168.0.17:49028         192.168.0.26:8080   users:(("tcp_connections",3342,5)) uid:1000 ino:23368 sk:ffff8800b4910e80
ESTAB      0      0              192.168.0.17:49026         192.168.0.26:8080   users:(("tcp_connections",3342,3)) uid:1000 ino:23326 sk:ffff8800b4914fc0
ESTAB      0      0              192.168.0.17:49033         192.168.0.26:8080   users:(("tcp_connections",3342,10)) uid:1000 ino:23445 sk:ffff8800b4912b80
ESTAB      0      0              192.168.0.17:49034         192.168.0.26:8080   users:(("tcp_connections",3342,11)) uid:1000 ino:23447 sk:ffff8800b49132c0
ESTAB      0      0              192.168.0.17:49032         192.168.0.26:8080   users:(("tcp_connections",3342,9)) uid:1000 ino:23426 sk:ffff8800b4912440
---
...

After a few seconds, we can see a few ESTABLISHED tcp connections using source ports from 49026 to 49034, 9 connections to be more precise, and they all show 0 bytes in Send-Q which suggests the 4 bytes sent by the application were actually acknowledged by the server.

...
---
State      Recv-Q Send-Q        Local Address:Port          Peer Address:Port
ESTAB      0      0              192.168.0.17:49031         192.168.0.26:8080   users:(("tcp_connections",3342,8)) uid:1000 ino:23408 sk:ffff8800b4911d00
ESTAB      0      0              192.168.0.17:49027         192.168.0.26:8080   users:(("tcp_connections",3342,4)) uid:1000 ino:23350 sk:ffff8800b4914140
ESTAB      0      0              192.168.0.17:49029         192.168.0.26:8080   users:(("tcp_connections",3342,6)) uid:1000 ino:23372 sk:ffff8800b4910740
ESTAB      0      0              192.168.0.17:49030         192.168.0.26:8080   users:(("tcp_connections",3342,7)) uid:1000 ino:23390 sk:ffff8800b49115c0
ESTAB      0      0              192.168.0.17:49028         192.168.0.26:8080   users:(("tcp_connections",3342,5)) uid:1000 ino:23368 sk:ffff8800b4910e80
ESTAB      0      0              192.168.0.17:49037         192.168.0.26:8080   users:(("tcp_connections",3342,14)) uid:1000 ino:23502 sk:ffff8800b4917400
ESTAB      0      0              192.168.0.17:49035         192.168.0.26:8080   users:(("tcp_connections",3342,12)) uid:1000 ino:23465 sk:ffff8800b4913a00
ESTAB      0      0              192.168.0.17:49036         192.168.0.26:8080   users:(("tcp_connections",3342,13)) uid:1000 ino:23484 sk:ffff8800b4915e40
ESTAB      0      0              192.168.0.17:49026         192.168.0.26:8080   users:(("tcp_connections",3342,3)) uid:1000 ino:23326 sk:ffff8800b4914fc0
ESTAB      0      0              192.168.0.17:49038         192.168.0.26:8080   users:(("tcp_connections",3342,15)) uid:1000 ino:23520 sk:ffff8800b4916580
ESTAB      0      0              192.168.0.17:49033         192.168.0.26:8080   users:(("tcp_connections",3342,10)) uid:1000 ino:23445 sk:ffff8800b4912b80
ESTAB      0      0              192.168.0.17:49034         192.168.0.26:8080   users:(("tcp_connections",3342,11)) uid:1000 ino:23447 sk:ffff8800b49132c0
ESTAB      0      0              192.168.0.17:49040         192.168.0.26:8080   users:(("tcp_connections",3342,17)) uid:1000 ino:23557 sk:ffff8800b4914880
ESTAB      0      0              192.168.0.17:49039         192.168.0.26:8080   users:(("tcp_connections",3342,16)) uid:1000 ino:23539 sk:ffff8800b4915700
ESTAB      0      0              192.168.0.17:49032         192.168.0.26:8080   users:(("tcp_connections",3342,9)) uid:1000 ino:23426 sk:ffff8800b4912440
---
...

A few more seconds later we see even more TCP connections in ESTABLISHED state, including src ports from 49035 to 49040. So we have a total of 15 ESTABLISHED connections with 0 bytes waiting to be ACKnowledged by the server. Remember we were expecting 14 (4 + 10), but instead we got 15 (4 + 11), odd right? I'll get back to this later...

...
---
State      Recv-Q Send-Q        Local Address:Port          Peer Address:Port
ESTAB      0      0              192.168.0.17:49031         192.168.0.26:8080   users:(("tcp_connections",3342,8)) uid:1000 ino:23408 sk:ffff8800b4911d00
ESTAB      0      0              192.168.0.17:49027         192.168.0.26:8080   users:(("tcp_connections",3342,4)) uid:1000 ino:23350 sk:ffff8800b4914140
ESTAB      0      0              192.168.0.17:49029         192.168.0.26:8080   users:(("tcp_connections",3342,6)) uid:1000 ino:23372 sk:ffff8800b4910740
ESTAB      0      0              192.168.0.17:49030         192.168.0.26:8080   users:(("tcp_connections",3342,7)) uid:1000 ino:23390 sk:ffff8800b49115c0
ESTAB      0      0              192.168.0.17:49028         192.168.0.26:8080   users:(("tcp_connections",3342,5)) uid:1000 ino:23368 sk:ffff8800b4910e80
ESTAB      0      0              192.168.0.17:49037         192.168.0.26:8080   users:(("tcp_connections",3342,14)) uid:1000 ino:23502 sk:ffff8800b4917400
ESTAB      0      0              192.168.0.17:49035         192.168.0.26:8080   users:(("tcp_connections",3342,12)) uid:1000 ino:23465 sk:ffff8800b4913a00
ESTAB      0      0              192.168.0.17:49036         192.168.0.26:8080   users:(("tcp_connections",3342,13)) uid:1000 ino:23484 sk:ffff8800b4915e40
ESTAB      0      0              192.168.0.17:49026         192.168.0.26:8080   users:(("tcp_connections",3342,3)) uid:1000 ino:23326 sk:ffff8800b4914fc0
ESTAB      0      0              192.168.0.17:49038         192.168.0.26:8080   users:(("tcp_connections",3342,15)) uid:1000 ino:23520 sk:ffff8800b4916580
ESTAB      0      0              192.168.0.17:49033         192.168.0.26:8080   users:(("tcp_connections",3342,10)) uid:1000 ino:23445 sk:ffff8800b4912b80
ESTAB      0      0              192.168.0.17:49034         192.168.0.26:8080   users:(("tcp_connections",3342,11)) uid:1000 ino:23447 sk:ffff8800b49132c0
ESTAB      0      0              192.168.0.17:49040         192.168.0.26:8080   users:(("tcp_connections",3342,17)) uid:1000 ino:23557 sk:ffff8800b4914880
ESTAB      0      4              192.168.0.17:49041         192.168.0.26:8080   timer:(on,1.208ms,3) users:(("tcp_connections",3342,18)) uid:1000 ino:23575 sk:ffff8800b4916cc0
ESTAB      0      0              192.168.0.17:49039         192.168.0.26:8080   users:(("tcp_connections",3342,16)) uid:1000 ino:23539 sk:ffff8800b4915700
ESTAB      0      0              192.168.0.17:49032         192.168.0.26:8080   users:(("tcp_connections",3342,9)) uid:1000 ino:23426 sk:ffff8800b4912440
ESTAB      0      4              192.168.0.17:49042         192.168.0.26:8080   timer:(on,608ms,2) users:(("tcp_connections",3342,19)) uid:1000 ino:23593 sk:ffff8800b4910000
---
...

This new sample shows 2 lines that are different than the rest, the ones using src ports 49041 and 49042. These were the first 2 connections that couldn't fit in the somaxconn queue!!! And even more, you can see how the 4 bytes that were sent from the client haven't been acknowledged by the server and therefore they show up in the Send-Q column. Note how the connection looks as ESTABLISHED for the client though, this is because the 3-way handshake was completed from the client point of view.

...
---
State      Recv-Q Send-Q        Local Address:Port          Peer Address:Port
ESTAB      0      0              192.168.0.17:49031         192.168.0.26:8080   users:(("tcp_connections",3342,8)) uid:1000 ino:23408 sk:ffff8800b4911d00
ESTAB      0      0              192.168.0.17:49027         192.168.0.26:8080   users:(("tcp_connections",3342,4)) uid:1000 ino:23350 sk:ffff8800b4914140
ESTAB      0      0              192.168.0.17:49029         192.168.0.26:8080   users:(("tcp_connections",3342,6)) uid:1000 ino:23372 sk:ffff8800b4910740
ESTAB      0      4              192.168.0.17:49047         192.168.0.26:8080   timer:(on,460ms,4) users:(("tcp_connections",3342,24)) uid:1000 ino:23632 sk:ffff8800a6efba00
ESTAB      0      4              192.168.0.17:49050         192.168.0.26:8080   timer:(on,132ms,3) users:(("tcp_connections",3342,27)) uid:1000 ino:23635 sk:ffff8800a6efc140
ESTAB      0      0              192.168.0.17:49030         192.168.0.26:8080   users:(("tcp_connections",3342,7)) uid:1000 ino:23390 sk:ffff8800b49115c0
ESTAB      0      0              192.168.0.17:49028         192.168.0.26:8080   users:(("tcp_connections",3342,5)) uid:1000 ino:23368 sk:ffff8800b4910e80
ESTAB      0      4              192.168.0.17:49045         192.168.0.26:8080   timer:(on,4.740ms,5) users:(("tcp_connections",3342,22)) uid:1000 ino:23630 sk:ffff8800a6efcfc0
ESTAB      0      4              192.168.0.17:49049         192.168.0.26:8080   timer:(on,2.336ms,4) users:(("tcp_connections",3342,26)) uid:1000 ino:23634 sk:ffff8800a6efc880
ESTAB      0      0              192.168.0.17:49037         192.168.0.26:8080   users:(("tcp_connections",3342,14)) uid:1000 ino:23502 sk:ffff8800b4917400
ESTAB      0      0              192.168.0.17:49035         192.168.0.26:8080   users:(("tcp_connections",3342,12)) uid:1000 ino:23465 sk:ffff8800b4913a00
ESTAB      0      4              192.168.0.17:49043         192.168.0.26:8080   timer:(on,2.740ms,5) users:(("tcp_connections",3342,20)) uid:1000 ino:23611 sk:ffff8800a6efde40
ESTAB      0      0              192.168.0.17:49036         192.168.0.26:8080   users:(("tcp_connections",3342,13)) uid:1000 ino:23484 sk:ffff8800b4915e40
ESTAB      0      4              192.168.0.17:49044         192.168.0.26:8080   timer:(on,3.732ms,5) users:(("tcp_connections",3342,21)) uid:1000 ino:23616 sk:ffff8800a6efd700
ESTAB      0      4              192.168.0.17:49046         192.168.0.26:8080   timer:(on,5.740ms,5) users:(("tcp_connections",3342,23)) uid:1000 ino:23631 sk:ffff8800a6eff400
ESTAB      0      0              192.168.0.17:49026         192.168.0.26:8080   users:(("tcp_connections",3342,3)) uid:1000 ino:23326 sk:ffff8800b4914fc0
ESTAB      0      4              192.168.0.17:49048         192.168.0.26:8080   timer:(on,1.336ms,4) users:(("tcp_connections",3342,25)) uid:1000 ino:23633 sk:ffff8800a6efecc0
ESTAB      0      0              192.168.0.17:49038         192.168.0.26:8080   users:(("tcp_connections",3342,15)) uid:1000 ino:23520 sk:ffff8800b4916580
ESTAB      0      0              192.168.0.17:49033         192.168.0.26:8080   users:(("tcp_connections",3342,10)) uid:1000 ino:23445 sk:ffff8800b4912b80
ESTAB      0      0              192.168.0.17:49034         192.168.0.26:8080   users:(("tcp_connections",3342,11)) uid:1000 ino:23447 sk:ffff8800b49132c0
ESTAB      0      0              192.168.0.17:49040         192.168.0.26:8080   users:(("tcp_connections",3342,17)) uid:1000 ino:23557 sk:ffff8800b4914880
ESTAB      0      4              192.168.0.17:49041         192.168.0.26:8080   timer:(on,732ms,5) users:(("tcp_connections",3342,18)) uid:1000 ino:23575 sk:ffff8800b4916cc0
ESTAB      0      0              192.168.0.17:49039         192.168.0.26:8080   users:(("tcp_connections",3342,16)) uid:1000 ino:23539 sk:ffff8800b4915700
ESTAB      0      0              192.168.0.17:49032         192.168.0.26:8080   users:(("tcp_connections",3342,9)) uid:1000 ino:23426 sk:ffff8800b4912440
ESTAB      0      4              192.168.0.17:49042         192.168.0.26:8080   timer:(on,1.732ms,5) users:(("tcp_connections",3342,19)) uid:1000 ino:23593 sk:ffff8800b4910000
---
^C
root@test:/home/juan#

The last sample shows the remaining connections from ports 49043 to 49050 all of them in the same condition, 4 bytes not being acknowledged.

From the client side samples we can confirm we got 15 complete TCP connections (src ports from 49026 to 49040) and 10 that even though showed up as ESTABLISHED weren't able to get acknowledged the 4 bytes sent by the client (src ports from 49041 to 49050).

TCP connections on the server

Lets have a look now to the samples taken on the server and see if they match with what we found on the client.

[root@server juan]# for i in {1..100}; do ss -tpn|grep "State\|8080"; echo "---"; sleep 2;done
State      Recv-Q Send-Q Local Address:Port               Peer Address:Port
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49026               users:(("nc",pid=13705,fd=5))
---
State      Recv-Q Send-Q Local Address:Port               Peer Address:Port
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49026               users:(("nc",pid=13705,fd=5))
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49028               users:(("nc",pid=13705,fd=7))
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49027               users:(("nc",pid=13705,fd=6))
---
...

The first samples look ok, nothing fancy there, we can see the first 3 TCP connections using src ports from 49026 to 49028.

...
---
State      Recv-Q Send-Q Local Address:Port               Peer Address:Port
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49026               users:(("nc",pid=13705,fd=5))
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49031
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49033
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49029               users:(("nc",pid=13705,fd=8))
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49028               users:(("nc",pid=13705,fd=7))
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49032
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49030
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49027               users:(("nc",pid=13705,fd=6))
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49034
---
State      Recv-Q Send-Q Local Address:Port               Peer Address:Port
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49036
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49026               users:(("nc",pid=13705,fd=5))
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49035
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49031
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49033
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49029               users:(("nc",pid=13705,fd=8))
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49028               users:(("nc",pid=13705,fd=7))
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49032
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49030
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49027               users:(("nc",pid=13705,fd=6))
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49034
---
...

And here it got interesting! We have now 4 connections (src ports between 49026 and 49029) which besides being ESTABLISHED have a FD associated to our nc process. However we have another 7 ESTABLISHED connections (src ports from 49030 to 49036) with no FD associated (and no process either :D)!!! These 7 connections are the ones waiting on the queue for nc to issue the accept() call. nc is actually issuing the accept calls, but the calls are failing due to nc having reached the maximum open files (because of the "ulimit -n 9").

...
---
State      Recv-Q Send-Q Local Address:Port               Peer Address:Port
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49036
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49026               users:(("nc",pid=13705,fd=5))
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49038
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49035
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49031
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49033
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49029               users:(("nc",pid=13705,fd=8))
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49028               users:(("nc",pid=13705,fd=7))
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49032
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49039
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49030
ESTAB      0      0      192.168.0.26:8080               192.168.0.17:49027               users:(("nc",pid=13705,fd=6))
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49037
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49040
ESTAB      4      0      192.168.0.26:8080               192.168.0.17:49034
---
^C
[root@server juan]#

After the binary issued the 25 connections, on the server we can see 4 ESTABLISHED connections (src ports 49026 to 49029) with corresponding FDs and 11 ESTABLISHED connections (src ports 49030 to 49040) with no FD. So... where are the remaining 10 connections that show up as ESTABLISHED on the client??? Well, these 10 connections didn't reach ESTABLISHED state on the server :D as per the documentation of listen():

The backlog argument defines the maximum length to which the queue of pending connections for sockfd may grow. If a connection request arrives when the queue is full, the client may receive an error with an indication of ECONNREFUSED or, if the underlying protocol supports retransmission, the request may be ignored so that a later reattempt at connection succeeds.

Considering we are using TCP , the use case matches exactly with the underlined part.

To be continued...

jueves, 22 de septiembre de 2016

Hard links and Soft links - Do you really understand them?

I've noticed the Hard/Soft link concept seems to be a bit confusing for some people, some of them don't even known Hard links exist. Well, I must confess that I didn't quite get them at the beginning however after dealing a bit and breaking stuff you get to ~~love~~ understand them. Therefore this post is intended to explain them in details to understand their differences and the reasons behind them.

Test scenario

The test scenario is pretty simple:

juan@test:~/hard_soft$ ls -lai
total 16
457232 drwxrwxr-x  2 juan juan 4096 Sep 21 17:45 .
415465 drwxr-xr-x 39 juan juan 4096 Sep 21 17:45 ..
457337 -rw-rw-r--  1 juan juan   14 Sep 21 17:45 File1
juan@test:~/hard_soft$

just a folder with 1 file that we are going to use as target for the links. The content of the file is just a string "File1 content\n", so 14 bytes. We can see the inode numbers is 457337 -> File1 (every file has an inode number). We also see a number 1 on the third column, that is the number of "Hard Links" the inode has, we'll see this more in details later on :D.

We can also get similar information (and more) by using stat program:

juan@test:~/hard_soft$ stat File1
File: ‘File1’
  Size: 14              Blocks: 8          IO Block: 4096   regular file
Device: 801h/2049d      Inode: 457337      Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/    juan)   Gid: ( 1000/    juan)
Access: 2016-09-21 17:45:43.744729861 +0100
Modify: 2016-09-21 17:45:43.744729861 +0100
Change: 2016-09-21 17:45:43.744729861 +0100
 Birth: -
juan@test:~/hard_soft$

so this is the file we'll use to play.

Soft links aka Symbolic link

Lets take a look at the soft ones first. Long story short, creating a soft link is creating a new file that points to a target file, therefore you end up having 2 files, yeahp, trust me! You can see that happening here:

juan@test:~/hard_soft$ ln -s File1 LinkToFile1
juan@test:~/hard_soft$ ls -lai
total 16
457232 drwxrwxr-x  2 juan juan 4096 Sep 21 18:06 .
415465 drwxr-xr-x 39 juan juan 4096 Sep 21 17:45 ..
457337 -rw-rw-r--  1 juan juan   14 Sep 21 17:45 File1
457342 lrwxrwxrwx  1 juan juan    5 Sep 21 18:06 LinkToFile1 -> File1
juan@test:~/hard_soft$

I just used "ln -s" to create a soft link called LinkToFile1 that points to File1. The particular syscall in use here is symlink, we can see that here:

juan@test:~/hard_soft$ strace ln -s File1 Link1ToFile1
execve("/bin/ln", ["ln", "-s", "File1", "Link1ToFile1"], [/* 22 vars */]) = 
brk(0)                                  = 0x13e0000
...
stat("Link1ToFile1", 0x7ffc1f9a1470)    = -1 ENOENT (No such file or directory)
symlink("File1", "Link1ToFile1")        = 0
lseek(0, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
close(0)                                = 0
close(1)                                = 0
close(2)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++
juan@test:~/hard_soft$

ln first checks if there's a file already called Link1ToFile1 and since there isn't any, it moves forward and creates the symbolic link with that name pointing to file File1.

Taking a look at the "ls" output:

File LinkToFile1 has its own inode number 45734, therefore is an independent file. Also the 1 on the third column suggest there's a single link to the inode.
Before the permissions we have an "l", that suggests this is not a regular file, but a symbolic link.
The file size is different, isn't that funny? just 5 bytes! Why?
Permissions are kind of open, right? Yeahp, that's normal for soft links, the permissions of the target file are the ones that matters.

Regarding the size of LinkToFile1, what are these 5 bytes?

juan@test:~/hard_soft$ cat LinkToFile1
File1 content
juan@test:~/hard_soft$

oops... of course, doing "cat LinkToFile1" is in the end doing cat to File1! So how can we actually read the content of LinkToFile1? Lets see if strace can help here (wanna know more about strace? take a look at this post):

juan@test:~/hard_soft$ strace cat LinkToFile1
execve("/bin/cat", ["cat", "LinkToFile1"], [/* 22 vars */]) = 0
brk(0)                                  = 0x149f000
...
open("LinkToFile1", O_RDONLY)           = 3
fstat(3, {st_mode=S_IFREG|0664, st_size=14, ...}) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
read(3, "File1 content\n", 65536)       = 14
write(1, "File1 content\n", 14File1 content
)         = 14
read(3, "", 65536)                      = 0
close(3)                                = 0
close(1)                                = 0
close(2)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++
juan@test:~/hard_soft$

turns out that by default open syscall will recognize the file as a soft link and will follow it (you can avoid this with certain flags). In the end, the returned FD 3 will actually point to File1 and that's why the read syscall returns "File1 content\n". So how can we actually retrieve the content of LinkToFile1?, well we can use readlink program (which actually uses readlink syscall xD) to read the content of a Symbolic link, just like this:

juan@test:~/hard_soft$ readlink LinkToFile1
File1
juan@test:~/hard_soft$

Yes :D, the content of LinkToFile1 is File1, the name of the file (the relative path actually) that's why the size is 5 bytes!!! But if the content of LinkToFile1 is a path to File1 what happens if I move File1 somewhere else?
Lets have a look:

juan@test:~/hard_soft$ mv File1 ..
juan@test:~/hard_soft$ ls -lai
total 12
457232 drwxrwxr-x  2 juan juan 4096 Sep 21 18:44 .
415465 drwxr-xr-x 39 juan juan 4096 Sep 21 18:44 ..
457342 lrwxrwxrwx  1 juan juan    5 Sep 21 18:06 LinkToFile1 -> File1
juan@test:~/hard_soft$ readlink LinkToFile1
File1
juan@test:~/hard_soft$ cat LinkToFile1
cat: LinkToFile1: No such file or directory
juan@test:~/hard_soft$

exactly, the link breaks and it doesn't work anymore! Same thing happens if we remove the target file or if we move LinkToFile1 instead:

juan@test:~/hard_soft$ ls
File1  File2  LinkToFile1
juan@test:~/hard_soft$ mv LinkToFile1 ../
juan@test:~/hard_soft$ ll -i ../LinkToFile1
457342 lrwxrwxrwx 1 juan juan 5 Sep 21 18:06 ../LinkToFile1 -> File1
juan@test:~/hard_soft$ cat ../LinkToFile1
cat: ../LinkToFile1: No such file or directory
juan@test:~/hard_soft$

We could workaround the moving link file issue by using the full path to File1 as target when creating the link instead of the relative one:

juan@test:~/hard_soft$ ln -s /home/juan/hard_soft/File1 Link2ToFile1
juan@test:~/hard_soft$ ls -lai
total 16
457232 drwxrwxr-x  2 juan juan 4096 Sep 21 18:50 .
415465 drwxr-xr-x 39 juan juan 4096 Sep 21 18:50 ..
457337 -rw-rw-r--  1 juan juan   14 Sep 21 17:45 File1
457343 lrwxrwxrwx  1 juan juan   26 Sep 21 18:50 Link2ToFile1 -> /home/juan/hard_soft/File1
457342 lrwxrwxrwx  1 juan juan    5 Sep 21 18:06 LinkToFile1 -> File1
juan@test:~/hard_soft$ readlink Link2ToFile1
/home/juan/hard_soft/File1
juan@test:~/hard_soft$

What if I delete a Soft link? since soft links are just files, deleting one of them will just make it go away and nothing will happen with the target/linked file:

juan@test:~/hard_soft$ rm Link*
juan@test:~/hard_soft$ ls -lai
total 12
457232 drwxrwxr-x  2 juan juan 4096 Sep 21 19:14 .
415465 drwxr-xr-x 39 juan juan 4096 Sep 21 18:50 ..
457337 -rw-rw-r--  1 juan juan   14 Sep 21 17:45 File1
juan@test:~/hard_soft$

What actually happens when you delete a file is that the number of links in the inode is decreased by one (and the entry gets removed from the directory). Once the number of links reaches 0, the file is officially gone (unless there's a running process that has a FD using it).

So... Soft links are just files and its content is the path to the targeted/linked file, makes perfect sense now, right?!

Hard links

The story is a bit different with Hard Links. When you create one you DO NOT get an extra file, nope you don't. Creating a Hard Link increases the number of links for a particular inode, lets see an example:

juan@test:~/hard_soft$ ln File1 Link1ToFile1
juan@test:~/hard_soft$ ls -lai
total 16
457232 drwxrwxr-x  2 juan juan 4096 Sep 21 19:24 .
415465 drwxr-xr-x 39 juan juan 4096 Sep 21 18:50 ..
457337 -rw-rw-r--  2 juan juan   14 Sep 21 17:45 File1
457337 -rw-rw-r--  2 juan juan   14 Sep 21 17:45 Link1ToFile1
juan@test:~/hard_soft$

Note: the syscall in play here is link or linkat.

Interesting! looks like 2 files, but:

They both have the same inode number! therefore they are the same file. The reason behind this is that directory wise, there are indeed two directory entries, one for File1 and one for Link1ToFile1.
Permissions are the same, that makes sense because is the same file (xD you are getting my point, right?), and that also applies for the rest of the properties MAC times for example.

Does that mean that if I have a third Hard link I'll get 3 as number of links for inode 457337, yeahp, that's correct:

juan@test:~/hard_soft$ ln File1 Link2ToFile1
juan@test:~/hard_soft$ ls -lai
total 20
457232 drwxrwxr-x  2 juan juan 4096 Sep 21 19:56 .
415465 drwxr-xr-x 39 juan juan 4096 Sep 21 18:50 ..
457337 -rw-rw-r--  3 juan juan   14 Sep 21 17:45 File1
457337 -rw-rw-r--  3 juan juan   14 Sep 21 17:45 Link1ToFile1
457337 -rw-rw-r--  3 juan juan   14 Sep 21 17:45 Link2ToFile1
juan@test:~/hard_soft$

The good thing about hard links, is that you can move them around and they just keep working:

juan@test:~/hard_soft$ mv Link1ToFile1 ../
juan@test:~/hard_soft$ cat ../Link1ToFile1
File1 content
juan@test:~/hard_soft$

that's because by moving the hard link, the directory entry in hard_soft directory was removed and a corresponding one was created on the parent directory (/home/juan), so accessing the link keeps working. Did this change the number of links on inode 457337?

juan@test:~/hard_soft$ ls -lai
total 16
457232 drwxrwxr-x  2 juan juan 4096 Sep 21 20:01 .
415465 drwxr-xr-x 39 juan juan 4096 Sep 21 20:01 ..
457337 -rw-rw-r--  3 juan juan   14 Sep 21 17:45 File1
457337 -rw-rw-r--  3 juan juan   14 Sep 21 17:45 Link2ToFile1
juan@test:~/hard_soft$ ls -lai ../Link1ToFile1
457337 -rw-rw-r-- 3 juan juan 14 Sep 21 17:45 ../Link1ToFile1
juan@test:~/hard_soft$

of course not, inode 457337 keeps having 3 links.

Then what if I delete a Hard link? As we mentioned before deleting a file decreases by one the link counter on the inode, therefore if we have 3 hard links and we delete one of them we'll be back to 2, like you can see here:

juan@test:~/hard_soft$ ls -lai
total 20
457232 drwxrwxr-x  2 juan juan 4096 Sep 22 19:52 .
415465 drwxr-xr-x 39 juan juan 4096 Sep 22 19:51 ..
457337 -rw-rw-r--  3 juan juan   14 Sep 21 17:45 File1
457337 -rw-rw-r--  3 juan juan   14 Sep 21 17:45 Link1ToFile1
457337 -rw-rw-r--  3 juan juan   14 Sep 21 17:45 Link2ToFile1
juan@test:~/hard_soft$ rm Link2ToFile1
juan@test:~/hard_soft$ ls -lai
total 16
457232 drwxrwxr-x  2 juan juan 4096 Sep 22 19:53 .
415465 drwxr-xr-x 39 juan juan 4096 Sep 22 19:51 ..
457337 -rw-rw-r--  2 juan juan   14 Sep 21 17:45 File1
457337 -rw-rw-r--  2 juan juan   14 Sep 21 17:45 Link1ToFile1
juan@test:~/hard_soft$

Summary

To wrap up the idea, I'll summarize the most important points here:

A soft link is another file (link type though), its content indicates the location of the targeted file.
You can have a soft links pointing to a file in a different partitions, which is NOT possible with hard links.
Hard links don't require extra space or inodes, and they can be moved around (in the same partition) and will keep working fine.
Every time you create a file, a hard link is created and that's link 1 in the inode :D.

viernes, 19 de agosto de 2016

Strace 101 - stracing your stuff!

It's been a while since the last blog post now, like 4 months went by! I remeber once I thought I could write one post per week xD, that did not work at all hahaha. Anyway..., the last few days had the chance to spend some time working with strace and perf, so I decided to write something about it here as well.

Strace? What the heck is it?

Strace is a nice debugging/troubleshooting tool for Linux, that helps you identify the syscalls a particular process is using and the signals a proess receives (more details here). Syscalls are basically the interface to access to kernel services. Therefore knowing what particular syscalls a program is issuing and what the results of these calls are is really interesting when debugging some software problems.

But how is it possible that a User space process is able to see the syscalls another User space process issues? Well, there's a kernel feature called ptrace that makes that possible (yeah... ptrace is a system call xD). By definition ptrace is:

The ptrace() system call provides a means by which one process (the "tracer") may observe and control the execution of another process (the "tracee"), and examine and change the tracee's memory and registers. It is primarily used to implement breakpoint debugging and system call tracing.

There are basically 2 ways you can strace a process to see its syscalls, you can either launch the process using strace or you can attach strace to a running process (under certain conditions :D).

How does it work? Show me the money!

I strongly believe the best way to explain something is by showing working examples, so there we go... Lets see how many syscalls (and which ones) the classic "Hello World" issues. The C code is:

#include <stdio.h>
int main()
{
     printf("Hello World\n");
     return 0;
}

Compile and run:

juan@juan-VirtualBox:~$ gcc -o classic classic.c 
juan@juan-VirtualBox:~$ ./classic 
Hello World
juan@juan-VirtualBox:~$

now we run it using strace instead:

juan@juan-VirtualBox:~$ strace ./classic
execve("./classic", ["./classic"], [/* 60 vars */]) = 0
brk(0)  = 0x7f0000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0362d17000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=85679, ...}) = 0
mmap(NULL, 85679, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f0362d02000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P \2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1840928, ...}) = 0
mmap(NULL, 3949248, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f0362732000
mprotect(0x7f03628ec000, 2097152, PROT_NONE) = 0
mmap(0x7f0362aec000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1ba000) = 0x7f0362aec000
mmap(0x7f0362af2000, 17088, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f0362af2000
close(3) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0362d01000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0362cff000
arch_prctl(ARCH_SET_FS, 0x7f0362cff740) = 0
mprotect(0x7f0362aec000, 16384, PROT_READ) = 0
mprotect(0x600000, 4096, PROT_READ) = 0
mprotect(0x7f0362d19000, 4096, PROT_READ) = 0
munmap(0x7f0362d02000, 85679) = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0 
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0362d16000
write(1, "Hello World\n", 12Hello World
) = 12
exit_group(0) = ?
+++ exited with 12 +++
juan@juan-VirtualBox:~$

The ouput provides one system call per line, and each line includes the syscall name, the parameters used to invoke it and the result returned (after the =). As we can see the simplest piece of code you can imagine will actually make use of many syscalls. I will only comment on a few of them (probably in a next post we can dive deep there :D):

execve (the first one), just before this call strace forked itself and the child process called ptrace syscall allowing strace parent to trace him. So after all that, the process runs execve changing its running code to our classic C program.
brk changes the location of the program break, which defines the end of the process's data segment. Increasing it has the effect of allocating memory to the process; or decreasing the break deallocates memory. In this case is called with 0 as increment, which makes the call return the current program break (0x7f0000).
we see a couple of mmap syscalls mapping memory regions as anonymous and also some mapping library libc.so.6. This is the Dynamic Linker doing its job and adding all the necessary libraries to the process memory space.
there are also a few open syscalls, opening files like /etc/ld.so.cache where it can find a list of the available system libraries.
just before finishing we see a write syscall, sending our classic "Hello World" to file descriptor 1, also known as STDOUT (standard output). Since both strace and classic are sending the standard output to the console we can see how the colided in lines 29 and 30.
the last call in place was exit_group, it's the equivalent to exit syscall but it terminates not only the calling thread but all the threads in the thread group (this particular example was single threaded).

This should provide a fair idea of what a particular piece of software does and which kernel services it's accessing to. However, sometimes we don't want to go into so much details, but instead we would like to see a summary. We can easily get that with -c flag:

juan@juan-VirtualBox:~$ strace -c ./classic 
Hello World
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0.00    0.000000           0         1           read
  0.00    0.000000           0         1           write
  0.00    0.000000           0         2           open
  0.00    0.000000           0         2           close
  0.00    0.000000           0         3           fstat
  0.00    0.000000           0         8           mmap
  0.00    0.000000           0         4           mprotect
  0.00    0.000000           0         1           munmap
  0.00    0.000000           0         1           brk
  0.00    0.000000           0         3         3 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000                    28         3 total
juan@juan-VirtualBox:~$

the output shows the list of syscalls issued by the process (and its threads if there were more than 1), the number of times each one was called, the number of times they failed, and the CPU time they consumed (kernel space time aka system time). With this output we can easily identify that 3 syscalls returned failed states, if we go back to the full strace output we'll see that lines 4, 6 and 11 returned ENOENT "No such file or directory".

I mentioned before that is also possible to strace an already running processes, so lets take a look at that. First we have to identify a process as target, in this case nc with PID 8739:

juan@juan-VirtualBox:~$ ps aux|grep -i nc
root       954  0.0  0.0  19196  2144 ?        Ss   01:33   0:04 /usr/sbin/irqbalance
juan      2055  0.0  0.2 355028  8396 ?        Ssl  01:33   0:00 /usr/lib/at-spi2-core/at-spi-bus-launcher --launch-immediately
juan      8739  0.0  0.0   9132   800 pts/0    S    11:52   0:00 nc -l 9999
juan      2585  0.0  0.0  24440  1964 ?        S    01:33   0:00 dbus-launch --autolaunch 0c0058daf07f369dd9b0d1605654eff1 --binary-syntax --close-stderr
juan      9477  0.0  0.0  15948  2304 pts/2    R+   14:03   0:00 grep --color=auto -i nc
juan@juan-VirtualBox:~$

now lets try to attach strace to it in order to inspect the syscalls:

juan@juan-VirtualBox:~$ strace -p 8739
strace: attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
juan@juan-VirtualBox:~$

interesting :D. Turns out that by default Ubuntu doesn't allow ptrace_attach feature, why? we'll see an example later :P. Ptrace has different scopes (4 actually, 0 to 3, being 3 the most restrictive one):

A PTRACE scope of "0" is the more permissive mode. A scope of "1" limits PTRACE only to direct child processes (e.g. "gdb name-of-program" and "strace -f name-of-program" work, but gdb's "attach" and "strace -fp $PID" do not). The PTRACE scope is ignored when a user has CAP_SYS_PTRACE, so "sudo strace -fp $PID" will work as before.

if we take a look at the current ptrace_scope value will see we have scope 1:

juan@juan-VirtualBox:~$ cat /proc/sys/kernel/yama/ptrace_scope
1
juan@juan-VirtualBox:~$

At this point we have two options, we either try with sudo or we enable scope 1 system wide by changing /proc/sys/kernel/yama/ptrace_scope (this might be dangerous). Lets go with sudo now:

juan@juan-VirtualBox:~$ sudo strace -p 8739
Process 8739 attached
accept(3, {sa_family=AF_INET, sin_port=htons(34404), sin_addr=inet_addr("127.0.0.1")}, [16]) = 4
close(3) = 0
poll([{fd=4, events=POLLIN}, {fd=0, events=POLLIN}], 2, 4294967295) = 1 ([{fd=4, revents=POLLIN}])
read(4, "Hello World through NC\n", 2048) = 23
write(1, "Hello World through NC\n", 23) = 23
poll([{fd=4, events=POLLIN}, {fd=0, events=POLLIN}], 2, 4294967295) = 1 ([{fd=4, revents=POLLIN}])
read(4, "", 2048) = 0
shutdown(4, SHUT_RD) = 0
close(4) = 0
close(3) = -1 EBADF (Bad file descriptor)
close(3) = -1 EBADF (Bad file descriptor)
exit_group(0) = ?
+++ exited with 0 +++
juan@juan-VirtualBox:~$

It worked indeed! Lets do a brief review of the syscalls:

First an accept call. It extracts the first connection request on the queue of pending connections for the listening socket, 3, creates a new connected socket, and returns a new file descriptor referring to that socket , 4. The newly created socket is not in the listening state. The original socket 3 is unaffected by this call.
Then close call closes the listening socket 3.
The next call was poll, it waits for one of a set of file descriptors to become ready to perform I/O. In this case we can see it waits for fd 4 (the recent socket created due to the incoming connection) and fd 0 (the standard input).
Right after the poll we sea the read call, reading 23 bytes out of fd 4 using a 2048 bytes buffer.
After finishing reading, nc uses write to send the received bytes to fd 1, the standard output.
Then nc polls again for any extra data coming in, and this time the call returns empty as can be seen on the next read call returning 0. This poll is probably triggered by the connection being finished on the other side.
Shutdown call shuts down all or part of a full dupplex connection on a given socket, in this case the one pointed by fd 4.
Then we have 3 close calls, the first one closes the file descriptor used by the socket created by the accept call, while the next two calls try to close a fd that has already been closed on line 4 (fd 3), which is kind of weird and could be a bug.

Why ptrace could be dangerous?

Usually debugging tools are like two edged swords, right? Well, ptrace is no exception to that rule. Having access to the interface between user space and kernel space of a process can leak some important information, like credentials.

Lets see an extremely simple example. My virtual machine has vsftpd 3.0.2 running, so lets capture the credentials of a system user that logs into the FTP service. In this case will set a few extra flags on strace in order to make things easier:

-f will Trace child processes as they are created by currently traced processes as a result of the fork system call.
-eread -ewrite are two filters to tell strace to only record read and write syscalls.
-o sets an output file where the syscalls will be recorded.

So lets strace:

juan@juan-VirtualBox:~$ sudo strace -f -e trace=read,write -o output -p $(pidof vsftpd)
Process 10040 attached
Process 10280 attached
Process 10281 attached
Process 10282 attached
^CProcess 10040 detached
juan@juan-VirtualBox:~$

we can see vsftpd forked a couple of times while we connected to it using the ftp client. Now lets take a look at the content of output file:

juan@juan-VirtualBox:~$ cat output 
10280 read(3, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\n\0\0\0\n\0\0\0\0"..., 4096) = 3533
10280 read(3, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\v\0\0\0\v\0\0\0\0"..., 4096) = 2248
10280 read(4,  
10281 read(4, "# /etc/nsswitch.conf\n#\n# Example"..., 4096) = 507
10281 read(4, "", 4096)                 = 0
10281 read(4, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260\23\0\0\0\0\0\0"..., 832) = 832
10281 read(4, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240!\0\0\0\0\0\0"..., 832) = 832
10281 read(4, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240\"\0\0\0\0\0\0"..., 832) = 832
10281 write(3, "Sat Aug 20 17:06:02 2016 [pid 10"..., 65) = 65
10281 write(0, "220 (vsFTPd 3.0.2)\r\n", 20) = 20
10281 read(0, "USER juan\r\n", 11)      = 11
10281 write(0, "331 Please specify the password."..., 34) = 34
10281 read(0, "PASS MyPassw0rd\r\n", 15)  = 15
10281 write(5, "\1", 1)                 = 1
10280 <... read resumed> "\1", 1)       = 1
10281 write(5, "\4\0\0\0", 4 
10280 read(4,  
10281 <... write resumed> )             = 4
10280 <... read resumed> "\4\0\0\0", 4) = 4
10281 write(5, "juan", 4 
10280 read(4,  
...
10282 write(0, "230 Login successful.\r\n", 23) = 23
10282 read(0, "SYST\r\n", 6)            = 6
10282 write(0, "215 UNIX Type: L8\r\n", 19) = 19
10282 read(0, "QUIT\r\n", 6)            = 6
10282 write(0, "221 Goodbye.\r\n", 14)  = 14
10282 +++ exited with 0 +++
10280 <... read resumed> 0x7ffe4994960f, 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
10280 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=10282, si_status=0, si_utime=0, si_stime=0} ---
10280 +++ killed by SIGSYS +++
10040 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=10280, si_status=SIGSYS, si_utime=0, si_stime=1} ---
juan@juan-VirtualBox:~$

As we anticipated, we can see both username and password on lines 12 and 14. Ok, yes, this is FTP and we could have captured the keys also with a simple network capture, but this is just an example :D of how strace (ptrace actually) can be used to leak sensitive information.

I hope this was interesting, at least it was for me xD. Here I list some interesting links I found on the way:

http://linux.die.net/man/1/strace
http://linux.die.net/man/2/ptrace
http://man7.org/linux/man-pages/man2/syscalls.2.html
https://www.kernel.org/doc/Documentation/security/Yama.txt

martes, 15 de marzo de 2016

TCP Keep Alive - how it works

Besides being a short entry this will be the first one in English. Why? mmmm, good question, not sure to be honest. I suppose English will make the entries accessible to more people and that is good reason enough :D to me, I'm not saying all of them will be in English from now on (who knows haha).

So a few days ago I faced an awkward situation that kind of pushed me to see how the TCP keep alive feature works on the Linux Kernel. Nothing life changing, but I wanted to share it here for you out there and for me as well xD, is fun how many times I come back to my old entries looking for commands or answers :D.

The problem was quite simple, on the server side a simple netstat showed more than 20k TCP connections on ESTABLISHED state. However all the clients that had started these connections were shutdown... yes, shutdown, they weren't even online. So, how is this possible? Well, you will understand it after this entry, hopefully XD.

Kernel TCP keep-alive configuration

Our lovely Kernel provides 3 parameters to handle TCP keep-alive behavior:

tcp_keepalive_time: number of seconds a connection needs to be idle before keep-alive tests begin. The default value is 7200 (2 h), this parameter is valid ONLY if the option SO_KEEPALIVE is set on the socket.
tcp_keepalive_intvl: once the keep-alive tests begin, this value states the number of seconds between each test. The default value is 75 (1min 15 sec).
tcp_keepalive_probes: the number of probes that must fail before the connection gets terminated. The default value is 9.

Like every Kernel parameter we can access these 3 guys through the /proc FS, here you have them on one of the test instances:

[root@ip-172-31-24-218 ec2-user]# cat /proc/sys/net/ipv4/tcp_keepalive_intvl
75
[root@ip-172-31-24-218 ec2-user]# cat /proc/sys/net/ipv4/tcp_keepalive_probes
9
[root@ip-172-31-24-218 ec2-user]# cat /proc/sys/net/ipv4/tcp_keepalive_time
7200
[root@ip-172-31-24-218 ec2-user]#

The test scenario is the following:

The test scenario consists of basically 2 EC2 instances:

A, IP 172.31.24.219. This will be the source of the connections.
B, IP 172.31.24.218. This will be the so called server instance.

to speed up the tests I reduced, on instance B, tcp_keepalive_time from 7200 to 10, the following way:

[root@ip-172-31-24-218 ec2-user]# echo 10 > /proc/sys/net/ipv4/tcp_keepalive_time
[root@ip-172-31-24-218 ec2-user]# cat /proc/sys/net/ipv4/tcp_keepalive_time
10
[root@ip-172-31-24-218 ec2-user]#

Test 1 "Testing keep-alive"

To simulate the server I used the well known TCP swiss army knife, nc :P (same on the client side nc 172.31.24.218 3333). Basically started nc in background listening on port 3333 on instance B and then started tcpdump to capture the traffic coming to that port from instance A:

[ec2-user@ip-172-31-24-218 ~]$ nc -l 3333 &
[1] 2536
[ec2-user@ip-172-31-24-218 ~]$ sudo tcpdump -nn port 3333
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
19:27:27.477761 IP 172.31.24.219.42115 > 172.31.24.218.3333: Flags [S], seq 2777265429, win 26883, options [mss 8961,sackOK,TS val 260487 ecr 0,nop,wscale 6], length 0
19:27:27.477983 IP 172.31.24.218.3333 > 172.31.24.219.42115: Flags [S.], seq 2747775912, ack 2777265430, win 26847, options [mss 8961,sackOK,TS val 260850 ecr 260487,nop,wscale 6], length 0
19:27:27.478456 IP 172.31.24.219.42115 > 172.31.24.218.3333: Flags [.], ack 1, win 421, options [nop,nop,TS val 260488 ecr 260850], length 0

^C
3 packets captured
3 packets received by filter
0 packets dropped by kernel

[1]+ Stopped                 nc -l 3333
[ec2-user@ip-172-31-24-218 ~]$ date
Tue Mar 15 19:33:09 UTC 2016
[ec2-user@ip-172-31-24-218 ~]$

mmmm, I've colored with blue the TCP Handshake, and you can see that between the time the connection was established 19:27:27 and the date command 19:33:09 we have more than 5 minutes. Not even one single packet was exchanged between the instances, so why didn't the keep-alive process kick in considering we set it to 20 seconds? Well, perhaps nc is not setting SO_KEEPALIVE on the socket when it opens it? What does strace have to say about it:

[ec2-user@ip-172-31-24-218 ~]$ strace nc -l 3333
execve("/usr/bin/nc", ["nc", "-l", "3333"], [/* 32 vars */]) = 0
brk(0)                                  = 0xff9000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff071850000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
...
blah blah blah
...
blah blah blah
...
set_robust_list(0x7ff071847a20, 24)     = 0
rt_sigaction(SIGRTMIN, {0x7ff070d30780, [], SA_RESTORER|SA_SIGINFO, 0x7ff070d39100}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {0x7ff070d30810, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x7ff070d39100}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
brk(0)                                  = 0xff9000
brk(0x101a000)                          = 0x101a000
brk(0)                                  = 0x101a000
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(3, {sa_family=AF_INET, sin_port=htons(3333), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
listen(3, 1)                            = 0
accept(3, ^CProcess 2690 detached

[1]+ Killed                  nc -l 3333
[ec2-user@ip-172-31-24-218 ~]$

Cool, I've highlighted the lines where the socket is created and you can see SO_KEEPALIVE is not being set as an option. So... what now? Lets get dirty!!!

Test 2 "If you don't like it, change it!"

So, why don't we just force nc to use SO_KEEPALIVE option? Hell yeah!!! I downloaded netcat source from the official website and did the following changes on network.c file:

[ec2-user@ip-172-31-24-218 ~]$ diff netcat-0.7.1/src/network.c netcat-0.7.1_juan/src/network.c
374a375,383
>   sockopt = 1;
>   if (type == SOCK_STREAM){
>      ret = setsockopt(sock,SOL_SOCKET,SO_KEEPALIVE,&sockopt,sizeof(sockopt));
>      if(ret < 0){
>        close(sock);
>        return -2;
>      }
>   }
>
[ec2-user@ip-172-31-24-218 ~]$

as you can see I just added an if statement that will become true if the socket created is SOCK_STREAM (TCP :D), and will set the SO_KEEPALIVE option on the socket. After that I just issued ./configure and make, now strace shows something more interesting:

[ec2-user@ip-172-31-24-218 netcat-0.7.1_juan]$ strace ./src/netcat -l -p 3333
execve("./src/netcat", ["./src/netcat", "-l", "-p", "3333"], [/* 33 vars */]) = 0
brk(0)                                  = 0x125f000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fef806cb000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=21099, ...}) = 0
mmap(NULL, 21099, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fef806c5000
close(3)                                = 0
...
blah blah blah
...
blah blah blah
...
read(3, " # Redwood Chat\npdb             "..., 4096) = 4096
read(3, "      # ContinuStor Monitor Port"..., 4096) = 4096
read(3, "   3107/udp                # Bus"..., 4096) = 4096
read(3, "lfap        3145/tcp            "..., 4096) = 4096
read(3, "\nh2gf-w-2m       3179/udp       "..., 4096) = 4096
read(3, "    3212/tcp                # Su"..., 4096) = 4096
read(3, "eo-fe         3245/tcp          "..., 4096) = 4096
read(3, "80/tcp                # VS Serve"..., 4096) = 4096
read(3, "        # SDT License Manager\nof"..., 4096) = 4096
close(3)                                = 0
munmap(0x7fef806ca000, 4096)            = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
setsockopt(3, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8) = 0
setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
setsockopt(3, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
bind(3, {sa_family=AF_INET, sin_port=htons(3333), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
listen(3, 4)                            = 0
open("/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 4
...
blah blah blah
...
select(4, [3], NULL, NULL, NULL^CProcess 6656 detached

[ec2-user@ip-172-31-24-218 netcat-0.7.1_juan]$

so now we do have SO_KEEPALIVE in place, lets see if it works:

[ec2-user@ip-172-31-24-218 netcat-0.7.1_juan]$ ./src/netcat -l -p 3333 &
[1] 6657
[ec2-user@ip-172-31-24-218 netcat-0.7.1_juan]$ sudo tcpdump -nn port 3333
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
20:23:48.747981 IP 172.31.24.219.42122 > 172.31.24.218.3333: Flags [S], seq 723623108, win 26883, options [mss 8961,sackOK,TS val 1105802 ecr 0,nop,wscale 6], length 0
20:23:48.748190 IP 172.31.24.218.3333 > 172.31.24.219.42122: Flags [S.], seq 4243448461, ack 723623109, win 26847, options [mss 8961,sackOK,TS val 1106167 ecr 1105802,nop,wscale 6], length 0
20:23:48.748679 IP 172.31.24.219.42122 > 172.31.24.218.3333: Flags [.], ack 1, win 421, options [nop,nop,TS val 1105803 ecr 1106167], length 0

20:23:58.764759 IP 172.31.24.218.3333 > 172.31.24.219.42122: Flags [.], ack 1, win 420, options [nop,nop,TS val 1108672 ecr 1105803], length 0
20:23:58.765399 IP 172.31.24.219.42122 > 172.31.24.218.3333: Flags [.], ack 1, win 421, options [nop,nop,TS val 1108307 ecr 1106167], length 0

20:25:13.773009 IP 172.31.24.218.3333 > 172.31.24.219.42122: Flags [.], ack 1, win 420, options [nop,nop,TS val 1127424 ecr 1108307], length 0
20:25:13.773926 IP 172.31.24.219.42122 > 172.31.24.218.3333: Flags [.], ack 1, win 421, options [nop,nop,TS val 1127059 ecr 1106167], length 0

20:26:29.036973 IP 172.31.24.218.3333 > 172.31.24.219.42122: Flags [.], ack 1, win 420, options [nop,nop,TS val 1146240 ecr 1127059], length 0
20:26:29.037684 IP 172.31.24.219.42122 > 172.31.24.218.3333: Flags [.], ack 1, win 421, options [nop,nop,TS val 1145876 ecr 1106167], length 0

so our fancy workaround actually worked!! we can see how after 10 seconds of inactivity the keep-alive probes kick in and they are sent every 75 seconds. The probes are basically ACK packets with no real content, here we can see how the probes are answered by instance A with a plain ACK as well.

Test 3 "Let the show begin"

Ok, now we have SO_KEEPALIVE enabled, therefore we should be able to test the all the parameters. Again, to speed up the process we'll reduce some of them again to the following values:

[root@ip-172-31-24-218 netcat-0.7.1]# echo 5 > /proc/sys/net/ipv4/tcp_keepalive_probes
[root@ip-172-31-24-218 netcat-0.7.1]# echo 20 > /proc/sys/net/ipv4/tcp_keepalive_intvl
[root@ip-172-31-24-218 netcat-0.7.1]# cat /proc/sys/net/ipv4/tcp_keepalive_probes
5
[root@ip-172-31-24-218 netcat-0.7.1]# cat /proc/sys/net/ipv4/tcp_keepalive_intvl
20
[root@ip-172-31-24-218 netcat-0.7.1]#

now the probes should be sent every 20 seconds and if 5 of them fail, the connection should be terminated. Ok, but if I run the same test again, the probes won't really fail because the connection is perfectly fine, so I need to cause a problem here.

The easiest way to make sure the connection will be idle and that instance A won't answer the keep-alive probes is by dropping all the packets on instance A right after the connection is ready. I did this by adding the following iptables rules right after the connection was established:

iptables -A OUTPUT --dst 172.31.24.218 -j DROP
iptables -A INPUT --src 172.31.24.218 -j DROP

Note: I dropped everything going out to instance B and everything coming in from instance B.

so there we go...

[ec2-user@ip-172-31-24-218 netcat-0.7.1]$ ./src/netcat -l -p 3333 &
[2] 27020
[ec2-user@ip-172-31-24-218 netcat-0.7.1]$ sudo tcpdump -nn port 3333
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
21:57:41.857514 IP 172.31.24.219.42126 > 172.31.24.218.3333: Flags [S], seq 2971447882, win 26883, options [mss 8961,sackOK,TS val 2514081 ecr 0,nop,wscale 6], length 0
21:57:41.857818 IP 172.31.24.218.3333 > 172.31.24.219.42126: Flags [S.], seq 1633813684, ack 2971447883, win 26847, options [mss 8961,sackOK,TS val 2514445 ecr 2514081,nop,wscale 6], length 0
21:57:41.858211 IP 172.31.24.219.42126 > 172.31.24.218.3333: Flags [.], ack 1, win 421, options [nop,nop,TS val 2514082 ecr 2514445], length 0

21:57:51.884756 IP 172.31.24.218.3333 > 172.31.24.219.42126: Flags [.], ack 1, win 420, options [nop,nop,TS val 2516952 ecr 2514082], length 0
21:58:11.948771 IP 172.31.24.218.3333 > 172.31.24.219.42126: Flags [.], ack 1, win 420, options [nop,nop,TS val 2521968 ecr 2514082], length 0
21:58:31.980765 IP 172.31.24.218.3333 > 172.31.24.219.42126: Flags [.], ack 1, win 420, options [nop,nop,TS val 2526976 ecr 2514082], length 0
21:58:52.012770 IP 172.31.24.218.3333 > 172.31.24.219.42126: Flags [.], ack 1, win 420, options [nop,nop,TS val 2531984 ecr 2514082], length 0
21:59:12.044761 IP 172.31.24.218.3333 > 172.31.24.219.42126: Flags [.], ack 1, win 420, options [nop,nop,TS val 2536992 ecr 2514082], length 0
21:59:32.076768 IP 172.31.24.218.3333 > 172.31.24.219.42126: Flags [R.], seq 1, ack 1, win 420, options [nop,nop,TS val 2542000 ecr 2514082], length 0
^C
9 packets captured
9 packets received by filter
0 packets dropped by kernel

[2]+ Stopped ./src/netcat -l -p 3333
[ec2-user@ip-172-31-24-218 netcat-0.7.1]$

interesting, right? keep-alive probes kicked in after 10 seconds as expected, then we have a probe every 20 seconds (no answer from instance A), and after the 5th probe we can see a RST+ACK package going from B to A trying to terminate the connection. This is exactly the behavior we were expecting :D.

So coming back to 20k+ ESTABLISHED connections situation, now we can understand why a situation like that is possible. Basically all the clients, for some awkward and really hard to reproduce reason were crashing and not finishing the connections properly, so on the server side due to the fact that the sockets used for the connections weren't using SO_KEEPALIVE option, the connections remained ESTABLISHED for the eternity.

Could this be a big problem? well it might be:

The first thing that comes to my mind is wasted memory. These established connections even though they are not being actively used, they need kernel memory to exist, so do the math :D.
If a new connection comes in to the server and the source IP and source port match with one of the connections in ESTABLISHED state that connection will probably fail.
If you have a limit number of sockets available on the server I don't think you can afford having them allocated to this Ghost connections.

So why not having TCP keep-alive enabled by default? Good question... I guess that when we are talking about too many connections, handling the keep-alive could put some overhead on the kernel side. Imagine that for every connection the kernel should have to keep track of the time the last packet came in and then set timers to trigger the keep-alive probes.

I hope I was clear enough, otherwise you can always leave a comment :P.

... it wasn't a short entry after all.