martes, 15 de marzo de 2016

TCP Keep Alive - how it works

Besides being a short entry this will be the first one in English. Why? mmmm, good question, not sure to be honest. I suppose English will make the entries accessible to more people and that is good reason enough :D to me, I'm not saying all of them will be in English from now on (who knows haha).

So a few days ago I faced an awkward situation that kind of pushed me to see how the TCP keep alive feature works on the Linux Kernel. Nothing life changing, but I wanted to share it here for you out there and for me as well xD, is fun how many times I come back to my old entries looking for commands or answers :D.

The problem was quite simple, on the server side a simple netstat showed more than 20k TCP connections on ESTABLISHED state. However all the clients that had started these connections were shutdown... yes, shutdown, they weren't even online. So, how is this possible? Well, you will understand it after this entry, hopefully XD.

Kernel TCP keep-alive configuration


Our lovely Kernel provides 3 parameters to handle TCP keep-alive behavior:
  • tcp_keepalive_time: number of seconds a connection needs to be idle before keep-alive tests begin. The default value is 7200 (2 h), this parameter is valid ONLY if the option SO_KEEPALIVE is set on the socket.
  • tcp_keepalive_intvl: once the keep-alive tests begin, this value states the number of seconds between each test. The default value is 75 (1min 15 sec).
  • tcp_keepalive_probes: the number of probes that must fail before the connection gets terminated. The default value is 9.
Like every Kernel parameter we can access these 3 guys through the /proc FS, here you have them on one of the test instances:

[root@ip-172-31-24-218 ec2-user]# cat /proc/sys/net/ipv4/tcp_keepalive_intvl
75
[root@ip-172-31-24-218 ec2-user]# cat /proc/sys/net/ipv4/tcp_keepalive_probes
9
[root@ip-172-31-24-218 ec2-user]# cat /proc/sys/net/ipv4/tcp_keepalive_time
7200
[root@ip-172-31-24-218 ec2-user]#


 The test scenario is the following:

The test scenario consists of basically 2 EC2 instances:
  • A, IP 172.31.24.219. This will be the source of the connections.
  • B, IP 172.31.24.218. This will be the so called server instance.
to speed up the tests I reduced, on instance B, tcp_keepalive_time from 7200 to 10, the following way:

[root@ip-172-31-24-218 ec2-user]# echo 10 > /proc/sys/net/ipv4/tcp_keepalive_time
[root@ip-172-31-24-218 ec2-user]# cat /proc/sys/net/ipv4/tcp_keepalive_time
10
[root@ip-172-31-24-218 ec2-user]#


Test 1 "Testing keep-alive"


To simulate the server I used the well known TCP swiss army knife, nc :P (same on the client side nc 172.31.24.218 3333). Basically started nc in background listening on port 3333 on instance B and then started tcpdump to capture the traffic coming to that port from instance A:

[ec2-user@ip-172-31-24-218 ~]$ nc -l 3333 &
[1] 2536
[ec2-user@ip-172-31-24-218 ~]$ sudo tcpdump -nn port 3333
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
19:27:27.477761 IP 172.31.24.219.42115 > 172.31.24.218.3333: Flags [S], seq 2777265429, win 26883, options [mss 8961,sackOK,TS val 260487 ecr 0,nop,wscale 6], length 0
19:27:27.477983 IP 172.31.24.218.3333 > 172.31.24.219.42115: Flags [S.], seq 2747775912, ack 2777265430, win 26847, options [mss 8961,sackOK,TS val 260850 ecr 260487,nop,wscale 6], length 0
19:27:27.478456 IP 172.31.24.219.42115 > 172.31.24.218.3333: Flags [.], ack 1, win 421, options [nop,nop,TS val 260488 ecr 260850], length 0



^C
3 packets captured
3 packets received by filter
0 packets dropped by kernel

[1]+  Stopped                 nc -l 3333
[ec2-user@ip-172-31-24-218 ~]$ date
Tue Mar 15 19:33:09 UTC 2016
[ec2-user@ip-172-31-24-218 ~]$


mmmm, I've colored with blue the TCP Handshake, and you can see that between the time the connection was established 19:27:27 and the date command 19:33:09 we have more than 5 minutes. Not even one single packet was exchanged between the instances, so why didn't the keep-alive process kick in considering we set it to 20 seconds? Well, perhaps nc is not setting SO_KEEPALIVE on the socket when it opens it? What does strace have to say about it:

[ec2-user@ip-172-31-24-218 ~]$ strace nc -l 3333
execve("/usr/bin/nc", ["nc", "-l", "3333"], [/* 32 vars */]) = 0
brk(0)                                  = 0xff9000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff071850000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
...

blah blah blah
...
blah blah blah
...
set_robust_list(0x7ff071847a20, 24)     = 0
rt_sigaction(SIGRTMIN, {0x7ff070d30780, [], SA_RESTORER|SA_SIGINFO, 0x7ff070d39100}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {0x7ff070d30810, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x7ff070d39100}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
brk(0)                                  = 0xff9000
brk(0x101a000)                          = 0x101a000
brk(0)                                  = 0x101a000
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
 

bind(3, {sa_family=AF_INET, sin_port=htons(3333), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
listen(3, 1)                            = 0
accept(3, ^CProcess 2690 detached
 
[1]+  Killed                  nc -l 3333
[ec2-user@ip-172-31-24-218 ~]$


Cool, I've highlighted the lines where the socket is created and you can see SO_KEEPALIVE is not being set as an option. So... what now? Lets get dirty!!!

Test 2 "If you don't like it, change it!"


So, why don't we just force nc to use SO_KEEPALIVE option? Hell yeah!!! I downloaded netcat source from the official website and did the following changes on network.c file:

[ec2-user@ip-172-31-24-218 ~]$ diff netcat-0.7.1/src/network.c netcat-0.7.1_juan/src/network.c
374a375,383
>   sockopt = 1;
>   if (type == SOCK_STREAM){
>      ret = setsockopt(sock,SOL_SOCKET,SO_KEEPALIVE,&sockopt,sizeof(sockopt));
>      if(ret < 0){
>        close(sock);
>        return -2;
>      }
>   }
>
[ec2-user@ip-172-31-24-218 ~]$


as you can see I just added an if statement that will become true if the socket created is SOCK_STREAM (TCP :D), and will set the SO_KEEPALIVE option on the socket. After that I just issued ./configure and make, now strace shows something more interesting:

[ec2-user@ip-172-31-24-218 netcat-0.7.1_juan]$ strace ./src/netcat -l -p 3333
execve("./src/netcat", ["./src/netcat", "-l", "-p", "3333"], [/* 33 vars */]) = 0
brk(0)                                  = 0x125f000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fef806cb000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=21099, ...}) = 0
mmap(NULL, 21099, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fef806c5000
close(3)                                = 0

...
blah blah blah
...
blah blah blah
...
read(3, " # Redwood Chat\npdb             "..., 4096) = 4096
read(3, "      # ContinuStor Monitor Port"..., 4096) = 4096
read(3, "   3107/udp                # Bus"..., 4096) = 4096
read(3, "lfap        3145/tcp            "..., 4096) = 4096
read(3, "\nh2gf-w-2m       3179/udp       "..., 4096) = 4096
read(3, "    3212/tcp                # Su"..., 4096) = 4096
read(3, "eo-fe         3245/tcp          "..., 4096) = 4096
read(3, "80/tcp                # VS Serve"..., 4096) = 4096
read(3, "        # SDT License Manager\nof"..., 4096) = 4096
close(3)                                = 0
munmap(0x7fef806ca000, 4096)            = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
setsockopt(3, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8) = 0
setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
setsockopt(3, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
 

bind(3, {sa_family=AF_INET, sin_port=htons(3333), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
listen(3, 4)                            = 0
open("/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 4
...

blah blah blah
...
select(4, [3], NULL, NULL, NULL^CProcess 6656 detached
 
[ec2-user@ip-172-31-24-218 netcat-0.7.1_juan]$


so now we do have SO_KEEPALIVE in place, lets see if it works:

[ec2-user@ip-172-31-24-218 netcat-0.7.1_juan]$ ./src/netcat -l -p 3333 &
[1] 6657
[ec2-user@ip-172-31-24-218 netcat-0.7.1_juan]$ sudo tcpdump -nn port 3333
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
20:23:48.747981 IP 172.31.24.219.42122 > 172.31.24.218.3333: Flags [S], seq 723623108, win 26883, options [mss 8961,sackOK,TS val 1105802 ecr 0,nop,wscale 6], length 0
20:23:48.748190 IP 172.31.24.218.3333 > 172.31.24.219.42122: Flags [S.], seq 4243448461, ack 723623109, win 26847, options [mss 8961,sackOK,TS val 1106167 ecr 1105802,nop,wscale 6], length 0
20:23:48.748679 IP 172.31.24.219.42122 > 172.31.24.218.3333: Flags [.], ack 1, win 421, options [nop,nop,TS val 1105803 ecr 1106167], length 0


20:23:58.764759 IP 172.31.24.218.3333 > 172.31.24.219.42122: Flags [.], ack 1, win 420, options [nop,nop,TS val 1108672 ecr 1105803], length 0
20:23:58.765399 IP 172.31.24.219.42122 > 172.31.24.218.3333: Flags [.], ack 1, win 421, options [nop,nop,TS val 1108307 ecr 1106167], length 0


20:25:13.773009 IP 172.31.24.218.3333 > 172.31.24.219.42122: Flags [.], ack 1, win 420, options [nop,nop,TS val 1127424 ecr 1108307], length 0
20:25:13.773926 IP 172.31.24.219.42122 > 172.31.24.218.3333: Flags [.], ack 1, win 421, options [nop,nop,TS val 1127059 ecr 1106167], length 0


20:26:29.036973 IP 172.31.24.218.3333 > 172.31.24.219.42122: Flags [.], ack 1, win 420, options [nop,nop,TS val 1146240 ecr 1127059], length 0
20:26:29.037684 IP 172.31.24.219.42122 > 172.31.24.218.3333: Flags [.], ack 1, win 421, options [nop,nop,TS val 1145876 ecr 1106167], length 0


so our fancy workaround actually worked!! we can see how after 10 seconds of inactivity  the keep-alive probes kick in and they are sent every 75 seconds. The probes are basically ACK packets with no real content, here we can see how the probes are answered by instance A with a plain ACK as well.

Test 3 "Let the show begin"


Ok, now we have SO_KEEPALIVE enabled, therefore we should be able to test the all the parameters. Again, to speed up the process we'll reduce some of them again to the following values:

[root@ip-172-31-24-218 netcat-0.7.1]# echo 5 > /proc/sys/net/ipv4/tcp_keepalive_probes
[root@ip-172-31-24-218 netcat-0.7.1]# echo 20 > /proc/sys/net/ipv4/tcp_keepalive_intvl
[root@ip-172-31-24-218 netcat-0.7.1]# cat /proc/sys/net/ipv4/tcp_keepalive_probes
5
[root@ip-172-31-24-218 netcat-0.7.1]# cat /proc/sys/net/ipv4/tcp_keepalive_intvl
20
[root@ip-172-31-24-218 netcat-0.7.1]#


now the probes should be sent every 20 seconds and if 5 of them fail, the connection should be terminated. Ok, but if I run the same test again, the probes won't really fail because the connection is perfectly fine, so I need to cause a problem here.

The easiest way to make sure the connection will be idle and that instance A won't answer the keep-alive probes is by dropping all the packets on instance A right after the connection is ready. I did this by adding the following iptables rules right after the connection was established:

iptables -A OUTPUT --dst 172.31.24.218 -j DROP
iptables -A INPUT --src 172.31.24.218 -j DROP


Note: I dropped everything going out to instance B and everything coming in from instance B.

so there we go...

[ec2-user@ip-172-31-24-218 netcat-0.7.1]$ ./src/netcat -l -p 3333 &
[2] 27020
[ec2-user@ip-172-31-24-218 netcat-0.7.1]$ sudo tcpdump -nn port 3333
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
21:57:41.857514 IP 172.31.24.219.42126 > 172.31.24.218.3333: Flags [S], seq 2971447882, win 26883, options [mss 8961,sackOK,TS val 2514081 ecr 0,nop,wscale 6], length 0
21:57:41.857818 IP 172.31.24.218.3333 > 172.31.24.219.42126: Flags [S.], seq 1633813684, ack 2971447883, win 26847, options [mss 8961,sackOK,TS val 2514445 ecr 2514081,nop,wscale 6], length 0
21:57:41.858211 IP 172.31.24.219.42126 > 172.31.24.218.3333: Flags [.], ack 1, win 421, options [nop,nop,TS val 2514082 ecr 2514445], length 0



21:57:51.884756 IP 172.31.24.218.3333 > 172.31.24.219.42126: Flags [.], ack 1, win 420, options [nop,nop,TS val 2516952 ecr 2514082], length 0
21:58:11.948771 IP 172.31.24.218.3333 > 172.31.24.219.42126: Flags [.], ack 1, win 420, options [nop,nop,TS val 2521968 ecr 2514082], length 0
21:58:31.980765 IP 172.31.24.218.3333 > 172.31.24.219.42126: Flags [.], ack 1, win 420, options [nop,nop,TS val 2526976 ecr 2514082], length 0
21:58:52.012770 IP 172.31.24.218.3333 > 172.31.24.219.42126: Flags [.], ack 1, win 420, options [nop,nop,TS val 2531984 ecr 2514082], length 0
21:59:12.044761 IP 172.31.24.218.3333 > 172.31.24.219.42126: Flags [.], ack 1, win 420, options [nop,nop,TS val 2536992 ecr 2514082], length 0
21:59:32.076768 IP 172.31.24.218.3333 > 172.31.24.219.42126: Flags [R.], seq 1, ack 1, win 420, options [nop,nop,TS val 2542000 ecr 2514082], length 0

^C
9 packets captured
9 packets received by filter
0 packets dropped by kernel

[2]+  Stopped                 ./src/netcat -l -p 3333
[ec2-user@ip-172-31-24-218 netcat-0.7.1]$


interesting, right? keep-alive probes kicked in after 10 seconds as expected, then we have a probe every 20 seconds (no answer from instance A), and after the 5th probe we can see a RST+ACK package going from B to A trying to terminate the connection. This is exactly the behavior we were expecting :D.

So coming back to 20k+ ESTABLISHED connections situation, now we can understand why a situation like that is possible. Basically all the clients, for some awkward and really hard to reproduce reason were crashing and not finishing the connections properly, so on the server side due to the fact that the sockets used for the connections weren't using SO_KEEPALIVE option, the connections remained ESTABLISHED for the eternity.

Could this be a big problem? well it might be:

  • The first thing that comes to my mind is wasted memory. These established connections even though they are not being actively used, they need kernel memory to exist, so do the math :D. 
  • If a new connection comes in to the server and the source IP and source port match with one of the connections in ESTABLISHED state that connection will probably fail.
  • If you have a limit number of sockets available on the server I don't think you can afford having them allocated to this Ghost connections.

So why not having TCP keep-alive enabled by default? Good question... I guess that when we are talking about too many connections, handling the keep-alive could put some overhead on the kernel side. Imagine that for every connection the kernel should have to keep track of the time the last packet came in and then set timers to trigger the keep-alive probes.

I hope I was clear enough, otherwise you can always leave a comment :P.

... it wasn't a short entry after all.