lunes, 30 de abril de 2018

Where do your bytes go when you send them to /dev/null?


I've been playing around with some interesting tools lately so decided to use one of them to answer a rather ontological question. No, I'm not talking about the meaning of life and stuff, but the one on the title!

I'm pretty sure you have at least once come across the idea of running some IO benchmark (I didn't say this is a good idea, but lets face it... you've done it) of your drive and decided to run something like the following:

dd if=/home/juan/myfile of=/dev/null

You know, trying to measure the read throughput of the device where myfile file is stored. Or maybe you've redirected some annoying stderr output to prevent it from showing up on your stdout. The point is that I've always blindly assumed /dev/null is a kind of black hole or one way ticket for your bytes, but never actually understood how it works.

Spoiler alert, I've written this article while working on my Ubuntu Trusty VM, so all the kernel source links will be pointing to kernel 3.19 however, most (if not all) of the code/content should be valid in newer kernels. If it's not valid then, sorry xD I'm not going write this again, so get your hands dirty and read some source code.

What's /dev/null after all?


According to Wikipedia:

The null device is typically used for disposing of unwanted output streams of a process, or as a convenient empty file for input streams. This is usually done by redirection.

Something interesting is the concept of "device", /dev/null isn't a regular file, but a special device file, particularly a character device file. You may have heard the concept of "in Unix everything is a file" (ok, except Network cards :P), that's quite accurate and there are several different file types, being the most common the following:

  • Regular files
  • Directory
  • Named Pipe
  • Socket
  • Symbolic Link
  • Device file
    • Character
    • Block
Lets have a quick look at this special file and see what's so special about it! Using stat we already have some nice details about it:

juan@test:~$ stat /dev/null
  File: ‘/dev/null’
  Size: 0             Blocks: 0          IO Block: 4096   character special file
Device: 6h/6d    Inode: 1029        Links: 1     Device type: 1,3
Access: (0666/crw-rw-rw-)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2018-04-17 18:57:48.359574000 +0100
Modify: 2018-04-17 18:57:48.359574000 +0100
Change: 2018-04-17 18:57:48.359574000 +0100
 Birth: -
juan@test:~$

What does "Device type: 1,3"  stand for? Well that's the key here, these two numbers are the Major (identifies the driver used) and the Minor number (identifies the device).

According to Linux Devices documentation, the character device with Major number 1 and Minor Number 3 is the Memory character device /dev/null (aka Null device). If you feel like getting your hands really dirty, you can see the Memory character implementation code here :P.

So what happens when you send stuff to /dev/null?


As with every file intervention, at least the following things have to happen:

- Open the file and get a file descriptor
- Write to the file using the file descriptor obtained previously
- Close the file using the file descriptor obtained previously

these 3 actions require kernel intervention and they map directly to syscalls open, write and close. Lets have a look at this with strace:

root@test:/home/juan# strace -e "open,read,write,close,dup2" dd if=/dev/zero of=/dev/null count=1
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
close(3)                                = 0
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P \2\0\0\0\0\0"..., 832) = 832
close(3)                                = 0
open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
close(3)                                = 0
open("/dev/zero", O_RDONLY)             = 3
dup2(3, 0)                              = 0
close(3)                                = 0
open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
dup2(3, 1)                              = 1
close(3)                                = 0
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512
close(0)                                = 0
close(1)                                = 0
open("/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 0
read(0, "# Locale name alias data base.\n#"..., 4096) = 2570
read(0, "", 4096)                       = 0
close(0)                                = 0
open("/usr/share/locale/en_IE/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-langpack/en_IE/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-langpack/en/LC_MESSAGES/coreutils.mo", O_RDONLY) = 0
close(0)                                = 0
write(2, "1+0 records in\n1+0 records out\n", 311+0 records in
1+0 records out
) = 31
write(2, "512 bytes (512 B) copied", 24512 bytes (512 B) copied) = 24
write(2, ", 0.000607882 s, 842 kB/s\n", 26, 0.000607882 s, 842 kB/s
) = 26
close(2)                                = 0
+++ exited with 0 +++
root@test:/home/juan# 

So /dev/null file was opened and in return we got FD 3, which was then copied to FD 1 (dup2(3,1)). Right after this, write call was used to write on it, we can see the exact lines here:

open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
dup2(3, 1)                              = 1
close(3)                                = 0
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512

Once the process hands over control to the kernel to execute the syscalls code we totally loose track of what's going on. To be honest that's the whole point of an API like the syscall layer, however sometimes is good to have the chance to see how things work in the background, and in order to answer the question we need to understand it.

Ftrace, like strace but way more fun!


Ftrace was introduced on the Linux kernel long ago, somewhere around late 2008 on Kernel 2.6.27. It allows users to trace certain Kernel functions given an incredible visibility of what's going and how things really work.

In order for ftrace to be available, the kernel needs to be compiled with this tracing feature. You can confirm by looking for CONFIG_FUNCTION_TRACER, like:

juan@test:~$ grep FUNCTION_TRACER /boot/config-3.19.0-51-generic
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_FUNCTION_TRACER=y
juan@test:~$

Many distributions ship their kernels with ftrace enabled and even the debugfs (the way to access ftrace data) file system mounted (Ubuntu for example):

juan@test:~$ mount | grep debug
none on /sys/kernel/debug type debugfs (rw)
juan@test:~$

What essentially happens is pretty much summed up in these lines:

It uses a compiler feature to insert a small, 5-byte No-Operation instruction to the beginning of every kernel function, which NOP sequence is then dynamically patched into a tracer call when tracing is enabled by the administrator. 

so these few extra bytes are used later on to hook a tracing function that will gather the details of the invoked functions, pretty neat!

There's a LOT of ftrace documentation around so I won't waste time repeating any of it. I do suggest you to read it though. In this particular case I'll be using trace-cmd frontend tool instead of directly use the files under the debugfs directory.

BTW: ftrace is a great way to inspect and learn about the Linux Kernel, so do give it a go.

SHOW ME THE MONEY!

Ok, ok,... In order to track the kernel functions I'm going to use function_graph plugin with a little of filtering to only get the events related to the write syscall:

root@test:/home/juan# trace-cmd record -p function_graph -g "SyS_write" dd if=/dev/zero of=/dev/null count=1
  plugin 'function_graph'
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.00255381 s, 200 kB/s
Kernel buffer statistics:
  Note: "entries" are the entries left in the kernel ring buffer and are not
        recorded in the trace data. They should all be zero.

CPU: 0
entries: 0
overrun: 0
commit overrun: 0
bytes: 2476
oldest event ts: 180098.264687
now ts: 180098.267521
dropped events: 0
read events: 468

CPU: 1
entries: 0
overrun: 0
commit overrun: 0
bytes: 2184
oldest event ts: 180098.266223
now ts: 180098.267677
dropped events: 0
read events: 1882

CPU0 data recorded at offset=0x42b000
    20480 bytes in size
CPU1 data recorded at offset=0x430000
    77824 bytes in size
root@test:/home/juan# 

The execution of the command creates a "trace.dat" file were the captured events are written. Then you can access to these events using "trace-cmd report". Since we know we are looking for the first Write call done by dd command I will limit the output of the report to the following:

root@test:/home/juan# trace-cmd report | grep "dd-" | head -25
              dd-3770  [001] 180098.263240: funcgraph_entry:                   |  SyS_write() {
              dd-3770  [001] 180098.263269: funcgraph_entry:                   |    __fdget_pos() {
              dd-3770  [001] 180098.263269: funcgraph_entry:        0.077 us   |      __fget_light();
              dd-3770  [001] 180098.263270: funcgraph_exit:         0.669 us   |    }
              dd-3770  [001] 180098.263270: funcgraph_entry:                   |    vfs_write() {
              dd-3770  [001] 180098.263270: funcgraph_entry:                   |      rw_verify_area() {
              dd-3770  [001] 180098.263271: funcgraph_entry:                   |        security_file_permission() {
              dd-3770  [001] 180098.263271: funcgraph_entry:                   |          apparmor_file_permission() {
              dd-3770  [001] 180098.263271: funcgraph_entry:                   |            common_file_perm() {
              dd-3770  [001] 180098.263272: funcgraph_entry:        0.087 us   |              aa_file_perm();
              dd-3770  [001] 180098.263272: funcgraph_exit:         0.679 us   |            }
              dd-3770  [001] 180098.263272: funcgraph_exit:         1.230 us   |          }
              dd-3770  [001] 180098.263273: funcgraph_exit:         1.783 us   |        }
              dd-3770  [001] 180098.263273: funcgraph_exit:         2.330 us   |      }
              dd-3770  [001] 180098.263273: funcgraph_entry:        0.068 us   |      write_null();
              dd-3770  [001] 180098.263274: funcgraph_entry:                   |      __fsnotify_parent() {
              dd-3770  [001] 180098.263274: funcgraph_entry:        0.103 us   |        dget_parent();
              dd-3770  [001] 180098.263275: funcgraph_entry:        0.078 us   |        dput();
              dd-3770  [001] 180098.263275: funcgraph_exit:         1.255 us   |      }
              dd-3770  [001] 180098.263276: funcgraph_entry:                   |      fsnotify() {
              dd-3770  [001] 180098.263276: funcgraph_entry:        0.308 us   |        __srcu_read_lock();
              dd-3770  [001] 180098.263277: funcgraph_entry:        0.087 us   |        __srcu_read_unlock();
              dd-3770  [001] 180098.263277: funcgraph_exit:         1.548 us   |      }
              dd-3770  [001] 180098.263278: funcgraph_exit:         7.318 us   |    }
              dd-3770  [001] 180098.263278: funcgraph_exit:         9.498 us   |  }
root@test:/home/juan#

Pretty neat, right? we can clearly see the entry and exit of "every" kernel function within the SyS_write syscall. Now, with the function names we can go to our one and only https://elixir.bootlin.com and read the code of each function to have a better picture:
  • SyS_write Write syscall enter
  • vfs_write VFS generic write path, security checks and permissions validation (among other things :))
  •  write_null this function is the write function for the specific device /dev/null (indicated by the file_operations structure of the device), if this was a regular file we would see here a File System specific write function.
Now if we explore the code of write_null, we can see the following:

static ssize_t write_null(struct file *file, const char __user *buf,
     size_t count, loff_t *ppos)
{
 return count;
}

the parameters:
  • *file the file structure (created during the open syscall and linked to the fd)
  • *buf the buffer that keeps the data that is supposed to be written
  • count the number of bytes to write 
  • *ppos the position within the file where to start writing. 

As you can see the function itself doesn't do a thing (which was kind of expected after all xD), it just returns the number of bytes that were supposed to be written as some sort of acknowledgement and confirming the write operation did actually take place.

So not only now you know about /dev/null :P but you can go and have fun with ftrace exploring the Linux Kernel!!!

To answer the question, your bytes literally go nowhere!