martes, 7 de marzo de 2017

Where did my IO Scheduler go? Where are you my beloved [noop|deadline|cfq]?

That was exactly the question I asked myself when I came across the following output:
root@ip-172-31-22-167:/home/ubuntu# cat /sys/block/xvda/queue/scheduler
none
root@ip-172-31-22-167:/home/ubuntu#
root@ip-172-31-22-167:/home/ubuntu# uname -a
Linux ip-172-31-22-167 4.4.0-64-generic #85-Ubuntu SMP Mon Feb 20 11:50:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
root@ip-172-31-22-167:/home/ubuntu#

Of course my first approach was to curse, blame politicians, global warming and even capitalism..., but somehow that didn't provide any answers.

Well turns out that a few things have changed in the Linux Kernel without my consent (xD), how dared them, right? Developers and their creativity :P.

In order to understand a bit what this none "scheduler" means, I'll try to sum up a bit of history, Linux and politics in this post.

Time between Kernel 2.6.10 to somewhere before 3.13 / 3.16


So between these two kernel versions there were basically two different methods for the block drivers to interact with the system. First method was called Request Mode, and the second method you could call it "Make your own Request Interface" Mode.

Request Mode 


In this mode the Linux Block Layer maintains a simple request queue (using a linked list structure) where all the IO requests are placed one after the other (yeah, like a queue xD). The device driver then receives the requests from the head of the queue, and that's pretty much about it in terms of its logic. Having the IOs waiting there in the queue actually made some interesting stuff possible, for example:
  • Why don't we reorder them in order to make a better use of the HW? Makes perfect sense right? Well for the young ones... there used to be something called magnetic drives that due to their HW design (plates, spindles, head, etc) in order to get the better out of them you should try to do sequential operations as opposed to random ones, to minimize the seek times for example.
  • Also, now that we have them reorder, why don't we merge them to get even better throughput? And so they did.
  • Another interesting feature that was implemented later on is the chance to apply certain policies to achieve a kind of fairness in terms of processes getting access to IO.
These are the pillars where the IO schedulers like noop, deadline and CFQ were build on. You can still see them in some of the current Linux like in Ubuntu 14.04.5:

juan@test:~$ uname -a
Linux test 3.19.0-51-generic #58~14.04.1-Ubuntu SMP Fri Feb 26 22:02:58 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
juan@test:~$ cat /sys/block/sda/queue/scheduler
noop [deadline] cfq
juan@test:~$

Note: even though the kernel is 3.19, blk-mq had already been added by then, but it was not the default for some drivers like the scsi driver in this case.

So why was it necessary for the developers to replace this lovely piece of Software after all??? Well there were a few reasons:
  • Today's drives (mainly SSDs) are way faster that the drives we had a couple of years ago, Magnetic drives are kind of on their way to extinction as well...
  • Nowadays due to Virtualization the guest OS doesn't really get to know the real physical layout of the drives so using CPU cycles trying to sort the requests probed to be a waste of time.
  • With really fast SSD drives/arrays the single queue mechanism can easily become a bottle neck mainly in multiprocessor environments (yeah, everywhere now). 
    • The queue can only be accessed by a single processor at the time, so some contention can be generated because of this as well when trying to push high number of IO operations.

"Make your own Request Interface" Mode


This second Mode is just a way to skip the Request Mode and it's IO scheduling algorithms, allowing the driver to be directly accessed by higher parts of the IO stack. This is used for example for MD to build software RAID, and other tools that need to process the requests themselves before actually sending them to the devices.

The Block Multiqueue IO mechanism (that's quite a name, isn't it?)


Kernel developers realized that the previous Single queuing mechanism had become a huge bottleneck on the Linux IO stack so they came up with this new mechanism. On Kernel 3.13 the first Multiqueue compatible driver was introduced in order to start testing the future IO queuing mechanism (blk-mq), finally on 3.16 and beyond complete blk-mq implementations began to show up for different drivers.

This new mechanism includes two levels of queuing. On the first level, what used to be the single queue where the IO operations where placed has been split into multiple submission queues, one per CPU (or per Node), improving considerably the number of IO operations the system can handle (now each CPU has its own submission queue, getting the contention issue totally off the table). The second level of queuing is the hardware queues the device itself may provide.

The mapping between the submission queues and the hardware queues might be 1:1 or N:M depending on the configuration.

Therefore with this new Multiqueue mechanism, IO schedulers became pretty much useless and they are no longer available for many drivers already. Maybe in the future new schedulers/logic might show up to work on the individual submission queues.

I guess this was the longest possible way to explain where the IO schedulers went on my Ubuntu 16.04 instance after all :D, hopefully will help someone to understand what is going on:
root@ip-172-31-22-167:/home/ubuntu# uname -a
Linux ip-172-31-22-167 4.4.0-64-generic #85-Ubuntu SMP Mon Feb 20 11:50:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
root@ip-172-31-22-167:/home/ubuntu# cat /sys/block/xvda/queue/scheduler
none
root@ip-172-31-22-167:/home/ubuntu#

Note: in kernel 4.3 blkfront driver (Xen block driver) was converted to blk-mq

If I have the time will shoot some tests in order to see the different behaviors I mentioned, just for fun xD.

Some interesting bibliography:


https://lwn.net/Articles/552904/
https://www.thomas-krenn.com/en/wiki/Linux_Multi-Queue_Block_IO_Queueing_Mechanism_(blk-mq)
https://www.thomas-krenn.com/en/wiki/Linux_I/O_Stack_Diagram#Diagram_for_Linux_Kernel_3.17
http://kernel.dk/systor13-final18.pdf

No hay comentarios:

Publicar un comentario