martes, 28 de marzo de 2017

Death by Real Time scheduling

A few weeks ago I had the chance to troubleshoot a system that was literally crawling performance wise. A simple SSH connection would take a few seconds to be established and then the latency on the session would be really frustrating. Clearly something was putting the system on its knees. A simple look at the logs threw the first clue:
[18253.961495] sched: RT throttling activated
These simple 4 words were enough to put me on the right track and I asked, "Do you happen to have any processes using a Real Time scheduling policy?"...

Linux Scheduler


There's a "bit" of code in the kernel that decides which process will get to be executed next and that code is called Scheduler. In short, the scheduler is in charge of assigning CPU time to processes according to certain parameters. In Linux threads are scheduled considering two factors (ok, 2 + 1 if you want :P):
  • The scheduling policy: there are basically two different types of scheduling policies, regular policies (SCHED_OTHER aka SCHED_NORMAL, SCHED_BATCH, SCHED_IDLE) and real time policies (SCHED_FIFO, SCHED_RR, SCHED_DEADLINE). The regular policies are implemented using the Completely Fair Scheduler (CFS):
    • SCHED_NORMAL: regular scheduling policy, nothing fancy here, tasks get to be executed for a period of time and preemption is in place as well.
    • SCHED_BATCH: allows tasks to run longer periods of time before preempting them, this improves for example the use of cache lines (but of course reducing interactivity).
    • SCHED_IDLE: tasks with this policy have less priority than nice 19.
    • SCHED_FIFO: is a simple scheduling algorithm without time slicing, based in a set of queues for the available priorities. Static priority must be higher than 0.
    • SCHED_RR: unlike in SCHED_FIFO policy, threads under SCHED_RR run for a certain time slice called quantum and if they reach that time slice they are put at the end of its priority queue.
    • SCHED_DEADLINE: uses three parameters, named "runtime", "period", and "deadline", to schedule tasks. A SCHED_DEADLINE task should receive "runtime" microseconds of execution time every "period" microseconds, and these "runtime" microseconds are available within "deadline" microseconds from the beginning of the period.
  • Static priority: also known as sched_priority is a value between 1 and 99. This value only affects Real Time scheduling policies, it has no effect on regular policies and in fact should be set to 0.
In a brief:
       Conceptually, the scheduler maintains a list of runnable threads for
       each possible sched_priority value.  In order to determine which
       thread runs next, the scheduler looks for the nonempty list with the
       highest static priority and selects the thread at the head of this
       list.

There's another concept to mention when it comes to scheduling and is the "niceness" (this is the +1). This value, which only affects SCHED_OTHER and SCHED_BATCH policies, is used to influence the scheduler behavior to favor or disfavor certain threads. Linux's CFS (Completely Fair Scheduler) will then use the nice value to schedule the processes accordingly. Nice can take any integer value between [-20,19], being -20 the highest priority and 19 the lowest. This is the value you can change using nice command from your shell.

How do you find the Scheduling Policy of a process?


The easiest way to get the scheduling information from a process is by using our well known ps command. We can have a look at the process of the system like in the following example:

juan@test:~$ ps -eo pid,user,class,pri,nice,rtprio|awk '{print $3,$5,$6}'|sort|uniq -c
      1 CLS NI RTPRIO
      4 FF - 99
    106 TS 0 -
      1 TS 1 -
      1 TS -10 -
      1 TS 19 -
     31 TS -20 -
      1 TS 5 -
juan@test:~$

there we can see:
  • 4 processes running with real time policy FF (SCHED_FIFO) with real time priority 99. Also note how the NI (nice) column is null for these processes as we described before.
  • 141 processes with regular policy TS (SCHED_OTHER, you can see this in "man ps"). Also note how the RTPRIO (real time priority) column is null for these processes as we described before.
If we have a closer look at the real time threads (FF) we can see that they are kernel threads (one per available core):

juan@test:~$ ps -eo pid,cmd,user,pcpu,class,pri,nice,rtprio|grep "FF\|PID"|grep -v grep
  PID CMD                         USER     %CPU CLS PRI  NI RTPRIO
   11 [migration/0]               root      0.0 FF  139   -     99
   12 [watchdog/0]                root      0.0 FF  139   -     99
   13 [watchdog/1]                root      0.0 FF  139   -     99
   14 [migration/1]               root      0.0 FF  139   -     99
juan@test:~$

  • watchdog threads are there in order to identify software lockups conditions 
  • migration threads are there to handle load balance tasks among the available cores
These threads use a real time policy because their execution on time is crucial for the health of the system, however they only run for really short periods of time and that's the reason why they don't cause any pain.

Real time Policy


The core of the this post was actually real time policy right? Well lets have a look at the particularities of this policy and why it could become a headache easily.
  • SCHED_FIFO policy doesn't use time slicing (SCHED_RR does though)
  • So a SCHED_FIFO task will run until:
    • A higher priority task goes in runable state and gets scheduled right away.
    • It blocks due to IO
    • Executes sched_yield call. 
With these details said, you should already get an idea of how an out of control RT task could become a big problem.

CPU Starvation


The problem a RT task can cause is called CPU starvation. The idea of the starvation situation is that there's a small number of tasks (even just one) consuming most or all of certain resource in this case the CPU time. Here you can see a simple example of a real time process that goes into an infinite loop causing CPU starvation to the rest (scheduling policy is set to SCHED_FIFO in line 23).

Code (remember you can pull this example or the others from https://github.com/jjpavlik):
#include <stdio.h>
#include <sched.h>
#include <errno.h>
#include <string.h>
#include <sys/time.h>
#include <sys/resource.h>

void main()
{
        int sched,pri,pid,min,max,i,aux,error;
        struct sched_param params;

        pid = getpid();

        sched = sched_getscheduler(pid);
        pri = getpriority(PRIO_PROCESS,0);
        printf("PID=%d\n",pid);
        printf("Current scheduler is %d, current priority is %d\n",sched,pri);
        printf("Priorities MIN=%d, MAX=%d, nice=%d\n",sched_get_priority_min(sched), sched_get_priority_max(sched),getpriority(PRIO_PROCESS,0));

        printf("Changing scheduling class to SCHED_FIFO\n");
        params.sched_priority=99;
        aux = sched_setscheduler(pid,SCHED_FIFO,&params);
        error=errno;
        if( aux == -1)
        {
                //You need to run this as root :D, otherwise you will get a permission denied
                printf("Setscheduler failed: %s\n",strerror(error));
                return;
        }
        sched = sched_getscheduler(pid);
        pri = getpriority(PRIO_PROCESS,0);
        printf("Scheduler is %d, current priority is %d\n",sched,pri);
        printf("Priorities MIN=%d, MAX=%d, nice=%d\n",sched_get_priority_min(sched), sched_get_priority_max(sched),getpriority(PRIO_PROCESS,0));

        while(1)
        {//Inifinite loop
                i++;
        }
}

Execution:
root@test:/home/juan/scheduling_tests# ./sched_details &
[1] 3265
PID=3265
Current scheduler is 0, current priority is 0
Priorities MIN=0, MAX=0, nice=0
Changing scheduling class to SCHED_FIFO
Scheduler is 1, current priority is 0
Priorities MIN=1, MAX=99, nice=0
root@test:/home/juan/scheduling_tests# 
root@test:/home/juan/scheduling_tests# ps -eo pid,cmd,user,pcpu,class,pri,nice,rtprio|grep "FF\|PID"|grep -v grep
  PID CMD                         USER     %CPU CLS PRI  NI RTPRIO 
   11 [migration/0]               root      0.0 FF  139   -     99
   12 [watchdog/0]                root      0.0 FF  139   -     99
   13 [watchdog/1]                root      0.0 FF  139   -     99
   14 [migration/1]               root      0.0 FF  139   -     99
 3265./sched_details             root     99.3 FF  139   -     99
root@test:/home/juan/scheduling_tests#

Clearly the process is consuming almost completely one of the available cores in my system (99.3%), however we still don't see the throttling message on the logs because there's a second core available. So to push this even further I started a second run of the same binary to make sure the other core gets hammered as well and now:
root@test:/home/juan/scheduling_tests# ps -eo pid,cmd,user,pcpu,class,pri,nice,rtprio|grep "FF\|PID"|grep -v grep
  PID CMD                         USER     %CPU CLS PRI  NI RTPRIO 
   11 [migration/0]               root      0.0 FF  139   -     99
   12 [watchdog/0]                root      0.0 FF  139   -     99
   13 [watchdog/1]                root      0.0 FF  139   -     99
   14 [migration/1]               root      0.0 FF  139   -     99
 3299 ./sched_details             root     97.9 FF  139   -     99
 3338 ./sched_details             root     95.8 FF  139   -     99
root@test:/home/juan/scheduling_tests#

After a few seconds you should see the throttling activated message on your logs:
root@test:/home/juan/scheduling_tests# dmesg |tail -1
[ 9640.055754] sched: RT throttling activated
root@test:/home/juan/scheduling_tests#

at this stage the system is indeed crawling and barely responding to every command. But what does "sched: RT throttling activated" means after all?

There are two kernel parameters that help us reduce the impact of this situation with real time processes. Here you can see them:
root@test:/home/juan/scheduling_tests# cat /proc/sys/kernel/sched_rt_period_us
1000000
root@test:/home/juan/scheduling_tests# cat /proc/sys/kernel/sched_rt_runtime_us
950000
root@test:/home/juan/scheduling_tests#
  • sched_rt_period_us specifies a scheduling period that is equivalent to 100% CPU bandwidth.
  • sched_rt_runtime_us specifies how much of the rt_period can be used by real time and deadline tasks.
In my setup (default Ubuntu setup) the RT period is 1000000 us (1 second) and the RT runtime is 950000 us, meaning that 95% of the time can be used by realtime tasks, leaving 5% to other scheduling policies. This 5% of CPU time should be enough (being patience) to login to the system and get rid of the rogue processes, unless you have many other processes running as well.

For example, changing 950000 to 500000 would allow RT tasks to use up to around 50% of the CPU time:
root@test:/home/juan/scheduling_tests# echo 500000 > /proc/sys/kernel/sched_rt_runtime_us
root@test:/home/juan/scheduling_tests# cat /proc/sys/kernel/sched_rt_runtime_us
500000
root@test:/home/juan/scheduling_tests#
root@test:/home/juan/scheduling_tests# ps -eo pid,cmd,user,pcpu,class,pri,nice,rtprio|grep "FF\|PID"|grep -v grep
  PID CMD                         USER     %CPU CLS PRI  NI RTPRIO
   11 [migration/0]               root      0.0 FF  139   -     99
   12 [watchdog/0]                root      0.0 FF  139   -     99
   13 [watchdog/1]                root      0.0 FF  139   -     99
   14 [migration/1]               root      0.0 FF  139   -     99
 3394 ./sched_details             root     58.7 FF  139   -     99
 3395 ./sched_details             root     50.0 FF  139   -     99
root@test:/home/juan/scheduling_tests#

so this way you can keep the real time processes under control.

However, considering real time tasks are supposed to be time sensitive you should avoid capping them like this unless it was really necessary. Maybe it's worth having a look at why they are consuming so much CPU time instead, usually real time tasks are operations that take really short periods of time but they need to be executed as soon as possible. So tools like strace or even perf should be really good starting points to identify the reason behind the CPU time, even better if you have access to the source code!

martes, 7 de marzo de 2017

Where did my IO Scheduler go? Where are you my beloved [noop|deadline|cfq]?

That was exactly the question I asked myself when I came across the following output:
root@ip-172-31-22-167:/home/ubuntu# cat /sys/block/xvda/queue/scheduler
none
root@ip-172-31-22-167:/home/ubuntu#
root@ip-172-31-22-167:/home/ubuntu# uname -a
Linux ip-172-31-22-167 4.4.0-64-generic #85-Ubuntu SMP Mon Feb 20 11:50:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
root@ip-172-31-22-167:/home/ubuntu#

Of course my first approach was to curse, blame politicians, global warming and even capitalism..., but somehow that didn't provide any answers.

Well turns out that a few things have changed in the Linux Kernel without my consent (xD), how dared them, right? Developers and their creativity :P.

In order to understand a bit what this none "scheduler" means, I'll try to sum up a bit of history, Linux and politics in this post.

Time between Kernel 2.6.10 to somewhere before 3.13 / 3.16


So between these two kernel versions there were basically two different methods for the block drivers to interact with the system. First method was called Request Mode, and the second method you could call it "Make your own Request Interface" Mode.

Request Mode 


In this mode the Linux Block Layer maintains a simple request queue (using a linked list structure) where all the IO requests are placed one after the other (yeah, like a queue xD). The device driver then receives the requests from the head of the queue, and that's pretty much about it in terms of its logic. Having the IOs waiting there in the queue actually made some interesting stuff possible, for example:
  • Why don't we reorder them in order to make a better use of the HW? Makes perfect sense right? Well for the young ones... there used to be something called magnetic drives that due to their HW design (plates, spindles, head, etc) in order to get the better out of them you should try to do sequential operations as opposed to random ones, to minimize the seek times for example.
  • Also, now that we have them reorder, why don't we merge them to get even better throughput? And so they did.
  • Another interesting feature that was implemented later on is the chance to apply certain policies to achieve a kind of fairness in terms of processes getting access to IO.
These are the pillars where the IO schedulers like noop, deadline and CFQ were build on. You can still see them in some of the current Linux like in Ubuntu 14.04.5:

juan@test:~$ uname -a
Linux test 3.19.0-51-generic #58~14.04.1-Ubuntu SMP Fri Feb 26 22:02:58 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
juan@test:~$ cat /sys/block/sda/queue/scheduler
noop [deadline] cfq
juan@test:~$

Note: even though the kernel is 3.19, blk-mq had already been added by then, but it was not the default for some drivers like the scsi driver in this case.

So why was it necessary for the developers to replace this lovely piece of Software after all??? Well there were a few reasons:
  • Today's drives (mainly SSDs) are way faster that the drives we had a couple of years ago, Magnetic drives are kind of on their way to extinction as well...
  • Nowadays due to Virtualization the guest OS doesn't really get to know the real physical layout of the drives so using CPU cycles trying to sort the requests probed to be a waste of time.
  • With really fast SSD drives/arrays the single queue mechanism can easily become a bottle neck mainly in multiprocessor environments (yeah, everywhere now). 
    • The queue can only be accessed by a single processor at the time, so some contention can be generated because of this as well when trying to push high number of IO operations.

"Make your own Request Interface" Mode


This second Mode is just a way to skip the Request Mode and it's IO scheduling algorithms, allowing the driver to be directly accessed by higher parts of the IO stack. This is used for example for MD to build software RAID, and other tools that need to process the requests themselves before actually sending them to the devices.

The Block Multiqueue IO mechanism (that's quite a name, isn't it?)


Kernel developers realized that the previous Single queuing mechanism had become a huge bottleneck on the Linux IO stack so they came up with this new mechanism. On Kernel 3.13 the first Multiqueue compatible driver was introduced in order to start testing the future IO queuing mechanism (blk-mq), finally on 3.16 and beyond complete blk-mq implementations began to show up for different drivers.

This new mechanism includes two levels of queuing. On the first level, what used to be the single queue where the IO operations where placed has been split into multiple submission queues, one per CPU (or per Node), improving considerably the number of IO operations the system can handle (now each CPU has its own submission queue, getting the contention issue totally off the table). The second level of queuing is the hardware queues the device itself may provide.

The mapping between the submission queues and the hardware queues might be 1:1 or N:M depending on the configuration.

Therefore with this new Multiqueue mechanism, IO schedulers became pretty much useless and they are no longer available for many drivers already. Maybe in the future new schedulers/logic might show up to work on the individual submission queues.

I guess this was the longest possible way to explain where the IO schedulers went on my Ubuntu 16.04 instance after all :D, hopefully will help someone to understand what is going on:
root@ip-172-31-22-167:/home/ubuntu# uname -a
Linux ip-172-31-22-167 4.4.0-64-generic #85-Ubuntu SMP Mon Feb 20 11:50:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
root@ip-172-31-22-167:/home/ubuntu# cat /sys/block/xvda/queue/scheduler
none
root@ip-172-31-22-167:/home/ubuntu#

Note: in kernel 4.3 blkfront driver (Xen block driver) was converted to blk-mq

If I have the time will shoot some tests in order to see the different behaviors I mentioned, just for fun xD.

Some interesting bibliography:


https://lwn.net/Articles/552904/
https://www.thomas-krenn.com/en/wiki/Linux_Multi-Queue_Block_IO_Queueing_Mechanism_(blk-mq)
https://www.thomas-krenn.com/en/wiki/Linux_I/O_Stack_Diagram#Diagram_for_Linux_Kernel_3.17
http://kernel.dk/systor13-final18.pdf