Skip to content

Worked investigation - Catch the scheduler starving a task

Companion to Linux Kernel -> Month 02 -> Week 7: The CPU Scheduler (CFS, EEVDF). The chapter explains the scheduler picks tasks by virtual runtime. This page makes you watch a latency-sensitive task get starved by a CPU hog, see it in the scheduler's own accounting, and fix it with the exact knob a production engineer reaches for. ~40 minutes, any multi-core Linux box.

The symptom you're learning to diagnose

Your latency-sensitive service - a request handler, an audio thread, a control loop - gets intermittent latency spikes. CPU isn't even at 100%. But every so often a response that should take 2 ms takes 80 ms. Someone says "the box has spare CPU, it can't be scheduling." They're wrong, and you're going to prove it.

The CPU scheduler is fair by default - and fairness is exactly the problem when one task matters more than the others. We'll reproduce the starvation and then tell the kernel which task matters.

Step 0: the one fact that explains it

The default Linux scheduler (CFS, now EEVDF on 6.6+) is a fair scheduler. Its goal is to give every runnable task an equal share of CPU over time. It does not know that your request handler matters more than a batch job. If you have more runnable tasks than cores, fairness means your important task waits its turn - even if the box has "spare" CPU in aggregate, because aggregate idle time and per-task scheduling latency are different things.

The knobs that change this: niceness (a soft priority hint within the fair scheduler) and real-time scheduling classes (SCHED_FIFO/SCHED_RR, which preempt fair tasks entirely). You'll use both.

Step 1: set up a "latency-sensitive" task you can measure

We need a task that does a tiny bit of work on a fixed interval and reports how late each cycle was. Save as jitter.py:

import time
target = 0.005          # want to wake every 5 ms
worst = 0.0
next_t = time.perf_counter() + target
for _ in range(2000):
    while time.perf_counter() < next_t:
        pass            # busy-wait to the deadline
    now = time.perf_counter()
    late = (now - next_t) * 1000   # ms late
    worst = max(worst, late)
    next_t += target
print(f"worst wakeup lateness: {worst:.1f} ms")

Run it alone on an idle machine - the baseline:

$ python3 jitter.py
worst wakeup lateness: 0.3 ms

0.3 ms - the scheduler wakes it almost exactly on time. This is "healthy." Remember the number.

Step 2: starve it - flood the machine with CPU hogs

Now create contention: spawn one busy-loop per core so every CPU is oversubscribed, then run the latency task in the same pool:

$ NPROC=$(nproc)
$ for i in $(seq $NPROC); do python3 -c 'while True: pass' & done
$ python3 jitter.py
worst wakeup lateness: 47.8 ms

There it is. Same task, same code - but worst-case wakeup lateness went from 0.3 ms to 47.8 ms, a 150x degradation. And notice (top in another terminal) the machine is at 100% but not overwhelmed - it's doing exactly nproc busy loops plus your task. The CPU isn't the bottleneck; scheduling fairness is. Your task is runnable, but it waits in line behind equal-priority hogs. This is the spike the "we have spare CPU" person can't explain.

Step 3: see it in the scheduler's own books

Don't take the jitter number on faith - read the scheduler's per-task accounting. Find your task's PID and look at /proc/<pid>/schedstat and /proc/<pid>/sched:

$ python3 jitter.py & PID=$!
$ cat /proc/$PID/schedstat
4823914 89412305 1872
#   ^cpu-time  ^WAIT-time  ^timeslices

Three numbers: nanoseconds running, nanoseconds waiting on a runqueue (ready but not picked), and number of timeslices. Under contention, the middle number - time spent runnable but waiting - balloons. That wait time is your latency. The richer view:

$ cat /proc/$PID/sched | grep -E 'wait|nr_switches|nr_involuntary'
se.statistics.wait_max       :      48.213000   # worst single wait, ms - matches our jitter!
se.statistics.wait_sum       :    3917.402000
nr_involuntary_switches      :          1922    # times it was kicked off CPU against its will

wait_max of ~48 ms matches the jitter we measured - the scheduler is telling you the same story from its own ledger. nr_involuntary_switches counts how often the task was preempted before it wanted to yield. This is how you prove "it's scheduling" instead of arguing about it.

Step 4: the soft fix - niceness

nice shifts a task's weight within the fair scheduler. Lower nice = higher priority. Renice your task to the strongest fair priority (-20) and the hogs to the weakest (+19):

$ for p in $(pgrep -f 'while True'); do sudo renice 19 $p >/dev/null; done   # hogs: low priority
$ sudo nice -n -20 python3 jitter.py                                          # task: high priority
worst wakeup lateness: 6.2 ms

From 47.8 ms to 6.2 ms - much better. Niceness rebalanced the fair shares so your task gets CPU sooner. But 6.2 ms is still not the 0.3 ms baseline: niceness is a bias, not a guarantee. A fair scheduler with a strong bias still occasionally makes the favored task wait. For soft preferences (a build that should yield to interactive work) niceness is perfect. For a hard latency requirement, you need to leave the fair class entirely.

Step 5: the hard fix - a real-time scheduling class

SCHED_FIFO is a real-time class: a FIFO task preempts all fair (normal) tasks and runs until it blocks or yields. Promote the latency task with chrt:

$ sudo chrt --fifo 50 python3 jitter.py     # SCHED_FIFO, priority 50
worst wakeup lateness: 0.4 ms

Back to baseline - 0.4 ms even with every core pegged by hogs. The FIFO task jumps the entire fair queue whenever it's runnable. Confirm the class took effect:

$ chrt -p $PID
pid 19233's current scheduling policy: SCHED_FIFO
pid 19233's current scheduling priority: 50

This is what audio servers, trading engines, robotics control loops, and the kernel's own critical threads use. It is also a loaded gun: a SCHED_FIFO task that never yields (a real busy loop, not our deadline-sleeper) will starve everything else on its core forever, including your shell - you'd need another core or a hard reboot. That's why Linux ships kernel.sched_rt_runtime_us (default: real-time tasks capped at 950 ms per 1 s, leaving 50 ms for everyone else) as a safety valve. Use real-time classes only for tasks that block (sleep, wait on I/O) regularly.

Healthy vs starved, side by side

HEALTHY (idle box or RT class):          STARVED (fair class under contention):
jitter worst:      0.3 ms                jitter worst:     47.8 ms
wait_max:          0.4 ms                wait_max:         48.2 ms
nr_involuntary:    ~3                     nr_involuntary:   ~1900

The tell is nr_involuntary_switches and wait_max in /proc/<pid>/sched. High involuntary switches + high wait_max + a CPU-saturated box = scheduling starvation, full stop. Now you can diagnose it in 30 seconds instead of blaming the network.

Clean up

$ kill $(pgrep -f 'while True')      # stop the hogs

Now you do it

  1. Reproduce the baseline (idle) jitter, then the starved jitter. Record both. The ratio is your starvation factor.
  2. While starved, read /proc/<pid>/sched for wait_max and confirm it matches your measured worst lateness. Proving the two agree is the skill.
  3. Try nice -n -20 (soft) vs chrt --fifo 50 (hard) and compare. Feel the difference between "biased fair" and "preempts everything."
  4. Cautiously, on a machine with at least 2 cores, pin a non-yielding busy loop to one core with taskset -c 0 chrt --fifo 90 ... and watch that core's other tasks freeze (use htop per-core view). Then kill it. This teaches respect for real-time classes. Don't do this on a single-core VM you can't reboot.

What you might wonder

"Isn't 100% CPU the real problem - just add cores?" Sometimes. But the spike here isn't throughput starvation, it's latency starvation: even with one spare core, a fair scheduler can make a specific task wait through a full scheduling round. Adding cores reduces the odds but doesn't guarantee low tail latency the way a priority class does. For hard latency SLOs, you express priority, not just capacity.

"nice vs chrt - which in production?" nice/renice for soft preferences (batch should yield to interactive; a backup should yield to a database). chrt real-time classes only for genuine hard-real-time work that yields regularly (media, control loops) - and with care, because a misbehaving RT task is catastrophic. Most services need neither; reach for them when you've measured scheduling latency, exactly as we did.

"What changed with EEVDF (kernel 6.6+)?" EEVDF replaced CFS as the default fair scheduler. It picks tasks by an "eligibility + virtual deadline" model that improves latency for tasks that run briefly and often (like our deadline-sleeper) versus CPU hogs - so on a 6.6+ kernel your starved numbers may be somewhat better than shown, but the structure (fair = no guarantee, RT = preempt) and every command here are unchanged. The /proc/<pid>/sched fields are the same.

"How does this map to containers/Kubernetes?" cpu.weight (cgroup v2) is the cgroup-level equivalent of nice - it sets fair-share proportions between cgroups. Kubernetes CPU requests translate to cpu.weight; CPU limits translate to cpu.max (bandwidth throttling, a different mechanism - throttling makes a task wait even when CPU is idle, another latency-spike source worth its own investigation). The per-task /proc/<pid>/sched view here is how you debug inside a throttled container.

What this gave you

  • You watched a latency-sensitive task degrade 150x purely from scheduling fairness, on a box with "spare" CPU.
  • You read the proof in the scheduler's own books (/proc/<pid>/sched: wait_max, nr_involuntary_switches).
  • You fixed it softly with nice and hard with chrt/SCHED_FIFO, and you know when each is right.
  • You respect real-time classes and know the sched_rt_runtime_us safety valve and the never-yield footgun.
  • You can settle "is it scheduling?" with data in under a minute.

Back to the Week 7 chapter, or on to the syscall-tracing investigation.

Comments