Worked investigation - Catch the scheduler starving a task¶
Companion to Linux Kernel -> Month 02 -> Week 7: The CPU Scheduler (CFS, EEVDF). The chapter explains the scheduler picks tasks by virtual runtime. This page makes you watch a latency-sensitive task get starved by a CPU hog, see it in the scheduler's own accounting, and fix it with the exact knob a production engineer reaches for. ~40 minutes, any multi-core Linux box.
The symptom you're learning to diagnose¶
Your latency-sensitive service - a request handler, an audio thread, a control loop - gets intermittent latency spikes. CPU isn't even at 100%. But every so often a response that should take 2 ms takes 80 ms. Someone says "the box has spare CPU, it can't be scheduling." They're wrong, and you're going to prove it.
The CPU scheduler is fair by default - and fairness is exactly the problem when one task matters more than the others. We'll reproduce the starvation and then tell the kernel which task matters.
Step 0: the one fact that explains it¶
The default Linux scheduler (CFS, now EEVDF on 6.6+) is a fair scheduler. Its goal is to give every runnable task an equal share of CPU over time. It does not know that your request handler matters more than a batch job. If you have more runnable tasks than cores, fairness means your important task waits its turn - even if the box has "spare" CPU in aggregate, because aggregate idle time and per-task scheduling latency are different things.
The knobs that change this: niceness (a soft priority hint within the fair scheduler) and real-time scheduling classes (SCHED_FIFO/SCHED_RR, which preempt fair tasks entirely). You'll use both.
Step 1: set up a "latency-sensitive" task you can measure¶
We need a task that does a tiny bit of work on a fixed interval and reports how late each cycle was. Save as jitter.py:
import time
target = 0.005 # want to wake every 5 ms
worst = 0.0
next_t = time.perf_counter() + target
for _ in range(2000):
while time.perf_counter() < next_t:
pass # busy-wait to the deadline
now = time.perf_counter()
late = (now - next_t) * 1000 # ms late
worst = max(worst, late)
next_t += target
print(f"worst wakeup lateness: {worst:.1f} ms")
Run it alone on an idle machine - the baseline:
0.3 ms - the scheduler wakes it almost exactly on time. This is "healthy." Remember the number.
Step 2: starve it - flood the machine with CPU hogs¶
Now create contention: spawn one busy-loop per core so every CPU is oversubscribed, then run the latency task in the same pool:
$ NPROC=$(nproc)
$ for i in $(seq $NPROC); do python3 -c 'while True: pass' & done
$ python3 jitter.py
worst wakeup lateness: 47.8 ms
There it is. Same task, same code - but worst-case wakeup lateness went from 0.3 ms to 47.8 ms, a 150x degradation. And notice (top in another terminal) the machine is at 100% but not overwhelmed - it's doing exactly nproc busy loops plus your task. The CPU isn't the bottleneck; scheduling fairness is. Your task is runnable, but it waits in line behind equal-priority hogs. This is the spike the "we have spare CPU" person can't explain.
Step 3: see it in the scheduler's own books¶
Don't take the jitter number on faith - read the scheduler's per-task accounting. Find your task's PID and look at /proc/<pid>/schedstat and /proc/<pid>/sched:
$ python3 jitter.py & PID=$!
$ cat /proc/$PID/schedstat
4823914 89412305 1872
# ^cpu-time ^WAIT-time ^timeslices
Three numbers: nanoseconds running, nanoseconds waiting on a runqueue (ready but not picked), and number of timeslices. Under contention, the middle number - time spent runnable but waiting - balloons. That wait time is your latency. The richer view:
$ cat /proc/$PID/sched | grep -E 'wait|nr_switches|nr_involuntary'
se.statistics.wait_max : 48.213000 # worst single wait, ms - matches our jitter!
se.statistics.wait_sum : 3917.402000
nr_involuntary_switches : 1922 # times it was kicked off CPU against its will
wait_max of ~48 ms matches the jitter we measured - the scheduler is telling you the same story from its own ledger. nr_involuntary_switches counts how often the task was preempted before it wanted to yield. This is how you prove "it's scheduling" instead of arguing about it.
Step 4: the soft fix - niceness¶
nice shifts a task's weight within the fair scheduler. Lower nice = higher priority. Renice your task to the strongest fair priority (-20) and the hogs to the weakest (+19):
$ for p in $(pgrep -f 'while True'); do sudo renice 19 $p >/dev/null; done # hogs: low priority
$ sudo nice -n -20 python3 jitter.py # task: high priority
worst wakeup lateness: 6.2 ms
From 47.8 ms to 6.2 ms - much better. Niceness rebalanced the fair shares so your task gets CPU sooner. But 6.2 ms is still not the 0.3 ms baseline: niceness is a bias, not a guarantee. A fair scheduler with a strong bias still occasionally makes the favored task wait. For soft preferences (a build that should yield to interactive work) niceness is perfect. For a hard latency requirement, you need to leave the fair class entirely.
Step 5: the hard fix - a real-time scheduling class¶
SCHED_FIFO is a real-time class: a FIFO task preempts all fair (normal) tasks and runs until it blocks or yields. Promote the latency task with chrt:
Back to baseline - 0.4 ms even with every core pegged by hogs. The FIFO task jumps the entire fair queue whenever it's runnable. Confirm the class took effect:
$ chrt -p $PID
pid 19233's current scheduling policy: SCHED_FIFO
pid 19233's current scheduling priority: 50
This is what audio servers, trading engines, robotics control loops, and the kernel's own critical threads use. It is also a loaded gun: a SCHED_FIFO task that never yields (a real busy loop, not our deadline-sleeper) will starve everything else on its core forever, including your shell - you'd need another core or a hard reboot. That's why Linux ships kernel.sched_rt_runtime_us (default: real-time tasks capped at 950 ms per 1 s, leaving 50 ms for everyone else) as a safety valve. Use real-time classes only for tasks that block (sleep, wait on I/O) regularly.
Healthy vs starved, side by side¶
HEALTHY (idle box or RT class): STARVED (fair class under contention):
jitter worst: 0.3 ms jitter worst: 47.8 ms
wait_max: 0.4 ms wait_max: 48.2 ms
nr_involuntary: ~3 nr_involuntary: ~1900
The tell is nr_involuntary_switches and wait_max in /proc/<pid>/sched. High involuntary switches + high wait_max + a CPU-saturated box = scheduling starvation, full stop. Now you can diagnose it in 30 seconds instead of blaming the network.
Clean up¶
Now you do it¶
- Reproduce the baseline (idle) jitter, then the starved jitter. Record both. The ratio is your starvation factor.
- While starved, read
/proc/<pid>/schedforwait_maxand confirm it matches your measured worst lateness. Proving the two agree is the skill. - Try
nice -n -20(soft) vschrt --fifo 50(hard) and compare. Feel the difference between "biased fair" and "preempts everything." - Cautiously, on a machine with at least 2 cores, pin a non-yielding busy loop to one core with
taskset -c 0 chrt --fifo 90 ...and watch that core's other tasks freeze (usehtopper-core view). Thenkillit. This teaches respect for real-time classes. Don't do this on a single-core VM you can't reboot.
What you might wonder¶
"Isn't 100% CPU the real problem - just add cores?" Sometimes. But the spike here isn't throughput starvation, it's latency starvation: even with one spare core, a fair scheduler can make a specific task wait through a full scheduling round. Adding cores reduces the odds but doesn't guarantee low tail latency the way a priority class does. For hard latency SLOs, you express priority, not just capacity.
"nice vs chrt - which in production?" nice/renice for soft preferences (batch should yield to interactive; a backup should yield to a database). chrt real-time classes only for genuine hard-real-time work that yields regularly (media, control loops) - and with care, because a misbehaving RT task is catastrophic. Most services need neither; reach for them when you've measured scheduling latency, exactly as we did.
"What changed with EEVDF (kernel 6.6+)?" EEVDF replaced CFS as the default fair scheduler. It picks tasks by an "eligibility + virtual deadline" model that improves latency for tasks that run briefly and often (like our deadline-sleeper) versus CPU hogs - so on a 6.6+ kernel your starved numbers may be somewhat better than shown, but the structure (fair = no guarantee, RT = preempt) and every command here are unchanged. The /proc/<pid>/sched fields are the same.
"How does this map to containers/Kubernetes?" cpu.weight (cgroup v2) is the cgroup-level equivalent of nice - it sets fair-share proportions between cgroups. Kubernetes CPU requests translate to cpu.weight; CPU limits translate to cpu.max (bandwidth throttling, a different mechanism - throttling makes a task wait even when CPU is idle, another latency-spike source worth its own investigation). The per-task /proc/<pid>/sched view here is how you debug inside a throttled container.
What this gave you¶
- You watched a latency-sensitive task degrade 150x purely from scheduling fairness, on a box with "spare" CPU.
- You read the proof in the scheduler's own books (
/proc/<pid>/sched:wait_max,nr_involuntary_switches). - You fixed it softly with
niceand hard withchrt/SCHED_FIFO, and you know when each is right. - You respect real-time classes and know the
sched_rt_runtime_ussafety valve and the never-yield footgun. - You can settle "is it scheduling?" with data in under a minute.
Back to the Week 7 chapter, or on to the syscall-tracing investigation.