Process & Thread Scheduling (CFS)
Process & Thread Scheduling (CFS)
The Linux Completely Fair Scheduler decides which runnable thread runs next by tracking, for each thread, a single number called virtual runtime (vruntime) — the CPU time it has consumed, scaled by its weight — and always dispatching the thread that has run the least in that weighted sense. It stores runnable threads in a per-CPU red-black tree keyed on vruntime, so “pick the neediest thread” is just “take the leftmost node,” an O(1) read of a cached pointer.
To a first approximation CFS emulates an ideal fair CPU: a hypothetical machine that runs all N runnable threads simultaneously, each at 1/N speed. Real hardware can only run one thread per core at a time, so CFS runs the thread whose vruntime has fallen furthest behind the ideal, lets it catch up, then re-picks. The scheduling policy for ordinary threads is SCHED_OTHER; CFS is its implementation. (Note: kernels since 6.6 replaced CFS with EEVDF — covered under trade-offs — but the CFS model below is the foundation every Linux systems engineer must know, and its data structures live on.)
Why it matters
Before CFS (merged in 2.6.23, 2007), the O(1) scheduler used two priority arrays and a bag of interactivity heuristics — it estimated whether a task was “interactive” from its sleep history and boosted it. Those heuristics were fragile, gameable, and produced latency cliffs: a mislabeled task would stutter. CFS threw out heuristics entirely and replaced them with one invariant — equalize weighted CPU time. Interactivity falls out for free: a thread that sleeps on I/O barely advances its vruntime, so on wakeup it is far to the left of the tree and gets the CPU almost immediately. No special-casing, no cliffs. This is the difference between a desktop that stays responsive under a kernel compile and one that freezes.
Weights, vruntime, and the fairness invariant
Each thread has a nice value from −20 (highest priority) to +19 (lowest); the default is 0. Nice maps to a weight through a fixed kernel table (sched_prio_to_weight[]), calibrated so each nice level multiplies CPU share by about 1.25×. That is a multiplicative step, not a linear one — five steps compound to roughly 1.25⁵ ≈ 3×, not 5×10%=50%; treat “~10% per step” only as a loose gloss on the table, not an independent formula:
| nice | weight | meaning |
|---|---|---|
| −5 | 3121 | ~3× the CPU of nice 0 |
| 0 | 1024 | NICE_0_LOAD (the baseline) |
| +5 | 335 | ~1/3 the CPU of nice 0 |
| +19 | 15 | near-idle |
When a thread runs for a real interval delta_exec nanoseconds, CFS charges its virtual runtime by
delta_vruntime = delta_exec * (NICE_0_LOAD / weight)
= delta_exec * (1024 / weight)A high-weight (low-nice) thread's vruntime grows slowly, so it stays left in the tree and is picked more often. That is the entire priority mechanism — no priority levels, just a scaling factor on a clock. Fairness means all runnable threads advance vruntime at the same rate; that they do different amounts of wall-clock work to achieve it is exactly proportional-share scheduling.
Time slices: sched_latency and min_granularity
CFS does not use a fixed quantum. It targets a scheduling period (sysctl_sched_latency, default 6 ms) in which every runnable thread should run once. Each thread's slice is its weighted share of that period:
slice_i = sched_latency * weight_i / total_weightBut if there are many threads, dividing 6 ms among all of them yields slices so tiny that context-switch overhead dominates. So CFS floors each slice at sched_min_granularity (default 0.75 ms); once nr_running > sched_latency / min_granularity (≈ 8), the period stretches to min_granularity * nr_running instead. This is the knob that trades latency (small period, snappy) against throughput (fewer switches, more cache reuse).
A traced example: three threads, one core
Threads A and B are nice 0 (weight 1024); thread C is nice +5 (weight 335). Total weight = 1024 + 1024 + 335 = 2383. With nr_running = 3 (below the ~8 threshold), the period stays at sched_latency = 6 ms. Slices:
- A: 6 ms × 1024/2383 = 2.58 ms
- B: 6 ms × 1024/2383 = 2.58 ms
- C: 6 ms × 335/2383 = 0.84 ms
Now watch the vruntime charge each thread accrues for its own slice, using delta_vruntime = delta_exec × 1024/weight:
| thread | ran (wall) | vruntime charged |
|---|---|---|
| A | 2.58 ms | 2.58 × 1024/1024 = 2.58 ms |
| B | 2.58 ms | 2.58 × 1024/1024 = 2.58 ms |
| C | 0.84 ms | 0.84 × 1024/335 = 2.57 ms |
The punchline: all three advance their vruntime by ~2.58 ms per period, so they interleave fairly in vruntime space and none starves. But C did far less wall-clock work — 0.84 ms vs 2.58 ms — so over any window A and B each get ~43% of the CPU while C gets ~14% (335/2383). That is precisely the ~3× ratio its five-nice-level demotion demands, achieved with one multiply and no priority queues.
Preemption: how a running thread loses the CPU
Two triggers set the TIF_NEED_RESCHED flag, which forces a reschedule at the next safe point (return to userspace, or a kernel preemption point):
- Tick preemption. On each timer tick,
task_tick_fair()checks whether the current thread has consumed its slice (delta_exec > ideal_slice) or whether itsvruntimenow exceeds the tree's leftmost by more than its slice. If so, it is marked for preemption. - Wakeup preemption. When a sleeping thread wakes,
check_preempt_wakeup()compares the waker'svruntimeto the running thread's. If the woken thread is more thansched_wakeup_granularity(≈1 ms) to the left, it preempts immediately — this is what makes a keypress or an arriving packet feel instant.
Why CPU-bound and I/O-bound threads behave differently
This is the whole payoff of the vruntime model, and it emerges without any “is this interactive?” test.
- A CPU-bound thread (video encode, kernel build) runs whenever scheduled, so its
vruntimeclimbs steadily. It keeps drifting rightward in the tree, spending most of its time waiting behind others — exactly what you want for a background hog. - An I/O-bound thread (shell, database connection, event loop) runs for microseconds then blocks on
read()/epoll_wait(). While asleep it accrues zero vruntime, so it falls far behind. On wakeup it is the leftmost node and is dispatched at once, giving it low latency despite tiny total CPU usage.
To stop a thread that slept for minutes from waking with a near-zero vruntime and monopolizing the CPU to “catch up,” CFS tracks min_vruntime — a monotonic floor equal to the smallest vruntime in the tree. In place_entity(), a waking thread's vruntime is clamped to about min_vruntime − sched_latency/2. That's sleeper credit: enough of a head start to preempt and get serviced quickly, but bounded so it cannot hog. One number gives you both interactivity and anti-starvation.
Context-switch cost — why min_granularity exists
Every preemption has a price, and it is larger than the visible part. The direct cost is saving/restoring registers, running the scheduler, and switching the address space; a same-address-space thread switch is ~1 µs on modern x86, a cross-process switch more, because it reloads CR3 and may flush the TLB (mitigated by PCID/ASID tagging, and worsened by KPTI page-table isolation after Meltdown).
The indirect cost usually dominates: the incoming thread finds cold L1/L2/LLC caches and a cold TLB, and stalls while it rebuilds its working set from memory — this can be tens of microseconds of lost throughput that never shows up in a ctxt counter. Measure switches with perf sched, vmstat (cs column), or pidstat -w; the cache effects need perf stat cache-miss counters. This overhead is the reason CFS floors slices at min_granularity: below ~0.75 ms, you'd spend more time switching and refilling caches than computing.
Run queues and multicore
There is no global scheduler lock. Each CPU owns a struct rq containing its own cfs_rq (red-black tree), so scheduling decisions are local and cheap. A periodic and idle-triggered load balancer migrates threads across CPUs to equalize load, guided by scheduling domains that model the cache/NUMA topology — it prefers to keep a thread on a CPU that still holds its warm cache, and pulls across a NUMA node only when the imbalance is worth the memory-latency penalty. Because each runqueue has its own min_vruntime, a migrated thread's vruntime is renormalized (subtract the source's min_vruntime, add the destination's) so it neither starves nor hogs on arrival.
Pitfalls a working engineer actually hits
- CFS bandwidth throttling in containers. Kubernetes/cgroup
cpu.cfs_quota_us+cpu.cfs_period_us(or a Kubernetes CPU limit) cap how much CPU a cgroup gets per 100 ms period. Burn the quota early and every thread in the cgroup is throttled — frozen until the next period — producing p99 latency spikes even when the node is idle. Diagnose vianr_throttled/throttled_timeincpu.stat. This is one of the most common surprise-latency bugs in production; often the fix is to remove the CPU limit and rely on requests (shares) instead. - nice is per-runqueue, not global. Weights arbitrate among threads on the same CPU; global fairness is only approximated by the load balancer. Renicing won't fix a multi-thread imbalance the balancer hasn't corrected.
- Never compare raw vruntime values.
vruntimeis a 64-bit counter that can wrap; the kernel compares with signed subtraction ((s64)(a − b) < 0), nota < b. Ordering, not magnitude. - sched_yield() is nearly useless under CFS. It just re-inserts you into the tree; if you're still leftmost you get the CPU right back. Spinning “politely” with yield is an anti-pattern — block on the real event instead.
- cgroup group scheduling changes the arithmetic. With
cpu.shares, weights compose hierarchically: 100 threads in a low-share cgroup collectively get that cgroup's slice, so per-thread fairness is nested, not flat.
Trade-offs & when to use it — vs named alternatives
CFS (SCHED_OTHER) vs the O(1) scheduler (its predecessor). O(1) picked next in constant time via a 140-bucket priority bitmap and active/expired arrays, but paid for it with brittle interactivity heuristics. CFS is O(log n) for enqueue/dequeue (the red-black tree) instead of O(1), yet wins decisively because a single fairness metric is robust and un-gameable where hand-tuned heuristics were not. You accept a logarithmic-per-operation cost to delete an entire class of latency bugs.
CFS vs EEVDF (its successor, Linux 6.6+). CFS guarantees only proportional share: it equalizes vruntime, but it has no notion of a deadline, so it cannot promise “this thread runs within 2 ms of becoming runnable” — the sleeper-fairness placement in place_entity() is a heuristic clamp, not a bound, and can still let a newly-woken latency-sensitive thread wait behind a thread that is merely less-far-left. EEVDF (Earliest Eligible Virtual Deadline First) keeps the same weight/vruntime accounting but adds two ideas CFS lacks: (1) eligibility — a thread only competes for the CPU once its virtual runtime is behind its fair share (lag ≥ 0), which bounds how far any one thread can get ahead; and (2) a per-wakeup virtual deadline = vruntime + requested_slice/weight, with the scheduler always picking the eligible thread whose deadline is earliest. A per-thread latency_nice lets a thread request a shorter slice (tighter deadline, more preemptions, lower latency) or a longer one (fewer switches, more throughput) independent of its CPU-share nice value — a knob CFS never had, since CFS conflated “how much CPU” and “how soon” into one number. That is why EEVDF replaced CFS: same fairness core, plus explicit, tunable latency control and provably bounded lag instead of a best-effort heuristic.
When CFS-style vruntime fairness is (and isn't) the right model. Vruntime/weighted-fair-queuing is the right choice whenever you want proportional CPU shares among many threads with no thread specifying explicit timing needs — general-purpose desktops, app servers, build farms. It's the wrong model when: (a) a thread has a real-time deadline (missing it is a correctness failure, not a slowdown) — use SCHED_DEADLINE (EDF with admission control) or SCHED_FIFO/SCHED_RR, which always preempt SCHED_OTHER and give no fairness guarantee at all, by design; (b) you need strict, low-overhead turn-taking among a small, fixed set of equal-priority tasks — plain round-robin with a fixed quantum is simpler and has O(1), heuristic-free behavior, at the cost of no weighting and no sleeper credit; (c) you need priority to strictly dominate (a lower-priority thread must never run while a higher one is runnable) — classic fixed-priority preemptive scheduling (as in RTOSes) is a cleaner fit than weight-based sharing, which by construction still lets low-weight threads make some progress. The cost of picking vruntime fairness when you actually needed a deadline guarantee is silent, occasional latency spikes that no amount of nice-tuning removes.
CFS/EEVDF vs realtime classes (SCHED_FIFO / SCHED_RR), the practical rule. Use CFS/EEVDF for general-purpose, throughput-plus-interactivity mixes (99% of workloads); reach for RT/DEADLINE only when a missed deadline is a correctness failure, not just a slowdown. The cost of RT is that a runaway RT thread can starve everything else on that CPU — hence the kernel throttles it via sched_rt_runtime_us by default.
Takeaways
- CFS reduces scheduling to one invariant: equalize
vruntime = ∫ delta_exec × 1024/weight. Priority is a scaling factor on a clock, not a queue level. - Interactivity is emergent: sleepers accrue no vruntime, wake near the
min_vruntimefloor, and preempt — no heuristic needed. - The real cost of preemption is cold caches/TLB, not the register swap;
min_granularityexists to keep that overhead bounded. - CFS gives fair shares with only heuristic latency control; EEVDF adds eligibility and virtual deadlines for bounded lag and tunable latency — that gap is why EEVDF replaced CFS, and why RT/DEADLINE classes exist alongside both for hard-deadline work.
Recall question
Two threads share a core: thread X is nice 0 and CPU-bound; thread Y is nice 0 but sleeps on I/O for 50 ms, then runs 200 µs, repeatedly. Why does Y get near-instant CPU on every wakeup even though both have identical weight — and what single mechanism stops Y from monopolizing the CPU if it had instead slept for a full minute?
Re-authored/deepened for this guide from: Linux kernel source (kernel/sched/fair.c, place_entity, sched_prio_to_weight[]); Robert Love, Linux Kernel Development (3rd ed., ch. 4); Bovet & Cesati, Understanding the Linux Kernel; Brendan Gregg, Systems Performance (2nd ed., scheduler & context-switch analysis); Peter Zijlstra's EEVDF patch series and LWN.net coverage (“An EEVDF CPU scheduler for Linux,” 2023); Ingo Molnar's original CFS design notes (Documentation/scheduler/sched-design-CFS.rst).
🤖 Don't fully get this? Learn it with Claude
Stuck on Process & Thread Scheduling (CFS)? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.
Build the mental picture, not memorization.
I just read a lesson on **Process & Thread Scheduling (CFS)** (System Design) and want to truly understand it. Explain Process & Thread Scheduling (CFS) from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
Socratic — adapts to where you're stuck.
Teach me **Process & Thread Scheduling (CFS)** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
Active recall exposes what you missed.
Quiz me on **Process & Thread Scheduling (CFS)** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
Intuition + hook + flashcards for long-term memory.
Help me remember **Process & Thread Scheduling (CFS)** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.