Home › System Design › OS & Kernel Internals

Syscalls & the User/Kernel Boundary

A system call works because the CPU physically runs in one of two privilege levels, and the only sanctioned way to cross from the low-privilege one to the high-privilege one is a hardware trap: a single instruction (syscall on x86-64) that atomically raises the privilege level, swaps the stack and instruction pointer to a kernel-controlled entry point, and hands control to code the application can never jump into directly. Everything about syscall cost and syscall batching falls out of that one fact.

User mode vs kernel mode: why the wall exists

An x86-64 core is always executing at a current privilege level (CPL), one of four rings; in practice operating systems use only two: ring 3 (user mode) and ring 0 (kernel mode). The ring is not a software convention — it is bits in a hardware register the CPU checks on every privileged operation. In ring 3 the core will fault if you try to touch a page marked supervisor-only, execute HLT, reprogram the MMU, disable interrupts, or talk to a device port. Those powers exist only in ring 0.

This wall is what makes a multi-tenant machine possible. Your process cannot read another process's memory, corrupt the page tables, or monopolise the CPU, because the hardware refuses — and refusing means trapping into the kernel, which decides what happens next. The kernel is the one program trusted with ring-0 powers; every other program must ask it to act on their behalf. That request is a system call.

The trap mechanism, concretely

Say your program calls read(fd, buf, 4096). Glibc's wrapper does not contain the read logic — it marshals arguments into registers and executes one syscall instruction. On Linux x86-64 the contract is fixed by the ABI:

rax = syscall number (read is 0, write is 1, openat is 257)
arguments go in rdi, rsi, rdx, r10, r8, r9 — in that order
the return value comes back in rax (a negative value in -4095..-1 is an errno)

When syscall executes, the CPU does several things in one shot, in microcode: it loads the kernel entry address from the model-specific register IA32_LSTAR into rip, saves the old user rip into rcx and the flags into r11, and switches CPL to ring 0. It does not automatically switch the stack — the kernel entry stub (entry_SYSCALL_64) does that by swapping to the per-CPU kernel stack via the swapgs/GSBASE trick, then pushes a full register frame. Only now does ordinary C kernel code run: it validates rax against the syscall table, checks that buf points into the caller's address space (never trusting a user pointer), and dispatches to ksys_read. On the way out, sysret restores rip from rcx, flags from r11, drops back to ring 3, and your wrapper returns.

The older path, int 0x80, did the same job through the interrupt descriptor table and is noticeably slower — the ratio isn't a fixed constant, it varies with microarchitecture and with whether Meltdown/Spectre mitigations are active, but it lands roughly in the low single digits of multiples on most modern chips. syscall/sysret was added precisely to make the common case cheap, and that is the mechanism worth remembering, not a specific number.

What a syscall actually costs vs a function call

A normal function call is a call/ret pair — a few cycles, ~1 ns, all in ring 3. A syscall is not "a slightly bigger function call"; it is a privilege transition, and the cost splits into two parts that people routinely conflate:

Direct cost — the mode switch. The microcode transition, register save/restore, stack swap, and pointer/permission checks. On modern x86-64 this is roughly 100–300 ns for a trivial syscall like getpid(). Meltdown/Spectre mitigations made this worse: KPTI (kernel page-table isolation) maintains separate page tables for user and kernel, so each crossing may reload CR3 and flush TLB entries, adding hundreds of nanoseconds on unpatched-microcode CPUs.
Indirect cost — pollution aftermath. This is the part that surprises people. The kernel runs on your core, evicting your L1/L2 cache lines, TLB entries, and branch-predictor state. When control returns to your code, it stalls on cold caches. Soares & Stumm's FlexSC (OSDI 2010) reported, in their benchmark suite (Apache, MySQL, BIND and a microbenchmark), a direct switch cost on the order of ~100–150 cycles, while the indirect penalty — degraded user-mode IPC and cache behaviour after the syscall — ran into the thousands of cycles for some of those workloads. Those are workload-dependent measurements from one paper, not a universal constant, but the qualitative point holds broadly: the mode switch you can see; the pollution you cannot, and it is often the larger bill.

So the rule of thumb is not "a syscall is expensive" but "a syscall is on the order of 100–1000x a function call and it makes your next few thousand instructions slower." A function that does one syscall per call, invoked in a hot loop, is a performance smell.

Why batching syscalls matters: readv and io_uring

If each crossing is expensive and pollutes caches, the fix is structural: cross fewer times, do more work per crossing. Two mechanisms embody this at different scales.

Scatter/gather: readv and writev

Instead of four write() calls to emit a header, two body chunks, and a trailer, writev(fd, iov, 4) passes an array of struct iovec (pointer + length pairs) in one crossing; the kernel walks the vector and writes all four regions. One mode switch instead of four, and the kernel can also submit the whole thing to the device as a single I/O. This is why HTTP servers assemble responses with writev rather than concatenating buffers first — it avoids both the extra crossings and the memory copy.

Amortising to near-zero: io_uring

io_uring goes further and attacks the per-operation crossing itself. It sets up two ring buffers in memory shared between user space and kernel: a submission queue (SQ) and a completion queue (CQ). To issue I/O you write submission entries directly into the SQ ring — no syscall — and to reap results you read the CQ ring — no syscall. A single io_uring_enter() call can submit hundreds of queued operations and collect hundreds of completions at once. With SQ-polling mode a dedicated kernel thread continuously drains the SQ ring in the background, so a busy server can keep submitting I/O by only writing to shared memory — no io_uring_enter call needed on the steady-state hot path at all, as long as the kernel poll thread stays awake within its idle timeout. The boundary crossing, once per operation, becomes one crossing per batch — or, under SQPOLL, effectively none while the workload keeps the ring fed.

The traced difference for reading 1000 small records: with blocking read() that is 1000 crossings (~100–500 µs of pure switch tax, plus pollution). With plain io_uring you queue 1000 SQEs and issue one io_uring_enter — a single crossing — then harvest completions from the CQ ring with zero further syscalls. With SQPOLL enabled, even that one io_uring_enter call is unnecessary once the poll thread is running: you write the 1000 SQEs to shared memory and the kernel thread picks them up on its own.

Observing the boundary with strace

strace makes the invisible boundary visible. It uses ptrace(PTRACE_SYSCALL) (or the faster seccomp-BPF backend) to stop the traced process on every syscall entry and exit, printing the decoded name, arguments, and return value. Two flags matter most:

$ strace -T -e trace=read,write cat file.txt
read(3, "hello\n", 131072)  = 6 <0.000012>
read(3, "", 131072)         = 0 <0.000004>
write(1, "hello\n", 6)       = 6 <0.000019>

-T appends each call's wall-clock duration in <seconds>; -e trace= filters to the calls you care about. The killer view for performance work is -c, which aggregates instead of tracing line-by-line:

$ strace -c -f ./myserver
%% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 61.2    0.184000           4     46000           read
 27.4    0.082000          41      2000           futex
  8.1    0.024000          12      2000           write

That table instantly answers "is this program syscall-bound?" A million read calls of a few bytes each screams "add buffering or switch to io_uring." Note that strace itself adds two ptrace stops per syscall, so it inflates per-call latency — use it to find which and how many calls, not to measure their true production cost (use perf or eBPF for that). On Linux, strace -c for counts and perf trace for low-overhead timing are the standard pairing.

Pitfalls a working engineer hits

Unbuffered I/O in a loop. Reading a file one byte at a time is one syscall per byte. BufferedReader/stdio/a 64 KB buffer collapses that to one syscall per buffer-fill. This is the single most common syscall-cost bug.
EINTR. A blocking syscall can return early with errno == EINTR if a signal arrives. Naive code treats that as failure; correct code retries the call. Many wrappers now auto-restart via SA_RESTART, but not all syscalls honour it.
Partial reads/writes. write(fd, buf, 8192) can legally return 4096. It is not an error — you must loop on the remainder. Treating the return as "all or nothing" silently truncates data, especially on sockets and pipes.
Assuming strace timings are real. ptrace overhead can make a syscall look 10x slower than it is in production. Use counts from strace, timings from perf/eBPF.
vDSO surprises. Some "syscalls" like clock_gettime() and gettimeofday() don't actually trap — the kernel maps a vDSO page into every process so they run entirely in user mode. If you're benchmarking syscall cost with gettimeofday, you may be measuring nothing crossing the boundary at all.

Trade-offs & when to use what

The design question is always "how do I do the necessary work with the fewest, cheapest crossings?" — and the answer depends on scale and complexity budget.

Approach	Crossings	Use when	Cost / when NOT
Plain blocking `read`/`write`	1 per op	Low I/O rate, simple code, correctness over throughput	Dies under high op rates; each call pays the full switch tax
`readv`/`writev`	1 per group	You already have several buffers to move at once (headers + body)	Helps only when regions are naturally batched; no async benefit
`epoll` + non-blocking sockets	~2 per ready op (epoll_wait + read)	Many concurrent connections, mostly network I/O	Still one `read` syscall per ready fd; readiness model, not completion; awkward for disk I/O
`io_uring`	~1 per batch, ~0 with SQPOLL while the ring stays fed	High-throughput storage and network I/O; you can afford the complexity	Newer, larger API surface, security-sensitive (disabled in some sandboxes); SQPOLL burns a CPU core for the poll thread; overkill for low I/O rates

io_uring vs epoll is the sharp comparison. epoll is a readiness interface — it tells you a fd is ready, then you still issue a read syscall per ready fd, and it never covered regular-file disk I/O well. io_uring is a completion interface that batches submission and completion and covers both disk and network. Choose epoll when you have an existing reactor and moderate load; reach for io_uring when syscall count is your measured bottleneck and per-op crossings dominate your profile — and prove it with strace -c first.

Takeaways

A syscall is a hardware privilege transition (ring 3 → ring 0) via a trap, not a function call — that is why it costs on the order of 100x more and pollutes caches for thousands of cycles afterward (the qualitative FlexSC insight; exact numbers are workload-dependent).
The kernel never trusts a user pointer or syscall number: validation on entry is the whole point of the boundary.
Performance work on I/O is largely the art of reducing crossings: buffer, use writev, or amortise with io_uring's shared rings down toward zero syscalls per op, and all the way to zero with SQPOLL while the ring stays fed.
strace -c tells you if a program is syscall-bound and which calls dominate; use perf/eBPF for true timings.

Recall question

Your service reads 4-byte messages from a socket in a tight loop and strace -c shows 2 million read calls dominating CPU time. Name two distinct fixes and explain which boundary cost each one attacks.

Sources: Brendan Gregg, Systems Performance (2nd ed.) and BPF Performance Tools; Soares & Stumm, "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls" (OSDI 2010) — numbers cited are from that paper's benchmark suite, not universal constants; Jens Axboe, io_uring design papers and liburing; the Linux x86-64 syscall ABI and entry_SYSCALL_64 kernel source; Bovet & Cesati, Understanding the Linux Kernel; strace(1) and perf-trace(1) man pages. Re-authored/Deepened for this guide.

🤖 Don't fully get this? Learn it with Claude

Stuck on Syscalls & the User/Kernel Boundary? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.

🎨 Explain it visually

Build the mental picture, not memorization.

I just read a lesson on **Syscalls & the User/Kernel Boundary** (System Design) and want to truly understand it. Explain Syscalls & the User/Kernel Boundary from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🤔 Walk me through it (interactive)

Socratic — adapts to where you're stuck.

Teach me **Syscalls & the User/Kernel Boundary** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🧪 Quiz me & fix my gaps

Active recall exposes what you missed.

Quiz me on **Syscalls & the User/Kernel Boundary** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🧠 Make it stick

Intuition + hook + flashcards for long-term memory.

Help me remember **Syscalls & the User/Kernel Boundary** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

📝 My notes

Process & Thread Scheduling (CFS) →