Home › System Design › OS & Kernel Internals

I/O Models: Blocking, epoll & io_uring

A network server scales or collapses on one decision: what a thread does while it waits for bytes that have not arrived yet. Under blocking I/O the kernel takes the calling thread off the CPU entirely — marks its task TASK_INTERRUPTIBLE, runs the scheduler, and only wakes it when a NIC interrupt delivers data — so serving N connections costs N threads. Non-blocking I/O plus a kernel readiness notifier (epoll) or a completion queue (io_uring) lets a single thread multiplex tens of thousands of connections, because it never parks on any one of them. This page traces how each model actually moves bytes across the syscall boundary, why select hit a wall at ~10,000 connections, and how epoll and io_uring got past it.

Blocking vs non-blocking: what read() does at the syscall boundary

Every socket read is a syscall into the kernel, and the socket's receive buffer is either empty or not. That single fact drives everything.

Blocking (the default). read(fd, buf, n) on an empty socket puts the calling thread to sleep in the kernel. The scheduler picks another runnable task; your thread consumes no CPU but holds its full kernel and user stack. When a packet arrives, the NIC raises an interrupt, the network softirq copies the segment into the socket buffer, marks the waiting task runnable, and — after a context switch — read() returns the bytes. Simple to reason about: one connection, one thread, straight-line code. The cost is that waiting is expensive — it burns a whole thread.

Non-blocking (fcntl(fd, F_SETFL, O_NONBLOCK)). Now read() on an empty buffer returns -1 immediately with errno == EAGAIN (a.k.a. EWOULDBLOCK). The thread is never parked, so it can go serve other sockets. But this only helps if something tells you when a socket becomes readable — otherwise you spin, calling read() in a hot loop and burning 100% CPU on EAGAIN. That "something" is the readiness notifier.

The C10k problem: why thread-per-connection stops scaling

Dan Kegel named it in 1999: how do you serve 10,000 concurrent connections on one box? The blocking, thread-per-connection model has three costs that all grow with connection count, not with actual work:

Memory. Each thread reserves a stack — commonly 1–8 MB of virtual address space (resident is smaller, but non-trivial). 10,000 threads × even ~1 MB touched ≈ gigabytes, most of it holding idle keep-alive connections.
Scheduler & context switches. With thousands of runnable threads the scheduler thrashes; each context switch flushes CPU caches and TLB entries. On old kernels the scheduler itself was O(n) in runnable tasks.
The notifier is O(N). If you go non-blocking with select/poll, every event-loop iteration re-scans all N descriptors — even though at any instant only a handful have data.

The killer detail: on a busy web server most connections are idle at any moment (HTTP keep-alive, slow clients, long-poll). You are paying O(N) for O(active). The fix is to make the kernel remember your interest set and report only what changed.

select/poll → epoll: from O(N) rescan to O(ready)

select takes three fd_set bitmasks (read/write/error). Every call you rebuild the masks, copy them into the kernel, the kernel scans all bits, marks the ready ones, copies everything back, and you scan again to find them. Cost is O(N) per call in both directions, and FD_SETSIZE caps it at 1024 fds. poll replaces the bitmask with a struct pollfd[] array — no 1024 limit — but it is still O(N): you pass the whole array every call and the kernel walks all of it.

epoll (Linux 2.6) breaks the pattern by splitting registration from waiting:

epoll_create1() makes a kernel epoll instance.
epoll_ctl(ep, EPOLL_CTL_ADD, fd, &ev) registers each fd once. The kernel stores it in a red-black tree (the "interest set") and hooks a callback (ep_poll_callback) onto that socket.
When a packet arrives, the network softirq fires that callback, which appends the fd to a ready list inside the epoll instance.
epoll_wait() just returns the ready list. Its cost is proportional to the number of ready fds, not the total registered. Nothing is copied per-call for the idle millions.

Trace it: 10,000 keep-alive connections, 3 with pending data. poll hands the kernel 10,000 pollfds and the kernel touches all 10,000. epoll_wait returns an array of length 3. Same workload, 3,000× less bookkeeping per loop.

Level-triggered vs edge-triggered — and the drain pitfall

epoll has two notification modes. Level-triggered (default): epoll_wait keeps reporting a fd as long as it has readable data — forgiving, but a fd you haven't fully read stays "hot." Edge-triggered (EPOLLET): you're notified only on the transition from not-ready to ready — one wakeup per arrival. Edge-triggered is faster (fewer wakeups) but has a mandatory contract: on each notification you must read until EAGAIN, or leftover bytes sit in the buffer with no further wakeup and the connection stalls forever. Level-triggered is the safe default — use it unless you have measured that edge-triggered's extra wakeup savings matter, and even then, wrap every read in a drain loop.

int ep = epoll_create1(0);
struct epoll_event ev = { .events = EPOLLIN | EPOLLET, .data.fd = conn };
epoll_ctl(ep, EPOLL_CTL_ADD, conn, &ev);

struct epoll_event events[64];
for (;;) {
    int n = epoll_wait(ep, events, 64, -1);   /* returns ONLY ready fds */
    for (int i = 0; i < n; i++) {
        int fd = events[i].data.fd;
        for (;;) {                            /* edge-triggered: must drain */
            ssize_t r = read(fd, buf, sizeof buf);
            if (r > 0)                     { handle(buf, r); continue; }
            if (r == -1 && errno == EAGAIN) break;       /* buffer drained */
            if (r == -1 && errno == EINTR)  continue;     /* retry */
            close(fd); break;                            /* r==0 peer closed, or error */
        }
    }
}

The inner for(;;) loop is the whole point of edge-triggered epoll: a single read() is a bug.

io_uring: from readiness to completion, and batched syscalls

epoll still has two structural taxes. First, it is readiness-based: it tells you a fd is ready, then you make a separate read() syscall to actually move the data — two boundary crossings per operation. Second, epoll never worked for regular files: a disk file is always "ready," so buffered file I/O under epoll silently blocks. io_uring (Linux 5.1, 2019, by Jens Axboe) fixes both by being completion-based over two shared ring buffers.

The app and kernel share two lock-free single-producer/single-consumer rings via mmap, so no data is copied to move a request or result across the boundary:

Submission Queue (SQ). The app fills a Submission Queue Entry (SQE) — "read fd=7 into buf, offset 0", "accept on fd=3", "write fd=9" — and advances the SQ tail. Any op that has a syscall has an SQE opcode: read, write, accept, connect, recv, send, openat, fsync, even timeouts and linked op chains.
Completion Queue (CQ). When the kernel finishes an op it posts a Completion Queue Entry (CQE) holding the result (bytes transferred or -errno) and your user_data tag, and advances the CQ tail.

io_uring_enter() submits a batch: 128 queued reads go across the boundary in one syscall instead of 128. With IORING_SETUP_SQPOLL, a dedicated kernel thread continuously polls the SQ ring for new entries, so a saturated server can submit and reap I/O with zero syscalls in steady state — the app just writes SQEs into shared memory and the kernel thread picks them up (at the cost of that thread spinning and burning a CPU core, and it must be re-armed if it idles out). io_uring_register() can pre-register fds and fixed buffers so the kernel skips the per-call fd-table lookup and page-pinning that a plain read() repeats every time. If the app falls behind draining completions, the CQ can fill up; io_uring reports this as CQE overflow, and with IORING_FEATURE_NODROP the kernel backpressures the ring rather than silently dropping completions. Because it's completion-based, there's no second read after readiness, and it does true async for disk files that epoll never could.

io_uring vs epoll: when to reach for which

They are not strict upgrades of each other — pick based on what you're actually bottlenecked on.

Syscall count. epoll+read/write still pays one syscall per readiness check plus one per actual read/write. io_uring collapses N operations into one io_uring_enter(), or zero with SQPOLL. Under high fan-out (many small ops/sec) this is the single biggest win.
Regular files. epoll cannot usefully watch disk files (always "ready"); io_uring can do real async disk I/O. If your workload is proxying sockets only, this advantage is moot.
Maturity and portability. epoll is 20+ years old, present on every Linux since 2.6, and battle-tested in every major event loop. io_uring needs a modern kernel (meaningful features arrived 5.6–5.11+), so containers/cloud fleets pinned to older LTS kernels, or anything targeting non-Linux, can't rely on it.
Security surface. io_uring's large, fast-evolving in-kernel surface has had a disproportionate share of Linux kernel CVEs (privilege escalation bugs, e.g. those found by Google's kCTF/kernel fuzzing efforts circa 2021–2023). Several cloud providers and hardened distros (Google's ChromeOS, some managed Kubernetes/container platforms) disable io_uring by default or restrict it via seccomp, which can silently break an io_uring-based app in production.
Complexity. Raw io_uring is a ring-buffer protocol, not a friendly API; almost everyone uses liburing to wrap it, and even then, correctly reasoning about SQE lifetime, buffer registration, and CQE overflow is harder to get right than an epoll readiness loop.

Rule of thumb: reach for epoll first — it's simpler, universally available, and sufficient for most socket-only servers. Reach for io_uring when profiling shows syscall overhead dominating (very high connection/op churn) or when you need real async file I/O, and you control the kernel version and can tolerate the newer, occasionally-patched security surface.

The reactor pattern (and its proactor cousin)

These primitives are wired together by the reactor pattern, the backbone of Nginx, Node.js (libuv), Netty, and Redis. Its parts: a synchronous event demultiplexer (epoll), a set of non-blocking event handlers registered per fd, and a single-threaded dispatcher loop. The loop calls epoll_wait, gets the ready fds, and for each one invokes its handler, which does a non-blocking read/write and returns immediately. One thread, one loop, thousands of connections — the handler must never block or it stalls every other connection on that loop.

io_uring instead fits the proactor pattern (the same shape as Windows IOCP): you initiate an operation and the OS completes it, handing you the finished result. The dispatcher reaps completions rather than readiness. Reactor asks "which fds can I act on?"; proactor says "here are the ops that are done." That shift is why completion-based I/O removes the extra syscall per op that reactor-style epoll+read/write always pays — the price is a fundamentally different, ring-buffer-shaped programming model instead of a simple readiness callback.

🤖 Don't fully get this? Learn it with Claude

Stuck on I/O Models: Blocking, epoll & io_uring? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.

🎨 Explain it visually

Build the mental picture, not memorization.

I just read a lesson on **I/O Models: Blocking, epoll & io_uring** (System Design) and want to truly understand it. Explain I/O Models: Blocking, epoll & io_uring from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🤔 Walk me through it (interactive)

Socratic — adapts to where you're stuck.

Teach me **I/O Models: Blocking, epoll & io_uring** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🧪 Quiz me & fix my gaps

Active recall exposes what you missed.

Quiz me on **I/O Models: Blocking, epoll & io_uring** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🧠 Make it stick

Intuition + hook + flashcards for long-term memory.

Help me remember **I/O Models: Blocking, epoll & io_uring** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

📝 My notes

← Virtual Memory, Paging & the TLB Containers: cgroups & Namespaces →