Knowledge Guide
HomeSystem DesignOS & Kernel Internals

Containers: cgroups & Namespaces

A container is a normal Linux process wearing a costume

A container is not a lightweight VM and there is no "container" object in the kernel — it is an ordinary process that the kernel has been asked to lie to. When you launch one, the kernel does three cheap things on top of a normal fork()/execve(): it puts the process in a fresh set of namespaces (so its view of PIDs, mounts, network, hostname, IPC and users is a private sandbox), it attaches it to a cgroup (so the scheduler and memory allocator cap and account its CPU, memory and I/O), and it pivots its root filesystem onto an overlay mount (so it sees a private image without copying it). Everything else — the process runs on the host kernel, is visible in the host's ps, is scheduled by the same CFS/EEVDF scheduler — is completely normal. That is the whole trick.

This matters because the alternative, a virtual machine, boots an entire guest kernel and emulates virtual hardware, costing hundreds of MB of RAM and seconds of boot per instance. A container adds a few syscalls to a process that already exists, so it starts in milliseconds and shares the host kernel's page cache and scheduler. That density — thousands of isolated workloads per host — is what makes Kubernetes, Lambda, and CI runners economically possible. The cost is the flip side of the same coin: one shared kernel, so the isolation is only as strong as the kernel's namespace and cgroup boundaries, not a hardware-enforced VM boundary.

Namespaces isolate the view; cgroups limit the resources

Keep these two axes strictly separate in your head — they are orthogonal and answer different questions. Namespaces answer "what can this process see and name?" Cgroups answer "how much can this process consume?" A process could be in a private PID namespace but with no memory limit (sees only itself, can OOM the host), or in a tight memory cgroup but the host network namespace (throttled, but sees every interface). A real container uses both.

The seven namespaces (created via clone() / unshare() flags)

NamespaceFlagWhat it virtualizesConcrete effect inside the container
PIDCLONE_NEWPIDProcess-ID number spaceYour entrypoint is PID 1; it cannot see or signal host processes
NETCLONE_NEWNETInterfaces, routes, ports, netfilterOwn lo + a veth pair; two containers can both bind :8080
MNTCLONE_NEWNSThe mount tableA private filesystem tree; /proc, / differ from host
UTSCLONE_NEWUTSHostname & domainnamehostname returns the container ID, not the node
IPCCLONE_NEWIPCSystem-V IPC, POSIX msg queuesShared-memory segments are invisible across containers
USERCLONE_NEWUSERUID/GID mappingroot (uid 0) inside can map to an unprivileged uid (e.g. 100000) outside — if the runtime opts in
CGROUPCLONE_NEWCGROUPThe cgroup filesystem's root viewThe container sees its own cgroup as /, hiding sibling and ancestor cgroup paths

The user namespace is the one that can turn "root in the container" from a host-level danger into a safe illusion: when UID mapping is enabled, uid 0 inside is a mapped, unprivileged uid outside, so even a container escape lands as nobody. In practice this mapping is opt-in, not the default: Docker's default runtime and most production Kubernetes clusters still run container processes as UID 0 in the host's own user namespace — Docker's userns-remap and Kubernetes' hostUsers/user-namespace support (GA in 1.30+) exist precisely because rootless-by-default isn't yet the common case. Treat user-namespace remapping as a hardening option you should turn on, not a guarantee you already have.

Traced example: building a container by hand

Docker/containerd/runc do exactly this sequence — you can reproduce the core of it with a few commands and watch each layer engage. Assume a host running cgroup v2 (unified hierarchy under /sys/fs/cgroup).

Step-by-step trace

  1. t=0 — create the namespaces. unshare --pid --net --mount --uts --ipc --user --fork --map-root-user /bin/sh. The kernel does a clone() with the requested CLONE_NEW* flags set. The child is now PID 1 in its own PID namespace.
  2. t=1 — verify the PID illusion. Inside, echo $$ prints 1; ps (after mounting a private /proc) shows only your shell. On the host, ps aux | grep sh shows the same process as PID 5177. Same task_struct, two different numbers — the PID namespace is just a translation table on struct pid.
  3. t=2 — pivot the root. In the mount namespace, pivot_root onto an extracted image dir so / is the image, not the host. The host's / is now unreachable from inside — not hidden, unmounted from this namespace's view.
  4. t=3 — attach the cgroup. mkdir /sys/fs/cgroup/demo; echo "50000 100000" > cpu.max (50 ms of CPU per 100 ms period = half a core); echo 512M > memory.max; then echo <pid> > cgroup.procs to move the process in.
  5. t=4 — hit the CPU limit. Run a busy loop. On the host, top shows the process pinned close to 50% of one core. The CFS bandwidth controller enforces this: in the simple case, the process (or its threads) spend the 50 ms of quota within the 100 ms period and are then throttled until the period resets. The real accounting is a bit richer — quota is drawn from a per-cgroup pool that multiple runnable threads can spend concurrently, and cpu.max's optional burst allowance lets a cgroup borrow a little unused quota from a previous period — but the observable effect is the same: cpu.stat's throttled_usec climbs once quota is exhausted.
  6. t=5 — hit the memory limit. Allocate past 512 MB. The kernel first reclaims page cache, then triggers the cgroup OOM killer, which kills a task inside this cgroup only — the host and other containers are untouched. memory.events shows oom_kill 1.

Notice what you never did: boot a kernel, allocate virtual RAM, or emulate a device. You added flags to one process. That is the entire mechanism runc implements — plus seccomp/capabilities/AppArmor for hardening.

Pitfalls a working engineer actually hits

Trade-offs: containers vs virtual machines vs sandboxed runtimes

The decision is fundamentally about where the isolation boundary sits and what you are willing to pay for it.

PropertyContainer (runc)VM (KVM/Firecracker)Sandboxed (gVisor / Kata)
Isolation boundaryShared kernel + namespacesHardware (VT-x), separate kernelUser-space kernel (gVisor) or micro-VM (Kata)
Start time~10–50 ms~100 ms–seconds (Firecracker ~125 ms)~100–200 ms
Memory overhead~MB (just the process)tens–hundreds of MB (guest kernel)tens of MB
Blast radius of a kernel 0-dayHost compromise possibleContained to the guestContained (syscalls intercepted)
Syscall performanceNative (zero overhead)Near-nativegVisor: measurable syscall tax

Use containers when you own the workloads and trust the code: microservices, CI, internal batch jobs. Density and speed dominate and a shared kernel is acceptable. Use VMs (or Firecracker micro-VMs) when you run untrusted, multi-tenant code — this is exactly why AWS Lambda and Fargate wrap each customer's container in a Firecracker micro-VM: they want container ergonomics with VM-grade isolation. Use gVisor/Kata when you need stronger-than-namespace isolation but can't afford a full VM per workload (GKE Sandbox). The named alternative to internalize: a container trades the VM's hardware boundary for ~10× faster start and ~10× higher density — a great trade for trusted code, a dangerous one for hostile tenants on a shared kernel.

Takeaways

Recall question

Two containers on the same host both bind port 8080 successfully with no conflict, yet ps on the host shows both their main processes. Which mechanism makes the ports non-conflicting, and which fact reveals that a container is "just a process"?

Answer: separate network namespaces give each container its own port space, so :8080 is a different socket in each. The host's ps listing both proves there is no VM boundary — they are ordinary host processes sharing one kernel, merely placed in different namespaces.


Sources: The Linux Programming Interface (Kerrisk, ch. 28 & namespaces/cgroups); Linux kernel documentation (cgroup-v2.rst, namespaces(7), overlayfs.rst); Brendan Gregg, Systems Performance (2nd ed., cgroup CPU throttling & the USE method); AWS Firecracker paper (NSDI 2020); Docker userns-remap and Kubernetes user-namespaces (KEP-127) documentation; the gVisor design docs. Authored for this guide.

🤖 Don't fully get this? Learn it with Claude

Stuck on Containers: cgroups & Namespaces? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.

🎨 Explain it visually

Build the mental picture, not memorization.

I just read a lesson on **Containers: cgroups & Namespaces** (System Design) and want to truly understand it. Explain Containers: cgroups & Namespaces from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🤔 Walk me through it (interactive)

Socratic — adapts to where you're stuck.

Teach me **Containers: cgroups & Namespaces** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🧪 Quiz me & fix my gaps

Active recall exposes what you missed.

Quiz me on **Containers: cgroups & Namespaces** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🧠 Make it stick

Intuition + hook + flashcards for long-term memory.

Help me remember **Containers: cgroups & Namespaces** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

📝 My notes