Home › System Design › OS & Kernel Internals

Containers: cgroups & Namespaces

A container is a normal Linux process wearing a costume

A container is not a lightweight VM and there is no "container" object in the kernel — it is an ordinary process that the kernel has been asked to lie to. When you launch one, the kernel does three cheap things on top of a normal fork()/execve(): it puts the process in a fresh set of namespaces (so its view of PIDs, mounts, network, hostname, IPC and users is a private sandbox), it attaches it to a cgroup (so the scheduler and memory allocator cap and account its CPU, memory and I/O), and it pivots its root filesystem onto an overlay mount (so it sees a private image without copying it). Everything else — the process runs on the host kernel, is visible in the host's ps, is scheduled by the same CFS/EEVDF scheduler — is completely normal. That is the whole trick.

This matters because the alternative, a virtual machine, boots an entire guest kernel and emulates virtual hardware, costing hundreds of MB of RAM and seconds of boot per instance. A container adds a few syscalls to a process that already exists, so it starts in milliseconds and shares the host kernel's page cache and scheduler. That density — thousands of isolated workloads per host — is what makes Kubernetes, Lambda, and CI runners economically possible. The cost is the flip side of the same coin: one shared kernel, so the isolation is only as strong as the kernel's namespace and cgroup boundaries, not a hardware-enforced VM boundary.

Namespaces isolate the view; cgroups limit the resources

Keep these two axes strictly separate in your head — they are orthogonal and answer different questions. Namespaces answer "what can this process see and name?" Cgroups answer "how much can this process consume?" A process could be in a private PID namespace but with no memory limit (sees only itself, can OOM the host), or in a tight memory cgroup but the host network namespace (throttled, but sees every interface). A real container uses both.

The seven namespaces (created via `clone()` / `unshare()` flags)

Namespace	Flag	What it virtualizes	Concrete effect inside the container
PID	`CLONE_NEWPID`	Process-ID number space	Your entrypoint is `PID 1`; it cannot see or signal host processes
NET	`CLONE_NEWNET`	Interfaces, routes, ports, netfilter	Own `lo` + a veth pair; two containers can both bind `:8080`
MNT	`CLONE_NEWNS`	The mount table	A private filesystem tree; `/proc`, `/` differ from host
UTS	`CLONE_NEWUTS`	Hostname & domainname	`hostname` returns the container ID, not the node
IPC	`CLONE_NEWIPC`	System-V IPC, POSIX msg queues	Shared-memory segments are invisible across containers
USER	`CLONE_NEWUSER`	UID/GID mapping	`root` (uid 0) inside can map to an unprivileged uid (e.g. 100000) outside — if the runtime opts in
CGROUP	`CLONE_NEWCGROUP`	The cgroup filesystem's root view	The container sees its own cgroup as `/`, hiding sibling and ancestor cgroup paths

The user namespace is the one that can turn "root in the container" from a host-level danger into a safe illusion: when UID mapping is enabled, uid 0 inside is a mapped, unprivileged uid outside, so even a container escape lands as nobody. In practice this mapping is opt-in, not the default: Docker's default runtime and most production Kubernetes clusters still run container processes as UID 0 in the host's own user namespace — Docker's userns-remap and Kubernetes' hostUsers/user-namespace support (GA in 1.30+) exist precisely because rootless-by-default isn't yet the common case. Treat user-namespace remapping as a hardening option you should turn on, not a guarantee you already have.

Traced example: building a container by hand

Docker/containerd/runc do exactly this sequence — you can reproduce the core of it with a few commands and watch each layer engage. Assume a host running cgroup v2 (unified hierarchy under /sys/fs/cgroup).

Step-by-step trace

t=0 — create the namespaces. unshare --pid --net --mount --uts --ipc --user --fork --map-root-user /bin/sh. The kernel does a clone() with the requested CLONE_NEW* flags set. The child is now PID 1 in its own PID namespace.
t=1 — verify the PID illusion. Inside, echo $$ prints 1; ps (after mounting a private /proc) shows only your shell. On the host, ps aux | grep sh shows the same process as PID 5177. Same task_struct, two different numbers — the PID namespace is just a translation table on struct pid.
t=2 — pivot the root. In the mount namespace, pivot_root onto an extracted image dir so / is the image, not the host. The host's / is now unreachable from inside — not hidden, unmounted from this namespace's view.
t=3 — attach the cgroup. mkdir /sys/fs/cgroup/demo; echo "50000 100000" > cpu.max (50 ms of CPU per 100 ms period = half a core); echo 512M > memory.max; then echo <pid> > cgroup.procs to move the process in.
t=4 — hit the CPU limit. Run a busy loop. On the host, top shows the process pinned close to 50% of one core. The CFS bandwidth controller enforces this: in the simple case, the process (or its threads) spend the 50 ms of quota within the 100 ms period and are then throttled until the period resets. The real accounting is a bit richer — quota is drawn from a per-cgroup pool that multiple runnable threads can spend concurrently, and cpu.max's optional burst allowance lets a cgroup borrow a little unused quota from a previous period — but the observable effect is the same: cpu.stat's throttled_usec climbs once quota is exhausted.
t=5 — hit the memory limit. Allocate past 512 MB. The kernel first reclaims page cache, then triggers the cgroup OOM killer, which kills a task inside this cgroup only — the host and other containers are untouched. memory.events shows oom_kill 1.

Notice what you never did: boot a kernel, allocate virtual RAM, or emulate a device. You added flags to one process. That is the entire mechanism runc implements — plus seccomp/capabilities/AppArmor for hardening.

Pitfalls a working engineer actually hits

PID 1 reaps nothing → zombie storm. In a PID namespace your entrypoint is PID 1, which inherits orphaned children. A shell or app that does not reap exited children leaves zombies that accumulate until the PID table fills. Fix: run a tiny init (tini, --init, or dumb-init) as PID 1, or handle SIGCHLD.
The JVM/Go runtime reads the host, not the cgroup. Older JREs (pre-8u191) and many tools call sysconf//proc/cpuinfo and see all 64 host cores, then size thread pools and heap for a machine they can't use — the cgroup then throttles them into latency cliffs. Set GOMAXPROCS/heap explicitly or use cgroup-aware runtimes.
CFS quota throttling looks like a mystery latency spike. A service under cpu.max that bursts (GC, request spike) exhausts its 100 ms quota early and is throttled until the next period — p99 jumps by tens of ms while average CPU looks low. Diagnose via cpu.stat's nr_throttled/throttled_usec; often the fix is raising the quota, enabling cpu.max's burst allowance, or using CPU shares/weight instead of a hard cap.
OOM kill is silent and local. Exceeding memory.max kills a process inside the cgroup with SIGKILL — no stack trace, exit code 137. Check memory.events and dmesg; page cache counts toward the limit, so heavy file I/O can trigger it even with modest heap.
"It works in Docker, not in prod" = missing capabilities/seccomp. Containers drop most Linux capabilities and apply a seccomp filter by default. Code that needs CAP_NET_ADMIN, mount, or raw sockets fails with EPERM that has nothing to do with your app logic.
Overlay copy-up on huge files. As the diagram shows, editing one byte of a large lower-layer file copies the entire file into the upper layer — surprise disk usage and a write stall. Put mutable large data on a real volume, not the overlay.
Assuming rootless-by-default protects you. A service that assumes uid 0 inside the container is already unprivileged outside (because "that's how containers work now") is wrong on a default Docker or Kubernetes install: without userns-remap / hostUsers: false explicitly configured, container root is host uid 0. Verify your runtime's actual user-namespace configuration before relying on it as a security boundary.

Trade-offs: containers vs virtual machines vs sandboxed runtimes

The decision is fundamentally about where the isolation boundary sits and what you are willing to pay for it.

Property	Container (runc)	VM (KVM/Firecracker)	Sandboxed (gVisor / Kata)
Isolation boundary	Shared kernel + namespaces	Hardware (VT-x), separate kernel	User-space kernel (gVisor) or micro-VM (Kata)
Start time	~10–50 ms	~100 ms–seconds (Firecracker ~125 ms)	~100–200 ms
Memory overhead	~MB (just the process)	tens–hundreds of MB (guest kernel)	tens of MB
Blast radius of a kernel 0-day	Host compromise possible	Contained to the guest	Contained (syscalls intercepted)
Syscall performance	Native (zero overhead)	Near-native	gVisor: measurable syscall tax

Use containers when you own the workloads and trust the code: microservices, CI, internal batch jobs. Density and speed dominate and a shared kernel is acceptable. Use VMs (or Firecracker micro-VMs) when you run untrusted, multi-tenant code — this is exactly why AWS Lambda and Fargate wrap each customer's container in a Firecracker micro-VM: they want container ergonomics with VM-grade isolation. Use gVisor/Kata when you need stronger-than-namespace isolation but can't afford a full VM per workload (GKE Sandbox). The named alternative to internalize: a container trades the VM's hardware boundary for ~10× faster start and ~10× higher density — a great trade for trusted code, a dangerous one for hostile tenants on a shared kernel.

Takeaways

A container is a process the kernel lies to: namespaces virtualize its view, cgroups cap its resources, and overlayfs gives it a private image — no guest kernel, no emulated hardware.
Namespaces and cgroups are orthogonal — "what it sees" vs "what it consumes." Both are needed for real isolation; either alone leaves a hole.
The isolation is one-way and only kernel-strong: the host sees every container process, and a kernel exploit escapes the sandbox. That single shared kernel is the whole security trade-off vs a VM.
User-namespace UID remapping is what makes root-in-container safe, but it is opt-in on Docker and most Kubernetes clusters today — do not assume it is on without checking.
Most production pain is cgroup-adjacent: CFS-quota throttling masquerading as latency bugs, local OOM kills (exit 137), and runtimes that size themselves to the host instead of the cgroup.

Recall question

Two containers on the same host both bind port 8080 successfully with no conflict, yet ps on the host shows both their main processes. Which mechanism makes the ports non-conflicting, and which fact reveals that a container is "just a process"?

Answer: separate network namespaces give each container its own port space, so :8080 is a different socket in each. The host's ps listing both proves there is no VM boundary — they are ordinary host processes sharing one kernel, merely placed in different namespaces.

Sources: The Linux Programming Interface (Kerrisk, ch. 28 & namespaces/cgroups); Linux kernel documentation (cgroup-v2.rst, namespaces(7), overlayfs.rst); Brendan Gregg, Systems Performance (2nd ed., cgroup CPU throttling & the USE method); AWS Firecracker paper (NSDI 2020); Docker userns-remap and Kubernetes user-namespaces (KEP-127) documentation; the gVisor design docs. Authored for this guide.

🤖 Don't fully get this? Learn it with Claude

Stuck on Containers: cgroups & Namespaces? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.

🎨 Explain it visually

Build the mental picture, not memorization.

I just read a lesson on **Containers: cgroups & Namespaces** (System Design) and want to truly understand it. Explain Containers: cgroups & Namespaces from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🤔 Walk me through it (interactive)

Socratic — adapts to where you're stuck.

Teach me **Containers: cgroups & Namespaces** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🧪 Quiz me & fix my gaps

Active recall exposes what you missed.

Quiz me on **Containers: cgroups & Namespaces** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🧠 Make it stick

Intuition + hook + flashcards for long-term memory.

Help me remember **Containers: cgroups & Namespaces** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

📝 My notes

← I/O Models: Blocking, epoll & io_u