DNS

Step 3 in the System Design path · 2 concepts · 1 problems

0 / 3 complete

📘 Learn DNS from zero

DNS is the internet's phone book: it turns a human name like api.example.com into an IP address such as 93.184.216.34 before any TCP connection can open. In a system-design interview it almost never gets "designed" from scratch, but it is the first hop of every request, so interviewers probe whether you understand the resolution chain, how DNS caching shapes tail latency, and how you'd use DNS for load balancing, geo-routing, and failover. Getting it right means cheap global traffic steering and fast failover; getting it wrong means users pinned to a dead datacenter for minutes because of a stale TTL.

✨ Added by the guide to build intuition — not from the source course.

Lessons in this topic

🏗️ Apply it — design walkthrough

Work through this after you've learned the concepts in the lessons above.

Why resolve at all?

🤔 You type maps.example.com into a browser. Why can't the TCP/IP stack just send packets to that name — what does it actually need first, and who is asked?

Reveal the reasoning

Routers forward packets by IP address, not names — the name is a human convenience the network layer can't use. So the OS must first map name → IP. The chain: browser → OS stub resolver → recursive resolver (your ISP / 8.8.8.8) → root server → TLD server (.com) → authoritative server.

Cause: only the authoritative server knows the real answer for example.com.
Effect: the recursive resolver walks the hierarchy: root tells it which .com TLD server to ask, the TLD points to the authoritative nameserver, the authoritative returns the A record.

Trade-off / cost: a fully cold lookup is up to 4 sequential round trips (stub→recursive, then recursive→root, →TLD, →authoritative). Over the open internet that can be 50–200 ms before the first byte of your real request — pure overhead. That cost is exactly why every layer caches (next step).

Where caching kicks in

🤔 If the full chain costs ~100 ms, but most repeat lookups return in under 1 ms, where did the answer come from — and what number controls that?

Reveal the reasoning

Every record carries a TTL (time-to-live, in seconds). Any resolver may cache the answer until the TTL expires. Caches stack: browser cache → OS cache → recursive resolver cache.

Cause: the recursive resolver caches the A record with TTL=300s.
Effect: for the next 5 minutes, lookups for that name skip root/TLD/authoritative entirely — answer served from local memory in <1 ms instead of ~100 ms. With millions of users behind one resolver, the authoritative server sees a tiny fraction of total query volume.

Trade-off / cost: the answer is now stale by up to TTL seconds. If you change the IP, old clients keep hitting the old IP until their cached copy expires. TTL is the dial between cheap/fast (high TTL) and agile/controllable (low TTL) — you can't max both.

Picking a TTL

🤔 You're about to migrate a service to a new IP next Tuesday. Today your TTL is 86400s (24h). What do you change, when, and why?

Reveal the reasoning

Lower the TTL well before the change, then raise it back after.

Cause: at least 24h before (so the old day-long cached entries have time to expire), set TTL to 60s.
Effect: once those old 24h-cached entries naturally expire, every resolver is re-fetching every 60s. On cutover, clients converge on the new IP within ~60s instead of up to 24h.
After it's stable, raise TTL back to e.g. 3600s to cut query load and latency again.

Trade-off / cost: a 60s TTL means resolvers re-query ~1440×/day instead of once — more load on authoritative servers and slightly higher average lookup latency for users. Low TTL buys failover agility at the price of query volume; that's why you don't just leave it at 60s forever. (Caveat: many resolvers clamp very low TTLs to a floor of ~30–60s, so don't assume a 1s TTL is honored.)

DNS as a load balancer

🤔 You have 3 web servers at different IPs and want to spread traffic without buying a load balancer. How can DNS alone split the load — and what's the catch?

Reveal the reasoning

Use round-robin DNS: publish multiple A records for the same name.

Cause: web.example.com → [10.0.0.1, 10.0.0.2, 10.0.0.3], and the authoritative server rotates the order it returns them on each query.
Effect: different clients get a different IP first, so over ~1000 clients each server receives roughly 1/3 of new resolutions — crude traffic spreading for free, with no extra network hop.

Trade-off / cost: it's blind. DNS doesn't know server load or even whether a server is up. Caching means a heavy client can pin to one IP for the whole TTL, so distribution is uneven. And clients typically connect to the first IP returned (only falling back to the rest if it's unreachable). It balances resolutions, not requests — fine for coarse spreading, not for true load balancing.

Health checks + failover

🤔 Server 10.0.0.2 just crashed. With plain round-robin, ~1/3 of users still get sent to a dead IP. How do you stop handing out a broken address?

Reveal the reasoning

Use a managed DNS provider with health checks (e.g. Route 53, NS1). The provider actively probes each endpoint and only returns healthy IPs.

Cause: the health checker probes 10.0.0.2 every 30s; after N consecutive failures it marks the IP unhealthy.
Effect: that IP is removed from DNS answers, so new resolutions only return live servers — automatic failover with no human in the loop.

Trade-off / cost: failover is not instant — it's gated by detection time + TTL. With a 30s health interval and a 60s TTL, some clients can keep hitting the dead IP for up to ~90s because of cached answers. This is the core limitation of DNS failover: you can't force-expire a cache you don't control, so DNS failover is a coarse, tens-of-seconds-to-minutes safety net, not millisecond HA.

Routing by geography

🤔 You run datacenters in Virginia and Frankfurt. A user in Berlin and a user in New York both resolve app.example.com. How do you give each the nearby IP from one hostname?

Reveal the reasoning

Use geo / latency-based routing at the authoritative DNS layer — the same name resolves to different answers depending on who's asking.

Cause: the authoritative server inspects the resolver's source IP (or EDNS Client Subnet) to estimate location.
Effect: Berlin user → Frankfurt IP, New York user → Virginia IP. Avoiding a cross-Atlantic round trip cuts each leg of connection setup from ~85 ms to ~10 ms — so a TLS handshake that took well over 150 ms now finishes in tens of ms, and EU data stays in-region for compliance.

Trade-off / cost: accuracy depends on the resolver's location, not the user's — someone in Berlin using a US-based public resolver may get routed to Virginia. EDNS Client Subnet helps but isn't universal, and per-region answers make caching and debugging harder because the "correct" answer is now relative to who asks.

Surviving a DNS outage

🤔 If the authoritative nameservers for your domain go down, no client can resolve your site even if every web server is healthy. How do you avoid this single point of failure?

Reveal the reasoning

Run multiple authoritative nameservers, ideally across independent providers, advertised via several NS records for the zone.

Cause: you publish ns1…ns4 on independent infrastructure, often using Anycast so one IP is announced from many global locations.
Effect: if one nameserver or region fails, resolvers retry the next NS; Anycast also routes each query to the nearest live node, so a regional outage just shifts traffic instead of breaking resolution.

Trade-off / cost: multi-provider DNS adds operational complexity — you must keep zone records in sync across providers, and a misconfiguration now exists in two places. It's insurance you pay for in config overhead, justified because DNS is the front door: if it's down, 100% of traffic is down regardless of backend health. (Note: the well-known "13" is the number of root-server identities — a limit driven by fitting the list in a UDP DNS response, not a protocol cap on how many nameservers your own zone may have.)

When NOT to lean on DNS

🤔 An interviewer asks: "Why not do all your load balancing and failover in DNS and skip the L4/L7 load balancer?" What's the honest answer?

Reveal the reasoning

DNS is the wrong tool for fine-grained or fast control because of the cache you don't own.

Cause: once a resolver caches an answer, you cannot revoke it before its TTL; clients also tend to use just the first returned IP.
Effect: DNS gives you coarse, region-level steering and tens-of-seconds-to-minutes failover, but it can't do per-request balancing, session stickiness, connection draining, or sub-second failover.

Trade-off / cost: the standard pattern is layered: DNS routes the user to the nearest region (cheap, global), then a real load balancer inside that region does per-request distribution and instant health-based failover. Use DNS for the coarse first hop; use an LB for everything that must react in milliseconds. Saying this out loud signals you know DNS's limits, not just its features.

📐 Architecture diagrams (1)

🎯 Guided practice

Problem 1 (Easy): Stable name, swappable server. Your app runs on a server at 11.22.33.44. You expect to migrate to a new server (new IP) someday. What DNS setup lets you migrate without users changing the URL — and what controls how fast the migration takes effect?

Identify the pattern: this is indirection — clients should depend on a name, not an IP.
Set the record: create an A record app.example.com → 11.22.33.44. Users only ever type the name.
Migrate: when the new server is ready at 55.66.77.88, update the A record's value. No client change needed.
Control the speed: the TTL governs propagation. If TTL is 3600s, old resolvers serve the stale IP for up to an hour. Pro move: lower the TTL (e.g. to 60s) before the migration window, switch, then raise it back. Key takeaway: indirection enables the swap; TTL governs the cutover speed.

Problem 2 (Medium): Global service, route users to the nearest region. You run data centers in the US and EU. A user in Paris should hit the EU servers, a user in Texas the US servers, and if a whole region dies, traffic should shift away. Design the DNS layer.

Recognize the trigger: "users worldwide, route to nearest, survive a region outage" → GeoDNS / latency-based routing + failover.
Geo step: configure the authoritative DNS to return different A records by query origin: app.example.com → EU VIP for European resolvers, → US VIP for American ones. Each VIP fronts a regional load balancer. Watch the resolver-vs-user gap: GeoDNS keys off the resolver's IP, so enable EDNS Client Subnet (ECS) if clients use distant public resolvers.
Layer the responsibilities: DNS picks the region (coarse, geographic); the regional load balancer picks the healthy server within that region (fine, load/health-aware). Don't make DNS do per-server balancing — it isn't load-aware.
Failover step: attach health checks to each regional endpoint. If the EU region fails health checks, DNS stops returning the EU IP and routes EU users to US instead. Set a low TTL (e.g. 30–60s) on these records so failover propagates quickly — but accept it's bounded by client/OS caches, not instant.
State the tradeoff explicitly: low TTL = faster failover but more query volume and latency; you accept that cost to bound the stale-routing window. Core pattern: DNS for geographic/region selection, load balancers for in-region health-aware distribution.

✨ Added by the guide — work these before the full problem set.