Kafka
Step 20 in the System Design path · 5 concepts · 0 problems
📘 Learn Kafka from zero
Kafka is a distributed, append-only commit log: producers append messages to partitioned topics, and consumers read at their own pace by tracking an offset. It is the default answer in interviews whenever you need to decouple services, absorb traffic spikes, fan one event out to many independent readers, or replay history — think "ingest 1M events/sec, let analytics, billing, and search all consume the same stream." Knowing the mechanism (partitions, offsets, consumer groups, replication) lets you justify throughput and ordering claims with numbers instead of hand-waving, which is exactly what separates a passing answer from a strong one.✨ Added by the guide to build intuition — not from the source course.
Lessons in this topic
- ○Introduction to Kafka
- ○Messaging patterns
- ○Popular Messaging Queue Systems
- ○RabbitMQ vs Kafka vs ActiveMQ
- ○Scalability and Performance (2)
🏗️ Apply it — design walkthrough
Work through this after you've learned the concepts in the lessons above.
🤔 A traditional queue (RabbitMQ) deletes a message once a consumer acks it. Kafka does NOT delete on read. Why does an append-only log that keeps messages around change what the system can do?
Reveal the reasoning
Mechanism: Kafka stores each topic as an immutable, append-only file on disk. Each message gets a monotonically increasing offset (0, 1, 2, ...). A consumer just remembers "I've read up to offset 4711" — the broker stores nothing per-consumer about delivery state.
Cause → effect chain: messages are not deleted on read → many independent consumer groups can read the same partition at different offsets (analytics at offset 9000, billing at offset 4711) without interfering → and any consumer can rewind to offset 0 to replay 7 days of history after a bug fix.
Retention is time/size based, e.g. retention.ms=604800000 (7 days) — not consumption based.
Trade-off / cost: the broker pushes delivery-tracking work onto the consumer (it must commit offsets), and you pay for disk to hold data nobody is reading yet — e.g. 7 days × 200 MB/s ≈ 120 TB before replication. A queue that deletes on ack uses far less storage and gives simpler per-message semantics.
🤔 A single topic on a single disk caps out around ~50–100 MB/s. You need to ingest 1,000,000 events/sec. What is the one structural change that unlocks horizontal scale, and what new problem does it create?
Reveal the reasoning
Mechanism: a topic is split into N partitions, each an independent log that can live on a different broker. Producers hash the message key to pick a partition (partition = hash(key) % N); keyless messages round-robin.
Cause → effect chain: split topic into 50 partitions across 10 brokers → writes and reads now happen in parallel on 50 disks → if one disk does 100 MB/s, aggregate ≈ 5 GB/s, easily covering 1M events/sec at ~1 KB each (≈ 1 GB/s) → throughput scales (almost) linearly with partition count.
Trade-off / cost: ordering is guaranteed only within a partition, never across the topic. So events for the same entity must share a key (e.g. userId) to land in the same partition and stay ordered. More partitions also means more open file handles and more leader-election work — thousands of partitions per broker increases failover time and metadata pressure. You also can't easily shrink partition count later (re-hashing breaks key→partition mapping).
🤔 You have 1 topic with 12 partitions and want to process events faster by adding more consumer instances. You scale to 12 consumers, then to 20. Throughput stops improving after 12. Why?
Reveal the reasoning
Mechanism: consumers join a consumer group (same group.id). Kafka assigns each partition to exactly one consumer in the group. This is how Kafka does competing-consumer load balancing.
Cause → effect chain: 12 partitions, 12 consumers → each consumer owns 1 partition, perfect parallelism. Add 8 more → there are no spare partitions to hand out → those 8 consumers sit idle, throughput is flat. Partition count is the hard ceiling on parallelism within one group.
Different use cases get different group IDs: billing group and analytics group each read all 12 partitions independently (fan-out), unaffected by each other.
Trade-off / cost: you must size partition count for your peak future parallelism up front, because increasing it later disrupts key→partition ordering. Also, every scale event triggers a rebalance — partitions are revoked and reassigned, pausing consumption for seconds, which is why frequent scaling/crashing hurts latency.
🤔 A producer sends a message, gets a success response, and 200 ms later the broker's disk dies. Could that message be lost? What setting decides the answer?
Reveal the reasoning
Mechanism: each partition has a leader and R-1 follower replicas (e.g. replication factor 3). Followers continuously fetch from the leader; those caught up form the in-sync replica set (ISR). The producer's acks setting decides when a write is considered done.
Cause → effect chain:
acks=0: fire-and-forget → fastest, but a crash before the write lands loses data.acks=1: leader confirms after writing locally → if the leader dies before a follower replicates, that message is lost (the scenario above).acks=allwithmin.insync.replicas=2: leader waits until ≥2 replicas have it → survives a single broker failure with zero loss.
Trade-off / cost: acks=all adds a network round-trip to the slowest in-sync follower, raising produce latency from ~1 ms to several ms and cutting throughput. And min.insync.replicas=2 means if too many replicas fall behind, the partition rejects writes rather than risk loss — you trade availability for durability.
🤔 A consumer reads a message, processes it (charges a credit card), then crashes before committing its offset. On restart it re-reads the same message. The customer is charged twice. How do you stop double-processing — and can Kafka give you true exactly-once?
Reveal the reasoning
Mechanism: the consumer's offset commit is what marks "I've handled up to here." The ordering of process-vs-commit determines the guarantee.
Cause → effect chain:
- At-most-once: commit offset before processing → a crash skips the message (lost), never duplicated.
- At-least-once (the default, the scenario above): process then commit → a crash before commit re-delivers → duplicates possible.
- Exactly-once: achievable Kafka→Kafka via idempotent producer + transactions (
read_committed), which atomically writes output and the offset commit together.
Trade-off / cost: end-to-end exactly-once into an external system (the credit-card charge) is generally impossible — Kafka can't transact with Stripe. The standard answer is at-least-once + an idempotency key (dedupe on a unique request ID) so reprocessing is harmless. Kafka transactions also add coordinator overhead and latency, so most pipelines deliberately stay at-least-once.
🤔 Kafka writes everything to disk, yet sustains hundreds of MB/s per broker — faster than many in-memory systems. Disk is supposed to be slow. What two tricks make disk-backed Kafka this fast?
Reveal the reasoning
Mechanism: Kafka exploits how operating systems and hardware actually behave, not random-access disk myths.
Cause → effect chain:
- Sequential append-only writes: messages are appended to the end of a file → the disk head/SSD does sequential I/O, which is ~100× faster than random I/O → spinning disks hit 100s of MB/s sequentially.
- Zero-copy (
sendfile): on read, data goes from page cache straight to the network socket → it never gets copied into the JVM heap and back → saves CPU and memory bandwidth, so a broker can push data near line-rate. - Plus batching + compression: the producer groups many messages into one batch (e.g.
linger.ms=10) → fewer, larger requests → higher throughput per network round-trip.
Trade-off / cost: batching with linger.ms deliberately adds latency (you wait to fill a batch) to buy throughput. Zero-copy is bypassed if you need TLS or broker-side transformation, so encryption-in-transit reduces the speedup. The design favors high-throughput streaming, not single-message low-latency request/response.
🤔 Your interviewer asks: "Why Kafka here and not RabbitMQ?" Picking the wrong one is a red flag. For (a) a real-time clickstream feeding 5 analytics jobs and replay, vs (b) routing 200 order-processing tasks to workers with per-message priority — which goes where, and why?
Reveal the reasoning
Mechanism: they are different shapes. Kafka is a dumb broker / smart consumer durable log: consumers pull and track their own offsets, and messages persist after read. RabbitMQ is a smart broker message queue: the broker routes via exchanges, pushes to consumers, and deletes each message once it is acked.
Cause → effect chain:
- (a) clickstream → Kafka. 5 analytics jobs each need the full stream → independent consumer groups each read all partitions without consuming each other's copy → and because the log is retained, you can rewind and replay after a bug. A queue that deletes on ack can't fan out to 5 independent readers or replay.
- (b) order tasks → RabbitMQ. You want work distributed to workers, per-message priority, dead-letter on failure, and TTL → RabbitMQ's smart broker does routing keys, priority queues, and per-message ack/redelivery natively. Kafka has no per-message priority (order is fixed by offset) and no built-in per-message redelivery, so forcing this onto Kafka is awkward.
Trade-off / cost: Kafka buys throughput, retention, and fan-out/replay at the cost of complex routing and per-message control. RabbitMQ buys flexible routing, priority, and simple at-most/at-least-once task delivery, but tops out far lower (tens of thousands of msg/s per queue vs Kafka's millions) and is not a replayable log — once acked, a message is gone. Choose by the shape: durable replayable event stream with many readers → Kafka; flexible task routing with per-message control → RabbitMQ.
📐 Architecture diagrams (5)
🎯 Guided practice
- Easy — "Which tool?" A team needs to send password-reset emails. ~50 messages/min, each must be processed once (or retried), and routed by priority (VIP vs normal). Should they use Kafka?
Reasoning: (1) Identify the signals: low volume, per-message ack + retry, priority routing, no replay/fan-out needed. (2) Match against Kafka's strengths — high throughput, replay, multi-consumer fan-out — none apply. (3) Recognize the anti-pattern: Kafka has no native per-message priority and no per-message redelivery (you advance an offset over an ordered log, you don't selectively ack/nack individual records). Priority queues, dead-letter routing, and per-message acks are exactly what a smart broker (RabbitMQ) provides via queues, exchanges, and bindings. Answer: use RabbitMQ, not Kafka. The lesson: pick Kafka for streams you replay/fan-out; pick RabbitMQ for task queues that need rich routing and a per-message lifecycle.
- Medium — "Design and size it." Design ingestion for an IoT fleet: 1,000,000 devices, each sending one reading/sec (≈1M msg/sec). Downstream, a real-time alerting service and a batch analytics warehouse both need every reading, in per-device order. How do you model topics, partitions, and keys, and how do you size consumers?
Reasoning: (1) Throughput: ≈1M readings/sec with fan-out to two independent consumers and per-device ordering — textbook Kafka. (2) Topic: one topic
device-readings. (3) Ordering key: key bydevice_idso every reading for a device hashes to the same partition, preserving per-device order (global order is neither needed nor achievable across partitions). (4) Partitions: ordering holds only within a partition and a consumer group's parallelism is capped at partition count, so partitions are the unit of scale. If one consumer instance sustains ~50k msg/sec, you need ≥20 partitions to keep up; pick a higher round number (e.g. 32 or 64) for headroom, since increasing partition count later remapsdevice_id→partition and breaks ordering for in-flight keys. Also set replication factor (e.g. 3) for durability. (5) Consumers: two separate consumer groups (alerting, analytics), each independently reading all partitions at its own offset; within a group, up to (partition count) instances split the partitions, one partition assigned to at most one instance at a time. (6) Delivery: default at-least-once, so make alerting idempotent (dedupe bydevice_id+ reading timestamp). Core pattern learned: the key chooses ordering, partition count caps parallelism, replication gives durability, and separate consumer groups give independent fan-out over the same durable log.
✨ Added by the guide — work these before the full problem set.