Checksum
Step 17 in the System Design path · 2 concepts · 0 problems
📘 Learn Checksum from zero
Imagine mailing a friend a 1,000-piece jigsaw puzzle. You worry a piece might fall out in transit. So before sealing the box you count the pieces and write "1000" on the lid. Your friend counts on arrival; if they get 999, they know something was lost — without re-examining every piece against the original. That number on the lid is a checksum: a small, fixed-size summary computed from a larger blob of data, used to detect whether the data changed.
In computing, when data is transmitted (over a network) or stored (on disk), bits can flip from noise, faulty hardware, or dropped packets. A checksum is a function that reads all the bytes and condenses them into a short digest. The sender computes it and ships it alongside the data. The receiver re-computes the digest from the bytes it actually received and compares. Match → almost certainly intact. Mismatch → treat the data as bad and re-request or discard (a mismatch could even mean the checksum itself was mangled, but the safe response is the same).
Worked example: send the message HI as ASCII bytes 72 and 73. Use a trivial checksum: sum the bytes modulo 256. (72 + 73) mod 256 = 145. You transmit [72, 73, 145]. Now a bit flips and the 72 arrives as 73. The receiver computes (73 + 73) mod 256 = 146, compares to the sent 145, sees 146 ≠ 145, and rejects the message. Real systems use stronger functions — a naive sum is blind to byte reordering (swap two bytes and the sum is unchanged), whereas CRC32 catches that and burst errors — but the mechanism is identical.
Key insight: a checksum is a cheap, one-way fingerprint that turns "is this data correct?" into a fast fixed-size comparison — it detects accidental corruption, but a plain checksum does not protect against a deliberate attacker.
✨ Added by the guide to build intuition — not from the source course.
Lessons in this topic
🎯 Guided practice
- Easy — Verify a downloaded file. You download a 5 GB Linux ISO; the site also publishes
sha256: a1b2.... How do you know the download isn't corrupted?Step 1: recognise the trigger — "did this large blob arrive intact?" → checksum. Step 2: run the same function the publisher used over your local file:
sha256sum ubuntu.iso, anO(n)single pass over the bytes. Step 3: compare your computed digest to the published one. Step 4: equal → accept; unequal → a byte got corrupted or you fetched the wrong file → re-download. Why it works: you never need the original 5 GB on hand — only the tiny 32-byte fingerprint, so comparison isO(32). Note: SHA-256 (cryptographic) is chosen over CRC32 here because it also makes it computationally infeasible for a malicious mirror to swap in a different file that still matches the published digest — though the integrity of the published digest itself still depends on the channel you got it from (HTTPS, a signed release). - Medium — Detect drift between two replicas. A primary DB and a replica each hold a large table. You must confirm they're identical without shipping the whole table across the network.
Step 1: reframe — "are these two copies the same?" is an integrity comparison → checksum. Step 2: naive approach: each side computes one checksum over all rows and exchanges the digest.
O(n)compute,O(1)network. Step 3: the catch — if the digests differ, you only learn that they differ somewhere, not where (the detection-vs-correction limit). Step 4: refine — partition rows into ranges (say 1,000 buckets), checksum each bucket, and compare the 1,000 small digests; only the mismatching buckets get drilled into. Generalise this and you get a Merkle tree — the anti-entropy / replica-reconciliation technique from the Dynamo paper, covered in Alex Xu Vol 1's "Design a Key-Value Store" chapter. It is a tree of hashes-of-hashes: comparing the two root hashes tells you "everything is equal" inO(1), and on a mismatch you descend only the divergent branches. Cost (state it precisely): for a single point of divergence that path costsO(log n); with d divergent leaves it isO(d log n)— the data exchanged is proportional to the differences, not the dataset size, which is the whole point. Pattern learned: checksums compose hierarchically — hashing the hashes turns "find the difference" from a full table scan into a targeted descent down only the branches that actually disagree.
✨ Added by the guide — work these before the full problem set.