Knowledge Guide
HomeSystem Design

Checksum

Step 17 in the System Design path · 2 concepts · 0 problems

0 / 2 complete

📘 Learn Checksum from zero

Imagine mailing a friend a 1,000-piece jigsaw puzzle. You worry a piece might fall out in transit. So before sealing the box you count the pieces and write "1000" on the lid. Your friend counts on arrival; if they get 999, they know something was lost — without re-examining every piece against the original. That number on the lid is a checksum: a small, fixed-size summary computed from a larger blob of data, used to detect whether the data changed.

In computing, when data is transmitted (over a network) or stored (on disk), bits can flip from noise, faulty hardware, or dropped packets. A checksum is a function that reads all the bytes and condenses them into a short digest. The sender computes it and ships it alongside the data. The receiver re-computes the digest from the bytes it actually received and compares. Match → almost certainly intact. Mismatch → treat the data as bad and re-request or discard (a mismatch could even mean the checksum itself was mangled, but the safe response is the same).

Worked example: send the message HI as ASCII bytes 72 and 73. Use a trivial checksum: sum the bytes modulo 256. (72 + 73) mod 256 = 145. You transmit [72, 73, 145]. Now a bit flips and the 72 arrives as 73. The receiver computes (73 + 73) mod 256 = 146, compares to the sent 145, sees 146 ≠ 145, and rejects the message. Real systems use stronger functions — a naive sum is blind to byte reordering (swap two bytes and the sum is unchanged), whereas CRC32 catches that and burst errors — but the mechanism is identical.

Key insight: a checksum is a cheap, one-way fingerprint that turns "is this data correct?" into a fast fixed-size comparison — it detects accidental corruption, but a plain checksum does not protect against a deliberate attacker.

✨ Added by the guide to build intuition — not from the source course.

Lessons in this topic

🎯 Guided practice

  1. Easy — Verify a downloaded file. You download a 5 GB Linux ISO; the site also publishes sha256: a1b2.... How do you know the download isn't corrupted?

    Step 1: recognise the trigger — "did this large blob arrive intact?" → checksum. Step 2: run the same function the publisher used over your local file: sha256sum ubuntu.iso, an O(n) single pass over the bytes. Step 3: compare your computed digest to the published one. Step 4: equal → accept; unequal → a byte got corrupted or you fetched the wrong file → re-download. Why it works: you never need the original 5 GB on hand — only the tiny 32-byte fingerprint, so comparison is O(32). Note: SHA-256 (cryptographic) is chosen over CRC32 here because it also makes it computationally infeasible for a malicious mirror to swap in a different file that still matches the published digest — though the integrity of the published digest itself still depends on the channel you got it from (HTTPS, a signed release).

  2. Medium — Detect drift between two replicas. A primary DB and a replica each hold a large table. You must confirm they're identical without shipping the whole table across the network.

    Step 1: reframe — "are these two copies the same?" is an integrity comparison → checksum. Step 2: naive approach: each side computes one checksum over all rows and exchanges the digest. O(n) compute, O(1) network. Step 3: the catch — if the digests differ, you only learn that they differ somewhere, not where (the detection-vs-correction limit). Step 4: refine — partition rows into ranges (say 1,000 buckets), checksum each bucket, and compare the 1,000 small digests; only the mismatching buckets get drilled into. Generalise this and you get a Merkle tree — the anti-entropy / replica-reconciliation technique from the Dynamo paper, covered in Alex Xu Vol 1's "Design a Key-Value Store" chapter. It is a tree of hashes-of-hashes: comparing the two root hashes tells you "everything is equal" in O(1), and on a mismatch you descend only the divergent branches. Cost (state it precisely): for a single point of divergence that path costs O(log n); with d divergent leaves it is O(d log n) — the data exchanged is proportional to the differences, not the dataset size, which is the whole point. Pattern learned: checksums compose hierarchically — hashing the hashes turns "find the difference" from a full table scan into a targeted descent down only the branches that actually disagree.

✨ Added by the guide — work these before the full problem set.