Knowledge Guide
HomeSystem DesignScalable Systems (Advanced Topics)

What Are Checksums and Etags, and How Do MD5SHA Compare to CRC for Integrity

Checksums are short digital fingerprints of data used to verify integrity, and ETags (entity tags) are unique identifiers often derived from checksums or hashes that web and storage systems use to track data versions and detect changes.

What Is a Checksum? (Data Integrity Basics)

A checksum is a small code calculated from a block of digital data to detect errors or alterations. By recomputing the checksum later and comparing it to the original value, you can verify whether the data has changed.

If the checksums match, the data is almost certainly unaltered.

Checksums are widely used for data integrity in storage and transmission, for example, to ensure a file wasn’t corrupted during download or to detect disk errors. They don’t by themselves prove authenticity (i.e. who created the data), but they excel at catching accidental corruption.

Common checksum algorithms range from simple ones (like parity bits or summing bytes) to more complex hash functions.

Good checksum algorithms change significantly even if the input data changes only a little. This property means even a one-bit error in the data will result in a very different checksum value, alerting us to corruption.

In data storage systems, checksums are crucial: disk controllers, file systems, and backup software often compute checksums (or hashes) for each data block or file to detect bit rot or transmission errors over time.

What Is an ETag? (Entity Tags in Web and Storage)

An ETag (Entity Tag) is an identifier assigned by a web or storage server to a specific version of a resource.

In HTTP (the web protocol), the ETag is returned in a response header to help with caching: it’s essentially a unique tag for the content.

When a browser or client caches a resource, it also stores its ETag.

Later, it can make a conditional request: “give me the data only if it has changed, using this ETag to check.”

The server compares the provided ETag with the current version’s ETag. If they match, it responds with a 304 Not Modified status (no need to resend the content). This makes web caching more efficient and saves bandwidth.

In practice, an ETag value is often generated by hashing the content or using a timestamp or version number.

(For example, many ETags look like strings of hex digits, which often come from an MD5 or SHA-1 hash of the content.)

ETags are basically fingerprints for data versions.

Whenever the resource changes, a new ETag is computed and assigned.

Besides caching, ETags can also be used for concurrency control (to prevent overwriting changes. A client can say “update this resource only if its ETag still matches”).

In the storage industry, ETags are used in cloud object storage systems as integrity checks.

For instance, Amazon S3 provides an ETag for each stored object; for standard uploads, this ETag is effectively an MD5 checksum of the file (or a combination of MD5 checksums for multipart uploads).

By comparing an upload’s ETag to a locally computed MD5, users can verify that the file arrived intact.

In short, ETags tie together the concept of checksums with practical version tracking. They ensure you’re working with the exact data you expect, without having to re-transfer the data unnecessarily.

MD5, SHA, and CRC: Comparing Integrity Check Algorithms

When it comes to algorithms for generating these fingerprints, there are two broad categories: cryptographic hash functions like MD5 or SHA, and error-detecting codes like CRC.

All serve as checksums in the general sense, but they have different strengths and purposes.

MD5 and SHA: Cryptographic Hashes for Integrity

MD5 and SHA (SHA-1, SHA-256, etc.) are cryptographic hash functions that produce a fixed-size hash value from input data. They are designed so that it’s extremely hard to find two different inputs with the same hash (especially for newer hashes like SHA-256).

Even a tiny change in the input (like flipping one bit) will avalanche into a completely different output. These hashes have a large output (MD5 is 128-bit, SHA-1 is 160-bit, SHA-256 is 256-bit), which means there are astronomically many possible values.

This makes the chance of an accidental collision (two different files having the same hash) vanishingly small.

For example, SHA-256’s 256-bit space provides 2^128 collision resistance (since collision resistance is generally half the hash length in bits), which is on the order of 3.4×10^38, effectively zero chance of random collisions for practical purposes.

Originally, hashes like MD5/SHA were meant for security (e.g. verifying signatures, storing passwords).

For integrity checking, their strength means a very high assurance against both random errors and intentional tampering.

If you download a file and compare its published SHA-256 checksum to the one you compute, a match gives high confidence the file is exact.

If an attacker tried to maliciously alter the file, they would not easily produce a different file with the same SHA-256.

In fact, “all common cryptographic hash functions are better than CRC for integrity, even MD5, because they’re designed to detect malicious tampering whereas CRC is not”.

However, some cryptographic hashes have known weaknesses. MD5, for instance, is now considered cryptographically broken. Researchers have found ways to create deliberate collisions (two different inputs with the same MD5).

This is why MD5 is no longer used for security-critical applications.

But for non-malicious integrity checking (detecting random corruption), MD5 still drastically lowers the collision odds compared to a short checksum.

In modern practice, SHA-256 (part of the SHA-2 family) or newer hashes are preferred for rigorous integrity verification, since they have no known collisions and produce an even larger, more unique signature.

SHA-256 in particular is widely used in storage systems and software for integrity, for example, ZFS file system can use SHA-256 to checksum data blocks.

CRC: Cyclic Redundancy Check for Error Detection

CRC (Cyclic Redundancy Check) is a different kind of checksum algorithm commonly used in storage devices, networks, and other hardware systems.

A CRC (like CRC32, a 32-bit checksum) is very fast to compute (often implemented in hardware) and excellent at detecting common accidental errors in data transmission or storage.

For instance, network packets and disk drives use CRCs to catch single-bit or small burst errors caused by noise or hardware issues.

The algorithm treats the data as coefficients of a polynomial and divides by a fixed generator polynomial, using the remainder as the checksum.

This method is tuned to catch typical errors: if just a few bits flip or a short burst of bytes get garbled, the CRC value will almost certainly change, alerting the system of corruption.

The emphasis of CRC is on speed and efficacy for random error detection, not on cryptographic security.

CRCs are relatively short (32 bits in CRC32, etc.), so there are far fewer possible values than with MD5/SHA. This means there is a higher (though still small) chance that two different pieces of data could by coincidence share the same CRC.

In fact, a 32-bit CRC has 4.3 billion possible values, so the probability that a random corruption goes undetected is about 1 in 4.3 billion (assuming truly random changes).

For most practical cases of accidental errors, this is extremely good. CRCs will catch virtually all random single-bit flips or small errors.

However, if you handle enormous numbers of files or data blocks (say, billions of files), the odds of an accidental CRC collision aren’t zero.

And importantly, an attacker who wants to fool a CRC can do so easily by making compensating changes to the data (CRC is not designed to resist intentional manipulation).

In storage industry usage, CRCs often guard low-level data integrity.

For example, disk controllers may append a CRC to each sector, so if a bad disk read occurs, the CRC mismatch signals the data is corrupted.

Similarly, RAID systems and data transmission protocols use CRCs to detect errors on the fly because they can be computed extremely fast (and even in parallel with data transfer). Some filesystems and storage solutions use both: a fast CRC for quick error checks and a stronger hash for end-to-end verification.

MD5/SHA vs. CRC: Which to Use for Integrity?

Choosing between a cryptographic hash (like MD5/SHA) and a CRC comes down to context:

Practical Examples

🤖 Don't fully get this? Learn it with Claude

Stuck on What Are Checksums and Etags, and How Do MD5SHA Compare to CRC for Integrity? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.

🎨 Explain it visually

Build the mental picture, not memorization.

I just read a lesson on **What Are Checksums and Etags, and How Do MD5SHA Compare to CRC for Integrity** (System Design) and want to truly understand it. Explain What Are Checksums and Etags, and How Do MD5SHA Compare to CRC for Integrity from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🤔 Walk me through it (interactive)

Socratic — adapts to where you're stuck.

Teach me **What Are Checksums and Etags, and How Do MD5SHA Compare to CRC for Integrity** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🧪 Quiz me & fix my gaps

Active recall exposes what you missed.

Quiz me on **What Are Checksums and Etags, and How Do MD5SHA Compare to CRC for Integrity** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🧠 Make it stick

Intuition + hook + flashcards for long-term memory.

Help me remember **What Are Checksums and Etags, and How Do MD5SHA Compare to CRC for Integrity** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

📝 My notes