Home › System Design › Scalable Systems (Advanced Topics)

What Are Checksums and Etags, and How Do MD5SHA Compare to CRC for Integrity

Checksums are short digital fingerprints of data used to verify integrity, and ETags (entity tags) are unique identifiers often derived from checksums or hashes that web and storage systems use to track data versions and detect changes.

What Is a Checksum? (Data Integrity Basics)

A checksum is a small code calculated from a block of digital data to detect errors or alterations. By recomputing the checksum later and comparing it to the original value, you can verify whether the data has changed.

If the checksums match, the data is almost certainly unaltered.

Checksums are widely used for data integrity in storage and transmission, for example, to ensure a file wasn’t corrupted during download or to detect disk errors. They don’t by themselves prove authenticity (i.e. who created the data), but they excel at catching accidental corruption.

Common checksum algorithms range from simple ones (like parity bits or summing bytes) to more complex hash functions.

Good checksum algorithms change significantly even if the input data changes only a little. This property means even a one-bit error in the data will result in a very different checksum value, alerting us to corruption.

In data storage systems, checksums are crucial: disk controllers, file systems, and backup software often compute checksums (or hashes) for each data block or file to detect bit rot or transmission errors over time.

What Is an ETag? (Entity Tags in Web and Storage)

An ETag (Entity Tag) is an identifier assigned by a web or storage server to a specific version of a resource.

In HTTP (the web protocol), the ETag is returned in a response header to help with caching: it’s essentially a unique tag for the content.

When a browser or client caches a resource, it also stores its ETag.

Later, it can make a conditional request: “give me the data only if it has changed, using this ETag to check.”

The server compares the provided ETag with the current version’s ETag. If they match, it responds with a 304 Not Modified status (no need to resend the content). This makes web caching more efficient and saves bandwidth.

In practice, an ETag value is often generated by hashing the content or using a timestamp or version number.

(For example, many ETags look like strings of hex digits, which often come from an MD5 or SHA-1 hash of the content.)

ETags are basically fingerprints for data versions.

Whenever the resource changes, a new ETag is computed and assigned.

Besides caching, ETags can also be used for concurrency control (to prevent overwriting changes. A client can say “update this resource only if its ETag still matches”).

In the storage industry, ETags are used in cloud object storage systems as integrity checks.

For instance, Amazon S3 provides an ETag for each stored object; for standard uploads, this ETag is effectively an MD5 checksum of the file (or a combination of MD5 checksums for multipart uploads).

By comparing an upload’s ETag to a locally computed MD5, users can verify that the file arrived intact.

In short, ETags tie together the concept of checksums with practical version tracking. They ensure you’re working with the exact data you expect, without having to re-transfer the data unnecessarily.

MD5, SHA, and CRC: Comparing Integrity Check Algorithms

When it comes to algorithms for generating these fingerprints, there are two broad categories: cryptographic hash functions like MD5 or SHA, and error-detecting codes like CRC.

All serve as checksums in the general sense, but they have different strengths and purposes.

MD5 and SHA: Cryptographic Hashes for Integrity

MD5 and SHA (SHA-1, SHA-256, etc.) are cryptographic hash functions that produce a fixed-size hash value from input data. They are designed so that it’s extremely hard to find two different inputs with the same hash (especially for newer hashes like SHA-256).

Even a tiny change in the input (like flipping one bit) will avalanche into a completely different output. These hashes have a large output (MD5 is 128-bit, SHA-1 is 160-bit, SHA-256 is 256-bit), which means there are astronomically many possible values.

This makes the chance of an accidental collision (two different files having the same hash) vanishingly small.

For example, SHA-256’s 256-bit space provides 2^128 collision resistance (since collision resistance is generally half the hash length in bits), which is on the order of 3.4×10^38, effectively zero chance of random collisions for practical purposes.

Originally, hashes like MD5/SHA were meant for security (e.g. verifying signatures, storing passwords).

For integrity checking, their strength means a very high assurance against both random errors and intentional tampering.

If you download a file and compare its published SHA-256 checksum to the one you compute, a match gives high confidence the file is exact.

If an attacker tried to maliciously alter the file, they would not easily produce a different file with the same SHA-256.

In fact, “all common cryptographic hash functions are better than CRC for integrity, even MD5, because they’re designed to detect malicious tampering whereas CRC is not”.

However, some cryptographic hashes have known weaknesses. MD5, for instance, is now considered cryptographically broken. Researchers have found ways to create deliberate collisions (two different inputs with the same MD5).

This is why MD5 is no longer used for security-critical applications.

But for non-malicious integrity checking (detecting random corruption), MD5 still drastically lowers the collision odds compared to a short checksum.

In modern practice, SHA-256 (part of the SHA-2 family) or newer hashes are preferred for rigorous integrity verification, since they have no known collisions and produce an even larger, more unique signature.

SHA-256 in particular is widely used in storage systems and software for integrity, for example, ZFS file system can use SHA-256 to checksum data blocks.

CRC: Cyclic Redundancy Check for Error Detection

CRC (Cyclic Redundancy Check) is a different kind of checksum algorithm commonly used in storage devices, networks, and other hardware systems.

A CRC (like CRC32, a 32-bit checksum) is very fast to compute (often implemented in hardware) and excellent at detecting common accidental errors in data transmission or storage.

For instance, network packets and disk drives use CRCs to catch single-bit or small burst errors caused by noise or hardware issues.

The algorithm treats the data as coefficients of a polynomial and divides by a fixed generator polynomial, using the remainder as the checksum.

This method is tuned to catch typical errors: if just a few bits flip or a short burst of bytes get garbled, the CRC value will almost certainly change, alerting the system of corruption.

The emphasis of CRC is on speed and efficacy for random error detection, not on cryptographic security.

CRCs are relatively short (32 bits in CRC32, etc.), so there are far fewer possible values than with MD5/SHA. This means there is a higher (though still small) chance that two different pieces of data could by coincidence share the same CRC.

In fact, a 32-bit CRC has 4.3 billion possible values, so the probability that a random corruption goes undetected is about 1 in 4.3 billion (assuming truly random changes).

For most practical cases of accidental errors, this is extremely good. CRCs will catch virtually all random single-bit flips or small errors.

However, if you handle enormous numbers of files or data blocks (say, billions of files), the odds of an accidental CRC collision aren’t zero.

And importantly, an attacker who wants to fool a CRC can do so easily by making compensating changes to the data (CRC is not designed to resist intentional manipulation).

In storage industry usage, CRCs often guard low-level data integrity.

For example, disk controllers may append a CRC to each sector, so if a bad disk read occurs, the CRC mismatch signals the data is corrupted.

Similarly, RAID systems and data transmission protocols use CRCs to detect errors on the fly because they can be computed extremely fast (and even in parallel with data transfer). Some filesystems and storage solutions use both: a fast CRC for quick error checks and a stronger hash for end-to-end verification.

MD5/SHA vs. CRC: Which to Use for Integrity?

Choosing between a cryptographic hash (like MD5/SHA) and a CRC comes down to context:

Speed and Performance: CRC algorithms are generally much faster and lightweight. They’re often available in hardware instructions, making them ideal for real-time error checking (e.g. streaming data, network packets). MD5 and SHA are computationally heavier, though MD5 is fairly fast in software and newer CPUs even have instructions for SHA. If you need to checksum terabytes of data quickly (and trust that errors will be random, not malicious), CRC might be the better fit for performance. For example, Google Cloud Storage recommends using CRC32C for integrity checks during data transfer for speed, though it also supports MD5 for those who prefer it.
Integrity Strength: Cryptographic hashes provide a much larger address space (128 bits for MD5, 256 bits for SHA-256, etc. vs 32 or 64 bits for typical CRCs). This drastically lowers collision probability. If you are guarding against even the slightest chance of undetected corruption, especially in large data sets or archives, a cryptographic hash is safer. For instance, if you have millions of files, a 32-bit CRC could theoretically collide on two different files. Using a 128-bit or 256-bit hash makes that astronomically unlikely. One cryptography answer notes that even a 64-bit hash is usually enough for non-malicious integrity checking in most cases, but the bigger the hash, the lower the collision risk.
Security and Tamper Resistance: If there’s any possibility of an adversary or malicious actor, CRC is not appropriate. It’s trivial for someone to alter data and adjust a few bytes so that the CRC remains the same (because of its mathematical structure). MD5/SHA are designed to make finding two inputs with the same hash infeasible. While MD5 has known collisions, it’s still much harder to foil by accident than CRC. SHA-1 has some collision concerns in theory, but SHA-256 and above are practically unbreakable in this regard. Thus, for software updates, downloads, or backups where you want to ensure no one has tampered with the data, you’d use an MD5 or SHA checksum (with SHA-256 or stronger being preferred today). For example, Linux ISO downloads often provide SHA256 sums. If your downloaded file’s hash doesn’t match, you know the file is corrupted or fake.
Use Cases: In summary, use CRC for quick error detection in internal systems and transmissions where performance is key and errors are expected to be random (e.g., internal disk/storage checks, network error checking, RAID parity verification). Use MD5/SHA (or another hash) for end-to-end data integrity where you need a stronger guarantee that the data is exactly as intended, especially in the presence of potential malicious changes or when distributing files to users. It’s not uncommon to use both: for instance, a storage system might use CRC to detect low-level errors and also maintain a SHA-256 hash of each file for periodic integrity audits or client verification.

Practical Examples

File Download Verification: Imagine you download a 1 GB software image. The provider offers an SHA-256 checksum on their website. After downloading, you run a tool to compute the SHA-256 of the file and compare it to the provided value. If even a single byte is off, the SHA-256 hash will be completely different, telling you the download is corrupted or tampered. (Some providers still use MD5 here, which serves the same purpose but with less safety margin due to MD5’s weaknesses. SHA-256 is now the standard for robust verification.)
Data Storage Integrity: In a cloud storage scenario, say you upload a file to Amazon S3. The service computes an MD5 checksum of the file as it receives it. S3 uses that to generate the object’s ETag (for a simple upload). After upload, you can compare S3’s reported ETag with your file’s MD5 to ensure the file wasn’t corrupted in transit. Meanwhile, S3 (and other storage systems) also use internal checksums. Amazon S3 and Google Cloud Storage will validate checksums on data to guard against corruption at rest. Some advanced file systems like ZFS store a checksum (often a Fletcher or SHA-256 hash) for every disk block; whenever data is read, ZFS recomputes the checksum and if it doesn’t match, it knows the block is bad and can recover from redundancy.
Network Packet Error Checking: Every Ethernet frame and many other network packets include a CRC (often 32-bit) in the trailer. As data arrives, the receiver recalculates the CRC and compares it. If it doesn’t match, the packet is assumed corrupted and discarded. This all happens extremely fast in hardware. Here, using a heavy hash like SHA-256 for each packet would be unnecessary overkill. A CRC is more than sufficient to catch the occasional flipped bit due to electrical noise. The same logic applies inside storage devices: for example, NVMe drives use a CRC64 (64-bit CRC) to protect data integrity on the interface.

🤖 Don't fully get this? Learn it with Claude

Stuck on What Are Checksums and Etags, and How Do MD5SHA Compare to CRC for Integrity? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.

🎨 Explain it visually

Build the mental picture, not memorization.

I just read a lesson on **What Are Checksums and Etags, and How Do MD5SHA Compare to CRC for Integrity** (System Design) and want to truly understand it. Explain What Are Checksums and Etags, and How Do MD5SHA Compare to CRC for Integrity from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🤔 Walk me through it (interactive)

Socratic — adapts to where you're stuck.

Teach me **What Are Checksums and Etags, and How Do MD5SHA Compare to CRC for Integrity** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🧪 Quiz me & fix my gaps

Active recall exposes what you missed.

Quiz me on **What Are Checksums and Etags, and How Do MD5SHA Compare to CRC for Integrity** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🧠 Make it stick

Intuition + hook + flashcards for long-term memory.

Help me remember **What Are Checksums and Etags, and How Do MD5SHA Compare to CRC for Integrity** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

📝 My notes

← What Is the Difference Between Con What Are Hot, Warm, Cold, and Arch →