Cache Coherence (MESI) & False Sharing
The mechanism
Every modern CPU core has its own private L1/L2 cache, so the same memory address can live in several caches at once — and hardware must guarantee that no two cores ever see different values for it. MESI is the classic protocol that enforces this: each cache line is tagged with one of four states (Modified, Exclusive, Shared, Invalid), and before any core writes a line, it forces every other copy of that line to Invalid. That single rule — "a write requires exclusive ownership" — is both what keeps caches consistent and what makes false sharing so expensive.
Caveat on mechanism: textbook MESI is usually explained as pure snooping — every core broadcasts its reads/writes on a shared bus and every other cache listens. That model is accurate for small, bus-connected systems and is the right mental model for learning the states. It is not how most modern multi-socket and many-core x86/ARM chips actually implement it at scale: those predominantly use directory-based or hybrid snoop-filter/probe-filter coherence, where a directory (or filter) tracks which cores hold a line so requests can be routed point-to-point instead of broadcast to everyone. The state machine below (M/E/S/I and the transitions) is still the right level to reason about correctness and false sharing — the broadcast-vs-directory distinction is a implementation detail of how the invalidation is delivered, not whether it happens.
Why it matters
Coherence is invisible when it works and catastrophic when you fight it. The unit of coherence is not a variable — it is a cache line, almost universally 64 bytes on x86-64 and ARM64. The hardware tracks ownership per line, not per byte. So two threads updating two completely independent variables that happen to sit within the same 64-byte line will ping the line back and forth between their cores' caches as if they were contending for one lock — except there is no lock, no correctness bug, and nothing in your source code that looks wrong. This is false sharing, and — as an illustrative, order-of-magnitude figure, not a guarantee — it routinely turns a naive counter into something ~5-10x slower than a version where each counter has its own cache line.
The four MESI states
Each line in each cache carries two state bits. The states form a strict hierarchy of ownership:
| State | Meaning | Other caches? | Memory in sync? | May write without a bus transaction? |
|---|---|---|---|---|
| Modified | This cache owns the only copy and it is dirty | No (all Invalid) | No — this cache is the truth | Yes |
| Exclusive | This cache has the only copy and it is clean | No | Yes | Yes — silently transitions to Modified |
| Shared | Clean copy, possibly present in other caches too | Maybe | Yes | No — must first invalidate the others |
| Invalid | Stale / empty — a read here is a miss | Irrelevant | — | No — must fetch first |
The two states people confuse are E and S. Exclusive is the optimization that makes read-then-write cheap: if you load a line nobody else has, you get it in E, and a subsequent write flips E → M with no bus traffic at all. Shared means someone else might hold it, so a write must first send an invalidation and wait — that round trip is the cost of coherence.
A traced example: two cores sharing one line
Line L holds address A. Core 0 and Core 1 each have private caches. Watch the state of L in each cache as the program runs. "RFO" = Read-For-Ownership, the invalidating read a core issues before it may write.
| # | Action | Bus transaction | Core 0 state | Core 1 state |
|---|---|---|---|---|
| 1 | C0 reads A (nobody has it) | BusRd, no responder | E | I |
| 2 | C0 writes A | none (silent E→M) | M | I |
| 3 | C1 reads A | BusRd; C0 flushes, downgrades | S | S |
| 4 | C1 writes A | BusRdX / Invalidate | I | M |
| 5 | C0 writes A | RFO; C1 flushes, invalidates | M | I |
Steps 4 and 5 are the killer. Each write is now a full coherence miss: the writing core must invalidate the other copy and pull the freshest data across the interconnect. On a modern server that is roughly ~40-80 ns if the line comes from another core on the same socket, and worse across sockets — versus ~1 ns for an L1 hit. If both cores loop writing to A, the line "ping-pongs" and every single write pays that penalty. This is exactly what false sharing does — except the two cores are writing to different addresses that merely live in the same line.
MESI vs. MOESI/MESIF: notice step 3 forces C0 to write the dirty line back to memory before C1 can read it in S — a real memory round trip on the critical path. AMD's MOESI adds an Owned state so the previous owner can hand the dirty data directly to the requester cache-to-cache and both end up sharing it, skipping the write-to-memory step. Intel's MESIF instead adds a Forward state: among several Shared copies, exactly one is designated the responder for future read requests, avoiding N-way duplicate responses. Both are strict superset optimizations of MESI aimed at cutting the same read-after-write-elsewhere latency this trace exposes — they change who answers a read miss and whether memory is touched, not the fundamental E/S/I contention story or the false-sharing failure mode, which is identical under all three protocols.
False sharing, concretely
Consider a striped counter: an array of per-thread longs where each thread updates only its own slot, no lock or CAS involved.
long[] counters = new long[N]; // 8 bytes each
// thread t does, in a tight loop:
counters[t]++; // only ever touches its own index
Logically there is zero sharing — thread 0 touches counters[0], thread 1 touches counters[1], and so on; this is plain unsynchronized disjoint memory access, not a lock-free algorithm (there's no shared mutable state or CAS to reason about — the point is precisely that no synchronization is needed). But a long is 8 bytes, so eight consecutive counters fit in one 64-byte line. Threads 0-7 are all hammering the same cache line. Every increment by thread 0 sends an RFO that invalidates the line in threads 1-7's caches; their next increment misses and drags the line back. The line ping-pongs on every write. Measured on a typical 8-core x86 box, the packed version runs on the order of ~5-10x slower than the padded one (same illustrative figure as above — exact ratio depends on core count, topology, and memory latency), and — the tell-tale sign — it gets slower as you add threads instead of faster.
The fix: padding and @Contended
The remedy is to guarantee each hot variable owns its own 64-byte line, so a write only ever transitions E→M locally and never invalidates a neighbour. Three levels:
1. Manual padding — pad the struct out to (at least) 64 bytes. In Java this is fragile because the JVM can reorder or eliminate unused fields, but the classic idiom is:
class PaddedLong {
// 7 longs of padding on each side keeps 'value' alone on its line
long p1, p2, p3, p4, p5, p6, p7;
volatile long value;
long q1, q2, q3, q4, q5, q6, q7;
}
2. @jdk.internal.vm.annotation.Contended (public @Contended pre-Java 9) — the JVM inserts padding for you, and unlike manual fields the compiler won't dead-strip it. Requires -XX:-RestrictContended to use outside the JDK. This is exactly how java.util.concurrent.atomic.LongAdder's internal Cell[] stripes avoid false sharing, and why LongAdder crushes a single contended AtomicLong under write-heavy load.
@jdk.internal.vm.annotation.Contended
static final class Cell { volatile long value; } // JDK's own layout
3. In C/C++/Go — align to the line: alignas(64) in C++, _ Padding [64]byte or golang.org/x/sys/cpu.CacheLinePad in Go. Go's runtime uses exactly this for its per-P sharded structures.
Padding is one remedy among several — compare the costs
Padding/@Contended is the cheapest fix but not the only one, and it's the wrong tool when the real problem is genuine contention rather than accidental line-sharing. Three remedies, ranked by what they actually attack:
| Remedy | Attacks | Cost | Use when |
|---|---|---|---|
Padding / @Contended | Accidental co-location on one line | ~56-120 wasted bytes per field; no code restructuring | Independent per-thread/per-object values that merely alias a line (the striped-counter example above) |
| Thread-local accumulation, merge at the end | Per-write coherence traffic entirely — no cross-thread write during the hot loop at all | Extra memory per thread for the life of the loop; a merge/reduce step at completion; results are only as fresh as the last merge (not visible to other threads mid-flight) | Batch or bounded-lifetime work (a request, a job, a benchmark loop) where nobody needs to observe partial totals from another thread |
Striping / sharding (LongAdder-style), aligned per stripe | True contention on one logical counter, updated continuously by many writers, with results needed live | Read path becomes O(#stripes) to sum; memory grows with contention (stripes can expand under load) | Many concurrent writers, infrequent reads, value needed at any time (metrics, request counters) |
| Restructuring data layout (e.g. struct-of-arrays → array-of-structs, or vice versa; separating hot/cold fields) | Unrelated hot and cold fields sharing a line, or a hot field sharing a line with a field owned by a different thread's object | Requires redesigning the type, can hurt locality for code paths that do want the fields together | The false sharing is structural — e.g. a per-connection object's hot counter sits next to another connection's hot counter in an array of structs |
Decide like this: if the sharing is accidental and per-thread state is genuinely independent, pad it — it's a one-line fix. If many threads must continuously update the same logical value, don't pad a single field forever fighting contention — shard it and pay the O(#stripes) cost at read time (this is why LongAdder beats a padded single AtomicLong under heavy concurrent writes). If the workload is a bounded batch where no one needs to see live partial results, skip synchronization entirely and accumulate thread-locally, merging once at the end — that's cheaper than either padding or striping because it removes cross-thread writes altogether. Reach for layout restructuring only when the sharing is baked into how objects are arranged in memory, not into a single hot field.
How to see it — measurement, not guessing
False sharing is invisible in source and in most profilers that attribute time to functions. You diagnose it with hardware counters:
perf stat -e cache-misses,l2_rqsts.all_rfo— a false-sharing hotspot shows a high rate of RFOs / L2 misses on a tiny working set that should fit in L1.- On Intel,
perf c2c(cache-to-cache) is the purpose-built tool: it pinpoints the exact cache line and the offending offsets, showing HITM (a load that hit a Modified line in another core) events — the fingerprint of ping-pong. - The behavioural tell without any tool: throughput that drops as you add threads, and a hot loop whose CPI (cycles per instruction) is far worse than the instruction mix predicts.
Pitfalls a working engineer hits
- Padding gets optimized away. Naive unused padding fields in Java can be eliminated or reordered by the JIT. Use
@Contended, not hand-rolled longs, on the JVM. - Adjacent-line prefetch. Some Intel CPUs fetch pairs of lines (128-byte effective unit). If you're chasing the last bit of contention, aligning to 128 bytes can matter more than 64.
- False sharing hides at the edges of arrays and objects. The first and last elements of a padded array can still share a line with an adjacent object on the heap. That's why
@Contendedpads both sides. - The read-only case is fine. Many threads reading a shared line all hold it in S concurrently — no invalidation, no penalty. False sharing only bites when at least one thread writes. Don't pad read-mostly data; you'll just waste cache.
- Over-padding blows the cache. Padding trades memory and cache capacity for coherence traffic. Pad the genuinely hot, write-contended fields only.
- Calling disjoint per-thread writes "lock-free" is a category error. There's no algorithm being contrasted with a lock here — it's just unsynchronized access to non-overlapping memory. Reserve "lock-free" for algorithms with a progress guarantee under contention (CAS loops, etc.); don't apply it to code that simply has no shared state to synchronize.
Takeaways
- Coherence is tracked per 64-byte line, not per variable — the line is the true unit of contention.
- MESI's rule "a write needs exclusive ownership" means every write to a line another core holds forces an invalidation and a ~40-80 ns coherence miss; E→M writes to an unshared line are essentially free. Real hardware often implements the invalidation via directories/snoop-filters rather than pure broadcast, and MOESI/MESIF shave the read-after-remote-write path further — but the false-sharing failure mode is the same under all of them.
- False sharing = independent variables on one line, so writes ping-pong the line between cores for no logical reason (this is unsynchronized access, not a lock-free algorithm); symptom is throughput dropping as threads increase.
- Fix accidental sharing with padding /
@Contended; fix continuous true contention by sharding (LongAdder); for bounded batch work, skip synchronization and merge thread-local results at the end — it's cheaper than either. Only pad write-contended data — never read-mostly data.
Recall
Two threads increment two different long fields in a loop and the program gets slower when pinned to two cores than to one. There is no shared variable and no lock. What is happening, which MESI transition dominates, and what's the one-line fix?
Answer: the two fields sit in the same 64-byte cache line (false sharing). Each store issues an RFO/invalidate, forcing the other core's copy to Invalid, so the line ping-pongs (S/M → I on every write) instead of staying in M. Fix: separate the fields onto distinct lines — @Contended or 64-byte alignment.
Sources: Ulrich Drepper, "What Every Programmer Should Know About Memory" (MESI, coherence traffic); Brendan Gregg, Systems Performance (PMC/perf methodology, c2c); Intel 64 & IA-32 Architectures Optimization Reference Manual (cache lines, adjacent-line prefetch, RFO); Culler & Singh, Parallel Computer Architecture (directory-based coherence, MOESI/MESIF); Doug Lea / OpenJDK LongAdder and @Contended source; Martin Thompson, "Mechanical Sympathy" (false-sharing benchmarks). Re-authored/Deepened for this guide.
🤖 Don't fully get this? Learn it with Claude
Stuck on Cache Coherence (MESI) & False Sharing? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.
Build the mental picture, not memorization.
I just read a lesson on **Cache Coherence (MESI) & False Sharing** (Concurrency) and want to truly understand it. Explain Cache Coherence (MESI) & False Sharing from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
Socratic — adapts to where you're stuck.
Teach me **Cache Coherence (MESI) & False Sharing** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
Active recall exposes what you missed.
Quiz me on **Cache Coherence (MESI) & False Sharing** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
Intuition + hook + flashcards for long-term memory.
Help me remember **Cache Coherence (MESI) & False Sharing** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.