Knowledge Guide
HomeConcurrencyLock-Free & Memory Ordering

Cache Coherence (MESI) & False Sharing

The mechanism

Every modern CPU core has its own private L1/L2 cache, so the same memory address can live in several caches at once — and hardware must guarantee that no two cores ever see different values for it. MESI is the classic protocol that enforces this: each cache line is tagged with one of four states (Modified, Exclusive, Shared, Invalid), and before any core writes a line, it forces every other copy of that line to Invalid. That single rule — "a write requires exclusive ownership" — is both what keeps caches consistent and what makes false sharing so expensive.

Caveat on mechanism: textbook MESI is usually explained as pure snooping — every core broadcasts its reads/writes on a shared bus and every other cache listens. That model is accurate for small, bus-connected systems and is the right mental model for learning the states. It is not how most modern multi-socket and many-core x86/ARM chips actually implement it at scale: those predominantly use directory-based or hybrid snoop-filter/probe-filter coherence, where a directory (or filter) tracks which cores hold a line so requests can be routed point-to-point instead of broadcast to everyone. The state machine below (M/E/S/I and the transitions) is still the right level to reason about correctness and false sharing — the broadcast-vs-directory distinction is a implementation detail of how the invalidation is delivered, not whether it happens.

Why it matters

Coherence is invisible when it works and catastrophic when you fight it. The unit of coherence is not a variable — it is a cache line, almost universally 64 bytes on x86-64 and ARM64. The hardware tracks ownership per line, not per byte. So two threads updating two completely independent variables that happen to sit within the same 64-byte line will ping the line back and forth between their cores' caches as if they were contending for one lock — except there is no lock, no correctness bug, and nothing in your source code that looks wrong. This is false sharing, and — as an illustrative, order-of-magnitude figure, not a guarantee — it routinely turns a naive counter into something ~5-10x slower than a version where each counter has its own cache line.

The four MESI states

Each line in each cache carries two state bits. The states form a strict hierarchy of ownership:

StateMeaningOther caches?Memory in sync?May write without a bus transaction?
ModifiedThis cache owns the only copy and it is dirtyNo (all Invalid)No — this cache is the truthYes
ExclusiveThis cache has the only copy and it is cleanNoYesYes — silently transitions to Modified
SharedClean copy, possibly present in other caches tooMaybeYesNo — must first invalidate the others
InvalidStale / empty — a read here is a missIrrelevantNo — must fetch first

The two states people confuse are E and S. Exclusive is the optimization that makes read-then-write cheap: if you load a line nobody else has, you get it in E, and a subsequent write flips E → M with no bus traffic at all. Shared means someone else might hold it, so a write must first send an invalidation and wait — that round trip is the cost of coherence.

A traced example: two cores sharing one line

Line L holds address A. Core 0 and Core 1 each have private caches. Watch the state of L in each cache as the program runs. "RFO" = Read-For-Ownership, the invalidating read a core issues before it may write.

#ActionBus transactionCore 0 stateCore 1 state
1C0 reads A (nobody has it)BusRd, no responderEI
2C0 writes Anone (silent E→M)MI
3C1 reads ABusRd; C0 flushes, downgradesSS
4C1 writes ABusRdX / InvalidateIM
5C0 writes ARFO; C1 flushes, invalidatesMI

Steps 4 and 5 are the killer. Each write is now a full coherence miss: the writing core must invalidate the other copy and pull the freshest data across the interconnect. On a modern server that is roughly ~40-80 ns if the line comes from another core on the same socket, and worse across sockets — versus ~1 ns for an L1 hit. If both cores loop writing to A, the line "ping-pongs" and every single write pays that penalty. This is exactly what false sharing does — except the two cores are writing to different addresses that merely live in the same line.

MESI vs. MOESI/MESIF: notice step 3 forces C0 to write the dirty line back to memory before C1 can read it in S — a real memory round trip on the critical path. AMD's MOESI adds an Owned state so the previous owner can hand the dirty data directly to the requester cache-to-cache and both end up sharing it, skipping the write-to-memory step. Intel's MESIF instead adds a Forward state: among several Shared copies, exactly one is designated the responder for future read requests, avoiding N-way duplicate responses. Both are strict superset optimizations of MESI aimed at cutting the same read-after-write-elsewhere latency this trace exposes — they change who answers a read miss and whether memory is touched, not the fundamental E/S/I contention story or the false-sharing failure mode, which is identical under all three protocols.

False sharing, concretely

Consider a striped counter: an array of per-thread longs where each thread updates only its own slot, no lock or CAS involved.

long[] counters = new long[N];   // 8 bytes each
// thread t does, in a tight loop:
counters[t]++;                    // only ever touches its own index

Logically there is zero sharing — thread 0 touches counters[0], thread 1 touches counters[1], and so on; this is plain unsynchronized disjoint memory access, not a lock-free algorithm (there's no shared mutable state or CAS to reason about — the point is precisely that no synchronization is needed). But a long is 8 bytes, so eight consecutive counters fit in one 64-byte line. Threads 0-7 are all hammering the same cache line. Every increment by thread 0 sends an RFO that invalidates the line in threads 1-7's caches; their next increment misses and drags the line back. The line ping-pongs on every write. Measured on a typical 8-core x86 box, the packed version runs on the order of ~5-10x slower than the padded one (same illustrative figure as above — exact ratio depends on core count, topology, and memory latency), and — the tell-tale sign — it gets slower as you add threads instead of faster.

The fix: padding and @Contended

The remedy is to guarantee each hot variable owns its own 64-byte line, so a write only ever transitions E→M locally and never invalidates a neighbour. Three levels:

1. Manual padding — pad the struct out to (at least) 64 bytes. In Java this is fragile because the JVM can reorder or eliminate unused fields, but the classic idiom is:

class PaddedLong {
    // 7 longs of padding on each side keeps 'value' alone on its line
    long p1, p2, p3, p4, p5, p6, p7;
    volatile long value;
    long q1, q2, q3, q4, q5, q6, q7;
}

2. @jdk.internal.vm.annotation.Contended (public @Contended pre-Java 9) — the JVM inserts padding for you, and unlike manual fields the compiler won't dead-strip it. Requires -XX:-RestrictContended to use outside the JDK. This is exactly how java.util.concurrent.atomic.LongAdder's internal Cell[] stripes avoid false sharing, and why LongAdder crushes a single contended AtomicLong under write-heavy load.

@jdk.internal.vm.annotation.Contended
static final class Cell { volatile long value; }   // JDK's own layout

3. In C/C++/Go — align to the line: alignas(64) in C++, _ Padding [64]byte or golang.org/x/sys/cpu.CacheLinePad in Go. Go's runtime uses exactly this for its per-P sharded structures.

Padding is one remedy among several — compare the costs

Padding/@Contended is the cheapest fix but not the only one, and it's the wrong tool when the real problem is genuine contention rather than accidental line-sharing. Three remedies, ranked by what they actually attack:

RemedyAttacksCostUse when
Padding / @ContendedAccidental co-location on one line~56-120 wasted bytes per field; no code restructuringIndependent per-thread/per-object values that merely alias a line (the striped-counter example above)
Thread-local accumulation, merge at the endPer-write coherence traffic entirely — no cross-thread write during the hot loop at allExtra memory per thread for the life of the loop; a merge/reduce step at completion; results are only as fresh as the last merge (not visible to other threads mid-flight)Batch or bounded-lifetime work (a request, a job, a benchmark loop) where nobody needs to observe partial totals from another thread
Striping / sharding (LongAdder-style), aligned per stripeTrue contention on one logical counter, updated continuously by many writers, with results needed liveRead path becomes O(#stripes) to sum; memory grows with contention (stripes can expand under load)Many concurrent writers, infrequent reads, value needed at any time (metrics, request counters)
Restructuring data layout (e.g. struct-of-arrays → array-of-structs, or vice versa; separating hot/cold fields)Unrelated hot and cold fields sharing a line, or a hot field sharing a line with a field owned by a different thread's objectRequires redesigning the type, can hurt locality for code paths that do want the fields togetherThe false sharing is structural — e.g. a per-connection object's hot counter sits next to another connection's hot counter in an array of structs

Decide like this: if the sharing is accidental and per-thread state is genuinely independent, pad it — it's a one-line fix. If many threads must continuously update the same logical value, don't pad a single field forever fighting contention — shard it and pay the O(#stripes) cost at read time (this is why LongAdder beats a padded single AtomicLong under heavy concurrent writes). If the workload is a bounded batch where no one needs to see live partial results, skip synchronization entirely and accumulate thread-locally, merging once at the end — that's cheaper than either padding or striping because it removes cross-thread writes altogether. Reach for layout restructuring only when the sharing is baked into how objects are arranged in memory, not into a single hot field.

How to see it — measurement, not guessing

False sharing is invisible in source and in most profilers that attribute time to functions. You diagnose it with hardware counters:

Pitfalls a working engineer hits

Takeaways

Recall

Two threads increment two different long fields in a loop and the program gets slower when pinned to two cores than to one. There is no shared variable and no lock. What is happening, which MESI transition dominates, and what's the one-line fix?

Answer: the two fields sit in the same 64-byte cache line (false sharing). Each store issues an RFO/invalidate, forcing the other core's copy to Invalid, so the line ping-pongs (S/M → I on every write) instead of staying in M. Fix: separate the fields onto distinct lines — @Contended or 64-byte alignment.


Sources: Ulrich Drepper, "What Every Programmer Should Know About Memory" (MESI, coherence traffic); Brendan Gregg, Systems Performance (PMC/perf methodology, c2c); Intel 64 & IA-32 Architectures Optimization Reference Manual (cache lines, adjacent-line prefetch, RFO); Culler & Singh, Parallel Computer Architecture (directory-based coherence, MOESI/MESIF); Doug Lea / OpenJDK LongAdder and @Contended source; Martin Thompson, "Mechanical Sympathy" (false-sharing benchmarks). Re-authored/Deepened for this guide.

🤖 Don't fully get this? Learn it with Claude

Stuck on Cache Coherence (MESI) & False Sharing? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.

🎨 Explain it visually

Build the mental picture, not memorization.

I just read a lesson on **Cache Coherence (MESI) & False Sharing** (Concurrency) and want to truly understand it. Explain Cache Coherence (MESI) & False Sharing from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🤔 Walk me through it (interactive)

Socratic — adapts to where you're stuck.

Teach me **Cache Coherence (MESI) & False Sharing** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🧪 Quiz me & fix my gaps

Active recall exposes what you missed.

Quiz me on **Cache Coherence (MESI) & False Sharing** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🧠 Make it stick

Intuition + hook + flashcards for long-term memory.

Help me remember **Cache Coherence (MESI) & False Sharing** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

📝 My notes