Home › Concurrency › Lock-Free & Memory Ordering

Memory Ordering, Barriers & Happens-Before

The order in which one CPU core's writes become visible to another core is not the order they appear in your source code — because both the compiler and the hardware reorder memory operations to hide latency, and each core commits its stores through a private store buffer before they reach coherent cache. A memory ordering model is the contract that tells you which reorderings are legal, and the fences/annotations you use to forbid the ones that would break your algorithm.

Why compilers and CPUs reorder at all

Two independent layers reorder, and you must reason about both:

Compiler. The optimizer hoists loads out of loops, sinks stores past branches, and keeps values in registers instead of re-reading memory. As far as a single thread's observable result goes (the as-if-serial rule), these are legal. Another thread never agreed to that rule.
CPU. A store first lands in the core's store buffer (a FIFO of pending writes), so the writing core can retire the instruction without waiting ~tens-to-hundreds of cycles for cache-coherence. Loads may be satisfied speculatively and out of order. On x86 (TSO — Total Store Order) the only reordering allowed is StoreLoad: a later load can pass an earlier store to a different address, because that store is still sitting in the buffer. On weaker ISAs (ARMv8, POWER, RISC-V) StoreStore, LoadLoad, and LoadStore reorder too.

Why it matters: the canonical failure is Dekker/Peterson-style mutual exclusion and lock-free flag signalling. Each thread writes its own flag then reads the other's; on the source page it looks impossible for both to read 0, yet both stores can still be buffered when both loads execute, so r1 == r2 == 0 is a real, reproducible outcome. Get the ordering wrong and your lock lets two threads into the critical section a few times per billion iterations — a bug that never shows up in a debugger.

Sequential consistency vs relaxed models

Sequential consistency (SC), Lamport's model, is the intuition every programmer starts with: the result is as if all threads' operations were interleaved into one global order, and each thread's ops appear in program order within it. It is the easiest to reason about and the most expensive to provide — enforcing it everywhere would mean a fence after essentially every shared store.

Real hardware and language memory models are relaxed: they guarantee SC only for data-race-free programs that use synchronization correctly (the DRF-SC theorem — if you have no data races, you get SC behavior). Otherwise you get the weaker native ordering. The spectrum, strongest to weakest:

Model	Reorderings allowed	Where
Sequential consistency	none	Java `volatile` for SC-DRF; explicit `seq_cst`
TSO (Total Store Order)	StoreLoad only	x86-64, SPARC-TSO
Weak / relaxed	StoreStore, LoadLoad, LoadStore, StoreLoad	ARMv8, POWER, RISC-V

The practical consequence: the same lock-free code that is accidentally correct on your x86 laptop can fail on an ARM server, because ARM permits LoadLoad and StoreStore reordering that x86 does not. The memory model — not the CPU you happen to test on — is the contract.

Acquire / release: the cheap ordering that actually matters

Full SC is overkill for the most common pattern — publishing data through a flag. What you actually need is one-directional ordering, and that is exactly acquire/release:

Release store (the publish): every memory operation before it in program order is guaranteed to be visible to any thread that later acquire-reads the same variable. It acts as a one-way ceiling — nothing above sinks below it.
Acquire load (the consume): every memory operation after it stays after. A one-way floor — nothing below hoists above it.

Pair them and you get the guarantee: if the acquire-load sees the value written by the release-store, then everything the writer did before the release is visible to the reader after the acquire. Crucially, a release does not stop later stores from moving up into the critical region, and an acquire does not stop earlier loads from sinking in — which is why the pair is cheaper than SC (no StoreLoad fence needed on TSO), yet strong enough for locks, queues, and one-time publication.

Acquire/release vs. a plain mutex vs. a plain SC atomic — when relaxed pairing is unsafe

These three are not interchangeable; pick by how many variables must move together and whether readers ever need to agree on a single global order:

Acquire/release (a single flag/pointer publish). Correct when exactly one release-acquire edge needs to carry a batch of prior writes to one consumer path — the classic "build an object, then flip a ready flag" or "push a node, then update a lock-free head pointer." It says nothing about ordering relative to a second, unrelated acquire/release pair elsewhere in the program — those two edges can be observed in different orders by different threads, because acquire/release is only a partial order, not a global one.
Full mutex (lock/unlock). Needed the moment you have a multi-variable invariant that must be read and updated as one unit — e.g. "balance and pendingCount must always change together," or any code path with more than one write that must not be observed torn or interleaved with another thread's writes to the same set. A mutex additionally serializes all critical sections against each other (mutual exclusion), which acquire/release alone does not give you: two threads can both successfully "acquire-read" a release-published value at the same time — release/acquire has no exclusion, only ordering.
Plain SC atomic (seq_cst / Java volatile). Needed when multiple threads must agree on one interleaving of several independent atomics — Dekker/Peterson-style algorithms, or any protocol where thread A reasons about the relative order of thread B's write to X and thread C's write to Y. Acquire/release cannot give this because it only orders the two ends of one edge, not all edges against each other.

Rule of thumb: reach for acquire/release for one-shot or per-message publication (cheapest correct tool); reach for a mutex the moment more than one location must change atomically together; reach for SC only when the algorithm's correctness argument genuinely depends on a single total order across independently-published variables.

Memory barriers / fences — the primitive underneath

Acquire/release and volatile are ultimately compiled to fence instructions that forbid specific reorderings. There are four fine-grained fence types, named by the pair they prevent from swapping:

Fence	Prevents	Role
`LoadLoad`	a later load passing an earlier load	part of acquire
`StoreStore`	a later store passing an earlier store	part of release
`LoadStore`	a later store passing an earlier load	acquire & release
`StoreLoad`	a later load passing an earlier store	the expensive one — full/SC

An acquire is LoadLoad + LoadStore; a release is LoadStore + StoreStore. Only StoreLoad requires draining the store buffer, which is why it is the only fence x86 actually needs to emit (via MFENCE or any LOCK-prefixed instruction) and why SC costs more than acquire/release. On ARMv8 these map to DMB ISH variants; a release store is STLR, an acquire load is LDAR. Compiler-only reordering is separately fenced by a compiler barrier (e.g. std::atomic_signal_fence, or an asm volatile("" ::: "memory")), which emits no CPU instruction but stops the optimizer.

Java: `volatile` and the happens-before model

The Java Memory Model (JSR-133, 2004) defines correctness in terms of the happens-before partial order. A read is allowed to return a write's value if the write happens-before the read (and no intervening write clobbers it). If two accesses to the same location are not ordered by happens-before and at least one is a write, that is a data race and the result is undefined. The edges that establish happens-before:

Program order within a single thread.
Monitor lock: unlock of a monitor happens-before every subsequent lock of the same monitor.
volatile: a write to a volatile field happens-before every subsequent read of that same field. The JLS itself states this purely in happens-before terms, without naming hardware fences. The common implementation mapping — laid out in Doug Lea's JSR-133 cookbook, not the JLS text itself — compiles a volatile write as a release plus StoreStore+LoadStore and a volatile read as an acquire plus LoadLoad+LoadStore, and additionally inserts a StoreLoad fence after volatile stores. That extra StoreLoad is what gives volatile-to-volatile accesses a total, SC-like order among themselves on top of happens-before; it is a widely-used compiler/JIT convention for satisfying the spec, not a sentence you will find verbatim in the JLS.
Thread start/join, and final-field freeze at constructor exit.

The killer property is piggybacking: a volatile write publishes not just the volatile variable but everything the writer did before it. So a single volatile flag safely publishes a whole object graph. Below is a complete, compilable example:

class Publisher {
    private int payload;              // plain field
    private volatile boolean ready;   // the release/acquire gate

    void publish() {
        payload = 42;                 // (1) plain write
        ready = true;                 // (2) volatile write == release
    }

    int consume() {
        if (ready)                    // (3) volatile read == acquire
            return payload;           // (4) guaranteed to see 42
        return -1;
    }
}

Because (2) happens-before (3) via the volatile rule, and (1) is before (2) and (4) is after (3) in program order, transitivity gives (1) happens-before (4): the reader that sees ready==true is guaranteed to see payload==42. Drop volatile and there is no edge — the reader may see ready==true but stale payload==0, or spin forever because the JIT hoisted the plain read out of the loop into a register.

C++: `std::memory_order`

C++11 exposes the ordering knob directly on each atomic operation, so you pay for exactly the ordering you need:

Value	Guarantee	Use for
`memory_order_relaxed`	atomicity only; no ordering with other vars	counters, stats
`memory_order_acquire`	on a load: floor for later ops	consuming a flag/lock take
`memory_order_release`	on a store: ceiling for earlier ops	publishing / lock release
`memory_order_acq_rel`	both, for read-modify-write	`fetch_add`, CAS on a lock
`memory_order_seq_cst`	acq/rel + single global total order	default; Dekker-style algorithms

The same publication pattern, spelled explicitly:

std::atomic<bool> ready{false};
int payload = 0;                     // plain

void publish() {
    payload = 42;
    ready.store(true, std::memory_order_release);   // ceiling
}

int consume() {
    while (!ready.load(std::memory_order_acquire)) {} // floor
    return payload;                  // sees 42
}

Default operations use seq_cst — correct but it emits a StoreLoad fence you rarely need. Dropping to acquire/release for message passing removes that fence on x86 and swaps LDAR/STLR for cheaper forms on ARM. (Avoid memory_order_consume: it was meant to be a cheaper acquire that follows only data-dependency chains, but no compiler implements it as specified — they all promote it to acquire. Treat it as deprecated.)

The double-checked-locking bug

DCL tries to make lazy singleton init cheap: check the field without a lock, and only synchronize on the slow path when it's null. The naive version is broken, and it is broken by exactly the reordering we've been discussing:

// BROKEN
private static Singleton instance;   // NOT volatile
static Singleton get() {
    if (instance == null) {                 // 1st check (no lock)
        synchronized (Singleton.class) {
            if (instance == null)            // 2nd check
                instance = new Singleton();  // (!) not atomic
        }
    }
    return instance;
}

instance = new Singleton() is three steps: (a) allocate memory, (b) run the constructor, (c) publish the reference into instance. The JMM permits reordering to (a) → (c) → (b) — the field is published before the object is initialized. A second thread on the fast path sees instance != null, skips the lock, and returns a reference to a half-constructed object — reading default-zero fields. The bug is invisible on x86 in most runs and surfaces under load on ARM.

The fix is one keyword — declare the field volatile. The volatile write is a release: it forbids (c) from floating above (b), and the fast-path volatile read is an acquire that pairs with it, so a non-null reference is guaranteed fully constructed.

// CORRECT
private static volatile Singleton instance;   // the fix
static Singleton get() {
    Singleton r = instance;              // read volatile ONCE
    if (r == null) {
        synchronized (Singleton.class) {
            r = instance;
            if (r == null) instance = r = new Singleton();
        }
    }
    return r;
}

The local r is the standard micro-optimization: it reads the volatile field once instead of twice on the hot path. Even simpler and preferred in Java: use the initialization-on-demand holder idiom (a static nested class), which gets lazy, thread-safe init from the class-loading lock with no volatile and no synchronization on the read path.

🤖 Don't fully get this? Learn it with Claude

Stuck on Memory Ordering, Barriers & Happens-Before? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.

🎨 Explain it visually

Build the mental picture, not memorization.

I just read a lesson on **Memory Ordering, Barriers & Happens-Before** (Concurrency) and want to truly understand it. Explain Memory Ordering, Barriers & Happens-Before from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🤔 Walk me through it (interactive)

Socratic — adapts to where you're stuck.

Teach me **Memory Ordering, Barriers & Happens-Before** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🧪 Quiz me & fix my gaps

Active recall exposes what you missed.

Quiz me on **Memory Ordering, Barriers & Happens-Before** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🧠 Make it stick

Intuition + hook + flashcards for long-term memory.

Help me remember **Memory Ordering, Barriers & Happens-Before** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

📝 My notes

← Cache Coherence (MESI) & False Sha Compare-And-Swap (CAS) & the ABA P →

Memory Ordering, Barriers & Happens-Before