Knowledge Guide
HomeConcurrencyLock-Free & Memory Ordering

Memory Ordering, Barriers & Happens-Before

Memory Ordering, Barriers & Happens-Before

The order in which one CPU core's writes become visible to another core is not the order they appear in your source code — because both the compiler and the hardware reorder memory operations to hide latency, and each core commits its stores through a private store buffer before they reach coherent cache. A memory ordering model is the contract that tells you which reorderings are legal, and the fences/annotations you use to forbid the ones that would break your algorithm.

Why compilers and CPUs reorder at all

Two independent layers reorder, and you must reason about both:

Why it matters: the canonical failure is Dekker/Peterson-style mutual exclusion and lock-free flag signalling. Each thread writes its own flag then reads the other's; on the source page it looks impossible for both to read 0, yet both stores can still be buffered when both loads execute, so r1 == r2 == 0 is a real, reproducible outcome. Get the ordering wrong and your lock lets two threads into the critical section a few times per billion iterations — a bug that never shows up in a debugger.

Sequential consistency vs relaxed models

Sequential consistency (SC), Lamport's model, is the intuition every programmer starts with: the result is as if all threads' operations were interleaved into one global order, and each thread's ops appear in program order within it. It is the easiest to reason about and the most expensive to provide — enforcing it everywhere would mean a fence after essentially every shared store.

Real hardware and language memory models are relaxed: they guarantee SC only for data-race-free programs that use synchronization correctly (the DRF-SC theorem — if you have no data races, you get SC behavior). Otherwise you get the weaker native ordering. The spectrum, strongest to weakest:

ModelReorderings allowedWhere
Sequential consistencynoneJava volatile for SC-DRF; explicit seq_cst
TSO (Total Store Order)StoreLoad onlyx86-64, SPARC-TSO
Weak / relaxedStoreStore, LoadLoad, LoadStore, StoreLoadARMv8, POWER, RISC-V

The practical consequence: the same lock-free code that is accidentally correct on your x86 laptop can fail on an ARM server, because ARM permits LoadLoad and StoreStore reordering that x86 does not. The memory model — not the CPU you happen to test on — is the contract.

Acquire / release: the cheap ordering that actually matters

Full SC is overkill for the most common pattern — publishing data through a flag. What you actually need is one-directional ordering, and that is exactly acquire/release:

Pair them and you get the guarantee: if the acquire-load sees the value written by the release-store, then everything the writer did before the release is visible to the reader after the acquire. Crucially, a release does not stop later stores from moving up into the critical region, and an acquire does not stop earlier loads from sinking in — which is why the pair is cheaper than SC (no StoreLoad fence needed on TSO), yet strong enough for locks, queues, and one-time publication.

Acquire/release vs. a plain mutex vs. a plain SC atomic — when relaxed pairing is unsafe

These three are not interchangeable; pick by how many variables must move together and whether readers ever need to agree on a single global order:

Rule of thumb: reach for acquire/release for one-shot or per-message publication (cheapest correct tool); reach for a mutex the moment more than one location must change atomically together; reach for SC only when the algorithm's correctness argument genuinely depends on a single total order across independently-published variables.

Memory barriers / fences — the primitive underneath

Acquire/release and volatile are ultimately compiled to fence instructions that forbid specific reorderings. There are four fine-grained fence types, named by the pair they prevent from swapping:

FencePreventsRole
LoadLoada later load passing an earlier loadpart of acquire
StoreStorea later store passing an earlier storepart of release
LoadStorea later store passing an earlier loadacquire & release
StoreLoada later load passing an earlier storethe expensive one — full/SC

An acquire is LoadLoad + LoadStore; a release is LoadStore + StoreStore. Only StoreLoad requires draining the store buffer, which is why it is the only fence x86 actually needs to emit (via MFENCE or any LOCK-prefixed instruction) and why SC costs more than acquire/release. On ARMv8 these map to DMB ISH variants; a release store is STLR, an acquire load is LDAR. Compiler-only reordering is separately fenced by a compiler barrier (e.g. std::atomic_signal_fence, or an asm volatile("" ::: "memory")), which emits no CPU instruction but stops the optimizer.

Java: volatile and the happens-before model

The Java Memory Model (JSR-133, 2004) defines correctness in terms of the happens-before partial order. A read is allowed to return a write's value if the write happens-before the read (and no intervening write clobbers it). If two accesses to the same location are not ordered by happens-before and at least one is a write, that is a data race and the result is undefined. The edges that establish happens-before:

The killer property is piggybacking: a volatile write publishes not just the volatile variable but everything the writer did before it. So a single volatile flag safely publishes a whole object graph. Below is a complete, compilable example:

class Publisher {
    private int payload;              // plain field
    private volatile boolean ready;   // the release/acquire gate

    void publish() {
        payload = 42;                 // (1) plain write
        ready = true;                 // (2) volatile write == release
    }

    int consume() {
        if (ready)                    // (3) volatile read == acquire
            return payload;           // (4) guaranteed to see 42
        return -1;
    }
}

Because (2) happens-before (3) via the volatile rule, and (1) is before (2) and (4) is after (3) in program order, transitivity gives (1) happens-before (4): the reader that sees ready==true is guaranteed to see payload==42. Drop volatile and there is no edge — the reader may see ready==true but stale payload==0, or spin forever because the JIT hoisted the plain read out of the loop into a register.

C++: std::memory_order

C++11 exposes the ordering knob directly on each atomic operation, so you pay for exactly the ordering you need:

ValueGuaranteeUse for
memory_order_relaxedatomicity only; no ordering with other varscounters, stats
memory_order_acquireon a load: floor for later opsconsuming a flag/lock take
memory_order_releaseon a store: ceiling for earlier opspublishing / lock release
memory_order_acq_relboth, for read-modify-writefetch_add, CAS on a lock
memory_order_seq_cstacq/rel + single global total orderdefault; Dekker-style algorithms

The same publication pattern, spelled explicitly:

std::atomic<bool> ready{false};
int payload = 0;                     // plain

void publish() {
    payload = 42;
    ready.store(true, std::memory_order_release);   // ceiling
}

int consume() {
    while (!ready.load(std::memory_order_acquire)) {} // floor
    return payload;                  // sees 42
}

Default operations use seq_cst — correct but it emits a StoreLoad fence you rarely need. Dropping to acquire/release for message passing removes that fence on x86 and swaps LDAR/STLR for cheaper forms on ARM. (Avoid memory_order_consume: it was meant to be a cheaper acquire that follows only data-dependency chains, but no compiler implements it as specified — they all promote it to acquire. Treat it as deprecated.)

The double-checked-locking bug

DCL tries to make lazy singleton init cheap: check the field without a lock, and only synchronize on the slow path when it's null. The naive version is broken, and it is broken by exactly the reordering we've been discussing:

// BROKEN
private static Singleton instance;   // NOT volatile
static Singleton get() {
    if (instance == null) {                 // 1st check (no lock)
        synchronized (Singleton.class) {
            if (instance == null)            // 2nd check
                instance = new Singleton();  // (!) not atomic
        }
    }
    return instance;
}

instance = new Singleton() is three steps: (a) allocate memory, (b) run the constructor, (c) publish the reference into instance. The JMM permits reordering to (a) → (c) → (b) — the field is published before the object is initialized. A second thread on the fast path sees instance != null, skips the lock, and returns a reference to a half-constructed object — reading default-zero fields. The bug is invisible on x86 in most runs and surfaces under load on ARM.

The fix is one keyword — declare the field volatile. The volatile write is a release: it forbids (c) from floating above (b), and the fast-path volatile read is an acquire that pairs with it, so a non-null reference is guaranteed fully constructed.

// CORRECT
private static volatile Singleton instance;   // the fix
static Singleton get() {
    Singleton r = instance;              // read volatile ONCE
    if (r == null) {
        synchronized (Singleton.class) {
            r = instance;
            if (r == null) instance = r = new Singleton();
        }
    }
    return r;
}

The local r is the standard micro-optimization: it reads the volatile field once instead of twice on the hot path. Even simpler and preferred in Java: use the initialization-on-demand holder idiom (a static nested class), which gets lazy, thread-safe init from the class-loading lock with no volatile and no synchronization on the read path.

🤖 Don't fully get this? Learn it with Claude

Stuck on Memory Ordering, Barriers & Happens-Before? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.

🎨 Explain it visually

Build the mental picture, not memorization.

I just read a lesson on **Memory Ordering, Barriers & Happens-Before** (Concurrency) and want to truly understand it. Explain Memory Ordering, Barriers & Happens-Before from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🤔 Walk me through it (interactive)

Socratic — adapts to where you're stuck.

Teach me **Memory Ordering, Barriers & Happens-Before** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🧪 Quiz me & fix my gaps

Active recall exposes what you missed.

Quiz me on **Memory Ordering, Barriers & Happens-Before** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🧠 Make it stick

Intuition + hook + flashcards for long-term memory.

Help me remember **Memory Ordering, Barriers & Happens-Before** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

📝 My notes