Home › Concurrency › Lock-Free & Memory Ordering

Lock-Free Stack & Queue (Treiber, Michael-Scott)

The core trick: a single atomic compare-and-swap replaces the lock

A lock-free data structure works because the hardware exposes one instruction — compareAndSet(expected, new), backed by x86 CMPXCHG (with a LOCK prefix), which atomically checks a memory word still holds the value you last read and, only if so, swaps in a new value, all as one indivisible bus transaction. ARM/POWER don't have a single CAS instruction; they build the same semantics from LDXR/STXR (load-linked / store-conditional): LDXR marks the cache line as watched, and STXR commits only if nothing touched that line since — otherwise it reports failure and software must retry the load-modify-store loop itself. LL/SC is strictly weaker than CAS: it can fail spuriously (any intervening event on the line, not just a value mismatch, can abort the store), so a correct CAS on ARM is really "LL/SC wrapped in a retry loop," never a single instruction.

If you can express a whole update as "flip one pointer from old to new," then instead of taking a mutex you optimistically build the new state, attempt the CAS, and retry the entire operation if someone beat you to the word. No thread ever holds a lock, so no thread can block another by being descheduled mid-update.

This matters because a mutex has a fatal failure mode under contention: the OS can preempt the thread holding the lock (page fault, quantum expiry, priority inversion), and now every other thread is stuck behind a sleeper — throughput collapses even though there is CPU to spare. A lock-free structure gives a stronger guarantee: at least one thread always makes progress system-wide, regardless of how the scheduler interleaves or stalls the others. That is the whole point — progress is a property of the algorithm, not of the scheduler's goodwill.

Treiber stack — CAS on a single `head` pointer

Treiber's stack (1986) is the canonical lock-free structure because a stack's entire mutable state is one pointer. Push means "make my node the new head, pointing at the old head"; pop means "make head's successor the new head." Each is a single-word swap, so a single CAS suffices.

import java.util.concurrent.atomic.AtomicReference;

public final class TreiberStack<E> {
    private static final class Node<E> {
        final E item;
        Node<E> next;
        Node(E item) { this.item = item; }
    }

    private final AtomicReference<Node<E>> head = new AtomicReference<>();

    public void push(E item) {
        Node<E> n = new Node<>(item);
        Node<E> old;
        do {
            old = head.get();       // read current top
            n.next = old;           // point new node at it
        } while (!head.compareAndSet(old, n));  // publish, or retry
    }

    public E pop() {
        Node<E> old, next;
        do {
            old = head.get();
            if (old == null) return null;   // empty
            next = old.next;
        } while (!head.compareAndSet(old, next));
        return old.item;
    }
}

Traced push race. Stack holds head → n20 → n17 → null. Thread T1 pushes 42, T2 pushes 99, and both are scheduled at the same instant:

t	T1 (push 42)	T2 (push 99)	head after
0	old = n20; n42.next = n20	old = n20; n99.next = n20	n20
1	`CAS(n20, n42)` → true	—	n42
2	—	`CAS(n20, n99)` → false (head is n42, not n20)	n42
3	returns	retry: old = n42; n99.next = n42	n42
4	—	`CAS(n42, n99)` → true	n99

No update was lost: the final stack is head → n99 → n42 → n20 → n17. T2 did wasted work (one failed CAS + a re-read) but was never blocked — that wasted retry is exactly the cost you trade a lock's blocking for.

Michael-Scott queue — two pointers, a dummy node, and "helping"

A FIFO queue is harder than a stack: enqueue mutates the tail, dequeue mutates the head, and a single CAS cannot atomically update two words. Michael & Scott's insight (1996, the algorithm behind Java's ConcurrentLinkedQueue) is to split enqueue into two independent single-word CAS steps and make every thread help finish a half-done operation it stumbles on. A permanent dummy node also keeps head and tail from ever being null, killing a swarm of empty/one-element edge cases.

Link: CAS the last node's next from null to the new node. This is the linearization point — the node is now in the queue.
Swing tail: CAS tail to point at the new node. This is advisory: if it fails or the thread stalls before doing it, tail lags one node behind — but the queue is still correct.

Because tail can lag, any thread that finds tail.next != null knows an enqueue is mid-flight and performs the swing itself (CAS(tail, t, next)) before proceeding. That is helping: it is what upgrades the queue from merely lock-free to robust — a thread that dies between step 1 and step 2 cannot wedge the structure, because the next thread through repairs it.

import java.util.concurrent.atomic.AtomicReference;

public final class MSQueue<E> {
    private static final class Node<E> {
        final E item;
        final AtomicReference<Node<E>> next = new AtomicReference<>();
        Node(E item) { this.item = item; }
    }
    private final AtomicReference<Node<E>> head, tail;

    public MSQueue() {
        Node<E> dummy = new Node<>(null);
        head = new AtomicReference<>(dummy);
        tail = new AtomicReference<>(dummy);
    }

    public void enqueue(E item) {
        Node<E> n = new Node<>(item);
        while (true) {
            Node<E> t = tail.get();
            Node<E> next = t.next.get();
            if (t == tail.get()) {                    // snapshot still valid?
                if (next == null) {                   // t is really last
                    if (t.next.compareAndSet(null, n)) {  // (1) LINK — linearizes
                        tail.compareAndSet(t, n);         // (2) swing (advisory)
                        return;
                    }
                } else {
                    tail.compareAndSet(t, next);      // HELP a lagging tail
                }
            }
        }
    }

    public E dequeue() {
        while (true) {
            Node<E> h = head.get();
            Node<E> t = tail.get();
            Node<E> next = h.next.get();
            if (h == head.get()) {
                if (h == t) {                         // queue empty or tail lags
                    if (next == null) return null;    // truly empty
                    tail.compareAndSet(t, next);      // HELP, then loop
                } else {
                    E item = next.item;               // read BEFORE the CAS
                    if (head.compareAndSet(h, next))  // linearizes dequeue
                        return item;
                }
            }
        }
    }
}

Linearizability: why these are correct, not just fast

The correctness bar for a concurrent object is linearizability: every operation must appear to take effect instantaneously at some single point between its call and its return, and that ordering must be consistent with a legal sequential history. For these structures the linearization point is exactly the successful CAS:

Treiber push/pop: the moment head.compareAndSet returns true. Before it, the operation is invisible; after it, it is fully committed and the node is present or gone. Failed CAS attempts are non-events.
MS enqueue: the successful t.next.compareAndSet(null, n) (step 1) — not the tail swing. The item is observable in the queue the instant its predecessor's next points to it, even while tail still lags.
MS dequeue: the successful head.compareAndSet(h, next). Note the code reads next.item before that CAS — read after, and a concurrent dequeuer could recycle the node and hand you garbage.

Because each operation has exactly one atomic commit point, concurrent executions are always equivalent to some sequential order — so callers can reason about the queue as if it were single-threaded, which is the entire contract they need.

Pitfalls a working engineer actually hits

1. The ABA problem. CAS checks the pointer value, not "was this word untouched." Suppose pop reads head=A, A.next=B, then stalls. Meanwhile another thread pops A, pops B, and pushes A back (a freed node reused at the same address). Head is A again — the stalled thread's CAS(A, B) succeeds because the bit pattern matches, but B may have been freed or mutated, and the stack's real structure is no longer what the stalled thread assumed. This is the number-one lock-free bug in C/C++, and Treiber's stack plus the Michael-Scott queue are the textbook examples used to teach it, because both hinge on "does this pointer still equal what I read" as their entire correctness argument.

Java's AtomicReference/CAS does NOT solve ABA by itself — nothing about compare-and-set detects that a value went A → B → A. Java gets away with it only as a side effect of garbage collection: the GC will not recycle A's memory while any thread still holds a reference to it, and every fresh push allocates a brand-new Node object, so the "same address, different meaning" scenario cannot arise for live nodes. Take away the GC and the same Java code is exposed to ABA exactly like C++.

Without a GC you must defeat ABA explicitly: a tagged/versioned pointer (a monotonically-incrementing counter packed alongside the pointer, updated atomically in the same double-width CAS — AtomicStampedReference in Java, CMPXCHG16B in C++), or a safe-reclamation scheme that prevents the address from being reused at all while it might still be observed.

2. Memory reclamation is the real hard problem. In a non-GC language, when can you free a popped or dequeued node? Never, if another thread might still be dereferencing it — and in the Michael-Scott queue specifically, a thread can be holding a reference to a node it read as head.next right as another thread dequeues and frees that same node out from under it. The industrial answers — hazard pointers (Michael 2004: each thread publishes the pointers it's currently dereferencing, and a reclaimer scans that list before freeing anything) and epoch-based reclamation (RCU-style: defer frees until every thread has passed a grace-period checkpoint) — are more code and more subtlety than the stack/queue itself. "Lock-free" in a managed runtime hides this cost entirely; in C/C++/Rust it dominates the design and is usually the harder half of implementing either structure correctly.

3. Livelock / retry storms under high contention. Lock-freedom guarantees someone progresses, not that everyone does, and not fairly. On a hot stack under dozens of cores, CAS failure rates spike, threads burn CPU on doomed retries, and the contended cache line ping-pongs between cores' L1 caches (each CAS forces an exclusive-state cache-line transfer). Throughput can end up worse than a good lock. Mitigations: exponential backoff on failure, or an elimination array (a push and a concurrent pop cancel out without touching head).

4. Missing memory ordering. You get correctness for free in Java only because AtomicReference operations carry the right happens-before edges (a successful CAS has volatile-write semantics). Roll your own with plain fields and the reader sees a stale or half-constructed node. In C++ you must choose the correct memory_order on every atomic — relaxed here is a silent data race.

Trade-offs: when lock-free beats a lock — and when it doesn't

The honest comparison is against a mutex-guarded structure (e.g. ArrayDeque behind a ReentrantLock or a synchronized block, or Java's LinkedBlockingQueue which uses two locks + condition variables).

Dimension	Lock-free (Treiber / MS)	Lock-based (ReentrantLock / synchronized / BlockingQueue)
Progress under preemption	Guaranteed — a stalled thread blocks no one	A preempted lock-holder stalls everyone waiting on it
Low / no contention	Fast (1 uncontended CAS)	Fast (an uncontended lock is cheap too, especially biased/thin locks)
High contention	Can degrade — retry storms, cache-line ping-pong	Degrades more gracefully; OS parks losers (no CPU burn)
Blocking semantics ("wait until non-empty")	Not natural — must spin or bolt on a semaphore	Built in via condition variables / `take()`
Reclamation (non-GC languages)	Hard — hazard pointers / epoch-based reclamation	Trivial — free inside the critical section, nobody else can be looking
Implementation & audit cost	High — ABA, memory ordering, pointer-level reasoning	Low — the lock linearizes everything for you
vs. wait-free algorithms	Only guarantees some thread progresses; a single thread can in theory retry forever under adversarial scheduling	N/A — locks make no forward-progress guarantee at all under preemption

Reach for lock-free when: the critical section is a single-pointer flip, you need hard progress guarantees (real-time code, an interrupt/signal handler, or code that must keep running while another thread might be paused mid-critical-section), contention is low-to-moderate, or you are in a GC language that erases the reclamation tax. Prefer a lock when: the operation touches multiple words or has no single natural CAS commit point, you need to block until a condition holds, contention is brutal (a good lock parks losers instead of spinning), or you are in a non-GC language where a correct lock is far cheaper to get right than correct reclamation. If you need a stronger guarantee than lock-free — that every thread finishes in a bounded number of steps regardless of contention — that's wait-free territory (e.g. wait-free queues via Kogan & Petrank's fast-path-slow-path technique); it costs more code and more per-op overhead than either a lock or a plain lock-free structure, and is rarely worth it outside hard real-time systems. As a default in Java, reach for java.util.concurrent first — ConcurrentLinkedQueue is the MS queue, and LinkedBlockingQueue is the lock-based alternative — before hand-rolling either.

Takeaways

Lock-free = express the update as one atomic CAS + retry. Progress becomes an algorithmic guarantee independent of the scheduler, not a promise the OS makes.
Treiber stack works because a stack is one pointer; MS queue needs two pointers and splits enqueue into link-then-swing so each step is still a single-word CAS — and any thread that finds a lagging tail helps finish it rather than blocking on it.
The linearization point is always the successful CAS. Everything before it is invisible, everything after it is fully committed — that single atomic instant is what lets callers reason about the structure as if it ran single-threaded.
ABA is the defining lock-free bug, and Java's CAS does not fix it on its own. compareAndSet only compares bit patterns; it cannot tell A → B → A from "untouched." Java is safe only because the GC keeps a live node's address from being reused — take away the GC (as in C/C++/Rust) and you need tagged/versioned pointers, hazard pointers, or epoch-based reclamation to get the same safety.
Memory reclamation, not the CAS loop itself, is usually the hard part in a non-GC language — freeing a node that another thread might still be dereferencing requires hazard pointers or epoch-based schemes, both more subtle than the stack or queue they protect.
Lock-free is not automatically faster. Under heavy contention, retry storms and cache-line ping-pong can make it slower than a well-tuned lock; back off or use an elimination array before assuming lock-free wins.
Choose deliberately: lock-free for single-pointer updates needing hard progress guarantees under moderate contention; a lock when you need blocking semantics, touch multiple words, or face brutal contention; wait-free only for hard real-time needs. In Java, default to java.util.concurrent's battle-tested implementations before hand-rolling either.

🤖 Don't fully get this? Learn it with Claude

Stuck on Lock-Free Stack & Queue (Treiber, Michael-Scott)? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.

🎨 Explain it visually

Build the mental picture, not memorization.

I just read a lesson on **Lock-Free Stack & Queue (Treiber, Michael-Scott)** (Concurrency) and want to truly understand it. Explain Lock-Free Stack & Queue (Treiber, Michael-Scott) from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🤔 Walk me through it (interactive)

Socratic — adapts to where you're stuck.

Teach me **Lock-Free Stack & Queue (Treiber, Michael-Scott)** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🧪 Quiz me & fix my gaps

Active recall exposes what you missed.

Quiz me on **Lock-Free Stack & Queue (Treiber, Michael-Scott)** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🧠 Make it stick

Intuition + hook + flashcards for long-term memory.

Help me remember **Lock-Free Stack & Queue (Treiber, Michael-Scott)** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

📝 My notes

← Compare-And-Swap (CAS) & the ABA P Safe Memory Reclamation (Hazard Po →