Lock-Free Stack & Queue (Treiber, Michael-Scott)
The core trick: a single atomic compare-and-swap replaces the lock
A lock-free data structure works because the hardware exposes one instruction — compareAndSet(expected, new), backed by x86 CMPXCHG (with a LOCK prefix), which atomically checks a memory word still holds the value you last read and, only if so, swaps in a new value, all as one indivisible bus transaction. ARM/POWER don't have a single CAS instruction; they build the same semantics from LDXR/STXR (load-linked / store-conditional): LDXR marks the cache line as watched, and STXR commits only if nothing touched that line since — otherwise it reports failure and software must retry the load-modify-store loop itself. LL/SC is strictly weaker than CAS: it can fail spuriously (any intervening event on the line, not just a value mismatch, can abort the store), so a correct CAS on ARM is really "LL/SC wrapped in a retry loop," never a single instruction.
If you can express a whole update as "flip one pointer from old to new," then instead of taking a mutex you optimistically build the new state, attempt the CAS, and retry the entire operation if someone beat you to the word. No thread ever holds a lock, so no thread can block another by being descheduled mid-update.
This matters because a mutex has a fatal failure mode under contention: the OS can preempt the thread holding the lock (page fault, quantum expiry, priority inversion), and now every other thread is stuck behind a sleeper — throughput collapses even though there is CPU to spare. A lock-free structure gives a stronger guarantee: at least one thread always makes progress system-wide, regardless of how the scheduler interleaves or stalls the others. That is the whole point — progress is a property of the algorithm, not of the scheduler's goodwill.
Treiber stack — CAS on a single head pointer
Treiber's stack (1986) is the canonical lock-free structure because a stack's entire mutable state is one pointer. Push means "make my node the new head, pointing at the old head"; pop means "make head's successor the new head." Each is a single-word swap, so a single CAS suffices.
import java.util.concurrent.atomic.AtomicReference;
public final class TreiberStack<E> {
private static final class Node<E> {
final E item;
Node<E> next;
Node(E item) { this.item = item; }
}
private final AtomicReference<Node<E>> head = new AtomicReference<>();
public void push(E item) {
Node<E> n = new Node<>(item);
Node<E> old;
do {
old = head.get(); // read current top
n.next = old; // point new node at it
} while (!head.compareAndSet(old, n)); // publish, or retry
}
public E pop() {
Node<E> old, next;
do {
old = head.get();
if (old == null) return null; // empty
next = old.next;
} while (!head.compareAndSet(old, next));
return old.item;
}
}Traced push race. Stack holds head → n20 → n17 → null. Thread T1 pushes 42, T2 pushes 99, and both are scheduled at the same instant:
| t | T1 (push 42) | T2 (push 99) | head after |
|---|---|---|---|
| 0 | old = n20; n42.next = n20 | old = n20; n99.next = n20 | n20 |
| 1 | CAS(n20, n42) → true | — | n42 |
| 2 | — | CAS(n20, n99) → false (head is n42, not n20) | n42 |
| 3 | returns | retry: old = n42; n99.next = n42 | n42 |
| 4 | — | CAS(n42, n99) → true | n99 |
No update was lost: the final stack is head → n99 → n42 → n20 → n17. T2 did wasted work (one failed CAS + a re-read) but was never blocked — that wasted retry is exactly the cost you trade a lock's blocking for.
Michael-Scott queue — two pointers, a dummy node, and "helping"
A FIFO queue is harder than a stack: enqueue mutates the tail, dequeue mutates the head, and a single CAS cannot atomically update two words. Michael & Scott's insight (1996, the algorithm behind Java's ConcurrentLinkedQueue) is to split enqueue into two independent single-word CAS steps and make every thread help finish a half-done operation it stumbles on. A permanent dummy node also keeps head and tail from ever being null, killing a swarm of empty/one-element edge cases.
- Link: CAS the last node's
nextfromnullto the new node. This is the linearization point — the node is now in the queue. - Swing tail: CAS
tailto point at the new node. This is advisory: if it fails or the thread stalls before doing it,taillags one node behind — but the queue is still correct.
Because tail can lag, any thread that finds tail.next != null knows an enqueue is mid-flight and performs the swing itself (CAS(tail, t, next)) before proceeding. That is helping: it is what upgrades the queue from merely lock-free to robust — a thread that dies between step 1 and step 2 cannot wedge the structure, because the next thread through repairs it.
import java.util.concurrent.atomic.AtomicReference;
public final class MSQueue<E> {
private static final class Node<E> {
final E item;
final AtomicReference<Node<E>> next = new AtomicReference<>();
Node(E item) { this.item = item; }
}
private final AtomicReference<Node<E>> head, tail;
public MSQueue() {
Node<E> dummy = new Node<>(null);
head = new AtomicReference<>(dummy);
tail = new AtomicReference<>(dummy);
}
public void enqueue(E item) {
Node<E> n = new Node<>(item);
while (true) {
Node<E> t = tail.get();
Node<E> next = t.next.get();
if (t == tail.get()) { // snapshot still valid?
if (next == null) { // t is really last
if (t.next.compareAndSet(null, n)) { // (1) LINK — linearizes
tail.compareAndSet(t, n); // (2) swing (advisory)
return;
}
} else {
tail.compareAndSet(t, next); // HELP a lagging tail
}
}
}
}
public E dequeue() {
while (true) {
Node<E> h = head.get();
Node<E> t = tail.get();
Node<E> next = h.next.get();
if (h == head.get()) {
if (h == t) { // queue empty or tail lags
if (next == null) return null; // truly empty
tail.compareAndSet(t, next); // HELP, then loop
} else {
E item = next.item; // read BEFORE the CAS
if (head.compareAndSet(h, next)) // linearizes dequeue
return item;
}
}
}
}
}Linearizability: why these are correct, not just fast
The correctness bar for a concurrent object is linearizability: every operation must appear to take effect instantaneously at some single point between its call and its return, and that ordering must be consistent with a legal sequential history. For these structures the linearization point is exactly the successful CAS:
- Treiber push/pop: the moment
head.compareAndSetreturns true. Before it, the operation is invisible; after it, it is fully committed and the node is present or gone. Failed CAS attempts are non-events. - MS enqueue: the successful
t.next.compareAndSet(null, n)(step 1) — not the tail swing. The item is observable in the queue the instant its predecessor'snextpoints to it, even whiletailstill lags. - MS dequeue: the successful
head.compareAndSet(h, next). Note the code readsnext.itembefore that CAS — read after, and a concurrent dequeuer could recycle the node and hand you garbage.
Because each operation has exactly one atomic commit point, concurrent executions are always equivalent to some sequential order — so callers can reason about the queue as if it were single-threaded, which is the entire contract they need.
Pitfalls a working engineer actually hits
1. The ABA problem. CAS checks the pointer value, not "was this word untouched." Suppose pop reads head=A, A.next=B, then stalls. Meanwhile another thread pops A, pops B, and pushes A back (a freed node reused at the same address). Head is A again — the stalled thread's CAS(A, B) succeeds because the bit pattern matches, but B may have been freed or mutated, and the stack's real structure is no longer what the stalled thread assumed. This is the number-one lock-free bug in C/C++, and Treiber's stack plus the Michael-Scott queue are the textbook examples used to teach it, because both hinge on "does this pointer still equal what I read" as their entire correctness argument.
AtomicReference/CAS does NOT solve ABA by itself — nothing about compare-and-set detects that a value went A → B → A. Java gets away with it only as a side effect of garbage collection: the GC will not recycle A's memory while any thread still holds a reference to it, and every fresh push allocates a brand-new Node object, so the "same address, different meaning" scenario cannot arise for live nodes. Take away the GC and the same Java code is exposed to ABA exactly like C++.AtomicStampedReference in Java, CMPXCHG16B in C++), or a safe-reclamation scheme that prevents the address from being reused at all while it might still be observed.2. Memory reclamation is the real hard problem. In a non-GC language, when can you free a popped or dequeued node? Never, if another thread might still be dereferencing it — and in the Michael-Scott queue specifically, a thread can be holding a reference to a node it read as head.next right as another thread dequeues and frees that same node out from under it. The industrial answers — hazard pointers (Michael 2004: each thread publishes the pointers it's currently dereferencing, and a reclaimer scans that list before freeing anything) and epoch-based reclamation (RCU-style: defer frees until every thread has passed a grace-period checkpoint) — are more code and more subtlety than the stack/queue itself. "Lock-free" in a managed runtime hides this cost entirely; in C/C++/Rust it dominates the design and is usually the harder half of implementing either structure correctly.
3. Livelock / retry storms under high contention. Lock-freedom guarantees someone progresses, not that everyone does, and not fairly. On a hot stack under dozens of cores, CAS failure rates spike, threads burn CPU on doomed retries, and the contended cache line ping-pongs between cores' L1 caches (each CAS forces an exclusive-state cache-line transfer). Throughput can end up worse than a good lock. Mitigations: exponential backoff on failure, or an elimination array (a push and a concurrent pop cancel out without touching head).
4. Missing memory ordering. You get correctness for free in Java only because AtomicReference operations carry the right happens-before edges (a successful CAS has volatile-write semantics). Roll your own with plain fields and the reader sees a stale or half-constructed node. In C++ you must choose the correct memory_order on every atomic — relaxed here is a silent data race.
Trade-offs: when lock-free beats a lock — and when it doesn't
The honest comparison is against a mutex-guarded structure (e.g. ArrayDeque behind a ReentrantLock or a synchronized block, or Java's LinkedBlockingQueue which uses two locks + condition variables).
| Dimension | Lock-free (Treiber / MS) | Lock-based (ReentrantLock / synchronized / BlockingQueue) |
|---|---|---|
| Progress under preemption | Guaranteed — a stalled thread blocks no one | A preempted lock-holder stalls everyone waiting on it |
| Low / no contention | Fast (1 uncontended CAS) | Fast (an uncontended lock is cheap too, especially biased/thin locks) |
| High contention | Can degrade — retry storms, cache-line ping-pong | Degrades more gracefully; OS parks losers (no CPU burn) |
| Blocking semantics ("wait until non-empty") | Not natural — must spin or bolt on a semaphore | Built in via condition variables / take() |
| Reclamation (non-GC languages) | Hard — hazard pointers / epoch-based reclamation | Trivial — free inside the critical section, nobody else can be looking |
| Implementation & audit cost | High — ABA, memory ordering, pointer-level reasoning | Low — the lock linearizes everything for you |
| vs. wait-free algorithms | Only guarantees some thread progresses; a single thread can in theory retry forever under adversarial scheduling | N/A — locks make no forward-progress guarantee at all under preemption |
Reach for lock-free when: the critical section is a single-pointer flip, you need hard progress guarantees (real-time code, an interrupt/signal handler, or code that must keep running while another thread might be paused mid-critical-section), contention is low-to-moderate, or you are in a GC language that erases the reclamation tax. Prefer a lock when: the operation touches multiple words or has no single natural CAS commit point, you need to block until a condition holds, contention is brutal (a good lock parks losers instead of spinning), or you are in a non-GC language where a correct lock is far cheaper to get right than correct reclamation. If you need a stronger guarantee than lock-free — that every thread finishes in a bounded number of steps regardless of contention — that's wait-free territory (e.g. wait-free queues via Kogan & Petrank's fast-path-slow-path technique); it costs more code and more per-op overhead than either a lock or a plain lock-free structure, and is rarely worth it outside hard real-time systems. As a default in Java, reach for java.util.concurrent first — ConcurrentLinkedQueue is the MS queue, and LinkedBlockingQueue is the lock-based alternative — before hand-rolling either.
Takeaways
- Lock-free = express the update as one atomic CAS + retry. Progress becomes an algorithmic guarantee independent of the scheduler, not a promise the OS makes.
- Treiber stack works because a stack is one pointer; MS queue needs two pointers and splits enqueue into link-then-swing so each step is still a single-word CAS — and any thread that finds a lagging tail helps finish it rather than blocking on it.
- The linearization point is always the successful CAS. Everything before it is invisible, everything after it is fully committed — that single atomic instant is what lets callers reason about the structure as if it ran single-threaded.
- ABA is the defining lock-free bug, and Java's CAS does not fix it on its own.
compareAndSetonly compares bit patterns; it cannot tell A → B → A from "untouched." Java is safe only because the GC keeps a live node's address from being reused — take away the GC (as in C/C++/Rust) and you need tagged/versioned pointers, hazard pointers, or epoch-based reclamation to get the same safety. - Memory reclamation, not the CAS loop itself, is usually the hard part in a non-GC language — freeing a node that another thread might still be dereferencing requires hazard pointers or epoch-based schemes, both more subtle than the stack or queue they protect.
- Lock-free is not automatically faster. Under heavy contention, retry storms and cache-line ping-pong can make it slower than a well-tuned lock; back off or use an elimination array before assuming lock-free wins.
- Choose deliberately: lock-free for single-pointer updates needing hard progress guarantees under moderate contention; a lock when you need blocking semantics, touch multiple words, or face brutal contention; wait-free only for hard real-time needs. In Java, default to
java.util.concurrent's battle-tested implementations before hand-rolling either.
🤖 Don't fully get this? Learn it with Claude
Stuck on Lock-Free Stack & Queue (Treiber, Michael-Scott)? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.
Build the mental picture, not memorization.
I just read a lesson on **Lock-Free Stack & Queue (Treiber, Michael-Scott)** (Concurrency) and want to truly understand it. Explain Lock-Free Stack & Queue (Treiber, Michael-Scott) from first principles using ONE vivid real-world analogy and a visual mental model — draw it as ASCII art or a clear step-by-step diagram — with a concrete example using real numbers. Then ask me one question to check I got the mental picture, and wait for my reply. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
Socratic — adapts to where you're stuck.
Teach me **Lock-Free Stack & Queue (Treiber, Michael-Scott)** interactively. Ask me ONE guiding question at a time, wait for my answer, and adapt to my confusion — build the idea with me step by step instead of explaining it all at once. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
Active recall exposes what you missed.
Quiz me on **Lock-Free Stack & Queue (Treiber, Michael-Scott)** with 5 questions, easy to tricky, ONE at a time. Tell me if each answer is right; at the end, explain clearly what I got wrong and why. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
Intuition + hook + flashcards for long-term memory.
Help me remember **Lock-Free Stack & Queue (Treiber, Michael-Scott)** for the long term: give the one-sentence intuition, a memorable hook/mnemonic, a tiny worked example, and 3 active-recall flashcards (Q -> A). If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.