The Surface Code and Willow: What Below-Threshold Actually Means
Google's Willow chip (December 2024) was the first demonstration of a quantum error-correcting code with errors that decrease as you add qubits — the 'below threshold' result the field had chased for 30 years. This tutorial explains what the surface code is, why the threshold theorem matters, and what Willow's numbers imply for the path to fault-tolerant quantum computing.
Prerequisites: Tutorial 18: Noise and Decoherence
In December 2024, Google Quantum AI published in Nature the first demonstration of a quantum error-correcting code whose logical error rate decreases exponentially as you add physical qubits. The “Willow” chip — 105 superconducting qubits arranged as three surface-code patches of increasing size — showed that the more physical qubits you throw at one logical qubit, the more protected it becomes. After three decades of theoretical expectation and repeated near-misses, this was the threshold theorem turned into experimental reality.
Understanding what that sentence means — and what it doesn’t mean — is the gateway to thinking seriously about fault-tolerant quantum computing. This tutorial builds the surface code from first principles, explains the threshold theorem, and gives you honest numbers for how far Willow is from a useful fault-tolerant machine.
Why error correction is necessary
Tutorial 18 gave the numbers: best 2026 two-qubit gate errors are around on Quantinuum H2, on IBM Heron. A useful quantum algorithm like Shor’s RSA-2048 needs roughly Toffoli-equivalent gates. Without error correction:
Astronomically impossible. You need per-logical-gate error rates of about to . Physical hardware is 10 orders of magnitude short. Error correction is the only known path from where we are to where useful quantum computing lives.
Classical error correction as warmup
Classical ECC: encode 1 bit as 3 bits. To send 0, send 000. If one bit flips in transit, majority vote recovers the original. If two bits flip, majority vote fails — but the probability of two flips is for flip probability , much smaller than the probability of one flip when is small.
The repetition code is the simplest error-correcting code, and it motivates every quantum ECC: use redundancy, measure a syndrome (the majority vote), decode.
Two complications in the quantum case:
- No-cloning theorem: you can’t literally copy
|ψ⟩to get|ψ⟩|ψ⟩|ψ⟩. But you can encode into a carefully entangled state . - Measurement disturbs state: you can’t look at the encoded qubit directly to check for errors. You need to measure only error-detecting observables, never the logical state itself.
The stabilizer formalism
A stabilizer code defines the logical subspace via a set of mutually commuting Pauli operators called stabilizers. A codeword is any state satisfying for all — i.e., a eigenstate of every stabilizer.
Example: the 3-qubit bit-flip repetition code. Stabilizers and . Codewords: (both stabilizers give ) and (both give ). Superpositions are the encoded qubit.
How error detection works:
- Single bit-flip error anti-commutes with (since and anti-commute on the same qubit) but commutes with . So after , measuring yields , yields . Syndrome → decode as “bit flip on qubit 0” → apply to correct.
- → syndrome .
- → syndrome .
- No error → .
Four possible error patterns → four distinct syndromes. Single bit-flips are detectable and correctable.
But repetition code only handles bit flips. Phase-flip errors () commute with both stabilizers and go undetected. The 3-qubit code protects against one kind of error, not arbitrary ones.
From repetition to the surface code
To protect against both bit flips and phase flips (equivalently, arbitrary single-qubit errors, since any single-qubit Pauli is a combination of X and Z), you need a code with both -type and -type stabilizers. The surface code is the most important example.
Layout: an grid of data qubits with ancilla qubits at faces and edges of the grid. Two types of stabilizer:
- Z-stabilizers (plaquettes): product of on the four data qubits around a face.
- X-stabilizers (vertices): product of on the four data qubits around a vertex.
d — Z — d — Z — d
| | | | |
X — d — X — d — X
| | | | |
d — Z — d — Z — d
| | | | |
X — d — X — d — X
| | | | |
d — Z — d — Z — d
d = data qubit, Z = Z-stabilizer plaquette, X = X-stabilizer vertex.
Distance of the code = the minimum weight of an undetectable Pauli error. For a distance- surface code, you need roughly data qubits plus a similar number of ancillas. Distance-3 surface code: ~17 physical qubits. Distance-5: ~49. Distance-7: ~97. That’s the scaling wall on qubit overhead.
The threshold theorem
Here’s the deep result. Define:
- Physical error rate : the probability of a Pauli error per gate on any one physical qubit.
- Logical error rate : the probability of an error on the encoded logical qubit per logical operation.
The threshold theorem (Aharonov-Ben-Or / Knill / Kitaev, ~1996-1998) says: for any fault-tolerant code family (including surface codes), there exists a threshold such that
- If , decreases exponentially in the code distance .
- If , adding more physical qubits makes things worse — the error-correcting machinery introduces more errors than it fixes.
For the surface code, theoretical threshold — a remarkably generous number. Hardware with physical error rate below 1% can, in principle, be error-corrected to any desired with sufficient .
Scaling formula. Below threshold,
Halving (from 10⁻³ to 5×10⁻⁴, for example) roughly squares at fixed . Adding 2 to roughly squares at fixed .
What Willow actually demonstrated
For 30 years the field tried to show experimentally that increasing code distance reduces logical error rate. Every attempt hit plateaus or even worsening from the transition. Physical gate errors sat just above threshold and the exponential suppression didn’t kick in.
Willow changed this in December 2024. Key Willow numbers:
| Code distance | Physical qubits | Logical error per cycle |
|---|---|---|
| 3 | 17 | ~3.0 × 10⁻² |
| 5 | 49 | ~1.5 × 10⁻² |
| 7 | 97 | ~7.2 × 10⁻³ |
Ratio , . Doubling by halved the logical error rate — the below-threshold exponential scaling, finally demonstrated.
What Willow doesn’t show
The honest caveats that temper the headline:
- One logical qubit, memory only. Willow preserved a single logical qubit against idle noise. It did not perform logical gates, entangle logical qubits, or do any computation. Logical gates (especially T gates) are much harder and more expensive.
- Error rate still ~10⁻³, not 10⁻¹². To factor RSA-2048 you need , so you’d need code distance ~27. That’s ~1,500 physical qubits per logical qubit, and you need thousands of logical qubits. Willow is the proof of concept; the physical-qubit scale-up is the 20 million-qubit engineering problem.
- T-gate cost dominates any real calculation. Clifford gates (H, S, CNOT) transversal on the surface code — relatively cheap. T gates require magic state distillation: preparing approximate T-eigenstates at low error and then distilling them up to target error rate. Each distillation stage has ~15x qubit overhead, and you need multiple stages. For RSA-2048, T-gate factories consume ~60% of the total physical qubits.
- 2.5 µs cycle time. Willow’s syndrome-extraction cycle (the round of measurements that decodes errors) takes 2.5 µs. An 8-hour RSA-2048 factoring requires ~10 billion cycles; at 2.5 µs that’s the 8 hours Gidney-Ekerå estimated, but tight.
- Classical processing must keep up. Every cycle, the classical decoder must ingest ~ syndrome bits, run maximum-likelihood decoding (approximated via minimum-weight perfect matching), and output corrections — all within the cycle time to avoid backlog. Willow demonstrated this is tractable at ; scaling to is a real systems-engineering challenge.
Surface code vs alternatives
Surface code is dominant because:
- Planar layout fits 2D hardware (Google, IBM).
- High threshold (~1%).
- Well-studied decoders (minimum-weight perfect matching).
Alternatives that may catch up:
- qLDPC codes (quantum low-density parity check) — better qubit overhead ( physical qubits per logical qubit vs for surface). IBM’s 2024-2025 results suggest qLDPC may be practical for fault-tolerance at 10× lower qubit count than surface code. Worse threshold, harder decoding.
- Color codes — surface-code cousin with transversal Clifford + T gates. Cool theoretically, harder experimentally.
- Floquet codes (Hastings-Haah) — dynamically defined stabilizers. Promising for hardware with connectivity limits.
2026 is still surface-code-dominated, but qLDPC is closing fast. IBM’s Starling roadmap targets qLDPC-based fault tolerance; Google and Quantinuum continue on surface code.
A minimal simulated surface code patch
Building an actual surface-code circuit is many pages of stabilizer algebra. For a quick taste, simulate the repetition code’s syndrome extraction — same idea, simpler layout.
import numpy as np
from qiskit import QuantumCircuit, transpile
from qiskit_aer import AerSimulator
from qiskit_aer.noise import NoiseModel, depolarizing_error
def rep_code_cycle(p: float, logical_bit: int) -> dict:
"""One round of 3-qubit bit-flip code with injected bit-flip errors with probability p."""
qc = QuantumCircuit(5, 2) # 3 data + 2 ancilla, 2 syndrome bits
if logical_bit:
qc.x([0, 1, 2]) # logical |1⟩
# Inject random bit flip errors (physical)
for q in range(3):
if np.random.rand() < p:
qc.x(q)
# Syndrome extraction: ancilla 3 = Z_0 Z_1, ancilla 4 = Z_1 Z_2
qc.cx(0, 3); qc.cx(1, 3)
qc.cx(1, 4); qc.cx(2, 4)
qc.measure([3, 4], [0, 1])
counts = AerSimulator().run(transpile(qc), shots=1).result().get_counts()
syndrome = list(counts)[0]
# Decode
if syndrome == "00": correction = "none"
elif syndrome == "01": correction = "flip q0"
elif syndrome == "10": correction = "flip q2"
else: correction = "flip q1"
return {"syndrome": syndrome, "correction": correction}
# Run many trials at different physical error rates, count logical errors
for p in [0.05, 0.1, 0.15, 0.2, 0.3]:
logical_errors = 0
trials = 5000
for _ in range(trials):
result = rep_code_cycle(p, logical_bit=0)
# After correction, count as logical error if majority of data is still wrong
# (Quick approximation — true analysis tracks coherent errors.)
if result["correction"] != "none" and np.random.rand() < p / (1 - p):
logical_errors += 1
p_L = logical_errors / trials
print(f"p={p:.2f}: p_L ≈ {p_L:.3f} ({'below threshold' if p_L < p else 'above threshold'})")
On the 3-qubit code, the threshold is around 50%, far from practical but useful for seeing the shape.
Exercises
1. Threshold reasoning
If physical error is and you want logical error , what code distance does the surface code need? Assume in the scaling formula.
Show answer
with . Solve , , , , . So distance-17 code — about physical qubits per logical qubit.
2. Stabilizer commutation
For the surface code Z-plaquette and X-vertex operators on a small lattice, verify two stabilizers either commute completely or anticommute an even number of times. What’s the geometric interpretation?
Show answer
and operators anti-commute on the same qubit but commute on different qubits. For two stabilizers and to commute, they must share an even number of qubits where they apply different Paulis. Geometrically, adjacent X-vertex and Z-plaquette share exactly two data qubits (two corners of the plaquette touch the vertex) — two anti-commutations, product = commute. Non-adjacent pairs share zero qubits — also commute. That’s why the construction works.
3. Magic state distillation overhead
A 15-to-1 magic state distillation protocol takes 15 noisy T-states (each with error ) and produces 1 low-error T-state (error ). To reach starting from , how many distillation stages do you need?
Show answer
One stage: . For : output error . Need to suppress by one more factor of ~35: not worth a full stage. So two stages (with intermediate level buffering) brings you through. Each stage multiplies qubit cost by 15; two stages = 225× overhead per T-state.
4. When is error correction worth it?
For a 100-gate Clifford circuit with per gate, compare: raw-hardware expected fidelity vs surface-code-protected fidelity at (). Is the surface code net-positive here?
Show answer
Raw: . With surface code: . Surface code is clearly net-positive in logical error, but it uses ~50× as many physical qubits and ~50× as many physical operations to do those 100 logical gates. So the question is whether the 10× fidelity gain is worth the 50× resource cost. For a single critical result, yes. For VQE-like sampling, no.
What you should take away
- Error correction encodes 1 logical qubit in physical qubits (surface code) and provides exponential suppression of logical errors for .
- Threshold theorem: below threshold, adding qubits helps; above, it hurts. Surface code threshold ~1%.
- Willow showed we’re below threshold experimentally for the first time. Λ > 2 is the key metric.
- T gates and magic state distillation are the dominant resource cost in fault-tolerant quantum computing — not the stabilizer measurements themselves.
- Path to RSA-2048: ~20M physical qubits at today’s error rates. Willow is the first proof of concept; scale-up is the next decade.
Next: hardware comparison — superconducting vs trapped-ion vs photonic vs neutral-atom. Honest scorecards, honest tradeoffs, honest numbers.