gates and circuits advanced · 16 min read · By LIPAI WANG · May 1, 2026

Controlled-Unitary Synthesis: How to Build C-U for Arbitrary U

Quantum phase estimation, amplitude amplification, and block encoding all rely on controlled-unitary operations C-U for arbitrary U. The naive construction is via phase kickback through the eigenbasis; practical constructions go through Barenco-style multi-control decomposition, lazy controlled-U for amplitude amplification, and qubitization-style controlled block-encodings. This tutorial covers the standard constructions and their gate-count costs.

Prerequisites: Tutorial 11: QFT and Phase Estimation, Tutorial 54: Toffoli Decomposition

The controlled-unitary $C\text{-}U$ — applying $U$ on the target register conditioned on a control qubit being $|1\rangle$ — is one of the most-used operations in quantum algorithms. Quantum phase estimation builds $C\text{-}U^{2^k}$ for various $k$ . Amplitude amplification uses controlled reflections. Block encoding constructions need controlled-block-encodings. The naive question — given an algorithm that requires $C\text{-}U$ for arbitrary $U$ , how do you build it from the available native gate set? — is more subtle than it first appears.

The structural fact: for any unitary $U$ , there is a controlled version $C\text{-}U$ , and it can be implemented at gate cost roughly proportional to $U$ ‘s own cost plus some overhead. The overhead matters: a generic decomposition via Barenco et al. 1995 adds factors of 2-4 to the gate count; specialized constructions (lazy controlled-U for amplitude amplification, qubitization-style block encodings) can do much better.

This tutorial covers the standard constructions: phase kickback through the eigenbasis, Barenco-style multi-control decomposition, the “lazy” controlled-U trick for reflection operations, and the modern block-encoding-friendly controlled constructions.

The problem

Given a unitary $U$ acting on $n$ qubits and a 1-qubit control register, the controlled version $C\text{-}U$ is defined as:

C\text{-}U \,|c\rangle |\psi\rangle \;=\; \begin{cases} |c\rangle |\psi\rangle & \text{if } c = 0, \\ |c\rangle U |\psi\rangle & \text{if } c = 1. \end{cases}

For specific gates, $C\text{-}U$ is a primitive: $C\text{-}X = \text{CNOT}$ , $C\text{-}Z$ is a phase gate, $C\text{-}H$ is a controlled-Hadamard, etc. For arbitrary $U$ given as a circuit, the construction is non-trivial.

Phase kickback: the fundamental mechanism

Suppose $U$ has spectral decomposition $U = \sum_k e^{i\phi_k} |\psi_k\rangle\langle\psi_k|$ . If the target register is in eigenstate $|\psi_k\rangle$ , applying $C\text{-}U$ gives

C\text{-}U \,|c\rangle |\psi_k\rangle \;=\; |c\rangle (e^{i\phi_k})^c |\psi_k\rangle \;=\; e^{i \phi_k c} |c\rangle |\psi_k\rangle.

The phase from $U$ on the eigenvector becomes a phase on the control qubit when the control is $|1\rangle$ . The eigenvalue of $U$ kicks back onto the control register. This is the phase-kickback trick that powers quantum phase estimation.

The implementation insight: $C\text{-}U$ does not need to “physically” condition each gate of $U$ on the control. It needs to produce the correct kickback phase. This unlocks several construction strategies.

The naive construction: control every gate

If $U = G_m \cdots G_2 G_1$ is a product of $m$ basic gates, the simplest construction of $C\text{-}U$ replaces each $G_i$ with $C\text{-}G_i$ :

C\text{-}U \;=\; C\text{-}G_m \cdots C\text{-}G_2 \, C\text{-}G_1.

The cost is $m$ controlled basic gates. If each $C\text{-}G_i$ is a controlled-Pauli or controlled-rotation, the cost is comparable to $U$ itself: factor-of-2-to-4 overhead from the Cliffords needed to construct controlled versions.

This works for any $U$ but produces non-optimal circuits. Modern constructions exploit structure to do better.

Barenco multi-controlled gates

The dual problem: given a multi-controlled $C^k\text{-}U$ (controlled by $k$ qubits, applying $U$ if all controls are $|1\rangle$ ), how do you decompose into 1- and 2-qubit gates?

The Barenco-Bennett-Cleve-DiVincenzo-Margolus-Shor-Sleator-Smolin-Weinfurter 1995 paper gave the canonical answer: a $C^k\text{-}U$ on $k$ controls and $n$ -qubit target uses:

$O(k)$ Toffolis if $k - 1$ “borrowed” auxiliary qubits are available (Barenco-style ladder).
$O(k^2)$ Toffolis if no auxiliary qubits are available.
$O(k \log k)$ in some intermediate regimes with parallel auxiliary structure.

For a $C^k\text{-}U$ with $U$ a single-qubit rotation, this gives total Clifford+T cost of roughly $4k$ T gates from the Toffoli ladder plus $\sim 4 \log(1/\epsilon)$ T gates from the Solovay-Kitaev synthesis of $U$ .

Key practical implication: multi-controlled gates are linear-in-controls in T-count, not exponential. The Barenco construction was the structural breakthrough that made multi-controlled operations affordable.

Lazy controlled-U for amplitude amplification

Tutorial 10 introduced amplitude amplification, which uses two controlled operations: the controlled oracle $C\text{-}U_f$ and the controlled diffuser $C\text{-}D$ . The naive construction would replace every gate in $U_f$ and $D$ with its controlled version — doubling or tripling the gate count.

The lazy controlled-U trick: for reflection operations (i.e., operations of the form $I - 2 |\psi\rangle\langle\psi|$ for some state $|\psi\rangle$ ), the controlled version equals the original operation conditioned on a phase factor. The full reflection only needs to be applied without the conditioning — a single ancilla qubit takes the phase via kickback.

Concretely, for a reflection $R = I - 2|\psi\rangle\langle\psi|$ :

C\text{-}R \;=\; (I \otimes R) \cdot e^{i \pi (|c=1\rangle\langle c=1| \otimes |\psi\rangle\langle\psi|)},

and the second factor decomposes into a single multi-controlled phase, which is the only “controlled” part of the construction. The reflection itself is uncontrolled.

The savings: a controlled reflection costs one extra multi-controlled phase rotation, rather than doubling the cost of the entire reflection. For amplitude amplification with many iterations, this is a large win.

Qubitization-style controlled block encodings

Modern block-encoding-based algorithms (tutorial 29) need controlled block encodings: $C\text{-}\text{BE}(A)$ where $\text{BE}(A)$ is the block encoding of matrix $A$ . The naive construction makes every gate of $\text{BE}(A)$ controlled, doubling the cost.

The qubitization technique (Low-Chuang 2017) uses a different structure: the block-encoding ancilla qubits encode a signal flag that automatically turns the action on or off. The controlled version costs only one additional qubit and one additional reflection — a very small overhead compared to the full block encoding.

Concretely, a qubitization step costs $O(1)$ block-encoding queries plus $O(1)$ Cliffords. The controlled version costs $O(1)$ block-encoding queries plus $O(1)$ Cliffords plus one extra reflection. The factor-of-2-to-4 overhead of generic controlled-U disappears for qubitization-style constructions. This is one of the structural reasons modern quantum algorithms (HHL via QSVT, quantum signal processing) achieve their gate counts.

A small controlled-U demonstration

Concrete code building a controlled-U for a parameterized rotation:

import pennylane as qml
import numpy as np

dev = qml.device("default.qubit", wires=3)


@qml.qnode(dev)
def controlled_rotation_demo(theta, control_state):
    """Apply controlled-RY(theta) on target, conditioned on control."""
    if control_state == 1:
        qml.PauliX(wires=0)  # set control to |1>
    # Initialize target in |+>
    qml.Hadamard(wires=1)
    # Controlled-RY: directly available in PennyLane
    qml.ctrl(qml.RY, control=0)(theta, wires=1)
    return qml.state()


# When control is |0>, target stays in |+>
print("Control |0>, theta=pi/2:")
print(controlled_rotation_demo(np.pi/2, 0))

# When control is |1>, target rotates by theta
print("Control |1>, theta=pi/2:")
print(controlled_rotation_demo(np.pi/2, 1))


# Manual decomposition of CRY(theta) into Cliffords + RY
@qml.qnode(dev)
def manual_cry(theta, control_state):
    """Manual CRY(theta) decomposition into Cliffords + RY/2 sequences.
    Standard identity: CRY(θ) = RY(θ/2) ⊗ I, CNOT, RY(-θ/2) ⊗ I, CNOT
    """
    if control_state == 1:
        qml.PauliX(wires=0)
    qml.Hadamard(wires=1)
    qml.RY(theta / 2, wires=1)
    qml.CNOT(wires=[0, 1])
    qml.RY(-theta / 2, wires=1)
    qml.CNOT(wires=[0, 1])
    return qml.state()


print("Manual CRY decomposition:")
print(manual_cry(np.pi/2, 1))

The manual decomposition of CRY uses 2 RY gates + 2 CNOTs. For an arbitrary controlled-U where U has $g$ gates, the analogous decomposition uses $\sim 2g$ gates. The overhead factor is roughly 2 — manageable for small $U$ , dominant for large $U$ where lazy or qubitization-style constructions are needed.

Common misconceptions

“Controlled-U doubles the cost of U.” Naive construction does. Specialized constructions (lazy controlled, qubitization) can have constant-factor overhead. Modern algorithms exploit these.

“Multi-controlled gates require exponential ancillas.” No, the Barenco construction shows linear-in-controls cost with linear-in-controls auxiliary qubits, or quadratic-in-controls cost with no auxiliaries. Both are polynomial.

“Phase kickback only works for eigenstates.” Phase kickback works for any state expanded in the $U$ -eigenbasis. The control register receives a superposition of phases corresponding to the superposition of eigenvalue components. This is the magic of phase estimation.

“Implementing $C^k\text{-}X$ is harder than implementing $C\text{-}U$ for arbitrary $U$ .” For large $k$ , multi-controlled $C^k\text{-}X$ is concretely harder because the Toffoli ladder is long. But for $k \leq 3$ , it’s smaller than most controlled-U constructions.

“Controlled-U is always implementable.” Mathematically yes. Practically, $U$ ‘s decomposition into native gates determines whether the controlled version is feasible — controlled versions of complex sub-circuits inherit all the complexity of the original.

Decision rule

For each $C\text{-}U$ in an algorithm:

Is $U$ small? If $U$ has fewer than 5-10 gates, naive controlled-each-gate is fine. The factor-of-2 overhead is acceptable.
Is $U$ a reflection? Use lazy controlled-U: one extra multi-controlled phase, no other overhead. Saves a factor of 2-4.
Is $U$ a block encoding (or qubitization step)? Use the qubitization-native controlled construction. Constant-factor overhead.
Is $U$ generic and large? Naive construction is the default; sometimes specialized constructions are available based on $U$ ‘s structure (e.g., $U$ is a Trotterization).
Is the control multi-qubit ( $C^k$ -U for $k > 1$ )? Use the Barenco decomposition — the auxiliary-qubit version is much cheaper than the no-auxiliary version.

For most quantum-algorithm work, the lazy-controlled trick (for reflections) and the qubitization-native trick (for block encodings) are the dominant savings. Plain phase-kickback handles the rest.

Exercises

1. The cost of generic controlled-U

A unitary $U$ has 1000 gates (mostly Cliffords + a few T gates). Estimate the gate count of $C\text{-}U$ via naive construction.

Show answer

Naive construction replaces each gate of $U$ with its controlled version. Each controlled-Clifford costs ~3 gates; each controlled-T costs ~5 gates (with phase kickback through an ancilla). If $U$ has 950 Cliffords and 50 T-gates, the controlled version has $\sim 2850 + 250 = 3100$ gates. Factor of 3 overhead for the naive construction. Lazy or qubitization-style constructions, where applicable, can reduce this to factor 1.5 or even constant overhead. Choosing the right construction matters when $C\text{-}U$ is called many times in an algorithm (e.g., in QPE).

2. Why lazy controlled-U works

The lazy trick replaces a fully-controlled reflection with a single multi-controlled phase rotation plus an uncontrolled reflection. Why does this give the same answer?

Show answer

A reflection $R = I - 2|\psi\rangle\langle\psi|$ acts on $|\psi\rangle$ as $R|\psi\rangle = -|\psi\rangle$ and on $|\psi^\perp\rangle$ as $R|\psi^\perp\rangle = |\psi^\perp\rangle$ . The controlled version $C\text{-}R$ should leave the target unchanged when control is $|0\rangle$ , and apply $R$ when control is $|1\rangle$ . This is equivalent to applying $R$ unconditionally and then applying a phase of $\pi$ on the $|c=0\rangle |\psi\rangle$ component to “undo” the reflection-action on the $|c=0\rangle$ branch. The phase correction is a single multi-controlled-phase gate — much cheaper than applying $R$ controlled. This trick generalizes to any operation that is “self-inverse with diagonal action on a known subspace” — reflections, Clifford-equivalent reflections, etc.

3. Multi-control overhead

Implement $C^4\text{-}X$ (4-controlled NOT) using the Barenco-with-ancillas construction. How many Toffolis are needed, and what is the T-count using Selinger’s 4-T Toffoli decomposition?

Show answer

The Barenco-with-ancillas construction uses a Toffoli ladder. For $C^4\text{-}X$ with 3 ancilla qubits: AND of controls 1,2 into ancilla A1 (1 Toffoli); AND of A1, control 3 into A2 (1 Toffoli); AND of A2, control 4 with target (1 Toffoli) — 3 Toffolis going forward. Then uncompute by reverse-Toffoli on the ancillas (3 more Toffolis). Total: 6 Toffolis. With Selinger’s 4-T decomposition: $6 \times 4 = 24$ T gates. The no-ancilla Barenco construction would be $O(k^2) = 16$ Toffolis = 64 T-gates — a clear win to use ancillas when available.

4. Why qubitization-controlled is cheaper

Qubitization-style block encoding has cost $O(1)$ controlled queries to the matrix oracle. Generic block encoding’s controlled version doubles the cost. Why does qubitization avoid this doubling?

Show answer

Qubitization’s block encoding uses a structure where the ancilla flag qubits naturally signal whether the matrix-action is “active.” Adding a control register to qubitization means adding one more flag qubit; the existing block-encoding query is unchanged because it would “noop” on the inactive sector anyway. The controlled version inherits the block encoding’s already-conditional structure. Generic block encoding’s controlled version requires conditioning every internal gate, doubling the cost. Qubitization’s design principle — building flag-conditional structure into the block encoding from the start — makes the controlled version essentially free. This is one of the key practical reasons qubitization-based algorithms (LCU, QSVT, HHL via QSVT) have such favorable resource estimates.

Where this goes next

Tutorial 56 covers ZX-calculus — a graphical rewriting language for quantum circuits that complements the gate-based decompositions of tutorials 53-55 with a different optimization framework. Together, the four tutorials in this batch (53-56) cover the gate-level compilation toolkit that takes algorithm-specified rotations and Toffolis through to fault-tolerant gate counts.

The problem

Phase kickback: the fundamental mechanism

The naive construction: control every gate

Barenco multi-controlled gates

Lazy controlled-U for amplitude amplification

Qubitization-style controlled block encodings

A small controlled-U demonstration

Common misconceptions

Decision rule

Exercises

1. The cost of generic controlled-U

2. Why lazy controlled-U works

3. Multi-control overhead

4. Why qubitization-controlled is cheaper

Where this goes next

Quantum, for people who already code.