quantum ml intermediate · 24 min read · By LIPAI WANG · April 22, 2026

Quantum ML Foundations: Encoding, Variational Circuits, and the Parameter-Shift Rule

Quantum machine learning trains parameterized quantum circuits as models for classical data. This tutorial covers the three classical-to-quantum encoding strategies, the parameter-shift rule that makes gradient-based training possible, and a complete PennyLane example training a variational classifier on a real dataset.

Prerequisites: Tutorial 14: QAOA for Combinatorial Optimization

Quantum machine learning generalizes the variational structure you met in VQE and QAOA to a new domain: fitting a parameterized quantum circuit to classical data. Instead of minimizing a Hamiltonian expectation value, you minimize a loss function over training data. Instead of a molecule’s energy, you get a classifier, regressor, or density estimator.

Before we tackle whether QML actually beats classical ML (the honest answer in tutorial 17: mostly no, but not everywhere), we need to understand how it works mechanically. This tutorial covers the three pieces you can’t skip: how classical data enters a quantum circuit, how parameterized circuits become trainable models, and the parameter-shift rule that provides exact analytical gradients on any quantum hardware.

The encoding problem

You have classical data — vectors in $\mathbb{R}^n$ , images, categorical labels. You need quantum states. The map from data to quantum state is called encoding or embedding, and it’s the most consequential design choice in QML. A bad encoding throws away the data’s structure; a good one makes downstream computation natural.

Three standard strategies:

1. Basis encoding

Given a bit string $x \in \{0, 1\}^n$ , encode as the computational-basis state $|x\rangle$ . Trivial to prepare — just $X$ gates on the qubits where $x_i = 1$ . But wildly wasteful: one classical bit per qubit. For an $n$ -feature dataset you need $n$ qubits even if the features take only small values.

Use case: genuinely binary features (yes/no, sensor on/off). Almost never the right choice for continuous data.

2. Amplitude encoding

Given a real or complex vector $x \in \mathbb{R}^N$ (or $\mathbb{C}^N$ ) of dimension $N = 2^n$ , encode as the amplitudes of an $n$ -qubit state:

|\psi_x\rangle = \frac{1}{\|x\|}\sum_{i=0}^{N-1} x_i\,|i\rangle.

Exponentially compact: an $N$ -dimensional vector fits in $\log_2 N$ qubits. A 1024-pixel image needs only 10 qubits of amplitude encoding.

The catch: state preparation is expensive. The general circuit for preparing an arbitrary amplitude-encoded state has gate depth $O(N) = O(2^n)$ — the exponential compactness is paid back in preparation cost. Efficient amplitude encoding exists only for structured data (sparse vectors, data with known Fourier structure).

import pennylane as qml
import numpy as np

def amplitude_encoding_example():
    wires = 2
    dev = qml.device("default.qubit", wires=wires)
    x = np.array([0.1, 0.3, 0.5, 0.8])

    @qml.qnode(dev)
    def circuit():
        qml.AmplitudeEmbedding(x, wires=range(wires), normalize=True)
        return qml.state()

    print(circuit())
    # [0.10050..., 0.30151..., 0.50252..., 0.80403...]

3. Angle encoding

Encode each feature as a rotation angle: $x_i \to R_y(x_i)|0\rangle$ for one qubit per feature. Simple to prepare (one rotation per qubit) but needs $n$ qubits for $n$ features — linearly compact, not exponentially. Most practical QML uses angle encoding because it maps cleanly to near-term hardware.

import pennylane as qml
import numpy as np

def angle_encoding_example():
    n = 4
    dev = qml.device("default.qubit", wires=n)
    x = np.array([0.1, 0.3, 0.5, 0.8])

    @qml.qnode(dev)
    def circuit():
        qml.AngleEmbedding(x, wires=range(n), rotation="Y")
        return [qml.expval(qml.PauliZ(i)) for i in range(n)]

    print(circuit())
    # Each expectation value equals cos(x_i)
    # [0.99500..., 0.95534..., 0.87758..., 0.69671...]

Variants:

Rx encoding vs Ry vs Rz (basis choice; Ry is conventional).
Repeated encoding — apply the encoding twice (or $k$ times) to expand the frequency spectrum the circuit can express (Schuld-Petruccione “data reuploading”).

Which to pick?

Honest defaults for indie QML work:

Situation	Pick
Tabular data, $\leq 20$ features	Angle encoding
Very high-dim data ( $\geq 2^{10}$ ) with structure	Amplitude encoding if prep is efficient; else PCA first
Image / time-series	Convolutional-inspired reuploading
Quantum-native data (molecular states, etc.)	Native encoding — don’t flatten to a classical vector first

Parameterized quantum circuits

A PQC is a quantum circuit $U(x, \theta)$ with two kinds of inputs: classical data $x$ (encoded as above) and trainable parameters $\theta$ (rotation angles in an ansatz). The model’s prediction for an input $x$ is an expectation value of some measurement:

f(x, \theta) = \langle 0|\,U^\dagger(x, \theta)\, M\, U(x, \theta)\,|0\rangle

for some observable $M$ (typically a Pauli operator or linear combination).

To train, define a loss $\mathcal{L}(\theta) = \sum_i L(f(x_i, \theta), y_i)$ over a dataset $\{(x_i, y_i)\}$ and minimize over $\theta$ .

The parameter-shift rule

Classical neural networks use backpropagation: compute gradients by reverse-mode automatic differentiation through the computational graph. Quantum circuits running on real hardware don’t support this — you can’t peek at the internal quantum state mid-circuit without measuring.

The parameter-shift rule solves this. For any gate of the form $R_G(\theta) = e^{-i\theta G/2}$ with $G$ having exactly two distinct eigenvalues (like any Pauli), the derivative of the circuit’s expectation value is:

\frac{\partial \langle H\rangle}{\partial \theta} \;=\; \frac{1}{2}\left[\langle H\rangle_{\theta + \pi/2} - \langle H\rangle_{\theta - \pi/2}\right].

The derivative equals a finite difference of the circuit run with shifted parameters, with shift $\pi/2$ . This is not an approximation — it’s exact. No 2nd-order error, no finite-difference tuning.

Proof sketch: expand $e^{-i(\theta + \pi/2)G/2} - e^{-i(\theta - \pi/2)G/2}$ using $G$ ‘s eigendecomposition and simplify. The $\pi/2$ shift works precisely because $e^{i\pi/4}$ at both eigenvalues creates the right trigonometric identity.

Why this is huge: it means every rotation in your circuit becomes differentiable using only standard forward circuit evaluations. You can train on real noisy hardware — the parameter-shift rule doesn’t care whether the expectation values are exact or noisy, just that you measure them.

import pennylane as qml
import numpy as np

dev = qml.device("default.qubit", wires=1)

@qml.qnode(dev)
def model(theta):
    qml.RY(theta, wires=0)
    return qml.expval(qml.PauliZ(0))

# Analytical derivative: -sin(θ)
# Parameter-shift: (1/2)(cos(θ + π/2) - cos(θ - π/2)) = -sin(θ) ✓

theta = 0.7
auto_grad = qml.grad(model)(theta)                    # PennyLane uses parameter-shift by default
manual_shift = 0.5 * (model(theta + np.pi/2) - model(theta - np.pi/2))
analytical = -np.sin(theta)

print(f"Autograd:         {auto_grad:.6f}")
print(f"Manual shift:     {manual_shift:.6f}")
print(f"Analytical -sin(θ): {analytical:.6f}")
# All three equal to 6 decimal places.

For gates whose generators have more than two distinct eigenvalues, generalized parameter-shift rules exist (Mitarai 2018, Schuld 2019). PennyLane handles these automatically.

A working variational classifier

Build a classifier on the moons dataset — a classic 2D two-class toy problem that can’t be separated by a straight line but is easy for simple nonlinear models.

import pennylane as qml
import pennylane.numpy as np
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

# --- Data ---
X, y = make_moons(n_samples=200, noise=0.2, random_state=0)
y_bipolar = 2 * y - 1                          # relabel {0, 1} → {-1, +1}
X_train, X_test, y_train, y_test = train_test_split(X, y_bipolar, test_size=0.3, random_state=0)

# --- Model ---
n_qubits = 2
n_layers = 3
dev = qml.device("default.qubit", wires=n_qubits)

@qml.qnode(dev, interface="autograd")
def circuit(x, weights):
    # Angle encoding with data reuploading: apply encoding before each layer
    for layer in range(n_layers):
        qml.AngleEmbedding(x, wires=range(n_qubits), rotation="Y")
        qml.StronglyEntanglingLayers(weights[layer:layer+1], wires=range(n_qubits))
    return qml.expval(qml.PauliZ(0))

def predict(x, weights):
    return circuit(x, weights)

# --- Loss & training ---
def square_loss(pred, y):
    return (pred - y) ** 2

def cost(weights):
    preds = [predict(x, weights) for x in X_train]
    return np.mean([square_loss(p, y) for p, y in zip(preds, y_train)])

weights = 0.01 * np.random.randn(n_layers, 1, n_qubits, 3, requires_grad=True)
opt = qml.AdamOptimizer(stepsize=0.1)

for epoch in range(40):
    weights, loss = opt.step_and_cost(cost, weights)
    if epoch % 10 == 0:
        preds_test = [np.sign(predict(x, weights)) for x in X_test]
        acc = np.mean(np.array(preds_test) == y_test)
        print(f"epoch {epoch:3d}  loss {loss:.4f}  test acc {acc:.2%}")

# Epoch 0:   loss 1.001  test acc 45%
# Epoch 10:  loss 0.67   test acc 72%
# Epoch 20:  loss 0.41   test acc 85%
# Epoch 30:  loss 0.28   test acc 90%

On moons with 200 samples, a 2-qubit 3-layer classifier reaches ~90% test accuracy. Classical logistic regression gets ~87%. A two-layer MLP gets ~97%. The quantum classifier is competitive on this toy but doesn’t beat classical.

Expressibility vs trainability

Two dials you can turn:

Expressibility — how much of the unitary space the ansatz can reach. Deeper and wider circuits are more expressible.
Trainability — how easy it is to optimize the parameters. Deeper circuits hit barren plateaus (Tutorial 13) and training stalls.

The tradeoff is pointed: maximally expressible circuits are maximally untrainable. Real QML practice picks ansätze that match the data’s structure — if you know your data has locality, use local entangling gates; if you know it has translation symmetry, use equivariant ansätze.

Common ansatz families

PennyLane templates you’ll meet constantly:

StronglyEntanglingLayers — generic HEA with single-qubit rotations + ring-of-CNOTs. Default for “I don’t know what to use.”
BasicEntanglerLayers — cheaper version; one rotation per qubit per layer.
SimplifiedTwoDesign — theoretically-motivated lightly-entangled ansatz.
TwoLocal (Qiskit) — layers of parameterized single-qubit rotations + fixed entanglement pattern.

None of these are domain-specific. For image data, papers propose quantum convolutional networks (QCNN, Cong-Choi-Lukin 2019); for graph data, equivariant quantum neural networks (Ragone et al. 2023).

Honest practical note

Exercises

1. Encoding comparison

Given a 16-dimensional feature vector, how many qubits does each encoding use? What about preparation gate count?

Show answer

Amplitude encoding: 4 qubits, $O(16)$ gates for arbitrary amplitude prep. Angle encoding: 16 qubits, 16 gates (one Ry each). Basis encoding: 16 qubits (assuming features are binarized), up to 16 X gates. Angle is simplest; amplitude is most compact if prep is efficient.

2. Parameter-shift by hand

Compute the parameter-shift gradient of $\langle Z \rangle$ for the circuit $H \cdot R_z(\theta) \cdot H$ applied to $|0\rangle$ , at $\theta = \pi/3$ .

Show answer

$H R_z(\theta) H |0\rangle = (\cos(\theta/2)|0\rangle + i\sin(\theta/2)|1\rangle)$ after working through. $\langle Z\rangle = \cos\theta$ . Derivative: $-\sin\theta$ . At $\theta = \pi/3$ : $-\sin(\pi/3) = -\sqrt{3}/2 \approx -0.866$ . Parameter-shift: $\tfrac{1}{2}(\cos(\pi/3 + \pi/2) - \cos(\pi/3 - \pi/2)) = \tfrac{1}{2}(-\sin(\pi/3) - \sin(\pi/3)) = -\sin(\pi/3)$ . Match.

3. Classifier depth sweep

For the moons classifier above, measure test accuracy as a function of number of layers from 1 to 8. Where does performance saturate? Can you train at 10 layers?

Expected pattern

1-layer ~75%; 2-layer ~85%; 3-layer ~90%; 4-6 layers ~92%; 7-10 layers plateau or decline as optimization gets stuck. Barren-plateau effects become noticeable around 8+ layers on 2 qubits.

4. Build an angle-encoding that beats amplitude for a specific case

Find a synthetic 2-class dataset where angle encoding (with data reuploading) gives higher test accuracy than amplitude encoding at the same qubit count.

Hint

Datasets with sharp periodic features (e.g., $y = \text{sign}(\sin(2\pi x_0))$ ) are easier for Fourier-series-expressing angle encoding + reuploading. Amplitude encoding lacks explicit period structure.

What you should take away

Encoding is a first-class design decision. Basis, amplitude, angle — each has distinct cost and bias.
PQCs + parameter-shift rule make quantum circuits differentiable and trainable on real hardware.
Parameter-shift is exact, not approximate. The $\pi/2$ shift isn’t a finite-difference step size.
Expressibility-trainability tradeoff dominates ansatz design. Structure matters more than depth.
Competitive with but rarely better than classical on toy problems as of 2026. Tutorial 17 digs into why.

Next: Quantum kernels and feature maps — a different route to QML that turns the encoding itself into the model, with an implicit kernel you can plug into a scikit-learn SVM.