Gradient descent & backprop
We keep saying "nudge the parameters to shrink the loss." Gradient descent is how — the optimization engine under all of deep learning. And backpropagation is the trick that computes the nudges efficiently for networks with millions of parameters. Understand these two and you understand how every model on Earth gets trained.
The gradient: which way is downhill
Picture the loss as a landscape: parameters are your position, loss is your altitude, and you want the valley floor. The gradient is the vector of partial derivatives — it points in the direction of steepest increase. So to go down, step in the opposite direction:
$$ \theta \leftarrow \theta - \eta ,\nabla_\theta L $$
where $\eta$ ("eta") is the learning rate — how big a step to take. That one line is gradient descent.
The simplest possible descent
Minimize $f(x) = (x-3)^2$, whose minimum is obviously at $x=3$. Its derivative is $f'(x) = 2(x-3)$. Start at 0 and walk downhill:
x, lr = 0.0, 0.1
for step in range(40):
grad = 2 * (x - 3) # the gradient (slope) at the current x
x -= lr * grad # step opposite the gradient
if step in (0, 5, 20, 39):
print(f"step {step:2d} x={x:.4f} f(x)={(x-3)**2:.4f}")
Output:
step 0 x=0.6000 f(x)=5.7600
step 5 x=2.2136 f(x)=0.6185
step 20 x=2.9723 f(x)=0.0008
step 39 x=2.9996 f(x)=0.0000
Starting clueless at 0, it slid down to $x \approx 3$ — exactly the loop from Chapter 1, but isolated. Real models do this in millions of dimensions at once, but the move is identical: compute gradient, step against it.
The learning rate: the one knob you must get right
Too small and training crawls. Too big and it diverges — overshooting the valley and bouncing to ever-worse losses:
x, lr = 0.0, 1.1 # too large for this problem
for step in range(6):
grad = 2 * (x - 3); x -= lr * grad
print(f"step {step} x={x:.3f} f={(x-3)**2:.3f}")
Output:
step 0 x=6.600 f=12.960
step 1 x=-1.320 f=18.662
step 2 x=8.184 f=26.874
step 3 x=-3.221 f=38.698
step 4 x=10.465 f=55.726
step 5 x=-5.958 f=80.245
The loss grows — the classic signature of a learning rate that's too high. If your training loss explodes to NaN, lower the learning rate first. This single hyperparameter causes more failed training runs than anything else.
The three flavors: batch, stochastic, mini-batch
The gradient is an average over examples. How many you average per step gives three variants:
| Variant | Examples per step | Trade-off |
|---|---|---|
| Batch GD | all of them | exact gradient, but slow and memory-hungry |
| Stochastic GD (SGD) | one | noisy, fast, the noise can help escape bad spots |
| Mini-batch | a small chunk (32–512) | the practical default — fast and stable |
Mini-batch SGD is what everyone actually uses. One full pass over the dataset is an epoch; you train for many epochs.
Better optimizers: momentum and Adam
Plain SGD can zig-zag and crawl. Two upgrades dominate practice:
- Momentum — accumulate a running average of past gradients, like a ball rolling downhill gaining speed. Smooths the zig-zag.
- Adam — momentum plus a per-parameter adaptive learning rate (it divides by a running estimate of each gradient's magnitude). Adam is the default optimizer for deep learning — robust, fast, forgiving of learning-rate choice. (Variant: AdamW, the standard for training transformers.)
You rarely implement these — you write torch.optim.Adam(model.parameters(), lr=...) and move on (Chapter 12).
Backpropagation: the chain rule, organized
A neural net is a deep composition of functions. To update a weight buried in layer 1, you need $\partial L / \partial w$ through every layer above it. The chain rule from calculus says: multiply the local derivatives along the path. Backpropagation is just the bookkeeping that does this efficiently — one forward pass to compute outputs, then one backward pass that pushes the error from the loss back to every parameter, reusing intermediate results.
The key insight that makes it cheap: compute gradients layer by layer from the output backward, reusing the gradient already computed for the layer above. Naively recomputing each path would be exponential; backprop is linear. We hand-derive and hand-code the entire backward pass in Chapter 11.
Don't be confused: backprop vs. gradient descent vs. autograd.
- Backprop computes the gradients (applies the chain rule efficiently).
- Gradient descent uses them to update parameters ($\theta \mathrel{-}= \eta,\text{grad}$).
- Autograd is the software (Chapter 12) that does backprop automatically so you never write it by hand.
They're three different things people blur together. Backprop = compute, gradient descent = update, autograd = the tool.
Trust but verify: the numerical gradient check
How do you know a hand-derived gradient is correct? Compare it to a numerical estimate from the definition of a derivative, $\frac{f(x+\epsilon) - f(x-\epsilon)}{2\epsilon}$. If they match, your analytic gradient is right. This gradient check has saved countless from-scratch implementations:
import numpy as np
def f(v): return v[0]**2 + 3*v[0]*v[1] # df/dx = 2x+3y, df/dy = 3x
v = np.array([2.0, 1.0])
analytic = np.array([2*v[0] + 3*v[1], 3*v[0]])
eps, num = 1e-6, np.zeros(2)
for i in range(2):
d = np.zeros(2); d[i] = eps
num[i] = (f(v + d) - f(v - d)) / (2 * eps) # central difference
print("analytic:", analytic, " numeric:", np.round(num, 5))
Output:
analytic: [7. 6.] numeric: [7. 6.]
They match to many decimals → the analytic gradient is correct. (This trick is too slow for training — it's $O(\text{params})$ forward passes — but perfect for checking a small implementation.)
Local minima, saddle points, and why it works anyway
The loss landscape of a deep network is wildly non-convex — full of valleys and saddle points. In theory gradient descent could get stuck; in practice, in very high dimensions, good-enough minima are everywhere and the noise of mini-batch SGD helps slide past bad spots. You don't need the global optimum, just a low valley that generalizes. That empirical fact is why deep learning works at all.
The takeaway
Gradient descent walks parameters downhill: θ -= lr * gradient. The learning rate
is the make-or-break knob (too big → diverge). Mini-batch SGD with Adam is the
default. Backprop is the chain rule applied efficiently to get the gradients;
autograd automates it; a numerical gradient check verifies it. We now have the full
training engine — time to confront its dark twin: overfitting. 👉