Gradient descent & backprop

We keep saying "nudge the parameters to shrink the loss." Gradient descent is how — the optimization engine under all of deep learning. And backpropagation is the trick that computes the nudges efficiently for networks with millions of parameters. Understand these two and you understand how every model on Earth gets trained.

The gradient: which way is downhill

Picture the loss as a landscape: parameters are your position, loss is your altitude, and you want the valley floor. The gradient is the vector of partial derivatives — it points in the direction of steepest increase. So to go down, step in the opposite direction:

$$ \theta \leftarrow \theta - \eta ,\nabla_\theta L $$

where $\eta$ ("eta") is the learning rate — how big a step to take. That one line is gradient descent.

The simplest possible descent

Minimize $f(x) = (x-3)^2$, whose minimum is obviously at $x=3$. Its derivative is $f'(x) = 2(x-3)$. Start at 0 and walk downhill:

x, lr = 0.0, 0.1
for step in range(40):
    grad = 2 * (x - 3)        # the gradient (slope) at the current x
    x -= lr * grad            # step opposite the gradient
    if step in (0, 5, 20, 39):
        print(f"step {step:2d}  x={x:.4f}  f(x)={(x-3)**2:.4f}")

Output:

step  0  x=0.6000  f(x)=5.7600
step  5  x=2.2136  f(x)=0.6185
step 20  x=2.9723  f(x)=0.0008
step 39  x=2.9996  f(x)=0.0000

Starting clueless at 0, it slid down to $x \approx 3$ — exactly the loop from Chapter 1, but isolated. Real models do this in millions of dimensions at once, but the move is identical: compute gradient, step against it.

The learning rate: the one knob you must get right

Too small and training crawls. Too big and it diverges — overshooting the valley and bouncing to ever-worse losses:

x, lr = 0.0, 1.1            # too large for this problem
for step in range(6):
    grad = 2 * (x - 3); x -= lr * grad
    print(f"step {step}  x={x:.3f}  f={(x-3)**2:.3f}")

Output:

step 0  x=6.600  f=12.960
step 1  x=-1.320  f=18.662
step 2  x=8.184  f=26.874
step 3  x=-3.221  f=38.698
step 4  x=10.465  f=55.726
step 5  x=-5.958  f=80.245

The loss grows — the classic signature of a learning rate that's too high. If your training loss explodes to NaN, lower the learning rate first. This single hyperparameter causes more failed training runs than anything else.

The three flavors: batch, stochastic, mini-batch

The gradient is an average over examples. How many you average per step gives three variants:

VariantExamples per stepTrade-off
Batch GDall of themexact gradient, but slow and memory-hungry
Stochastic GD (SGD)onenoisy, fast, the noise can help escape bad spots
Mini-batcha small chunk (32–512)the practical default — fast and stable

Mini-batch SGD is what everyone actually uses. One full pass over the dataset is an epoch; you train for many epochs.

Better optimizers: momentum and Adam

Plain SGD can zig-zag and crawl. Two upgrades dominate practice:

  • Momentum — accumulate a running average of past gradients, like a ball rolling downhill gaining speed. Smooths the zig-zag.
  • Adam — momentum plus a per-parameter adaptive learning rate (it divides by a running estimate of each gradient's magnitude). Adam is the default optimizer for deep learning — robust, fast, forgiving of learning-rate choice. (Variant: AdamW, the standard for training transformers.)

You rarely implement these — you write torch.optim.Adam(model.parameters(), lr=...) and move on (Chapter 12).

Backpropagation: the chain rule, organized

A neural net is a deep composition of functions. To update a weight buried in layer 1, you need $\partial L / \partial w$ through every layer above it. The chain rule from calculus says: multiply the local derivatives along the path. Backpropagation is just the bookkeeping that does this efficiently — one forward pass to compute outputs, then one backward pass that pushes the error from the loss back to every parameter, reusing intermediate results.

The key insight that makes it cheap: compute gradients layer by layer from the output backward, reusing the gradient already computed for the layer above. Naively recomputing each path would be exponential; backprop is linear. We hand-derive and hand-code the entire backward pass in Chapter 11.

Don't be confused: backprop vs. gradient descent vs. autograd.

  • Backprop computes the gradients (applies the chain rule efficiently).
  • Gradient descent uses them to update parameters ($\theta \mathrel{-}= \eta,\text{grad}$).
  • Autograd is the software (Chapter 12) that does backprop automatically so you never write it by hand.

They're three different things people blur together. Backprop = compute, gradient descent = update, autograd = the tool.

Trust but verify: the numerical gradient check

How do you know a hand-derived gradient is correct? Compare it to a numerical estimate from the definition of a derivative, $\frac{f(x+\epsilon) - f(x-\epsilon)}{2\epsilon}$. If they match, your analytic gradient is right. This gradient check has saved countless from-scratch implementations:

import numpy as np
def f(v): return v[0]**2 + 3*v[0]*v[1]      # df/dx = 2x+3y, df/dy = 3x
v = np.array([2.0, 1.0])
analytic = np.array([2*v[0] + 3*v[1], 3*v[0]])

eps, num = 1e-6, np.zeros(2)
for i in range(2):
    d = np.zeros(2); d[i] = eps
    num[i] = (f(v + d) - f(v - d)) / (2 * eps)   # central difference
print("analytic:", analytic, " numeric:", np.round(num, 5))

Output:

analytic: [7. 6.]  numeric: [7. 6.]

They match to many decimals → the analytic gradient is correct. (This trick is too slow for training — it's $O(\text{params})$ forward passes — but perfect for checking a small implementation.)

Local minima, saddle points, and why it works anyway

The loss landscape of a deep network is wildly non-convex — full of valleys and saddle points. In theory gradient descent could get stuck; in practice, in very high dimensions, good-enough minima are everywhere and the noise of mini-batch SGD helps slide past bad spots. You don't need the global optimum, just a low valley that generalizes. That empirical fact is why deep learning works at all.

The takeaway

Gradient descent walks parameters downhill: θ -= lr * gradient. The learning rate is the make-or-break knob (too big → diverge). Mini-batch SGD with Adam is the default. Backprop is the chain rule applied efficiently to get the gradients; autograd automates it; a numerical gradient check verifies it. We now have the full training engine — time to confront its dark twin: overfitting. 👉