PyTorch: tensors, autograd & training
You will not hand-derive backward passes in real work — you'll use a framework. The two that dominate are PyTorch (research and, increasingly, production) and TensorFlow/Keras (still common in industry). This book uses PyTorch because it's the lingua franca of modern AI and reads like plain Python. This chapter is the practical core: the four pillars, then the exact same XOR network from Chapter 11 — but with the backward pass written for us.
Run code/pytorch_intro.py to reproduce every output here.
The four pillars of every PyTorch program
- Tensors — like NumPy arrays, but they can live on a GPU and track gradients.
nn.Module— your model as a class, with learnable parameters registered automatically.- autograd —
loss.backward()computes every gradient by running backprop for you. - optimizer —
opt.step()applies the gradient-descent update.
Pillar 1: tensors (NumPy you can run on a GPU)
import torch
a = torch.tensor([[1., 2., 3.],
[4., 5., 6.]])
print("shape:", tuple(a.shape), "dtype:", a.dtype)
print("a.mean():", a.mean().item())
print("a @ a.T:\n", a @ a.T)
Output:
shape: (2, 3) dtype: torch.float32
a.mean(): 3.5
a @ a.T:
tensor([[14., 32.],
[32., 77.]])
If you know NumPy (Chapter 2), you already know 90% of
the tensor API — .shape, @, broadcasting, .mean(axis=...) (PyTorch calls the
arg dim). The two superpowers NumPy lacks: .to('cuda') moves a tensor to the
GPU, and requires_grad=True makes it track gradients.
Don't be confused: PyTorch defaults to float32, NumPy to float64. Bridging the two preserves dtype, which can surprise you:
import numpy as np print(torch.from_numpy(np.array([1., 2., 3.])).dtype) # float64!torch.float64A float64 tensor fed to a float32 model raises a dtype error. Cast with
.float()(→ float32) when crossing the NumPy↔PyTorch bridge.
Pillar 3: autograd (the reason frameworks exist)
This is the magic that replaces all of Chapter 11's hand-derived calculus. Mark a
tensor requires_grad=True, do some math, call .backward(), and PyTorch fills in
every gradient:
x = torch.tensor([2.0, 3.0], requires_grad=True)
y = (x ** 2).sum() # y = x0^2 + x1^2, so dy/dx = 2x
y.backward() # run backprop
print("y:", y.item())
print("x.grad:", x.grad) # should be [2*2, 2*3] = [4, 6]
Output:
y: 13.0
x.grad: tensor([4., 6.])
PyTorch recorded every operation into a computation graph as you went, then walked it backward to get $\partial y/\partial x = [4, 6]$ — exactly $2x$. You never wrote a derivative. Scale this to a 100-layer transformer and autograd still just works. This is what a framework buys you: you define only the forward pass; the backward pass is free.
Pillars 2 & 4: the model and the training loop
Here is the whole XOR network — model, loss, optimizer, loop — in idiomatic PyTorch:
import torch
import torch.nn as nn
class MLP(nn.Module):
def __init__(self, n_in=2, hidden=8):
super().__init__()
self.fc1 = nn.Linear(n_in, hidden) # a (w·x + b) layer; params auto-registered
self.fc2 = nn.Linear(hidden, 1)
def forward(self, x):
x = torch.tanh(self.fc1(x))
return self.fc2(x) # raw logit; the loss applies sigmoid
model = MLP()
loss_fn = nn.BCEWithLogitsLoss() # sigmoid + cross-entropy, fused & stable
opt = torch.optim.Adam(model.parameters(), lr=0.05)
for epoch in range(401):
opt.zero_grad() # 1. clear last step's gradients
logits = model(X) # 2. forward pass
loss = loss_fn(logits, y)
loss.backward() # 3. autograd computes every .grad
opt.step() # 4. optimizer updates every parameter
Those four lines inside the loop are the canonical PyTorch training step — you'll write them, in this order, for every model you ever train:
zero_grad -> forward -> backward -> step
Running the full script:
X is a tensor of shape (400, 2), dtype torch.float32
device: cpu (CUDA available: False)
epoch 0 loss 0.6951 train-acc 0.498
epoch 100 loss 0.0200 train-acc 0.995
epoch 200 loss 0.0148 train-acc 0.998
epoch 300 loss 0.0116 train-acc 0.995
epoch 400 loss 0.0102 train-acc 0.995
model has 33 learnable parameters
Same XOR problem, same ~99.5% accuracy as the hand-coded version — but we wrote
zero backprop. The 33 parameters: layer 1 is 2×8 + 8 = 24, layer 2 is 8×1 + 1 = 9, total 33. (Counting parameters by hand like this is a great sanity check on
any architecture.)
The two gotchas that bite everyone
Don't be confused:
zero_grad()is mandatory. PyTorch accumulates gradients — eachbackward()adds to.gradrather than replacing it. Forgetopt.zero_grad()and gradients from every past step pile up, and training goes haywire. (The accumulation is a feature — it lets you simulate big batches — but the default workflow needs the reset every step.)
Don't be confused:
model.train()vs.model.eval(). Layers like dropout and batch norm behave differently in training vs. inference. Callmodel.eval()before validating/predicting andmodel.train()before training. Also wrap inference inwith torch.no_grad():to skip building the gradient graph — it's faster and uses less memory. Forgettingeval()is a top source of "my validation numbers are mysteriously bad."
BCEWithLogitsLoss: logits, not probabilities
Notice the model returns a raw logit (Chapter 6), not a
sigmoid'd probability, and we used BCEWithLogitsLoss. That loss fuses the sigmoid
and the cross-entropy into one numerically stable operation (it uses the log-sum-exp
trick from Chapter 17). Applying sigmoid yourself and
then BCELoss is mathematically identical but can overflow. Rule: feed logits to
*WithLogitsLoss / CrossEntropyLoss; only apply sigmoid/softmax when you actually
need to read a probability.
The broader ecosystem (so the words don't surprise you)
DataLoader/Dataset— batch and shuffle your data (Chapter 16).torchvision/torchaudio— datasets, pretrained models, transforms.- Hugging Face
transformers— thousands of pretrained models, PyTorch-native. model.state_dict()/torch.save— save and load learned weights..to(device)— the one line that moves model + data to GPU.- JAX — a third framework (Google) you'll hear about; same autograd idea, function-oriented style.
The takeaway
PyTorch = tensors (GPU-capable NumPy) + autograd (free backward pass) + nn.Module
(your model) + optimizer (the update). The training loop is always zero_grad → forward → backward → step. Feed logits to the loss; remember eval() and
no_grad() for inference. You can now read essentially any model's training code.
Next, a tour of the architectures that go inside forward. 👉