PyTorch: tensors, autograd & training

You will not hand-derive backward passes in real work — you'll use a framework. The two that dominate are PyTorch (research and, increasingly, production) and TensorFlow/Keras (still common in industry). This book uses PyTorch because it's the lingua franca of modern AI and reads like plain Python. This chapter is the practical core: the four pillars, then the exact same XOR network from Chapter 11 — but with the backward pass written for us.

Run code/pytorch_intro.py to reproduce every output here.

The four pillars of every PyTorch program

Tensors — like NumPy arrays, but they can live on a GPU and track gradients.
nn.Module — your model as a class, with learnable parameters registered automatically.
autograd — loss.backward() computes every gradient by running backprop for you.
optimizer — opt.step() applies the gradient-descent update.

Pillar 1: tensors (NumPy you can run on a GPU)

import torch
a = torch.tensor([[1., 2., 3.],
                  [4., 5., 6.]])
print("shape:", tuple(a.shape), "dtype:", a.dtype)
print("a.mean():", a.mean().item())
print("a @ a.T:\n", a @ a.T)

Output:

shape: (2, 3) dtype: torch.float32
a.mean(): 3.5
a @ a.T:
 tensor([[14., 32.],
        [32., 77.]])

If you know NumPy (Chapter 2), you already know 90% of the tensor API — .shape, @, broadcasting, .mean(axis=...) (PyTorch calls the arg dim). The two superpowers NumPy lacks: .to('cuda') moves a tensor to the GPU, and requires_grad=True makes it track gradients.

Don't be confused: PyTorch defaults to float32, NumPy to float64. Bridging the two preserves dtype, which can surprise you:
import numpy as np
print(torch.from_numpy(np.array([1., 2., 3.])).dtype)   # float64!
torch.float64
A float64 tensor fed to a float32 model raises a dtype error. Cast with .float() (→ float32) when crossing the NumPy↔PyTorch bridge.

Pillar 3: autograd (the reason frameworks exist)

This is the magic that replaces all of Chapter 11's hand-derived calculus. Mark a tensor requires_grad=True, do some math, call .backward(), and PyTorch fills in every gradient:

x = torch.tensor([2.0, 3.0], requires_grad=True)
y = (x ** 2).sum()        # y = x0^2 + x1^2,  so dy/dx = 2x
y.backward()              # run backprop
print("y:", y.item())
print("x.grad:", x.grad)  # should be [2*2, 2*3] = [4, 6]

Output:

y: 13.0
x.grad: tensor([4., 6.])

PyTorch recorded every operation into a computation graph as you went, then walked it backward to get $\partial y/\partial x = [4, 6]$ — exactly $2x$. You never wrote a derivative. Scale this to a 100-layer transformer and autograd still just works. This is what a framework buys you: you define only the forward pass; the backward pass is free.

Pillars 2 & 4: the model and the training loop

Here is the whole XOR network — model, loss, optimizer, loop — in idiomatic PyTorch:

import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, n_in=2, hidden=8):
        super().__init__()
        self.fc1 = nn.Linear(n_in, hidden)   # a (w·x + b) layer; params auto-registered
        self.fc2 = nn.Linear(hidden, 1)
    def forward(self, x):
        x = torch.tanh(self.fc1(x))
        return self.fc2(x)                   # raw logit; the loss applies sigmoid

model  = MLP()
loss_fn = nn.BCEWithLogitsLoss()             # sigmoid + cross-entropy, fused & stable
opt     = torch.optim.Adam(model.parameters(), lr=0.05)

for epoch in range(401):
    opt.zero_grad()          # 1. clear last step's gradients
    logits = model(X)        # 2. forward pass
    loss   = loss_fn(logits, y)
    loss.backward()          # 3. autograd computes every .grad
    opt.step()               # 4. optimizer updates every parameter

Those four lines inside the loop are the canonical PyTorch training step — you'll write them, in this order, for every model you ever train:

zero_grad  ->  forward  ->  backward  ->  step

Running the full script:

X is a tensor of shape (400, 2), dtype torch.float32
device: cpu  (CUDA available: False)

epoch    0  loss 0.6951  train-acc 0.498
epoch  100  loss 0.0200  train-acc 0.995
epoch  200  loss 0.0148  train-acc 0.998
epoch  300  loss 0.0116  train-acc 0.995
epoch  400  loss 0.0102  train-acc 0.995

model has 33 learnable parameters

Same XOR problem, same ~99.5% accuracy as the hand-coded version — but we wrote zero backprop. The 33 parameters: layer 1 is 2×8 + 8 = 24, layer 2 is 8×1 + 1 = 9, total 33. (Counting parameters by hand like this is a great sanity check on any architecture.)

The two gotchas that bite everyone

Don't be confused: zero_grad() is mandatory. PyTorch accumulates gradients — each backward() adds to .grad rather than replacing it. Forget opt.zero_grad() and gradients from every past step pile up, and training goes haywire. (The accumulation is a feature — it lets you simulate big batches — but the default workflow needs the reset every step.)

Don't be confused: model.train() vs. model.eval(). Layers like dropout and batch norm behave differently in training vs. inference. Call model.eval() before validating/predicting and model.train() before training. Also wrap inference in with torch.no_grad(): to skip building the gradient graph — it's faster and uses less memory. Forgetting eval() is a top source of "my validation numbers are mysteriously bad."

`BCEWithLogitsLoss`: logits, not probabilities

Notice the model returns a raw logit (Chapter 6), not a sigmoid'd probability, and we used BCEWithLogitsLoss. That loss fuses the sigmoid and the cross-entropy into one numerically stable operation (it uses the log-sum-exp trick from Chapter 17). Applying sigmoid yourself and then BCELoss is mathematically identical but can overflow. Rule: feed logits to *WithLogitsLoss / CrossEntropyLoss; only apply sigmoid/softmax when you actually need to read a probability.

The broader ecosystem (so the words don't surprise you)

DataLoader / Dataset — batch and shuffle your data (Chapter 16).
torchvision / torchaudio — datasets, pretrained models, transforms.
Hugging Face transformers — thousands of pretrained models, PyTorch-native.
model.state_dict() / torch.save — save and load learned weights.
.to(device) — the one line that moves model + data to GPU.
JAX — a third framework (Google) you'll hear about; same autograd idea, function-oriented style.

The takeaway

PyTorch = tensors (GPU-capable NumPy) + autograd (free backward pass) + nn.Module (your model) + optimizer (the update). The training loop is always zero_grad → forward → backward → step. Feed logits to the loss; remember eval() and no_grad() for inference. You can now read essentially any model's training code. Next, a tour of the architectures that go inside forward. 👉

AI Foundations in Depth