The data pipeline & training in practice

The algorithms get the glory, but in real projects most of your time goes into getting data in and out of the model correctly — batching, shuffling, splitting, seeding, moving tensors to the GPU. This chapter covers the unglamorous machinery that turns "I have a model" into "I trained it without fooling myself." Skipping this is how good models produce silently wrong results.

Why we feed data in batches

Chapter 8 introduced mini-batch SGD; here's the mechanics. You don't feed the whole dataset at once (it won't fit in memory) or one example at a time (too slow, too noisy). You feed batches of, say, 32–512 examples. One pass through all the batches is one epoch, and you train for many epochs. A from-scratch batch iterator — shuffle, then slice:

import numpy as np
def batches(X, y, batch_size, seed=0):
    rng = np.random.default_rng(seed)
    idx = rng.permutation(len(X))                  # shuffle each epoch
    for start in range(0, len(X), batch_size):
        b = idx[start:start + batch_size]
        yield X[b], y[b]

X = np.arange(10).reshape(10, 1); y = np.arange(10)
for i, (xb, yb) in enumerate(batches(X, y, 4)):
    print(f"batch {i}: y={yb}")

Output:

batch 0: y=[4 6 2 7]
batch 1: y=[3 5 9 0]
batch 2: y=[8 1]

Three things to notice: the data is shuffled (order varies each epoch — critical, or the model learns the order), batches are fixed-size except the last one (here just 2 examples), and we covered every example exactly once. In PyTorch this is the DataLoader:

from torch.utils.data import DataLoader, TensorDataset
loader = DataLoader(TensorDataset(X, y), batch_size=32, shuffle=True)
for xb, yb in loader:        # same idea, parallelized & GPU-friendly
    ...

Don't be confused: batch size vs. epochs vs. iterations.

Batch size — examples per gradient step (e.g. 32).

Iteration / step — one gradient update (one batch).

Epoch — one full pass over the dataset = (dataset size / batch size) iterations.

"Trained for 10 epochs with batch size 64 on 6,400 examples" = 100 iterations per epoch = 1,000 gradient steps total. Mixing these up makes training logs unreadable.

Why shuffling matters (a real bug)

If your data is sorted by label (all the 0s, then all the 1s) and you don't shuffle, each batch contains only one class. The model swings wildly toward whichever class it's currently seeing and never learns the mix. Always shuffle the training set each epoch. (Never shuffle time series across the time boundary, though — see the leakage warning below.)

Reproducibility: set your seeds

Randomness is everywhere — weight init, shuffling, dropout. To get the same result twice (essential for debugging and for trusting a comparison), fix the random seed:

print("run A:", np.random.default_rng(42).normal(size=3).round(3))
print("run B:", np.random.default_rng(42).normal(size=3).round(3))

Output:

run A: [ 0.305 -1.04   0.75 ]
run B: [ 0.305 -1.04   0.75 ]

Identical — because the seed was identical. In a real run you'd set torch.manual_seed(42), np.random.seed(42), and Python's random.seed(42). If you can't reproduce a result, you can't trust a comparison between two models — maybe model B "won" purely by a luckier random init.

The cardinal sin: data leakage

Data leakage is when information from the test set sneaks into training, making your offline numbers look amazing and your production results collapse. The most common forms:

Scaling before splitting. Computing the standardization mean/std over all data leaks test statistics into training. Fit the scaler on train only, then apply to val/test (Chapter 3).
Shuffling time series. For temporal data, a random split lets the model "see the future." Split by time: train on the past, test on the future.
Duplicate or near-duplicate rows straddling the split — the model effectively memorizes test answers.
Target leakage — a feature that secretly encodes the answer (e.g. "was_refunded" when predicting fraud). If a feature is "too good," suspect leakage.

Don't be confused: a great validation score is not always good news. If it seems too good to be true, your first hypothesis should be leakage, not genius. The discipline of fit-on-train, apply-to-test and split-before-you-touch exists entirely to prevent this self-deception.

CPU, GPU, and moving tensors

Deep learning lives on the GPU because matrix multiplies parallelize massively — often 10–100× faster than CPU. In PyTorch, both the model and the data must be on the same device:

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
xb, yb = xb.to(device), yb.to(device)        # data must follow the model

Don't be confused: the "tensors on different devices" error. PyTorch's most common runtime error is a model on the GPU receiving CPU data (or vice versa). The fix is always: .to(device) both the model and every input batch. Also, .item() / .cpu().numpy() pulls a result back to the CPU for printing or plotting.

A few production practices worth knowing

Mixed precision (float16/bfloat16) — train in lower precision for ~2× speed and half the memory, with negligible accuracy loss. Standard for large models.
Gradient accumulation — sum gradients over several mini-batches before step(), simulating a big batch when GPU memory is tight (this is why PyTorch accumulates gradients by default).
Checkpointing — torch.save(model.state_dict()) periodically, so a crash doesn't lose a week of training.
Learning-rate schedules — lower the learning rate over time (warmup then decay) for better final accuracy; the standard recipe for transformers.
Monitoring — log train and validation loss every epoch and watch the gap (Chapter 9). The loss curve is your primary diagnostic; learn to read it (diverging → LR too high; flat → LR too low or stuck; val rising → overfitting).
Experiment tracking — tools like MLflow or Weights & Biases record every run's hyperparameters and metrics so you can compare honestly (the recsys book's capstone uses MLflow).

The takeaway

Feed data in shuffled batches; one pass is an epoch; know batch size vs. iteration vs. epoch. Set seeds so results are reproducible and comparisons are trustworthy. Guard against data leakage with religious fit-on-train-only and split-before- touch discipline — a too-good score usually means leakage. Keep model and data on the same device. With the pipeline solid, the last thing standing between you and correct results is a set of numerical landmines — let's defuse them. 👉

AI Foundations in Depth