Overfitting & regularization

A model that scores 100% on its training data may be worthless on new data. This chapter is about the single most important idea separating "got a number" from "built something that works": generalization — performing well on data you've never seen. Everything here exists to fight one enemy: overfitting.

Memorizing vs. learning

Overfitting is when a model memorizes the noise and quirks of the training set instead of the underlying pattern. It looks brilliant on training data and fails on new data. Its opposite, underfitting, is when the model is too simple to capture the real pattern and fails on both.

Let's watch it happen. We fit polynomials of increasing degree to 12 noisy points from a near-linear truth, and measure error on both the training points and a held-out test set:

import numpy as np
# fit polynomials of various degree to noisy near-linear data
print("deg | train_mse | test_mse")
for deg in [1, 3, 9, 11]:
    w = fit_poly(xtr, ytr, deg)          # least-squares polynomial fit
    print(f"{deg:3d} | {mse(w, xtr, ytr):.5f}  | {mse(w, xte, yte):.5f}")

Output:

deg | train_mse | test_mse
  1 | 0.00452  | 0.00039
  3 | 0.00408  | 0.00085
  9 | 0.00001  | 0.00541
 11 | 0.00000  | 0.00871

Read the two columns against each other — this is the whole lesson:

  • Degree 1 (a line): small train error, smallest test error. It matches the true near-linear pattern.
  • Degree 11: train error is essentially zero — it threads every noisy point perfectly — but test error is the worst of all (20× the line's). It memorized the noise.

The tell of overfitting: training error keeps dropping while validation error starts rising. The gap between them is the overfitting. A model is only as good as its validation number, never its training number.

The bias–variance tradeoff

This tug-of-war has a name. Total error decomposes into:

  • Bias — error from wrong assumptions / too simple a model (→ underfitting). The degree-1 fit on a wiggly truth would have high bias.
  • Variance — error from being too sensitive to the particular training sample (→ overfitting). The degree-11 fit has huge variance: a different 12 points would give a wildly different curve.

$$ \text{Test error} \approx \text{bias}^2 + \text{variance} + \text{irreducible noise} $$

You can't drive both to zero; you find the sweet spot. More model capacity ⇒ less bias, more variance. Everything below is about buying low variance without paying too much bias.

The non-negotiable: train / validation / test split

You cannot judge a model on data it trained on. Split your data three ways:

  • Training set (~70%) — the model learns its parameters here.
  • Validation set (~15%) — you tune hyperparameters (degree, learning rate, regularization strength) by checking here.
  • Test set (~15%) — touched once, at the very end, for an honest final estimate. If you tune against it, it's no longer honest.

Don't be confused: validation vs. test set. Validation is for choosing (you look at it many times while tuning). Test is for reporting (you look once, at the end). Reusing the test set to make decisions silently leaks it into training and inflates your reported number — one of the most common ways teams fool themselves.

Cross-validation squeezes more out of small datasets: split into $k$ folds, train on $k-1$ and validate on the held-out fold, rotate, average. k-fold CV is the gold standard when data is scarce.

Regularization: penalize complexity

The cure for overfitting is to discourage the model from using big, wiggly weights. Regularization adds a penalty on weight size to the loss:

$$ L_{\text{total}} = \underbrace{L_{\text{data}}}{\text{fit the data}} + \lambda \underbrace{\lVert w\rVert}{\text{stay simple}} $$

  • L2 / Ridge (penalize $\sum w_i^2$) — shrinks all weights smoothly toward zero. The default. (In deep learning this is called weight decay.)
  • L1 / Lasso (penalize $\sum |w_i|$) — drives many weights exactly to zero, doing automatic feature selection (a sparse model).

$\lambda$ (lambda) is the regularization strength — a hyperparameter you tune on validation. Watch L2 rescue the hopeless degree-11 model:

# same degree-11 polynomial, now with an L2 (ridge) penalty of strength lambda
for lam in [0.0, 1e-3, 1e-1]:
    w = fit_ridge(xtr, ytr, deg=11, lam=lam)
    print(f"lambda={lam:<6} train={mse(w,xtr,ytr):.5f} test={mse(w,xte,yte):.5f}")

Output:

lambda=0.0    train=0.00000 test=0.00871
lambda=0.001  train=0.00081 test=0.00342
lambda=0.1    train=0.00325 test=0.00190

As $\lambda$ rises, training error gets worse (good — we stop memorizing noise) and test error improves 4.5×. That's regularization working: trade a little training fit for a lot of generalization.

Don't be confused: L1 vs. L2. L2 (ridge) shrinks weights toward zero but rarely to exactly zero — keeps all features, just small. L1 (lasso) zeroes weights out entirely — picks a subset of features. Want a sparse, interpretable model? L1. Want smooth shrinkage and stable training? L2. Elastic Net mixes both.

The deep-learning regularization toolkit

Neural nets have their own anti-overfitting arsenal — you'll set these as hyperparameters:

  • Dropout — randomly zero out a fraction of neurons each training step, forcing redundancy so no single neuron is load-bearing. Turned off at inference.
  • Early stopping — watch validation loss; stop the moment it starts rising. Simple and remarkably effective.
  • Weight decay — L2, by another name.
  • Data augmentation — manufacture more training variety (flip/crop images, paraphrase text). More effective data ⇒ less overfitting.
  • Batch/Layer normalization — stabilizes training and mildly regularizes.

Don't be confused: dropout at train vs. test. Dropout is active during training (randomly drops neurons) and disabled during inference (uses the full network). This is exactly why you must call model.eval() in PyTorch before predicting — forgetting it leaves dropout/batchnorm in training mode and quietly corrupts your results (Chapter 12).

The takeaway

A model is judged on data it never trained on. Overfitting = memorizing noise (low train error, high validation error); underfitting = too simple (high both). Balance them via the bias–variance tradeoff. Always split train/validation/test and never tune on test. Fight overfitting with regularization (L1/L2), dropout, early stopping, and more data. Now: how do we measure "good" precisely? Metrics. 👉