Introduction

No background required. This book assumes you know nothing about machine learning and only a little Python. Every concept, symbol, and line of code is explained, and every code example is followed by the exact output it produces. If a word looks scary, it's defined the first time it appears.

The problem this book solves

You can sit down with an ML engineer, follow 80% of what they say, and then hit a single word that stops you cold — logit, kernel, embedding, broadcasting, autograd, cross-entropy, feature, tensor — and suddenly the rest of the sentence is noise. None of these words are hard. They're just rarely defined from the ground up, because every book assumes the other books taught them to you.

This is the book that teaches them. It's the missing prerequisite — a hitchhiker's guide to the vocabulary, math, and tools that sit underneath every AI project, whether that project is a recommender, a search engine, a fraud model, or a large language model.

How this book is different

Three rules, on every page:

  1. Show the output. Every snippet is followed by a text block with exactly what it prints. You never have to wonder "what does this actually do?"
  2. Build it from scratch, then name it. We compute cosine similarity with np.dot before telling you it's called cosine similarity. Understanding first, jargon second.
  3. "Don't be confused" boxes. The field is full of near-synonyms and collisions — normalization vs. standardization, loss vs. metric, parameter vs. hyperparameter, the five different things called "kernel." These boxes pull them apart explicitly.

A map of the journey

  • Part I — The data & the vocabulary. What a model is; tensors, shapes and broadcasting; features and feature extraction; the similarity/distance recipes (cosine, dot, Euclidean, RBF…); and the word kernel in all five of its meanings.
  • Part II — How learning actually works. Linear and logistic models; loss functions; gradient descent and backpropagation; overfitting and regularization; evaluation metrics.
  • Part III — Neural networks & PyTorch. A neural net built by hand in NumPy, then the same network in PyTorch; the deep-learning zoo (CNN, RNN, Transformer); and embeddings.
  • Part IV — The modern AI stack. Tokens, transformers and LLMs; the training pipeline in practice; and a compendium of numerical gotchas.
  • Part V — The classical ML toolkit. The algorithms that rule tabular data and interviews — k-NN, Naive Bayes, decision trees, random forests, gradient boosting (XGBoost), SVMs — plus clustering and PCA.
  • Part VI — The math & stats interviews assume. Probability and statistics (Bayes, MLE, hypothesis testing), the linear algebra you actually need (eigen, SVD), and A/B testing.
  • Part VII — Shipping it & the landscape. The practitioner's toolkit (Python, SQL, Pandas, scikit-learn, tuning), cloud & MLOps, and the modern frontier (LoRA, RAG, agents, diffusion, quantization, vector DBs).
  • Part VIII — Interview success. A repeatable ML system-design framework and a full interview playbook (concept bank, coding drills, prep plan).
  • Part IX — Capstone projects. Five complete, runnable projects — train a GPT from scratch, fine-tune with LoRA, build an LLM agent, train a CNN, and a diffusion model — the modern AI stack, built end to end on your laptop.
  • Part X — Reference. A copy-paste recipe book and a glossary of the words interviewers assume you know.

The first four parts are the deep-learning spine, read in order. Parts V–VIII broaden you into a complete, hands-on, interview-ready engineer — they can be read any time after Part II, in any order.

What you'll be able to do by the end

Define and code — in a few lines of NumPy — every term in the glossary. Read a PyTorch training loop and say what each line does. Look at a similarity score and know whether it's a distance or a similarity, whether it's normalized, and what would change if you swapped the metric. Walk into any AI conversation without flinching.

What you need

Python 3, NumPy, and (for one chapter) PyTorch — the CPU build is plenty. Every runnable program lives in code/ alongside the book. If you've never written Python, read the 5-minute primer next; otherwise skip straight to What is a model, really? 👉

A 5-minute primer: Python, NumPy & the mental model

This page gives you just enough to read every example in the book. Skip it if you're already comfortable with NumPy arrays.

Reading the code boxes

Grey boxes contain Python; the box right after shows what it prints:

print("hello")
print(2 + 3)

Output:

hello
5

Variables, lists, functions

x = 10                  # x now refers to the integer 10
words = ["cat", "dog"]  # a list: an ordered collection

def add(a, b):          # define a reusable function
    return a + b        # "return" hands a result back

print(add(2, 3))

Output:

5

NumPy: the array is everything

NumPy is the library for fast number-crunching in Python; essentially all of AI's math runs on it (or on its GPU cousin, PyTorch). We nickname it np. Its one big idea is the array: a grid of numbers you operate on all at once, instead of looping.

import numpy as np

v = np.array([2.0, 0.5, 1.0])   # a 1-D array = a vector
print(v)
print("shape:", v.shape)        # how big it is, per dimension
print("v * 2:", v * 2)          # operations apply to every element

Output:

[2.  0.5 1. ]
shape: (3,)
v * 2: [4. 1. 2.]

That last line is the whole point: v * 2 multiplied every element without a for loop. This is called vectorization, and it's why NumPy is fast.

Vectors, matrices, and the word "shape"

  • A vector is a 1-D array — a single row of numbers, a point in space.
  • A matrix is a 2-D array — a grid with rows and columns.
  • .shape tells you the size along each dimension. (3,) is a length-3 vector; (2, 3) is a 2-row, 3-column matrix.
M = np.array([[1, 2, 3],
              [4, 5, 6]])
print("shape:", M.shape)   # (rows, columns)
print("M.T:\n", M.T)       # transpose: rows become columns

Output:

shape: (2, 3)
M.T:
 [[1 4]
 [2 5]
 [3 6]]

The two operations you'll see constantly

a = np.array([1.0, 2.0, 3.0])
b = np.array([4.0, 5.0, 6.0])
print("dot product:", np.dot(a, b))   # 1*4 + 2*5 + 3*6 = 32
print("elementwise :", a * b)         # [4, 10, 18] — NOT a dot product

Output:

dot product: 32.0
elementwise : [ 4. 10. 18.]

Don't be confused: * vs @/np.dot. a * b multiplies element by element and keeps the same shape. a @ b (matrix multiply) / np.dot(a, b) sums those products into a single number (for vectors). Mixing these up is the #1 NumPy bug. The dot product is the engine of nearly every similarity in this book.

The mental model of "learning"

Here is the entire field in four words: adjust numbers to reduce error.

A model is a box of adjustable numbers (called parameters or weights). You show it examples, measure how wrong it is (the loss), and nudge the numbers in the direction that makes the loss smaller. Repeat millions of times. That's it — linear regression and a 100-billion-parameter language model differ in scale and architecture, not in this core loop. We make each of those four words concrete in Chapter 1.

NumPy bits we use throughout

You'll seeMeaning
np.array([...])make a vector / matrix
a * belementwise multiply (same shape out)
a @ b, np.dot(a, b)dot / matrix product (contracts a dimension)
np.linalg.norm(v)length of a vector
X.shape, X.Tsize per dimension; transpose
X.mean(axis=0)average down each column
np.exp, np.log$e^x$ and natural log, elementwise
np.argsort(d)indices that would sort d

That's everything you need. Next: what a "model" actually is. 👉

What is a model, really?

Strip away the mystique and a model is a function with adjustable numbers inside it. You feed it an input, it produces an output, and the "adjustable numbers" are what get tuned during training so the outputs come out right.

$$ \text{output} = f(\text{input}; ; \theta) $$

That $\theta$ ("theta") is the bag of adjustable numbers — the parameters. Training is the search for good values of $\theta$.

The five words that define the whole field

WordPlain meaning
Parameters (weights)the numbers inside the model that training adjusts
Training (fitting)the process of adjusting parameters to reduce error
Lossa single number measuring how wrong the model currently is
Inference (prediction)running a trained model on new input
Hyperparametersknobs you set that aren't learned (e.g. learning rate)

Learn these five and you can follow almost any ML conversation. Let's make them concrete with the smallest possible model.

The smallest model: a straight line

Suppose the true relationship is y = 2x + 1, but we don't know the 2 or the 1. Our model is pred = w*x + b, with two parameters: a slope w and an intercept b. We'll start them at zero and let training discover w=2, b=1.

import numpy as np

X = np.array([0.0, 1.0, 2.0, 3.0, 4.0])
y = 2.0 * X + 1.0          # the truth we're trying to recover

w, b = 0.0, 0.0            # parameters: start knowing nothing
lr = 0.05                  # learning rate (a hyperparameter)

for step in range(1000):
    pred = w * X + b               # 1. forward pass: the model's guess
    err = pred - y
    loss = np.mean(err ** 2)       # 2. loss: mean squared error
    grad_w = np.mean(2 * err * X)  # 3. gradient: which way is "wronger"
    grad_b = np.mean(2 * err)
    w -= lr * grad_w               # 4. update: step downhill
    b -= lr * grad_b
    if step in (0, 200, 999):
        print(f"step {step:4d}  loss {loss:8.4f}  w {w:.4f}  b {b:.4f}")

print("learned:", round(w, 3), round(b, 3))

Output:

step    0  loss  33.0000  w 1.4000  b 0.5000
step  200  loss   0.0000  w 2.0002  b 0.9994
step  999  loss   0.0000  w 2.0000  b 1.0000
learned: 2.0 1.0

The model started clueless (w=0, b=0, loss 33) and discovered w=2, b=1 purely by repeatedly nudging the parameters to shrink the loss. That four-step loop — forward → loss → gradient → update — is the heartbeat of all of machine learning. A transformer does exactly this; it just has billions of parameters instead of two, and a fancier f.

Inference vs. training

Once w and b are learned, inference is trivial — no loss, no gradients, just the forward pass on new data:

x_new = 10.0
print("prediction:", w * x_new + b)

Output:

prediction: 21.0

Training is the expensive part (millions of forward/loss/gradient/update steps); inference is cheap (one forward pass). This is why training a large model costs millions of dollars but using it costs a fraction of a cent.

Parameters vs. hyperparameters

Don't be confused. A parameter is learned by the model (w, b above). A hyperparameter is chosen by you and held fixed during training (the learning rate lr, the number of steps, how many layers). Rule of thumb: if gradient descent changes it, it's a parameter; if you change it, it's a hyperparameter. Picking good hyperparameters is "tuning."

What "supervised" means

The example above is supervised learning: every input x came with the correct answer y, and the loss measured the gap. The three classic flavors:

  • Supervised — learn from (input, correct-answer) pairs. Spam/not-spam, house-price prediction, image labels. Most of applied ML.
  • Unsupervised — no labels; find structure. Clustering, dimensionality reduction, most of how embeddings are pre-trained.
  • Reinforcement — learn from rewards by trial and error. Game-playing, robotics, RLHF (the alignment step in modern chatbots).

Parametric vs. non-parametric (a useful distinction)

Our line is parametric: a fixed set of parameters (w, b) summarizes the data, and you can throw the data away afterward. A non-parametric model like k-nearest-neighbors keeps the data and answers by looking up similar examples at query time — no w to learn. Both are "models"; they just store what they learned differently. (The sister HNSW and IVF-PQ books are entirely about making that lookup fast.)

The takeaway

A model is parameters + a forward function. Training minimizes a loss by gradient descent. Inference is just the forward pass. Everything else in this book — features, similarities, neural layers, transformers — is detail layered on top of this skeleton. Next we'll look hard at the objects these computations run on: tensors and their shapes. 👉

Tensors, shapes & broadcasting

Every AI computation is a flow of tensors through operations. If you're fluent in shapes and broadcasting, you can read any model's code and debug the error message that eats 90% of beginners' time: shapes (a,b) and (c,d) not aligned.

What is a tensor?

A tensor is just an n-dimensional array of numbers. The number of dimensions is its rank (NumPy calls it ndim). That's the whole definition — in deep learning, "tensor" is simply the word for the multi-dimensional arrays that flow through a model.

RankNameExampleShape
0scalar5.0()
1vector[1, 2, 3](3,)
2matrixa table / image channel(rows, cols)
33-tensoran RGB image(height, width, 3)
44-tensora batch of images(batch, H, W, 3)
import numpy as np

s = np.array(5.0)
v = np.array([1., 2., 3.])
M = np.array([[1., 2.], [3., 4.]])
T = np.zeros((2, 3, 4))
print("ranks:", s.ndim, v.ndim, M.ndim, T.ndim)
print("shapes:", s.shape, v.shape, M.shape, T.shape)

Output:

ranks: 0 1 2 3
shapes: () (3,) (2, 2) (2, 3, 4)

Don't be confused: "tensor" (ML) vs "tensor" (physics/math). In physics a tensor is an object with strict transformation rules. In ML, "tensor" just means "n-dimensional array." When a PyTorch person says tensor, they mean the array. Don't overthink it.

Shape is the thing you reason about

Almost every bug is a shape bug. Two habits save you:

  1. Say the shape out loud at each line: "X is (batch, features)."
  2. Know what each axis means — by deep convention, axis 0 is the batch / sample axis (one row per example). X[i] is the i-th example.

Reshaping moves the same numbers into a new shape

a = np.arange(12)          # [0 1 2 ... 11], shape (12,)
print(a.reshape(3, 4))     # same 12 numbers, now 3 rows of 4
print(a.reshape(3, 4).reshape(-1).shape)   # -1 = "infer this axis" -> flat again

Output:

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
(12,)

-1 means "you figure out this dimension so the total count matches." You'll see x.reshape(batch, -1) constantly to flatten everything-but-the-batch.

Broadcasting: the rule that removes loops

Broadcasting lets NumPy combine arrays of different shapes by virtually stretching the smaller one. It's how you add a bias to every row, or scale every column, without a loop. The rule: compare shapes from the right; dimensions are compatible if they're equal or one of them is 1.

Add a scalar to everything

X = np.array([[1., 2., 3.],
              [4., 5., 6.]])
print(X + 10)             # the 10 is stretched to every element

Output:

[[11. 12. 13.]
 [14. 15. 16.]]

Add a per-column vector (shape (3,)) to every row

print(X + np.array([100., 200., 300.]))    # (2,3) + (3,) -> stretched down rows

Output:

[[101. 202. 303.]
 [104. 205. 306.]]

Add a per-row vector — you must make it a column (2,1)

print(X + np.array([[10.], [20.]]))         # (2,3) + (2,1) -> stretched across cols

Output:

[[11. 12. 13.]
 [24. 25. 26.]]

That [:, None] / (2,1) trick — turning a vector into an explicit column — is how you control which axis broadcasts. We used it in the recipe book to compute all-pairs distances.

axis: the other thing everyone trips on

Reductions like mean, sum, max take an axis. The mental model: axis=k is the axis that disappears.

X = np.array([[1., 2., 3.],
              [4., 5., 6.]])
print("axis=0 (down columns):", X.mean(axis=0))   # collapses the 2 rows -> 3 numbers
print("axis=1 (across rows)  :", X.mean(axis=1))   # collapses the 3 cols -> 2 numbers

Output:

axis=0 (down columns): [2.5 3.5 4.5]
axis=1 (across rows)  : [2. 5.]

Don't be confused: axis=0 ≠ "rows." axis=0 is the row axis, so reducing over it collapses the rows and leaves one number per column. "Per-column statistic" (like feature means for standardization) = axis=0. "Per-row statistic" (like normalizing each sample) = axis=1. Read it as "the axis I sum over is the axis that vanishes."

Matrix multiplication: the shape contract

A @ B requires the inner dimensions to match: (m, k) @ (k, n) -> (m, n). The shared k is summed away.

A = np.ones((2, 3))
B = np.ones((3, 5))
print((A @ B).shape)        # (2,3) @ (3,5) -> (2,5)

Output:

(2, 5)

A neural network layer is this: outputs = inputs @ weights + bias, where inputs is (batch, in_features) and weights is (in_features, out_features). When you see the dreaded "shapes not aligned" error, line up the inner dimensions and one of them is wrong.

Dtype: the silent gotcha

Tensors have a dtype (float32, float64, int64…). Deep learning runs in float32 by default (half the memory of float64, plenty of precision), and modern training uses float16/bfloat16 for speed. Integer arrays do integer division and can't hold gradients — a frequent surprise:

print(np.array([1, 2, 3]) / 2)         # NumPy promotes to float -> fine
print((np.array([1, 2, 3]) * 1.0).dtype)

Output:

[0.5 1.  1.5]
float64

Don't be confused: float32 vs float64. NumPy defaults to float64; PyTorch defaults to float32. If you move data between them and get a dtype error, cast explicitly with .astype(np.float32) or .float(). Models almost never need float64.

The takeaway

Tensors are n-D arrays; rank counts the dimensions; shape is what you reason about; axis 0 is the batch; broadcasting stretches size-1 dimensions so you never loop; matmul contracts the shared inner dimension. With this, you can read model code. Next: how raw data becomes the tensors in the first place — features. 👉

Features & feature extraction

A model never sees "a customer," "a sentence," or "a photo." It sees a row of numbers. A feature is one of those numbers — one measurable property of the thing. The row is a feature vector, and the whole table is the feature matrix X of shape (n_samples, n_features).

Feature extraction is the craft of turning raw, messy reality into that clean numeric table. It is, in practice, where most of the real work — and most of the accuracy — lives. "Garbage in, garbage out" is never more true than here.

Don't be confused: the three "feature ___" terms.

  • Feature extractioncreate numeric features from raw data (pixels → edges, text → TF-IDF, audio → spectrogram). Changes the representation.
  • Feature engineeringhand-craft new features from existing ones (price_per_sqft = price / area). Uses domain knowledge.
  • Feature selectionpick a subset of existing features and drop the rest (remove redundant or useless columns). Reduces dimensionality.

Deep learning's headline trick is that it does feature extraction for you — the hidden layers learn features instead of you crafting them. That's what "representation learning" means (Chapter 14).

All code below is in code/features.py and prints exactly what's shown.

1. Numeric features — and why you must scale them

Raw numbers come on wildly different scales: age ∈ [0, 100], income ∈ [0, 1,000,000]. Many models (anything using distances, gradients, or regularization) will let the big-scale feature dominate purely because its numbers are bigger. Standardization fixes this: per column, subtract the mean and divide by the standard deviation, giving every feature mean 0 and std 1.

$$ z = \frac{x - \mu}{\sigma} $$

import numpy as np

def standardize(X):
    X = np.asarray(X, dtype=float)
    return (X - X.mean(axis=0)) / (X.std(axis=0) + 1e-12)

age_income = np.array([[25, 30000.0],
                       [40, 60000.0],
                       [55, 90000.0]])
print(standardize(age_income))

Output:

[[-1.225 -1.225]
 [ 0.     0.   ]
 [ 1.225  1.225]]

Now age and income are on the same footing — both range over the same small spread, so neither bullies the other. Note we used axis=0 (per column / per feature), exactly as Chapter 2 warned.

Don't be confused: normalization vs standardization. People use these loosely, but the usual meanings are:

  • Standardization → z-score: mean 0, std 1. Unbounded output. The default for tabular features.
  • Min-max normalization → rescale to a fixed range like [0, 1].
  • L2 normalization → scale each row to unit length (for cosine similarity; Chapter 4).

They answer different needs. When someone says "normalize," ask "to unit length, to [0,1], or to mean-0/std-1?"

The cardinal rule: fit the scaler on the training set only, then apply those same numbers to validation/test. Computing the mean over all data leaks test information into training — a classic, silent mistake.

2. Categorical features — one-hot encoding

A category like color ∈ {red, green, blue} is not a number. Encoding it as red=0, green=1, blue=2 is a trap: it tells the model blue > red and green is "between" them, which is nonsense. One-hot encoding instead gives each category its own 0/1 column:

from collections import Counter
import numpy as np

def one_hot_encode(values):
    vocab = sorted(set(values))
    index = {v: i for i, v in enumerate(vocab)}
    out = np.zeros((len(values), len(vocab)))
    for row, v in enumerate(values):
        out[row, index[v]] = 1.0
    return out, vocab

M, vocab = one_hot_encode(["red", "blue", "red", "green"])
print("vocab:", vocab)
print(M)

Output:

vocab: ['blue', 'green', 'red']
[[0. 0. 1.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]

Each row has a single 1 marking its category. No false ordering, no fake distances.

When not to one-hot: if a category has thousands of values (user IDs, product IDs), one-hot makes an absurdly wide, sparse matrix. There you use an embedding — a learned, dense vector per category — which is exactly the bridge to neural recommenders (Chapter 14). And ordinal categories with a real order (small < medium < large) can be integer-coded on purpose.

3. Text features — bag-of-words and TF-IDF

Text is the classic feature-extraction problem: variable-length strings must become fixed-length vectors. The oldest, still-useful idea is the bag of words — count how often each vocabulary word appears, ignoring order.

docs = ["the cat sat",
        "the dog sat",
        "the cat and the dog"]
# vocab (sorted): ['and', 'cat', 'dog', 'sat', 'the']

The count matrix (rows = documents, columns = words):

[[0. 1. 0. 1. 1.]      # "the cat sat"
 [0. 0. 1. 1. 1.]      # "the dog sat"
 [1. 1. 1. 0. 2.]]     # "the cat and the dog"  -> "the" appears twice

The problem: the appears everywhere, so it dominates the counts while carrying no discriminating signal. TF-IDF (Term Frequency × Inverse Document Frequency) fixes this by down-weighting words that appear in many documents:

$$ \text{tfidf}(t, d) = \underbrace{\text{count}(t, d)}{\text{term frequency}} \times \underbrace{\left(\log\frac{1 + N}{1 + \text{df}(t)} + 1\right)}{\text{inverse doc. frequency}} $$

where $N$ is the number of documents and $\text{df}(t)$ is how many documents contain term $t$. Rows are then L2-normalized. The result:

[[0.    0.62  0.    0.62  0.481]
 [0.    0.    0.62  0.62  0.481]
 [0.531 0.404 0.404 0.    0.627]]

Compare column 4 (the) — its weight (0.481, 0.481, 0.627) is pulled down relative to distinctive words like cat and dog (0.62), because the appears in every document so its IDF is small. TF-IDF is the workhorse behind classic search ranking and a strong baseline for text classification — and it's the direct ancestor of the dense embeddings used today.

The feature-extraction checklist

For any new dataset, walk these questions:

  1. Per column, what type is it? Numeric, categorical, ordinal, text, date, image, ID?
  2. Numeric → scale it (standardize by default).
  3. Categorical → one-hot if low-cardinality, embed if high-cardinality.
  4. Text → TF-IDF for a baseline, sentence-embeddings for semantics.
  5. Dates → extract day_of_week, month, is_weekend, time-since-event.
  6. Missing values → decide: drop, fill with mean/median, or add an "is-missing" flag (often the missingness itself is signal).
  7. Fit transforms on train only, apply to test. Always.

The takeaway

A feature is a number describing one property; feature extraction builds the numeric table the model trains on. Scale numerics, one-hot small categories, embed big ones, TF-IDF text — and never let test data leak into your transforms. Now that we have feature vectors, the next question is the one that powers search, recommendation, and clustering: how do we measure that two vectors are similar? 👉

Similarity & distance (the recipe chapter)

Once everything is a vector, "are these two things alike?" becomes "are these two vectors close?" This single question powers search, recommendation, clustering, deduplication, retrieval (RAG), face matching, and the attention inside transformers. This chapter is your recipe card for every common similarity and distance — with sample input, output, and the traps that bite people.

All recipes live in code/recipes.py and print exactly what's shown here.

Similarity vs. distance: opposite directions

Don't be confused. A similarity is bigger when things are more alike (cosine = 1 means identical direction). A distance is smaller when things are more alike (Euclidean = 0 means identical point). They point in opposite directions. You convert between them with tricks like distance = 1 - similarity (for cosine) or similarity = exp(-distance) (the RBF kernel). Always know which one a function returns before you sort by it — sorting the wrong way silently returns the least similar items.

Our cast of three example vectors:

import numpy as np
a = np.array([1.0, 0.0])   # points right
b = np.array([2.0, 0.0])   # same direction, twice as long
c = np.array([0.0, 1.0])   # perpendicular to a

Recipe 1 — Dot product (linear similarity)

The raw inner product: multiply elementwise, sum. It grows with both alignment and magnitude.

$$ a \cdot b = \sum_i a_i b_i $$

print(np.dot(a, b))    # 1*2 + 0*0

Output:

2.0

Use it when magnitude is meaningful and your vectors are already normalized (then dot is cosine). It's the cheapest similarity, and the one transformers use internally (Q·K). Watch out: a longer vector scores higher just for being long, even if it's off-topic.

Recipe 2 — Cosine similarity (direction only)

Normalize away length, keep only the angle. Range $[-1, 1]$: 1 = same direction, 0 = perpendicular, −1 = opposite.

$$ \cos(a, b) = \frac{a \cdot b}{\lVert a\rVert , \lVert b\rVert} $$

def cosine_sim(a, b, eps=1e-12):
    a = np.asarray(a, float); b = np.asarray(b, float)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + eps))

print("cos(a, b):", round(cosine_sim(a, b), 3))   # same direction
print("cos(a, c):", round(cosine_sim(a, c), 3))   # perpendicular

Output:

cos(a, b): 1.0
cos(a, c): 0.0

a and b point the same way → cosine 1.0, even though b is twice as long. a and c are perpendicular → 0.0. Cosine is the default for text/semantic embeddings, because there a document's topic (direction) matters, not its length.

Don't be confused: cosine vs. dot product. They agree only when vectors are unit length. When they're not, they can disagree badly. Here a generic-but- long vector beats a short-but-on-topic one under dot, while cosine judges by topic:

doc   = np.array([1., 1., 0.])
short = np.array([1., 0., 0.])   # same topic, short
longd = np.array([5., 5., 5.])   # generic, long
print("dot:", doc @ short, doc @ longd)            # 1.0  10.0  -> long wins
print("cos:", round(cosine_sim(doc, short),3), round(cosine_sim(doc, longd),3))
dot: 1.0 10.0
cos: 0.707 0.816

The fix used everywhere: L2-normalize your vectors once up front, then a plain dot product equals cosine — fast and unambiguous. That's why vector databases store normalized vectors and call it "inner-product" search.

Recipe 3 — Euclidean (L2) distance

Ordinary straight-line distance. A distance (0 = identical).

$$ \lVert a - b\rVert_2 = \sqrt{\sum_i (a_i - b_i)^2} $$

print(np.sqrt(np.sum((a - b) ** 2)))

Output:

1.0

Use it for dense numeric features on a comparable scale (after standardizing!). It's the metric behind k-means and most geometric clustering.

Don't be confused: cosine vs. Euclidean. They can rank things differently. Two vectors pointing the same way but at different lengths are cosine- identical yet Euclidean-far:

u = np.array([1., 2.]); v = np.array([10., 20.])    # identical direction
print("cosine:", round(cosine_sim(u, v), 3))         # 1.0
print("euclid:", round(float(np.sqrt(np.sum((u-v)**2))), 3))  # 20.125
cosine: 1.0
euclid: 20.125

Choose by what "similar" means for your data: same proportions → cosine; same location → Euclidean. Fun fact: for unit-length vectors the two are equivalent — $\lVert a-b\rVert^2 = 2(1 - \cos(a,b))$ — so normalizing collapses the distinction.

Recipe 4 — Manhattan (L1) distance

Sum of absolute differences — "city-block" distance, as if you can only move along grid streets.

$$ \lVert a - b\rVert_1 = \sum_i |a_i - b_i| $$

print(np.sum(np.abs(a - c)))     # |1-0| + |0-1|

Output:

2.0

Use it when you want robustness to outliers (L1 punishes a single huge coordinate difference less than L2's squaring does), or in high dimensions where it sometimes behaves better.

Recipe 5 — RBF / Gaussian kernel (distance → similarity)

The Radial Basis Function kernel turns a distance into a similarity that decays smoothly: 1 when identical, fading toward 0 as points separate.

$$ k(a, b) = \exp!\left(-\gamma , \lVert a - b\rVert^2\right) $$

gamma controls how fast similarity fades — it's the kernel's "reach."

def rbf(a, b, gamma=1.0):
    d2 = np.sum((np.asarray(a, float) - np.asarray(b, float)) ** 2)
    return float(np.exp(-gamma * d2))

for g in [0.1, 1.0, 10.0]:
    near = rbf([0,0], [0.5, 0], gamma=g)   # squared dist 0.25
    far  = rbf([0,0], [2.0, 0], gamma=g)   # squared dist 4.0
    print(f"gamma={g:<4}  near={near:.4f}  far={far:.4f}")

Output:

gamma=0.1   near=0.9753  far=0.6703
gamma=1.0   near=0.7788  far=0.0183
gamma=10.0  near=0.0821  far=0.0000

Small gamma = wide, gentle reach (far points still count). Large gamma = narrow, picky reach (only near-identical points count). The RBF kernel is the heart of the RBF-SVM and a building block in many "kernel methods" — and gamma is the single most important knob to tune. We meet kernels properly in Chapter 5.

Recipe 6 — Jaccard (set overlap)

For sets or binary vectors (tags, shingles, sparse boolean features): the size of the intersection over the size of the union.

$$ J(A, B) = \frac{|A \cap B|}{|A \cup B|} $$

def jaccard(a, b):
    a, b = set(a), set(b)
    return len(a & b) / len(a | b) if (a | b) else 1.0

print(jaccard([1, 2, 3], [2, 3, 4]))    # overlap {2,3}, union {1,2,3,4}

Output:

0.5

Use it for deduplication, plagiarism/near-duplicate detection (with MinHash), and any "how much do these two tag sets overlap?" question.

The all-pairs versions (what you actually ship)

In production you rarely compare two vectors — you compare a query against thousands. Do it in one matrix operation, not a Python loop:

def normalize_rows(X, eps=1e-12):
    return X / (np.linalg.norm(X, axis=1, keepdims=True) + eps)

def cosine_matrix(X):           # all-pairs cosine, shape (n, n)
    Xn = normalize_rows(X)
    return Xn @ Xn.T

X = np.array([[1., 0.], [0.9, 0.1], [0., 1.]])
print(np.round(cosine_matrix(X), 3))

Output:

[[1.    0.994 0.   ]
 [0.994 1.    0.11 ]
 [0.    0.11  1.   ]]

Row 0 vs row 1 = 0.994 (nearly identical), row 0 vs row 2 = 0.0 (perpendicular). Normalize once, multiply, done — this Xn @ Xn.T is the same operation that scores attention in a transformer. For squared Euclidean all-pairs, use the identity $\lVert a-b\rVert^2 = \lVert a\rVert^2 + \lVert b\rVert^2 - 2a\cdot b$ (the pairwise_sq_dists recipe) — never a double loop.

Quick chooser

You want…UseReturns
Text / semantic similaritycosine (or dot on normalized vectors)similarity ↑
Geometric closeness, clusteringEuclidean (after standardizing)distance ↓
Robustness to outliersManhattan (L1)distance ↓
Smooth, tunable similarity / SVMRBF kernelsimilarity ↑
Overlap of tags / setsJaccardsimilarity ↑
Fast many-vs-manynormalize, then X @ Q.Tsimilarity ↑

The takeaway

Know whether you hold a similarity or a distance; cosine compares direction, dot adds magnitude, Euclidean compares position, RBF turns distance into tunable similarity, Jaccard compares sets. Normalize first and dot = cosine. These six recipes recur in every system in this series. The word "kernel" snuck in twice already — let's pin down all five things it means. 👉

Kernels: five meanings of one word

Few words cause more confusion than kernel, because it means five completely different things depending on who's talking. An ML researcher, a GPU programmer, a computer-vision engineer, and an operating-systems person all say "kernel" and mean something unrelated. This chapter pins down all five so you're never lost — then goes deep on the two that matter most for AI.

Don't be confused — the five kernels at a glance:

Context"Kernel" meansOne-line definition
ML / SVMa similarity function$k(a,b)$ = a similarity that secretly equals a dot product in some richer space
CNN / imagea small filtera tiny weight grid slid over an image to detect a pattern
GPU / CUDAa function that runs on the GPUcode launched across thousands of parallel threads
Operating systemthe core of the OSthe privileged code managing hardware (Linux kernel)
Stats / densitya weighting windowthe bump function in kernel density estimation / RBF

They are genuinely unrelated meanings of one overloaded word. Which one is meant is always clear from context — once you know there are five.

Meaning 1 — The ML kernel (a similarity that's a hidden dot product)

This is the one that earns the deepest understanding. Start with a problem: the XOR data below can't be separated by any straight line.

import numpy as np
X = np.array([[0,0],[0,1],[1,0],[1,1]], float)
y = np.array([0, 1, 1, 0])     # XOR: 1 only when exactly one input is 1

No line splits the 1s from the 0s. But invent a new feature — the product $x_1 x_2$ — and the classes separate trivially:

print("new feature x1*x2:", X[:, 0] * X[:, 1])

Output:

new feature x1*x2: [0. 0. 0. 1.]

The lone 1 (for input [1,1]) is now separable from the three 0s by the simple rule "is the feature > 0.5?". Mapping data into a higher-dimensional space can make a hard problem linear. Call that map $\phi$ (phi).

The trick: skip building $\phi$, compute the dot product directly

The catch: $\phi$ can be huge or infinite-dimensional, so building it is expensive. The kernel trick is the beautiful shortcut: for many useful $\phi$, the dot product in the mapped space equals a cheap formula on the original vectors — so you never construct $\phi$ at all.

$$ k(a, b) = \phi(a) \cdot \phi(b) \quad\text{but computed without ever forming } \phi $$

Concretely, for the degree-2 polynomial map $\phi(x) = (x_1^2,\ \sqrt{2},x_1 x_2,\ x_2^2)$:

def phi(x):
    x1, x2 = x
    return np.array([x1**2, np.sqrt(2)*x1*x2, x2**2])

a = np.array([1., 2.]); b = np.array([3., 1.])
explicit = phi(a) @ phi(b)      # build the 3-D features, then dot
implicit = (a @ b) ** 2         # the kernel: just (a·b)^2
print("explicit (build phi, then dot):", round(explicit, 3))
print("kernel   (a·b)^2 directly      :", round(implicit, 3))

Output:

explicit (build phi, then dot): 25.0
kernel   (a·b)^2 directly      : 25.0

Identical — but the right side never built the feature vectors. That's the kernel trick: $(a\cdot b)^2$ silently computes a dot product in the 3-D quadratic space. Scale this up and a few cheap kernel evaluations stand in for dot products in infinite-dimensional spaces.

The kernel zoo (these are similarity functions)

KernelFormulaImplied feature space
Linear$k(a,b) = a\cdot b$the original space (no map)
Polynomial$k(a,b) = (a\cdot b + c)^d$all degree-$\le d$ monomials
RBF / Gaussian$k(a,b)=\exp(-\gamma\lVert a-b\rVert^2)$infinite-dimensional

You met linear, polynomial-ish, and RBF as similarities in Chapter 4. Now you know the deeper truth: a valid kernel is exactly a similarity function that corresponds to a dot product in some feature space. That's why the kernel matrix $K_{ij} = k(x_i, x_j)$ shows up everywhere — in SVMs, Gaussian processes, kernel PCA, and the Kernel Temporal Segmentation sister book, whose whole algorithm reads from this matrix.

Don't be confused: a kernel is a similarity, not a distance. $k(a,a)$ is large (a thing is maximally similar to itself), and a valid kernel matrix is symmetric and positive semi-definite. If you have a distance, exp(-gamma * d^2) turns it into a kernel.

Meaning 2 — The convolution kernel (a filter)

In CNNs, a kernel (a.k.a. filter) is a small grid of weights — say 3×3 — slid across an image. At each position it multiplies-and-sums the overlapping pixels, detecting a local pattern (an edge, a corner, a texture). A famous hand-built example, the vertical-edge detector:

import numpy as np
kernel = np.array([[-1, 0, 1],
                   [-1, 0, 1],
                   [-1, 0, 1]])   # responds to left-to-right brightness changes

A CNN learns these kernels by gradient descent instead of hand-coding them. Early layers learn edge/color kernels; deeper layers learn eyes, wheels, faces. (More in Chapter 13.)

Don't be confused: ML kernel vs. convolution kernel. The SVM kernel is a function $k(a,b)$ returning a similarity; the CNN kernel is a small array of weights. Same word, no relationship. Context: "kernel trick / RBF kernel / kernel matrix" → meaning 1; "3×3 kernel / conv kernel / filter" → meaning 2.

Meaning 3 — The GPU/CUDA kernel (a function on the GPU)

In GPU programming, a kernel is a function compiled to run on the GPU across thousands of threads at once. "We wrote a custom CUDA kernel to make attention faster" means "we wrote a GPU function." Libraries like FlashAttention are famous kernels in this sense. When an ML-systems engineer says "kernel," they usually mean this — a unit of GPU work — not the SVM concept.

Meaning 4 — The OS kernel (the core of the operating system)

The Linux kernel is the privileged core that manages memory, processes, and hardware. Relevant to AI only operationally (drivers, GPU access, containers). It shares nothing with the other four beyond the name "core/center."

Meaning 5 — The statistics kernel (a weighting window)

In kernel density estimation and smoothing, a kernel is a small bump function (often Gaussian) placed over each data point and summed to estimate a smooth distribution. This is related to meaning 1 — the RBF kernel is the same Gaussian bump — which is why the word overlaps there.

How to tell which kernel is meant

  • "kernel trick," "RBF kernel," "kernel matrix," "kernel method," SVM → ML kernel (meaning 1).
  • "3×3 kernel," "conv kernel," "filter," CNN → convolution kernel (meaning 2).
  • "CUDA kernel," "GPU kernel," "fused kernel," "launch" → GPU kernel (meaning 3).
  • "Linux kernel," "kernel panic," "kernel space" → OS kernel (meaning 4).
  • "kernel density," "smoothing kernel" → stats kernel (meaning 5).

The takeaway

"Kernel" is five unrelated ideas wearing one name. The two that matter for AI: the ML kernel is a similarity function that's secretly a dot product in a richer space (the kernel trick lets you use that space without building it), and the convolution kernel is a small learnable filter in a CNN. Know the other three exist so you're never thrown. With features and similarities in hand, we can now build actual predictive models — starting with the simplest, the line. 👉

Linear & logistic models

Before neural networks, meet the two models that 90% of practical ML still runs on and that every neural net is built from: linear regression (predict a number) and logistic regression (predict a probability). A neural network is literally a stack of these with non-linearities between them — so understand these two and the rest is repetition.

The linear model: a weighted sum

Every linear model computes the same thing — a weighted sum of the features plus a bias:

$$ z = w_1 x_1 + w_2 x_2 + \dots + w_d x_d + b = \mathbf{w}\cdot\mathbf{x} + b $$

Each weight $w_i$ says how much feature $i$ pushes the answer up or down; the bias $b$ is the baseline. That dot product is the entire model. What you do with $z$ is what separates regression from classification.

Linear regression — predict a number

Use $z$ directly as the prediction. We already did this in Chapter 1: recover y = 2x + 1. The loss is mean squared error, and there's even a closed-form solution (the normal equations), but gradient descent generalizes to everything, so we lead with it.

Use it for: house prices, demand forecasting, any continuous target — and as a baseline you must beat before reaching for anything fancier.

Logistic regression — predict a probability

For yes/no questions ("will this user click?"), a raw weighted sum is wrong: it ranges over all real numbers, but a probability must live in $[0, 1]$. So we squash $z$ through the sigmoid (logistic) function:

$$ \sigma(z) = \frac{1}{1 + e^{-z}} \qquad p = \sigma(\mathbf{w}\cdot\mathbf{x}+b) $$

import numpy as np
def sigmoid(z): return 1 / (1 + np.exp(-z))
print(sigmoid(np.array([-4., 0., 4.])))

Output:

[0.018 0.5   0.982]

Sigmoid maps $-\infty\to0$, $0\to0.5$, $+\infty\to1$ — a smooth S-curve turning any score into a probability. Despite the name, logistic regression is a classification model.

A worked example: hours studied → pass/fail

import numpy as np
def sigmoid(z): return 1 / (1 + np.exp(-z))

hours = np.array([1.,2.,3.,4.,5.,6.,7.,8.]).reshape(-1, 1)
y     = np.array([0.,0.,0.,0.,1.,1.,1.,1.])      # passed only with >= 5 hours
X     = np.hstack([np.ones_like(hours), hours])  # column of 1s = the bias term

w = np.zeros(2)
for step in range(20000):                        # gradient descent
    p = sigmoid(X @ w)
    grad = X.T @ (p - y) / len(y)                # gradient of cross-entropy
    w -= 0.1 * grad

print("weights (bias, slope):", np.round(w, 3))
for h in [3.0, 4.5, 6.0]:
    print(f"hours={h} -> P(pass)={sigmoid(w[0] + w[1]*h):.3f}")
print("decision boundary at hours =", round(-w[0]/w[1], 3))

Output:

weights (bias, slope): [-18.223   4.077]
hours=3.0 -> P(pass)=0.002
hours=4.5 -> P(pass)=0.531
hours=6.0 -> P(pass)=0.998
decision boundary at hours = 4.469

The model learned a confident S-curve: 3 hours → basically fail, 6 hours → almost certain pass, and a decision boundary at ~4.47 hours (where $P=0.5$, i.e. $\mathbf{w}\cdot\mathbf{x}+b=0$). Notice the gradient X.T @ (p - y) — the same clean (prediction − target) form we'll see again in backprop.

Don't be confused: logits vs. probabilities. The raw score $z = \mathbf{w}\cdot\mathbf{x}+b$ — before the sigmoid — is the logit. It lives in $(-\infty, +\infty)$. After the sigmoid it's a probability in $(0,1)$. Models output logits internally; loss functions like BCEWithLogitsLoss take logits (more numerically stable than applying sigmoid yourself). When someone says "the logits," they mean the pre-activation scores. Inverse of sigmoid is the logit function $\log\frac{p}{1-p}$ — hence the name.

More than two classes: softmax

For 3+ classes (digit 0–9, which-category), generalize the sigmoid to the softmax: produce one logit per class, then normalize them into a probability distribution that sums to 1.

$$ \text{softmax}(z)_i = \frac{e^{z_i}}{\sum_j e^{z_j}} $$

def softmax(z):
    z = z - np.max(z)          # numerically-stable shift (Chapter 17)
    e = np.exp(z)
    return e / e.sum()

print(np.round(softmax(np.array([2.0, 1.0, 0.1])), 3))

Output:

[0.659 0.242 0.099]

The biggest logit gets the biggest probability, everything sums to 1. Softmax is the final layer of essentially every classifier, including the output of a language model over its vocabulary. (Multi-class logistic regression is also called softmax regression or, confusingly, maximum-entropy classifier.)

Don't be confused: sigmoid vs. softmax. Sigmoid = one output, "yes/no" (binary, or independent multi-label). Softmax = several outputs that compete and sum to 1 (pick one of N mutually-exclusive classes). Two classes with softmax is mathematically the same as one sigmoid.

Why linear models matter even in the deep-learning era

  • They're the baseline. If logistic regression on good features already hits 92%, a neural net that hits 92.3% may not be worth the complexity.
  • They're interpretable. Each weight is a clear "this feature pushes the answer this much" — invaluable in regulated domains (credit, medicine).
  • They're the atom of deep learning. One neural-network "neuron" is exactly sigmoid(w·x + b). A layer is many of them; a deep net is layers stacked with non-linearities. You already understand the unit.

The takeaway

Linear models compute a weighted sum (a dot product). Use it raw for regression, through a sigmoid for binary probability, through softmax for multi-class. Logits are the pre-squash scores. These are both the strongest simple baseline and the building block of every neural network. But we keep saying "minimize the loss" — what exactly is a loss function? That's next. 👉

Loss functions

The loss is the single number that says how wrong the model is right now. Training is nothing but "make this number small." Choosing the right loss for your problem is one of the highest-leverage decisions in ML — the loss defines what "good" means, and the model will optimize exactly what you ask for, for better or worse.

Don't be confused: loss vs. metric. A loss is what the model optimizes during training — it must be smooth/differentiable so gradients flow. A metric is what you read to judge quality (accuracy, F1, revenue) — it can be non-differentiable. You can't gradient-descend on accuracy (it's flat then jumps), so you train on cross-entropy (the smooth stand-in) and report accuracy. They usually move together but aren't the same number.

Regression losses (predicting a number)

Mean Squared Error (MSE / L2)

Average of the squared errors. Squaring punishes big misses far more than small ones, and makes the loss smooth.

$$ \text{MSE} = \frac{1}{n}\sum_i (\hat{y}_i - y_i)^2 $$

import numpy as np
y    = np.array([3.0, -0.5, 2.0, 7.0])
yhat = np.array([2.5,  0.0, 2.0, 8.0])
print("MSE:", round(float(np.mean((yhat - y) ** 2)), 3))

Output:

MSE: 0.375

Mean Absolute Error (MAE / L1)

Average of the absolute errors. Treats all errors proportionally, so it's robust to outliers — one wildly wrong point doesn't dominate.

$$ \text{MAE} = \frac{1}{n}\sum_i |\hat{y}_i - y_i| $$

print("MAE:", round(float(np.mean(np.abs(yhat - y))), 3))

Output:

MAE: 0.5

Don't be confused: MSE vs. MAE. MSE squares, so a single huge error dominates the loss — the model bends to chase outliers, and MSE's "average" is the mean. MAE's optimum is the median, so it shrugs off outliers. Noisy data with outliers → lean MAE (or Huber loss, which is MSE near zero and MAE in the tails — the best of both).

Classification losses (predicting a category)

Binary cross-entropy (log loss)

For yes/no problems. It rewards confident-correct predictions and savagely punishes confident-wrong ones.

$$ \text{BCE} = -\big[,y\log p + (1-y)\log(1-p),\big] $$

Watch how the loss for a true label of 1 explodes as the prediction gets worse:

import numpy as np
y = 1.0
for p in [0.9, 0.5, 0.1]:
    bce = -(y*np.log(p) + (1-y)*np.log(1-p))
    print(f"true=1  pred={p}  loss={bce:.3f}")

Output:

true=1  pred=0.9  loss=0.105
true=1  pred=0.5  loss=0.693
true=1  pred=0.1  loss=2.303

Predict 0.9 when the truth is 1 → tiny loss (0.105). Predict 0.1 → big loss (2.303). And 0.5 ("I have no idea") → exactly $\ln 2 = 0.693$, a number you'll recognize everywhere as "the loss of a coin flip." This asymmetry — confidence must be earned — is why cross-entropy beats squared error for classification.

Categorical cross-entropy

The multi-class version: take the model's softmax probability for the true class and penalize $-\log$ of it. Only the true class's probability matters.

$$ \text{CE} = -\log p_{\text{true class}} $$

probs_good = np.array([0.7, 0.2, 0.1])   # confident in the right class (0)
probs_bad  = np.array([0.1, 0.2, 0.7])   # confident in the WRONG class
print("CE (true class 0), good model:", round(float(-np.log(probs_good[0])), 3))
print("CE (true class 0), bad model :", round(float(-np.log(probs_bad[0])),  3))

Output:

CE (true class 0), good model: 0.357
CE (true class 0), bad model : 2.303

This is the loss behind virtually every classifier and every language model (which is just a classifier over the vocabulary at each position). Related term: perplexity = $e^{\text{CE}}$, the standard way to report language-model quality.

Don't be confused: "cross-entropy" vs. "log loss" vs. "negative log likelihood" (NLL). For classification these are the same loss under different names. Cross-entropy = NLL of a softmax/sigmoid model = log loss. Don't let the three names fool you into thinking they're three things.

Hinge loss (the SVM loss)

Used by support vector machines. It only cares that the correct class beats the others by a margin of at least 1; once you're safely correct, zero loss.

$$ \text{hinge} = \max(0,; 1 - \text{margin}) $$

def hinge(margin): return max(0.0, 1.0 - margin)
for m in [2.0, 1.0, 0.5, -1.0]:
    print(f"margin={m:>4}  hinge={hinge(m)}")

Output:

margin= 2.0  hinge=0.0
margin= 1.0  hinge=0.0
margin= 0.5  hinge=0.5
margin=-1.0  hinge=2.0

Don't be confused: hinge vs. cross-entropy. Once a point is correct with margin ≥ 1, hinge loss is exactly 0 — the model stops caring about it. Cross- entropy is never exactly 0; it always nudges toward more confidence. Hinge → "be right by a margin, then relax"; cross-entropy → "always grow confidence." Hinge gives SVMs their sparse, support-vector behavior.

How to choose

ProblemDefault loss
Predict a numberMSE (→ MAE/Huber if outliers)
Yes/nobinary cross-entropy (on logits)
Pick 1 of N classescategorical cross-entropy (softmax)
Multi-label (several can be true)per-label binary cross-entropy
Max-margin classifierhinge
Rankings / recommendationspairwise (BPR, triplet) — see the recsys book

The takeaway

The loss is the scalar training minimizes, and it encodes your goal: MSE/MAE for numbers, cross-entropy for categories, hinge for margins. Keep loss (smooth, for optimizing) separate from metric (what you report). Now the central question: given a loss, how does the model actually find the parameters that minimize it? Enter gradient descent. 👉

Gradient descent & backprop

We keep saying "nudge the parameters to shrink the loss." Gradient descent is how — the optimization engine under all of deep learning. And backpropagation is the trick that computes the nudges efficiently for networks with millions of parameters. Understand these two and you understand how every model on Earth gets trained.

The gradient: which way is downhill

Picture the loss as a landscape: parameters are your position, loss is your altitude, and you want the valley floor. The gradient is the vector of partial derivatives — it points in the direction of steepest increase. So to go down, step in the opposite direction:

$$ \theta \leftarrow \theta - \eta ,\nabla_\theta L $$

where $\eta$ ("eta") is the learning rate — how big a step to take. That one line is gradient descent.

The simplest possible descent

Minimize $f(x) = (x-3)^2$, whose minimum is obviously at $x=3$. Its derivative is $f'(x) = 2(x-3)$. Start at 0 and walk downhill:

x, lr = 0.0, 0.1
for step in range(40):
    grad = 2 * (x - 3)        # the gradient (slope) at the current x
    x -= lr * grad            # step opposite the gradient
    if step in (0, 5, 20, 39):
        print(f"step {step:2d}  x={x:.4f}  f(x)={(x-3)**2:.4f}")

Output:

step  0  x=0.6000  f(x)=5.7600
step  5  x=2.2136  f(x)=0.6185
step 20  x=2.9723  f(x)=0.0008
step 39  x=2.9996  f(x)=0.0000

Starting clueless at 0, it slid down to $x \approx 3$ — exactly the loop from Chapter 1, but isolated. Real models do this in millions of dimensions at once, but the move is identical: compute gradient, step against it.

The learning rate: the one knob you must get right

Too small and training crawls. Too big and it diverges — overshooting the valley and bouncing to ever-worse losses:

x, lr = 0.0, 1.1            # too large for this problem
for step in range(6):
    grad = 2 * (x - 3); x -= lr * grad
    print(f"step {step}  x={x:.3f}  f={(x-3)**2:.3f}")

Output:

step 0  x=6.600  f=12.960
step 1  x=-1.320  f=18.662
step 2  x=8.184  f=26.874
step 3  x=-3.221  f=38.698
step 4  x=10.465  f=55.726
step 5  x=-5.958  f=80.245

The loss grows — the classic signature of a learning rate that's too high. If your training loss explodes to NaN, lower the learning rate first. This single hyperparameter causes more failed training runs than anything else.

The three flavors: batch, stochastic, mini-batch

The gradient is an average over examples. How many you average per step gives three variants:

VariantExamples per stepTrade-off
Batch GDall of themexact gradient, but slow and memory-hungry
Stochastic GD (SGD)onenoisy, fast, the noise can help escape bad spots
Mini-batcha small chunk (32–512)the practical default — fast and stable

Mini-batch SGD is what everyone actually uses. One full pass over the dataset is an epoch; you train for many epochs.

Better optimizers: momentum and Adam

Plain SGD can zig-zag and crawl. Two upgrades dominate practice:

  • Momentum — accumulate a running average of past gradients, like a ball rolling downhill gaining speed. Smooths the zig-zag.
  • Adam — momentum plus a per-parameter adaptive learning rate (it divides by a running estimate of each gradient's magnitude). Adam is the default optimizer for deep learning — robust, fast, forgiving of learning-rate choice. (Variant: AdamW, the standard for training transformers.)

You rarely implement these — you write torch.optim.Adam(model.parameters(), lr=...) and move on (Chapter 12).

Backpropagation: the chain rule, organized

A neural net is a deep composition of functions. To update a weight buried in layer 1, you need $\partial L / \partial w$ through every layer above it. The chain rule from calculus says: multiply the local derivatives along the path. Backpropagation is just the bookkeeping that does this efficiently — one forward pass to compute outputs, then one backward pass that pushes the error from the loss back to every parameter, reusing intermediate results.

The key insight that makes it cheap: compute gradients layer by layer from the output backward, reusing the gradient already computed for the layer above. Naively recomputing each path would be exponential; backprop is linear. We hand-derive and hand-code the entire backward pass in Chapter 11.

Don't be confused: backprop vs. gradient descent vs. autograd.

  • Backprop computes the gradients (applies the chain rule efficiently).
  • Gradient descent uses them to update parameters ($\theta \mathrel{-}= \eta,\text{grad}$).
  • Autograd is the software (Chapter 12) that does backprop automatically so you never write it by hand.

They're three different things people blur together. Backprop = compute, gradient descent = update, autograd = the tool.

Trust but verify: the numerical gradient check

How do you know a hand-derived gradient is correct? Compare it to a numerical estimate from the definition of a derivative, $\frac{f(x+\epsilon) - f(x-\epsilon)}{2\epsilon}$. If they match, your analytic gradient is right. This gradient check has saved countless from-scratch implementations:

import numpy as np
def f(v): return v[0]**2 + 3*v[0]*v[1]      # df/dx = 2x+3y, df/dy = 3x
v = np.array([2.0, 1.0])
analytic = np.array([2*v[0] + 3*v[1], 3*v[0]])

eps, num = 1e-6, np.zeros(2)
for i in range(2):
    d = np.zeros(2); d[i] = eps
    num[i] = (f(v + d) - f(v - d)) / (2 * eps)   # central difference
print("analytic:", analytic, " numeric:", np.round(num, 5))

Output:

analytic: [7. 6.]  numeric: [7. 6.]

They match to many decimals → the analytic gradient is correct. (This trick is too slow for training — it's $O(\text{params})$ forward passes — but perfect for checking a small implementation.)

Local minima, saddle points, and why it works anyway

The loss landscape of a deep network is wildly non-convex — full of valleys and saddle points. In theory gradient descent could get stuck; in practice, in very high dimensions, good-enough minima are everywhere and the noise of mini-batch SGD helps slide past bad spots. You don't need the global optimum, just a low valley that generalizes. That empirical fact is why deep learning works at all.

The takeaway

Gradient descent walks parameters downhill: θ -= lr * gradient. The learning rate is the make-or-break knob (too big → diverge). Mini-batch SGD with Adam is the default. Backprop is the chain rule applied efficiently to get the gradients; autograd automates it; a numerical gradient check verifies it. We now have the full training engine — time to confront its dark twin: overfitting. 👉

Overfitting & regularization

A model that scores 100% on its training data may be worthless on new data. This chapter is about the single most important idea separating "got a number" from "built something that works": generalization — performing well on data you've never seen. Everything here exists to fight one enemy: overfitting.

Memorizing vs. learning

Overfitting is when a model memorizes the noise and quirks of the training set instead of the underlying pattern. It looks brilliant on training data and fails on new data. Its opposite, underfitting, is when the model is too simple to capture the real pattern and fails on both.

Let's watch it happen. We fit polynomials of increasing degree to 12 noisy points from a near-linear truth, and measure error on both the training points and a held-out test set:

import numpy as np
# fit polynomials of various degree to noisy near-linear data
print("deg | train_mse | test_mse")
for deg in [1, 3, 9, 11]:
    w = fit_poly(xtr, ytr, deg)          # least-squares polynomial fit
    print(f"{deg:3d} | {mse(w, xtr, ytr):.5f}  | {mse(w, xte, yte):.5f}")

Output:

deg | train_mse | test_mse
  1 | 0.00452  | 0.00039
  3 | 0.00408  | 0.00085
  9 | 0.00001  | 0.00541
 11 | 0.00000  | 0.00871

Read the two columns against each other — this is the whole lesson:

  • Degree 1 (a line): small train error, smallest test error. It matches the true near-linear pattern.
  • Degree 11: train error is essentially zero — it threads every noisy point perfectly — but test error is the worst of all (20× the line's). It memorized the noise.

The tell of overfitting: training error keeps dropping while validation error starts rising. The gap between them is the overfitting. A model is only as good as its validation number, never its training number.

The bias–variance tradeoff

This tug-of-war has a name. Total error decomposes into:

  • Bias — error from wrong assumptions / too simple a model (→ underfitting). The degree-1 fit on a wiggly truth would have high bias.
  • Variance — error from being too sensitive to the particular training sample (→ overfitting). The degree-11 fit has huge variance: a different 12 points would give a wildly different curve.

$$ \text{Test error} \approx \text{bias}^2 + \text{variance} + \text{irreducible noise} $$

You can't drive both to zero; you find the sweet spot. More model capacity ⇒ less bias, more variance. Everything below is about buying low variance without paying too much bias.

The non-negotiable: train / validation / test split

You cannot judge a model on data it trained on. Split your data three ways:

  • Training set (~70%) — the model learns its parameters here.
  • Validation set (~15%) — you tune hyperparameters (degree, learning rate, regularization strength) by checking here.
  • Test set (~15%) — touched once, at the very end, for an honest final estimate. If you tune against it, it's no longer honest.

Don't be confused: validation vs. test set. Validation is for choosing (you look at it many times while tuning). Test is for reporting (you look once, at the end). Reusing the test set to make decisions silently leaks it into training and inflates your reported number — one of the most common ways teams fool themselves.

Cross-validation squeezes more out of small datasets: split into $k$ folds, train on $k-1$ and validate on the held-out fold, rotate, average. k-fold CV is the gold standard when data is scarce.

Regularization: penalize complexity

The cure for overfitting is to discourage the model from using big, wiggly weights. Regularization adds a penalty on weight size to the loss:

$$ L_{\text{total}} = \underbrace{L_{\text{data}}}{\text{fit the data}} + \lambda \underbrace{\lVert w\rVert}{\text{stay simple}} $$

  • L2 / Ridge (penalize $\sum w_i^2$) — shrinks all weights smoothly toward zero. The default. (In deep learning this is called weight decay.)
  • L1 / Lasso (penalize $\sum |w_i|$) — drives many weights exactly to zero, doing automatic feature selection (a sparse model).

$\lambda$ (lambda) is the regularization strength — a hyperparameter you tune on validation. Watch L2 rescue the hopeless degree-11 model:

# same degree-11 polynomial, now with an L2 (ridge) penalty of strength lambda
for lam in [0.0, 1e-3, 1e-1]:
    w = fit_ridge(xtr, ytr, deg=11, lam=lam)
    print(f"lambda={lam:<6} train={mse(w,xtr,ytr):.5f} test={mse(w,xte,yte):.5f}")

Output:

lambda=0.0    train=0.00000 test=0.00871
lambda=0.001  train=0.00081 test=0.00342
lambda=0.1    train=0.00325 test=0.00190

As $\lambda$ rises, training error gets worse (good — we stop memorizing noise) and test error improves 4.5×. That's regularization working: trade a little training fit for a lot of generalization.

Don't be confused: L1 vs. L2. L2 (ridge) shrinks weights toward zero but rarely to exactly zero — keeps all features, just small. L1 (lasso) zeroes weights out entirely — picks a subset of features. Want a sparse, interpretable model? L1. Want smooth shrinkage and stable training? L2. Elastic Net mixes both.

The deep-learning regularization toolkit

Neural nets have their own anti-overfitting arsenal — you'll set these as hyperparameters:

  • Dropout — randomly zero out a fraction of neurons each training step, forcing redundancy so no single neuron is load-bearing. Turned off at inference.
  • Early stopping — watch validation loss; stop the moment it starts rising. Simple and remarkably effective.
  • Weight decay — L2, by another name.
  • Data augmentation — manufacture more training variety (flip/crop images, paraphrase text). More effective data ⇒ less overfitting.
  • Batch/Layer normalization — stabilizes training and mildly regularizes.

Don't be confused: dropout at train vs. test. Dropout is active during training (randomly drops neurons) and disabled during inference (uses the full network). This is exactly why you must call model.eval() in PyTorch before predicting — forgetting it leaves dropout/batchnorm in training mode and quietly corrupts your results (Chapter 12).

The takeaway

A model is judged on data it never trained on. Overfitting = memorizing noise (low train error, high validation error); underfitting = too simple (high both). Balance them via the bias–variance tradeoff. Always split train/validation/test and never tune on test. Fight overfitting with regularization (L1/L2), dropout, early stopping, and more data. Now: how do we measure "good" precisely? Metrics. 👉

Evaluation metrics

A model is only as trustworthy as the number you judge it by — and accuracy, the metric everyone reaches for first, is often a trap. This chapter gives you the metrics that actually tell the truth, when to use each, and the confusions that make people ship broken models thinking they're great.

Why accuracy lies

Accuracy = fraction of predictions that are correct. Sounds perfect. Now imagine a fraud detector where 5% of transactions are fraud. A model that predicts "not fraud" for everything — a model that does literally nothing — scores:

import numpy as np
y = np.array([1]*5 + [0]*95)         # 5 fraud, 95 legit
predict_all_negative = np.zeros(100, dtype=int)
print("accuracy:", (predict_all_negative == y).mean())

Output:

accuracy: 0.95

95% accuracy for a useless model. On imbalanced data — fraud, disease, click- through, anomalies — accuracy is actively misleading. You need metrics that look at which errors happen.

The confusion matrix: the source of all classification metrics

Every binary prediction falls into one of four buckets. The 2×2 grid of their counts is the confusion matrix, and every other metric is just arithmetic on these four numbers:

Predicted PositivePredicted Negative
Actually PositiveTrue Positive (TP)False Negative (FN)
Actually NegativeFalse Positive (FP)True Negative (TN)
  • TP — caught a real positive ✅
  • FP — false alarm (predicted positive, was negative) — a Type I error
  • FN — missed it (predicted negative, was positive) — a Type II error
  • TN — correctly left a negative alone ✅
y_true = np.array([1,1,1,1, 0,0,0,0,0,0])
y_pred = np.array([1,1,0,0, 1,0,0,0,0,0])
TP = int(((y_pred==1) & (y_true==1)).sum())
FP = int(((y_pred==1) & (y_true==0)).sum())
FN = int(((y_pred==0) & (y_true==1)).sum())
TN = int(((y_pred==0) & (y_true==0)).sum())
print(f"TP={TP} FP={FP} FN={FN} TN={TN}")

Output:

TP=2 FP=1 FN=2 TN=5

Precision and recall: the two questions that matter

From those four numbers come the two metrics you'll quote constantly:

$$ \text{Precision} = \frac{TP}{TP+FP} \qquad \text{Recall} = \frac{TP}{TP+FN} $$

  • Precision"Of everything I flagged, how much was right?" Punishes false alarms (FP).
  • Recall (sensitivity) — "Of everything I should have caught, how much did I?" Punishes misses (FN).
precision = TP / (TP + FP)
recall    = TP / (TP + FN)
f1        = 2 * precision * recall / (precision + recall)
print(f"precision={precision:.3f}  recall={recall:.3f}  f1={f1:.3f}")
print(f"accuracy={(y_pred==y_true).mean():.3f}")

Output:

precision=0.667  recall=0.500  f1=0.571
accuracy=0.700

Don't be confused: precision vs. recall — and which to optimize. They trade off. Spam filter: a false positive deletes a real email — you want high precision (be sure before flagging). Cancer screening: a false negative misses a sick patient — you want high recall (never miss one, tolerate false alarms). Decide which error is more expensive in your domain, then optimize that metric. You can always trade one for the other by moving the decision threshold.

F1: one number when you want balance

The F1 score is the harmonic mean of precision and recall — high only when both are high (a model can't game it by maxing one).

$$ F_1 = 2\cdot\frac{\text{precision}\cdot\text{recall}}{\text{precision}+\text{recall}} $$

Why harmonic, not plain average? The harmonic mean is dragged down by the smaller value. Precision 1.0, recall 0.0 → plain average 0.5 (looks okay!), but F1 = 0.0 (correctly says "useless"). That's the whole point. For imbalanced data, F1 is a far better headline than accuracy.

ROC-AUC: quality across all thresholds

Precision/recall depend on where you set the decision threshold (default 0.5). The ROC curve plots true-positive rate vs. false-positive rate as you sweep the threshold from 0 to 1, and AUC (Area Under the Curve) summarizes it in one number:

  • AUC = 1.0 → perfect ranking. AUC = 0.5 → no better than random.
  • Interpretation: the probability that the model scores a random positive higher than a random negative.

Use AUC to compare models independent of threshold, especially on imbalanced data. (Related: PR-AUC, area under the precision–recall curve, which is even more informative when positives are very rare.)

Don't be confused: threshold-dependent vs. threshold-free metrics. Precision, recall, F1, and accuracy all depend on the 0.5 cutoff — change the threshold and they change. AUC is threshold-free: it measures how well the model ranks positives above negatives, regardless of cutoff. Report AUC to judge the model; pick a threshold afterward to deploy it.

Regression metrics (predicting numbers)

For continuous targets, the standard trio:

yt = np.array([3.0, -0.5, 2.0, 7.0])
yp = np.array([2.5,  0.0, 2.0, 8.0])
mae  = np.mean(np.abs(yp - yt))
rmse = np.sqrt(np.mean((yp - yt) ** 2))
r2   = 1 - np.sum((yt-yp)**2) / np.sum((yt - yt.mean())**2)
print(f"MAE={mae:.3f}  RMSE={rmse:.3f}  R2={r2:.3f}")

Output:

MAE=0.500  RMSE=0.612  R2=0.949
  • MAE — average absolute error, in the target's own units. Robust, easy to explain ("off by $0.50 on average").
  • RMSE — root mean squared error; same units, but penalizes big misses more (it's the square root of MSE). Use when large errors are especially bad.
  • (coefficient of determination) — fraction of variance explained, on a 0–1 scale (can go negative for a terrible model). R² = 0.949 means the model explains 95% of the variation. R² = 0 means "no better than always guessing the mean."

Don't be confused: RMSE vs. MAE (again, but for reporting). RMSE ≥ MAE always, and the gap between them signals the presence of large outlier errors. If RMSE ≫ MAE, a few predictions are badly wrong. Reporting both is more informative than either alone.

A metric-picking cheat sheet

SituationUse
Balanced classificationaccuracy is fine
Imbalanced classificationF1, precision/recall, PR-AUC — not accuracy
Compare classifiers, any thresholdROC-AUC
Cost of FP ≠ cost of FNtune the threshold, optimize the costly metric
Regression, explainableMAE
Regression, punish big errorsRMSE
Regression, "how much variance explained"
Ranking / recommendationsPrecision@k, Recall@k, NDCG, MAP (see recsys book)

The one rule above all

Pick your metric before you train, and match it to the real-world cost of each error. The metric is a proxy for what you actually care about — choose it carelessly and you'll cheerfully optimize the wrong thing all the way to production.

The takeaway

Accuracy lies on imbalanced data. The confusion matrix (TP/FP/FN/TN) generates precision (trust your alarms), recall (catch everything), and F1 (balance both). AUC judges ranking across thresholds. For regression use MAE/RMSE/R². Choose the metric by which mistake hurts most. We've now covered all of classical ML's machinery — time to stack linear units into something far more powerful: a neural network, built by hand. 👉

A neural network from scratch

Now we assemble everything — linear units, non-linearities, a loss, and backpropagation — into an actual neural network, built in pure NumPy with the backward pass derived by hand. No framework hiding the magic. After this chapter, the word "neural network" will hold no mystery: it's stacked logistic regressions trained by gradient descent.

The full program is code/nn_from_scratch.py; it prints exactly the output below.

A neuron is something you already know

One neuron computes activation(w·x + b) — a weighted sum (from Chapter 6) pushed through a non-linear function. Stack many neurons side by side → a layer. Stack layers → a multilayer perceptron (MLP), the original neural network:

$$ \mathbf{h} = \text{activation}(\mathbf{x}W_1 + \mathbf{b}_1), \qquad \mathbf{output} = \mathbf{h}W_2 + \mathbf{b}_2 $$

The middle layers are hidden layers; their outputs are features the network invents for itself. That's the superpower: instead of you hand-crafting features (Chapter 3), the network learns them.

Why non-linearity is the whole point

Don't be confused: without an activation function, depth is an illusion. Stack two linear layers and you get $,(xW_1)W_2 = x(W_1W_2)$ — still just one linear layer. A hundred linear layers collapse into one. The non-linear activation between layers is what lets a deep network represent curved, complex functions. No activation, no deep learning.

The activation zoo

import numpy as np
z = np.array([-2., -0.5, 0., 0.5, 2.])
print("relu   :", np.maximum(0, z))
print("sigmoid:", 1/(1+np.exp(-z)))
print("tanh   :", np.tanh(z))
print("leaky  :", np.where(z>0, z, 0.01*z))

Output:

relu   : [0.  0.  0.  0.5 2. ]
sigmoid: [0.119 0.378 0.5   0.622 0.881]
tanh   : [-0.964 -0.462  0.     0.462  0.964]
leaky  : [-0.02  -0.005  0.     0.5    2.   ]
ActivationRangeUse it for
ReLU max(0,z)$[0,\infty)$the default for hidden layers — cheap, no vanishing gradient
Leaky ReLU$(-\infty,\infty)$ReLU that avoids "dead" neurons (small slope for $z<0$)
Sigmoid$(0,1)$a single probability output (binary)
Tanh$(-1,1)$zero-centered hidden activations (older nets, RNNs)
GELU/SiLUsmooththe modern default inside transformers

Don't be confused: ReLU for hidden layers, sigmoid/softmax for outputs. Beginners reach for sigmoid everywhere because it's famous. In practice: ReLU (or GELU) in the hidden layers, and sigmoid/softmax only at the output to produce probabilities. Sigmoid in deep hidden layers causes the vanishing gradient problem (its slope is near-zero in the tails, so gradients die on the way back) — the bug that stalled deep learning for years until ReLU.

The network we'll build

A tiny MLP for a problem no linear model can solve — the XOR layout, where each class lives in two opposite corners:

2 inputs  ->  8 hidden neurons (tanh)  ->  1 output neuron (sigmoid)

Two corners are class 0, the other two are class 1 — not separable by any line, so the hidden layer must invent useful features. Loss: binary cross-entropy. The forward pass is two matrix multiplies with a tanh between them; the backward pass is the chain rule, by hand:

# ---- forward pass ----
z1 = X @ W1 + b1          # (m, hidden)
a1 = np.tanh(z1)          # hidden activation
z2 = a1 @ W2 + b2         # (m, 1)
p  = sigmoid(z2)          # predicted probability

# ---- loss: binary cross-entropy ----
loss = -np.mean(y*np.log(p+eps) + (1-y)*np.log(1-p+eps))

# ---- backward pass (chain rule, layer by layer from the output) ----
dz2 = (p - y) / m              # dL/dz2 — the clean (pred - target) form again
dW2 = a1.T @ dz2
db2 = dz2.sum(0, keepdims=True)
da1 = dz2 @ W2.T               # push the error back through W2
dz1 = da1 * (1 - a1**2)        # times tanh'(z) = 1 - tanh(z)^2
dW1 = X.T @ dz1
db1 = dz1.sum(0, keepdims=True)

# ---- gradient-descent update ----
W1 -= lr*dW1; b1 -= lr*db1; W2 -= lr*dW2; b2 -= lr*db2

Notice three things you already know reappearing: (p - y) is the cross-entropy gradient from Chapter 7; dz2 @ W2.T is the backprop "push the error backward" from Chapter 8; the final four lines are plain gradient descent. A neural network is not a new idea — it's the old ideas, composed.

Training it

$ python nn_from_scratch.py

Output:

data: 400 points, 2 features, 2 classes (XOR layout)

epoch    0  loss 0.6983  train-acc 0.505
epoch  400  loss 0.0377  train-acc 0.990
epoch  800  loss 0.0230  train-acc 0.998
epoch 1200  loss 0.0180  train-acc 0.998
epoch 1600  loss 0.0154  train-acc 0.998
epoch 1999  loss 0.0139  train-acc 0.998

final train accuracy: 0.998

sanity check on the four cluster centers:
  point [-1. -1.]  ->  P(class 1) = 0.000
  point [1. 1.]  ->  P(class 1) = 0.000
  point [-1.  1.]  ->  P(class 1) = 1.000
  point [ 1. -1.]  ->  P(class 1) = 1.000

It starts at 50.5% — pure chance, since the classes aren't linearly separable — and climbs to 99.8%. The sanity check confirms it learned the XOR pattern exactly: the two "0" corners get P≈0.000, the two "1" corners get P≈1.000. A plain logistic regression on this data is stuck near 50% forever; the hidden layer is what made the difference, by bending the input space until the classes became separable.

Initialization matters more than you'd think

We initialized weights to small random values scaled by $1/\sqrt{n_{\text{in}}}$ (Xavier/He initialization). Two failure modes this avoids:

  • All zeros → every neuron computes the same thing and gets the same gradient; they never differentiate. The network is effectively one neuron wide forever.
  • Too large → activations saturate, gradients vanish or explode, training stalls or NaNs.

Good initialization keeps the signal's variance stable as it flows through layers. Frameworks do this for you, but knowing why explains a whole class of "my network won't train" bugs.

The takeaway

A neuron is activation(w·x + b); a layer is many neurons; an MLP is stacked layers with non-linearities between them — and that non-linearity is what makes depth meaningful. Forward pass = matrix multiplies + activations; backward pass = the chain rule applied layer by layer; update = gradient descent. We just did all of it by hand and solved a problem no line can. Now let's let a framework write that backward pass for us. 👉

PyTorch: tensors, autograd & training

You will not hand-derive backward passes in real work — you'll use a framework. The two that dominate are PyTorch (research and, increasingly, production) and TensorFlow/Keras (still common in industry). This book uses PyTorch because it's the lingua franca of modern AI and reads like plain Python. This chapter is the practical core: the four pillars, then the exact same XOR network from Chapter 11 — but with the backward pass written for us.

Run code/pytorch_intro.py to reproduce every output here.

The four pillars of every PyTorch program

  1. Tensors — like NumPy arrays, but they can live on a GPU and track gradients.
  2. nn.Module — your model as a class, with learnable parameters registered automatically.
  3. autogradloss.backward() computes every gradient by running backprop for you.
  4. optimizeropt.step() applies the gradient-descent update.

Pillar 1: tensors (NumPy you can run on a GPU)

import torch
a = torch.tensor([[1., 2., 3.],
                  [4., 5., 6.]])
print("shape:", tuple(a.shape), "dtype:", a.dtype)
print("a.mean():", a.mean().item())
print("a @ a.T:\n", a @ a.T)

Output:

shape: (2, 3) dtype: torch.float32
a.mean(): 3.5
a @ a.T:
 tensor([[14., 32.],
        [32., 77.]])

If you know NumPy (Chapter 2), you already know 90% of the tensor API — .shape, @, broadcasting, .mean(axis=...) (PyTorch calls the arg dim). The two superpowers NumPy lacks: .to('cuda') moves a tensor to the GPU, and requires_grad=True makes it track gradients.

Don't be confused: PyTorch defaults to float32, NumPy to float64. Bridging the two preserves dtype, which can surprise you:

import numpy as np
print(torch.from_numpy(np.array([1., 2., 3.])).dtype)   # float64!
torch.float64

A float64 tensor fed to a float32 model raises a dtype error. Cast with .float() (→ float32) when crossing the NumPy↔PyTorch bridge.

Pillar 3: autograd (the reason frameworks exist)

This is the magic that replaces all of Chapter 11's hand-derived calculus. Mark a tensor requires_grad=True, do some math, call .backward(), and PyTorch fills in every gradient:

x = torch.tensor([2.0, 3.0], requires_grad=True)
y = (x ** 2).sum()        # y = x0^2 + x1^2,  so dy/dx = 2x
y.backward()              # run backprop
print("y:", y.item())
print("x.grad:", x.grad)  # should be [2*2, 2*3] = [4, 6]

Output:

y: 13.0
x.grad: tensor([4., 6.])

PyTorch recorded every operation into a computation graph as you went, then walked it backward to get $\partial y/\partial x = [4, 6]$ — exactly $2x$. You never wrote a derivative. Scale this to a 100-layer transformer and autograd still just works. This is what a framework buys you: you define only the forward pass; the backward pass is free.

Pillars 2 & 4: the model and the training loop

Here is the whole XOR network — model, loss, optimizer, loop — in idiomatic PyTorch:

import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, n_in=2, hidden=8):
        super().__init__()
        self.fc1 = nn.Linear(n_in, hidden)   # a (w·x + b) layer; params auto-registered
        self.fc2 = nn.Linear(hidden, 1)
    def forward(self, x):
        x = torch.tanh(self.fc1(x))
        return self.fc2(x)                   # raw logit; the loss applies sigmoid

model  = MLP()
loss_fn = nn.BCEWithLogitsLoss()             # sigmoid + cross-entropy, fused & stable
opt     = torch.optim.Adam(model.parameters(), lr=0.05)

for epoch in range(401):
    opt.zero_grad()          # 1. clear last step's gradients
    logits = model(X)        # 2. forward pass
    loss   = loss_fn(logits, y)
    loss.backward()          # 3. autograd computes every .grad
    opt.step()               # 4. optimizer updates every parameter

Those four lines inside the loop are the canonical PyTorch training step — you'll write them, in this order, for every model you ever train:

zero_grad  ->  forward  ->  backward  ->  step

Running the full script:

X is a tensor of shape (400, 2), dtype torch.float32
device: cpu  (CUDA available: False)

epoch    0  loss 0.6951  train-acc 0.498
epoch  100  loss 0.0200  train-acc 0.995
epoch  200  loss 0.0148  train-acc 0.998
epoch  300  loss 0.0116  train-acc 0.995
epoch  400  loss 0.0102  train-acc 0.995

model has 33 learnable parameters

Same XOR problem, same ~99.5% accuracy as the hand-coded version — but we wrote zero backprop. The 33 parameters: layer 1 is 2×8 + 8 = 24, layer 2 is 8×1 + 1 = 9, total 33. (Counting parameters by hand like this is a great sanity check on any architecture.)

The two gotchas that bite everyone

Don't be confused: zero_grad() is mandatory. PyTorch accumulates gradients — each backward() adds to .grad rather than replacing it. Forget opt.zero_grad() and gradients from every past step pile up, and training goes haywire. (The accumulation is a feature — it lets you simulate big batches — but the default workflow needs the reset every step.)

Don't be confused: model.train() vs. model.eval(). Layers like dropout and batch norm behave differently in training vs. inference. Call model.eval() before validating/predicting and model.train() before training. Also wrap inference in with torch.no_grad(): to skip building the gradient graph — it's faster and uses less memory. Forgetting eval() is a top source of "my validation numbers are mysteriously bad."

BCEWithLogitsLoss: logits, not probabilities

Notice the model returns a raw logit (Chapter 6), not a sigmoid'd probability, and we used BCEWithLogitsLoss. That loss fuses the sigmoid and the cross-entropy into one numerically stable operation (it uses the log-sum-exp trick from Chapter 17). Applying sigmoid yourself and then BCELoss is mathematically identical but can overflow. Rule: feed logits to *WithLogitsLoss / CrossEntropyLoss; only apply sigmoid/softmax when you actually need to read a probability.

The broader ecosystem (so the words don't surprise you)

  • DataLoader / Dataset — batch and shuffle your data (Chapter 16).
  • torchvision / torchaudio — datasets, pretrained models, transforms.
  • Hugging Face transformers — thousands of pretrained models, PyTorch-native.
  • model.state_dict() / torch.save — save and load learned weights.
  • .to(device) — the one line that moves model + data to GPU.
  • JAX — a third framework (Google) you'll hear about; same autograd idea, function-oriented style.

The takeaway

PyTorch = tensors (GPU-capable NumPy) + autograd (free backward pass) + nn.Module (your model) + optimizer (the update). The training loop is always zero_grad → forward → backward → step. Feed logits to the loss; remember eval() and no_grad() for inference. You can now read essentially any model's training code. Next, a tour of the architectures that go inside forward. 👉

The deep-learning zoo (CNN, RNN, Transformer)

The MLP from Chapter 11 treats its input as a flat bag of numbers. But real data has structure — images have spatial locality, text and audio have sequence, graphs have connections. The major neural architectures are each a clever way of baking a data structure's prior into the network so it doesn't have to learn that structure from scratch. This chapter is the conceptual map: what each one is, the one idea that defines it, and what it's for.

The unifying idea: inductive bias

Every architecture encodes an inductive bias — a built-in assumption about the data that makes learning vastly more efficient:

ArchitectureData structureBuilt-in assumption
MLPnone (flat vector)every input interacts with every output
CNNgrid (images)patterns are local and position-independent
RNN/LSTMsequencethe past influences the present, step by step
Transformersequence/seteverything attends to everything, in parallel
GNNgraphnodes are defined by their neighbors

CNN — convolutional neural network (images)

A CNN's core operation is the convolution from Chapter 5: slide a small learnable kernel/filter across the input, multiply-and-sum at each position. Here's the 1-D version detecting an edge — a jump in the signal:

import numpy as np
signal = np.array([0,0,0,1,1,1,0,0], float)
kernel = np.array([-1, 1], float)                 # responds to changes
conv = np.array([signal[i:i+2] @ kernel for i in range(len(signal)-1)])
print("signal:", signal)
print("conv  :", conv)

Output:

signal: [0. 0. 0. 1. 1. 1. 0. 0.]
conv  : [ 0.  0.  1.  0.  0. -1.  0.]

The output spikes (+1) where the signal rises and dips (−1) where it falls — the kernel found the edges and was silent on the flat regions. Two properties make this powerful and efficient:

  • Parameter sharing — the same small kernel is reused at every position, so a 3×3 kernel has 9 weights whether the image is 32×32 or 4K. Massively fewer parameters than an MLP.
  • Translation invariance — a cat detector works wherever the cat is, because the kernel slides everywhere.

A CNN stacks many such layers (with pooling to shrink spatial size): early layers learn edges, middle layers textures, deep layers objects. Use CNNs for images, video, spectrograms — anything with grid/spatial structure. (Famous names: LeNet, AlexNet, ResNet, U-Net.)

RNN / LSTM — recurrent networks (sequences)

A recurrent neural network processes a sequence one element at a time, carrying a hidden state that summarizes everything seen so far:

$$ h_t = \text{activation}(W_x x_t + W_h h_{t-1} + b) $$

The same weights run at every timestep; the loop is the sequence prior. Plain RNNs struggle to remember across long gaps (the vanishing gradient again), so the LSTM (Long Short-Term Memory) and GRU add gates — small learned valves that decide what to keep, forget, and output — letting information survive across hundreds of steps.

Use RNNs/LSTMs for time series, and historically for text and speech — though for language they've largely been replaced by transformers, which train far faster.

Don't be confused: RNN vs. CNN vs. Transformer for sequences. All three can process sequences, but: RNNs are inherently sequential (step t needs step t−1) → slow to train, weak long-range memory. CNNs capture local patterns in parallel but have a limited window. Transformers see the whole sequence at once and connect any two positions directly → the reason they won.

Transformer — attention (the architecture behind modern AI)

The transformer (2017, "Attention Is All You Need") is the architecture behind every modern LLM, and most new vision and audio models too. Its one idea is self-attention: each element looks at every other element and decides how much to pay attention to each, then mixes their information accordingly.

The mechanism, in three letters — Query, Key, Value. Each token emits a query ("what am I looking for?"), a key ("what do I offer?"), and a value ("what I'll contribute"). Attention weights come from query·key similarity (a dot product — the similarity from Chapter 4!), softmaxed into a distribution, then used to take a weighted average of the values:

$$ \text{Attention}(Q,K,V) = \text{softmax}!\left(\frac{QK^\top}{\sqrt{d}}\right)V $$

A stripped-down version (using one matrix as Q, K, and V to show the shape of it):

def softmax(z):
    z = z - z.max(-1, keepdims=True); e = np.exp(z)
    return e / e.sum(-1, keepdims=True)

X = np.array([[1.,0,1,0], [0,1,0,1], [1,1,0,0]])   # 3 tokens, dim 4
scores = (X @ X.T) / np.sqrt(X.shape[1])           # token-to-token similarity
A = softmax(scores)                                # attention weights (rows sum to 1)
print("attention weights:\n", np.round(A, 3))
print("output:\n", np.round(A @ X, 3))

Output:

attention weights:
 [[0.506 0.186 0.307]
 [0.186 0.506 0.307]
 [0.274 0.274 0.452]]
output:
 [[0.814 0.494 0.506 0.186]
 [0.494 0.814 0.186 0.506]
 [0.726 0.726 0.274 0.274]]

Each row of the weight matrix is a probability distribution (sums to 1) saying how much that token attends to each token — and each output row is the corresponding weighted blend. That softmax(QKᵀ/√d)·V is, at heart, the entire transformer. Stack many attention layers (with MLPs and normalization between them), run them in parallel across the whole sequence, and you get GPT, BERT, and friends. We go deeper in Chapter 15.

Why it dominates: attention is parallel (no sequential bottleneck → trains fast on GPUs), has unlimited range (any token reaches any other in one step), and scales beautifully with data and parameters.

The rest of the menagerie (names you'll hear)

  • Embedding layers — turn discrete tokens/IDs into dense vectors (Chapter 14).
  • Autoencoders — compress to a bottleneck and reconstruct; learn representations unsupervised.
  • GANs — a generator vs. a discriminator in a contest; once the kings of image generation.
  • Diffusion models — generate by iteratively denoising random noise; behind Stable Diffusion, DALL·E, Midjourney.
  • Graph Neural Networks (GNNs) — learn on graph-structured data (molecules, social networks) by aggregating neighbor information.
  • Mixture of Experts (MoE) — route each input to a few specialist sub-networks; how the biggest models stay affordable.

How to choose an architecture

Your dataStart with
Tabular (rows & columns)gradient-boosted trees or an MLP
Images / videoCNN (or a Vision Transformer)
Text / languageTransformer (use a pretrained one)
Time seriesLSTM, temporal CNN, or Transformer
Generation (images)Diffusion model
Graphs / networksGNN

You will almost never train these from scratch. You'll download a pretrained model and fine-tune it — which is the heart of Chapter 15.

The takeaway

Each architecture bakes in a prior about its data: CNNs assume local, position- independent patterns (images); RNNs/LSTMs assume step-by-step sequence (time series); transformers assume everything-attends-to-everything via softmax(QKᵀ/√d)·V (language and beyond). Pick the architecture whose built-in assumption matches your data's structure. One word kept recurring — embedding. It deserves its own chapter. 👉

Embeddings & representation learning

Embedding is one of the most-used and least-defined words in modern AI. It's the bridge between "raw discrete stuff" (words, users, products, images) and "vectors a model can do math on." Master this chapter and a huge swath of AI — search, RAG, recommendation, LLMs — clicks into place, because they all run on embeddings.

The one-sentence definition

An embedding is a learned, dense vector that represents something, arranged so that similar things land near each other in the vector space.

Contrast with the encodings from Chapter 3:

  • One-hot [0,0,1,0,...] — sparse, huge, and every word is equidistant from every other (no notion of similarity). "cat" is as far from "kitten" as from "spreadsheet."
  • Embedding [0.2, -0.7, 0.1, ...] — dense (say 300–1536 numbers), and "cat" sits near "kitten" because the geometry encodes meaning.

The embedding learns meaning; one-hot just assigns slots.

Meaning becomes geometry

Here's the magic, with a toy 4-dimensional "meaning space" whose axes you can read as royalty, male, female, fruit:

import numpy as np
emb = {
    "king":  np.array([1., 1, 0, 0]),
    "queen": np.array([1., 0, 1, 0]),
    "man":   np.array([0., 1, 0, 0]),
    "woman": np.array([0., 0, 1, 0]),
    "apple": np.array([0., 0, 0, 1]),
}
def cos(a, b): return float(a @ b / (np.linalg.norm(a)*np.linalg.norm(b)))

print("cos(king, queen):", round(cos(emb['king'], emb['queen']), 3))
print("cos(king, apple):", round(cos(emb['king'], emb['apple']), 3))

Output:

cos(king, queen): 0.5
cos(king, apple): 0.0

king and queen are related (both royalty) → positive cosine; king and apple share nothing → cosine 0. Similarity in meaning became similarity in geometry — exactly what Chapter 4 measures.

The famous trick: analogies are vector arithmetic

Because directions in a good embedding space carry meaning, you can do algebra on concepts. The legendary example: king − man + woman ≈ queen.

v = emb['king'] - emb['man'] + emb['woman']     # remove "male", add "female"
print("king - man + woman =", v)
sims = {k: round(cos(v, emb[k]), 3) for k in emb if k not in ('king','man','woman')}
print("nearest words:", sims)

Output:

king - man + woman = [1. 0. 1. 0.]
nearest words: {'queen': 1.0, 'apple': 0.0}

Subtracting man and adding woman moved us exactly to queen. Real embeddings (word2vec, GloVe) learned this from raw text alone, with no one ever labeling "royalty" or "gender" — the axes emerged from the training objective. That emergence is the deep idea of this chapter.

Representation learning: features the model invents

Don't be confused: feature engineering vs. representation learning. In Chapter 3 you hand-built features (TF-IDF, ratios). Representation learning is when the model discovers the features — embeddings are learned, not designed. This is the headline difference between classical ML and deep learning: deep nets turn raw data into useful representations automatically. An embedding is literally the network's learned internal feature vector.

Where do embeddings come from? They're just a lookup table of vectors (an nn.Embedding layer) whose entries are parameters trained by gradient descent, same as any weight. The training objective shapes them:

  • word2vec / GloVe — predict a word from its neighbors → words used in similar contexts get similar vectors ("you shall know a word by the company it keeps").
  • Sentence/text embeddings (Sentence-BERT, OpenAI/Cohere embeddings) — map whole sentences to one vector so semantic search works on meaning, not keywords.
  • Recommender embeddings — one vector per user and per item, learned so that user · item predicts a rating (the matrix-factorization idea; see the recsys book).
  • Image embeddings (CLIP, ResNet features) — map images to vectors; CLIP even puts images and text in the same space, so you can search images by text.

Why embeddings power half of modern AI

Once everything — words, documents, users, products, images — is a vector in a shared space, a single toolkit solves an enormous range of tasks:

  • Semantic search — embed the query, find nearest-neighbor document vectors.
  • RAG (retrieval-augmented generation) — embed your documents, retrieve the relevant ones at query time, feed them to an LLM. This is the dominant pattern for building on top of LLMs.
  • Recommendation — recommend items whose vectors are near the user's vector.
  • Clustering / deduplication / classification — all become geometry on vectors.
  • The input to every LLM — the first thing a transformer does is embed each token.

And finding those nearest neighbors fast, over millions of vectors, is exactly what the sister HNSW and IVF-PQ books build. This chapter is why those books exist.

Don't be confused: "embedding" the noun vs. the verb, and the dimension. "An embedding" = the resulting vector. "To embed" = the act of producing it. "Embedding dimension" = how long the vector is (e.g. 768, 1536) — a hyperparameter trading expressiveness for cost. Also: a token embedding (one word) and a sentence embedding (one whole sentence) are different granularities of the same idea.

A practical note: normalize before you compare

Embeddings are almost always compared with cosine similarity, so the universal convention is to L2-normalize them (Chapter 3) — then cosine is a plain dot product, and vector databases can use fast inner-product search. If your retrieval results look random, the first thing to check is whether both the query and the stored vectors are normalized the same way.

The takeaway

An embedding is a learned dense vector where geometric closeness = semantic similarity; directions carry meaning, so concept arithmetic works. Representation learning means the model invents these features instead of you engineering them. Embeddings turn words, users, items, and images into one common vector language — the substrate of search, RAG, recommendation, and LLMs. Speaking of which: how do those LLMs actually turn text into tokens and tokens into answers? 👉

Tokens, transformers & LLMs

Large language models are the reason most people care about AI today — and they are built entirely from the pieces in this book: tokens become embeddings (Chapter 14), attention mixes them (Chapter 13), a softmax over the vocabulary picks the next word (Chapter 6), and cross-entropy trains the whole thing (Chapter 7). This chapter assembles those parts into a working mental model of an LLM and decodes the vocabulary — tokens, context window, temperature, fine-tuning, RAG — so none of it is mysterious.

What an LLM actually does

Strip away the hype and a language model does one thing: predict the next token. Given some text, it outputs a probability distribution over its entire vocabulary for what comes next, samples one, appends it, and repeats. That's it. "Write me an email," "solve this bug," "summarize this" — all emerge from next-token prediction at scale. It's Chapter 6's softmax classifier, with a transformer computing the logits and a vocabulary of ~100k classes.

Step 1: Tokenization — text becomes integers

A model can't consume raw characters; it consumes integers. Tokenization splits text into tokens — usually subword chunks — and maps each to an ID. Subwords are the sweet spot between whole words (vocabulary too big, can't handle new words) and individual characters (sequences too long):

text = "unbelievable"
# a subword tokenizer might split it like this:
tokens = ["un", "believ", "able"]      # 3 tokens, not 12 characters, not 1 word

Each token becomes an ID, each ID looks up an embedding vector, and that sequence of vectors is what the transformer actually processes. Practical consequences you must internalize:

  • You pay per token, not per word. Roughly 1 token ≈ 0.75 English words, or ~4 characters. A 1,000-word email is ~1,300 tokens.
  • Token boundaries cause famous quirks. "Why can't the model count the r's in strawberry?" Because it sees tokens like straw + berry, not letters. Many "dumb" LLM failures are really tokenization artifacts.
  • Rare words and code fragment into many tokens, inflating cost and context use.

Don't be confused: tokens vs. words vs. characters. A token is the model's atomic unit, typically a subword. It is neither a word nor a character. Budgeting context, latency, and API cost all happen in tokens — learn to think in them.

Step 2: The context window

The context window is the maximum number of tokens the model can attend to at once — its working memory. Everything (your prompt, the documents you paste, the conversation history, and the reply being generated) must fit. Modern windows range from a few thousand to millions of tokens.

Don't be confused: context window vs. training data vs. "memory." The context window is what the model sees right now, this request. Training data is what it learned from, months ago, and is baked into the weights. An LLM has no memory between calls — each API request is stateless; a chatbot "remembers" only because the app resends the prior conversation inside the context window every time. This single fact explains most confusion about what LLMs can and can't do.

Step 3: Generation — temperature, top-k, top-p

The model outputs logits over the vocabulary; a softmax turns them into probabilities; then a sampling strategy picks the next token. The key knob is temperature $T$, which divides the logits before the softmax:

$$ p_i = \text{softmax}(z_i / T) $$

import numpy as np
def softmax(z): z = z - z.max(); e = np.exp(z); return e / e.sum()
logits = np.array([2.0, 1.0, 0.5, 0.0])     # 4 candidate next-tokens
for T in [0.5, 1.0, 2.0]:
    print(f"T={T}: {np.round(softmax(logits/T), 3)}")

Output:

T=0.5: [0.831 0.112 0.041 0.015]
T=1.0: [0.579 0.213 0.129 0.078]
T=2.0: [0.409 0.248 0.193 0.15 ]
  • Low temperature (0.5) → distribution sharpens; the top token dominates → focused, deterministic, repetitive. Use for code, math, extraction.
  • High temperature (2.0) → distribution flattens; long-shot tokens get a real chance → creative, varied, riskier. Use for brainstorming, fiction.
  • Temperature 0 → always take the argmax (greedy) → fully deterministic.

Two companions trim the candidate pool before sampling: top-k keeps only the k most likely tokens; top-p (nucleus) keeps the smallest set whose probabilities sum to p. Both cut off the implausible tail so high temperature stays coherent.

Don't be confused: temperature ≠ "how smart" the model is. It only controls randomness of selection, not capability. High temperature doesn't make a model more creative-in-a-good-way so much as more random; too high and it produces incoherent text. For most precise tasks, lower is better.

How an LLM is built: pretraining → fine-tuning → alignment

  1. Pretraining — train next-token prediction on a huge text corpus. This is the expensive part (millions of dollars, months of GPU time) and produces a base model with broad knowledge but no instinct to be helpful.
  2. Fine-tuning — continue training on narrower data to specialize. Instruction tuning teaches it to follow instructions; domain fine-tuning teaches it your jargon.
  3. Alignment (RLHF/DPO) — use human preference data to make it helpful, honest, and harmless. This is the Chapter 1 reinforcement- learning flavor at work.

Building with LLMs: prompting vs. RAG vs. fine-tuning

You rarely train an LLM; you build on one. Three escalating options — know which to reach for:

ApproachWhat it isUse when
Promptingcraft the input text; few-shot examplesthe model already knows enough
RAGretrieve relevant docs, paste into the promptyou need current or private facts
Fine-tuningadjust the weights on your datayou need a consistent style/format or a narrow skill

Don't be confused: RAG vs. fine-tuning — they solve different problems. RAG adds knowledge (facts the model can read at query time, kept fresh in a vector database). Fine-tuning adds behavior (tone, format, a skill baked into the weights). To make a model know your latest docs, use RAG, not fine-tuning — fine-tuning teaches how to act, not what's true today. Most production systems use RAG first and fine-tune only when style consistency demands it.

RAG is built entirely from this book: embed your documents (Chapter 14), index them for fast nearest-neighbor search (the HNSW/IVF-PQ books), retrieve the top matches by cosine similarity (Chapter 4), and stuff them into the context window. You now understand every layer of it.

A few more terms you'll meet

  • Hallucination — the model states something false with confidence. It optimizes plausible next token, not truth — RAG and verification mitigate it.
  • Parameters (e.g. "70B") — the count of learnable weights; loosely correlates with capability and cost.
  • Quantization — store weights in fewer bits (int8/int4) to run big models on small hardware (the same compression spirit as the IVF-PQ book).
  • Multimodal — models that take images/audio alongside text (GPT-4o, Claude), built by embedding all modalities into one space.
  • Agents / tool use — let the model call functions and APIs in a loop, turning a text predictor into something that acts.

The takeaway

An LLM is a next-token predictor: text → tokens → embeddings → transformer attention → softmax over the vocabulary → sample (temperature controls randomness). It's stateless between calls; "memory" is the resent context window. Build on it with prompting, RAG (adds knowledge), or fine-tuning (adds behavior) — and RAG is just the embedding + nearest-neighbor stack you already know. Next, the unglamorous machinery that makes any of this trainable: the data pipeline. 👉

The data pipeline & training in practice

The algorithms get the glory, but in real projects most of your time goes into getting data in and out of the model correctly — batching, shuffling, splitting, seeding, moving tensors to the GPU. This chapter covers the unglamorous machinery that turns "I have a model" into "I trained it without fooling myself." Skipping this is how good models produce silently wrong results.

Why we feed data in batches

Chapter 8 introduced mini-batch SGD; here's the mechanics. You don't feed the whole dataset at once (it won't fit in memory) or one example at a time (too slow, too noisy). You feed batches of, say, 32–512 examples. One pass through all the batches is one epoch, and you train for many epochs. A from-scratch batch iterator — shuffle, then slice:

import numpy as np
def batches(X, y, batch_size, seed=0):
    rng = np.random.default_rng(seed)
    idx = rng.permutation(len(X))                  # shuffle each epoch
    for start in range(0, len(X), batch_size):
        b = idx[start:start + batch_size]
        yield X[b], y[b]

X = np.arange(10).reshape(10, 1); y = np.arange(10)
for i, (xb, yb) in enumerate(batches(X, y, 4)):
    print(f"batch {i}: y={yb}")

Output:

batch 0: y=[4 6 2 7]
batch 1: y=[3 5 9 0]
batch 2: y=[8 1]

Three things to notice: the data is shuffled (order varies each epoch — critical, or the model learns the order), batches are fixed-size except the last one (here just 2 examples), and we covered every example exactly once. In PyTorch this is the DataLoader:

from torch.utils.data import DataLoader, TensorDataset
loader = DataLoader(TensorDataset(X, y), batch_size=32, shuffle=True)
for xb, yb in loader:        # same idea, parallelized & GPU-friendly
    ...

Don't be confused: batch size vs. epochs vs. iterations.

  • Batch size — examples per gradient step (e.g. 32).
  • Iteration / step — one gradient update (one batch).
  • Epoch — one full pass over the dataset = (dataset size / batch size) iterations.

"Trained for 10 epochs with batch size 64 on 6,400 examples" = 100 iterations per epoch = 1,000 gradient steps total. Mixing these up makes training logs unreadable.

Why shuffling matters (a real bug)

If your data is sorted by label (all the 0s, then all the 1s) and you don't shuffle, each batch contains only one class. The model swings wildly toward whichever class it's currently seeing and never learns the mix. Always shuffle the training set each epoch. (Never shuffle time series across the time boundary, though — see the leakage warning below.)

Reproducibility: set your seeds

Randomness is everywhere — weight init, shuffling, dropout. To get the same result twice (essential for debugging and for trusting a comparison), fix the random seed:

print("run A:", np.random.default_rng(42).normal(size=3).round(3))
print("run B:", np.random.default_rng(42).normal(size=3).round(3))

Output:

run A: [ 0.305 -1.04   0.75 ]
run B: [ 0.305 -1.04   0.75 ]

Identical — because the seed was identical. In a real run you'd set torch.manual_seed(42), np.random.seed(42), and Python's random.seed(42). If you can't reproduce a result, you can't trust a comparison between two models — maybe model B "won" purely by a luckier random init.

The cardinal sin: data leakage

Data leakage is when information from the test set sneaks into training, making your offline numbers look amazing and your production results collapse. The most common forms:

  • Scaling before splitting. Computing the standardization mean/std over all data leaks test statistics into training. Fit the scaler on train only, then apply to val/test (Chapter 3).
  • Shuffling time series. For temporal data, a random split lets the model "see the future." Split by time: train on the past, test on the future.
  • Duplicate or near-duplicate rows straddling the split — the model effectively memorizes test answers.
  • Target leakage — a feature that secretly encodes the answer (e.g. "was_refunded" when predicting fraud). If a feature is "too good," suspect leakage.

Don't be confused: a great validation score is not always good news. If it seems too good to be true, your first hypothesis should be leakage, not genius. The discipline of fit-on-train, apply-to-test and split-before-you-touch exists entirely to prevent this self-deception.

CPU, GPU, and moving tensors

Deep learning lives on the GPU because matrix multiplies parallelize massively — often 10–100× faster than CPU. In PyTorch, both the model and the data must be on the same device:

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
xb, yb = xb.to(device), yb.to(device)        # data must follow the model

Don't be confused: the "tensors on different devices" error. PyTorch's most common runtime error is a model on the GPU receiving CPU data (or vice versa). The fix is always: .to(device) both the model and every input batch. Also, .item() / .cpu().numpy() pulls a result back to the CPU for printing or plotting.

A few production practices worth knowing

  • Mixed precision (float16/bfloat16) — train in lower precision for ~2× speed and half the memory, with negligible accuracy loss. Standard for large models.
  • Gradient accumulation — sum gradients over several mini-batches before step(), simulating a big batch when GPU memory is tight (this is why PyTorch accumulates gradients by default).
  • Checkpointingtorch.save(model.state_dict()) periodically, so a crash doesn't lose a week of training.
  • Learning-rate schedules — lower the learning rate over time (warmup then decay) for better final accuracy; the standard recipe for transformers.
  • Monitoring — log train and validation loss every epoch and watch the gap (Chapter 9). The loss curve is your primary diagnostic; learn to read it (diverging → LR too high; flat → LR too low or stuck; val rising → overfitting).
  • Experiment tracking — tools like MLflow or Weights & Biases record every run's hyperparameters and metrics so you can compare honestly (the recsys book's capstone uses MLflow).

The takeaway

Feed data in shuffled batches; one pass is an epoch; know batch size vs. iteration vs. epoch. Set seeds so results are reproducible and comparisons are trustworthy. Guard against data leakage with religious fit-on-train-only and split-before- touch discipline — a too-good score usually means leakage. Keep model and data on the same device. With the pipeline solid, the last thing standing between you and correct results is a set of numerical landmines — let's defuse them. 👉

Numerical gotchas & "don't be confused"

Computers don't do real arithmetic — they do finite-precision floating-point arithmetic, and the gap bites AI code constantly: a softmax returns NaN, a loss goes inf, two "equal" numbers aren't, a model trains fine then mysteriously explodes. This chapter is the field guide to those landmines, plus a consolidated "don't be confused" glossary of the near-synonyms that trip everyone up. Knowing these is the difference between "the model is broken" and "oh, that's just float32."

Floats are approximate — stop testing them with ==

print("0.1 + 0.2 == 0.3 :", 0.1 + 0.2 == 0.3)
print("0.1 + 0.2        =", repr(0.1 + 0.2))
print("np.isclose       :", np.isclose(0.1 + 0.2, 0.3))

Output:

0.1 + 0.2 == 0.3 : False
0.1 + 0.2        = 0.30000000000000004
np.isclose       : True

0.1, 0.2, 0.3 can't be represented exactly in binary, so they don't add up exactly. Never compare floats with ==. Use np.isclose / np.allclose (with a tolerance) for scalars and arrays. This is the #1 reason "my test passes locally but fails in CI" — different hardware rounds the last bit differently.

Softmax overflow — and the trick that fixes it

exp overflows fast: exp(1000) is larger than any float can hold (→ inf), and inf / inf is NaN. A naive softmax detonates on large logits:

def naive_softmax(z):  e = np.exp(z);             return e / e.sum()
def stable_softmax(z): z = z - z.max(); e = np.exp(z); return e / e.sum()

big = np.array([1000., 1001., 1002.])
print("naive :", naive_softmax(big))
print("stable:", np.round(stable_softmax(big), 3))

Output:

naive : [nan nan nan]
stable: [0.09  0.245 0.665]

The fix — subtract the max before exponentiating — is mathematically a no-op (softmax is shift-invariant: $\text{softmax}(z) = \text{softmax}(z - c)$) but numerically it's the whole game: now the largest exponent is exp(0) = 1, no overflow. Every library's softmax does this. It's why you hand logits to CrossEntropyLoss (Chapter 12) — the framework applies the stable version internally.

log(0) = -inf, and the + eps habit

Cross-entropy takes log(p). If the model ever predicts exactly p = 0 for the true class, you get -inf, and gradients become NaN:

print("log(0):", np.log(0.0))

Output:

log(0): -inf

The defensive habit you saw throughout this book — np.log(p + 1e-9) or clipping p to [eps, 1-eps] — exists precisely to dodge this. (Better still: use the fused BCEWithLogitsLoss / CrossEntropyLoss, which are built to be safe.)

Log-sum-exp: the pattern behind all of the above

When you must compute $\log\sum_i e^{z_i}$ (it appears in softmax, cross-entropy, and probabilistic models), the same overflow lurks. The log-sum-exp trick factors out the max:

$$ \log\sum_i e^{z_i} = m + \log\sum_i e^{z_i - m}, \qquad m = \max_i z_i $$

def logsumexp(z):
    m = z.max()
    return m + np.log(np.exp(z - m).sum())

z = np.array([1000., 1001.])
print("logsumexp (stable):", round(logsumexp(z), 3))
print("naive log(sum(exp)):", np.log(np.exp(z).sum()))

Output:

logsumexp (stable): 1001.313
naive log(sum(exp)): inf

The stable version gives the right answer (1001.313); the naive one overflows to inf. Recognize this pattern — pull out the max, then exponentiate the differences — and you'll understand the internals of half the loss functions in PyTorch.

NaN is contagious

A single NaN poisons everything it touches — one bad value silently turns your whole loss into NaN:

a = np.array([1., np.nan, 3.])
print("mean   :", np.mean(a))
print("nanmean:", np.nanmean(a))

Output:

mean   : nan
nanmean: 2.0

If your loss suddenly becomes NaN mid-training, the usual suspects are: learning rate too high (exploding gradients — Chapter 8), log(0) / divide-by-0, or a NaN already in your input data. Use np.isnan(x).any() to hunt it down, and nanmean/nansum when missing values are expected.

A few more silent traps

  • Integer division. np.array([1,2,3]) / 2 is fine (NumPy promotes to float), but integer tensors and // can silently truncate. Keep features in float32.
  • In-place ops. x += 1 modifies the array other variables may share. With autograd, in-place edits on tensors that need gradients raise errors — make a copy.
  • Broadcasting surprises. Adding shape (n,) to (n,1) broadcasts to (n,n) — a silent memory blowup, not an error (Chapter 2).
  • float32 vs float64 mismatch at the NumPy↔PyTorch boundary (Chapter 12).

The consolidated "don't be confused" glossary

Every near-synonym pair this book untangled, in one place:

These sound alike but…The distinction
parameter vs. hyperparameterlearned by the model vs. set by you
loss vs. metricwhat you optimize (smooth) vs. what you report (any)
logit vs. probabilitypre-sigmoid score $(-\infty,\infty)$ vs. post $(0,1)$
sigmoid vs. softmaxone independent prob vs. competing probs summing to 1
normalization vs. standardizationto unit length / [0,1] vs. mean-0/std-1
L1 vs. L2absolute (sparse, robust) vs. squared (smooth, outlier-sensitive)
precision vs. recalltrust of alarms vs. coverage of positives
validation vs. test setfor tuning (seen often) vs. for reporting (seen once)
backprop vs. gradient descent vs. autogradcompute grads vs. apply them vs. automate it
cosine vs. dot productdirection only vs. direction + magnitude
similarity vs. distancebigger = closer vs. smaller = closer
RAG vs. fine-tuningadds knowledge vs. adds behavior
token vs. word vs. charactermodel's subword unit vs. word vs. letter
* vs. @elementwise vs. dot / matrix product
axis=0 vs axis=1collapse rows (per-column) vs. collapse cols (per-row)
model.train() vs .eval()dropout/batchnorm on vs. off
batch vs. epoch vs. iterationexamples/step vs. full pass vs. one update
overfitting vs. underfittingmemorized noise vs. too simple
bias (the term) two waysthe +b in w·x+b vs. bias in bias–variance
kernel (×5)SVM similarity / conv filter / GPU fn / OS core / stats window
tensor (ML) vs. (physics)"n-D array" vs. a strict transformation object
inference two waysrunning a trained model vs. statistical inference

The meta-lesson: AI vocabulary is full of overloaded words and false twins because the field fused statistics, computer science, and engineering, each bringing its own dialect. When a term feels slippery, it's usually because it is — pin down which of two meanings is in play and the fog lifts.

The takeaway

Floats are approximate (isclose, never ==); exp overflows (subtract the max — the softmax/log-sum-exp trick); log(0) = -inf (add eps or use fused losses); one NaN poisons everything (suspect a high learning rate). And when two terms blur together, consult the glossary above — almost every confusion in AI is a pair of near-synonyms wearing one coat. That completes the deep-learning spine. What follows broadens you into a complete, interview-ready engineer: the classical algorithms (trees, boosting, k-NN), the math and statistics interviews assume, the tools/cloud/landscape you'll ship with, and a system-design and interview playbook. First up — the classical ML algorithm zoo that rules tabular data. 👉

The classical ML algorithm zoo

Here's a secret the deep-learning hype hides: for tabular data — the spreadsheets that run most companies — gradient-boosted trees still beat neural networks, and they're what you'll be asked about in interviews far more than transformers. This chapter fills the biggest gap between Parts II and III: the classical algorithms every ML engineer must know cold — k-NN, Naive Bayes, decision trees, random forests, gradient boosting, and SVMs. We build the core of each from scratch so the intuition sticks.

Where this fits: read this any time after Chapter 10 (you need losses and metrics first). It's placed here as an "advanced part" so the deep-learning spine stayed unbroken, but conceptually it sits right alongside linear models.

k-Nearest Neighbors (k-NN) — lazy, simple, surprisingly strong

The simplest classifier there is: to label a new point, find the k closest training points and take a majority vote. No training — it just memorizes the data (a non-parametric, instance-based method).

import numpy as np
Xtr = np.array([[1,1], [1.2,0.9], [3,3], [3.2,2.8]]); ytr = np.array([0,0,1,1])

def knn_predict(q, k=3):
    d = np.sqrt(((Xtr - q) ** 2).sum(1))      # distance to every training point
    nearest = d.argsort()[:k]                 # indices of the k closest
    return np.bincount(ytr[nearest]).argmax() # majority vote

print("near cluster 0:", knn_predict(np.array([1.1, 1.0])))
print("near cluster 1:", knn_predict(np.array([3.1, 3.0])))

Output:

near cluster 0: 0
near cluster 1: 1
  • Pros: zero training, no assumptions, naturally multi-class, a great baseline.
  • Cons: slow at prediction time (compares to all data — this is exactly what the HNSW and IVF-PQ sister books accelerate), needs feature scaling (Chapter 3), and degrades in high dimensions (the curse of dimensionality).
  • The one hyperparameter, k: small k → low bias, high variance (jagged boundary); large k → smoother, higher bias. Classic bias–variance.

Don't be confused: k-NN vs. k-means. Both have a "k" and both use distances, but k-NN is supervised classification (k = how many neighbors vote); k-means is unsupervised clustering (Chapter 21) (k = how many clusters). Totally different algorithms — a favorite interview trap.

Naive Bayes — probability with a bold assumption

Apply Bayes' theorem (Chapter 22) to classify, assuming every feature is independent given the class — the "naive" part. That assumption is usually false, yet it works astonishingly well, especially for text (spam filters were built on it):

$$ P(\text{class} \mid \text{features}) \propto P(\text{class}) \prod_i P(\text{feature}_i \mid \text{class}) $$

  • Pros: blazingly fast, needs little data, a strong text baseline.
  • Cons: the independence assumption hurts when features are correlated; probabilities are poorly calibrated (treat the ranking, not the exact number).
  • Variants: Multinomial (word counts), Bernoulli (binary features), Gaussian (continuous features).

Decision trees — the if/else machine

A decision tree asks a sequence of yes/no questions about features, splitting the data until each leaf is (nearly) one class. To pick each split, it searches for the question that most reduces impurity, measured by Gini impurity (or entropy):

$$ \text{Gini} = 1 - \sum_c p_c^2 \quad (\text{0 = pure, higher = mixed}) $$

def gini(labels):
    p = np.bincount(labels) / len(labels)
    return 1 - (p ** 2).sum()

parent = np.array([0,0,1,1,1,1])
print("impurity before split:", round(gini(parent), 3))
print("after a perfect split :", gini(np.array([0,0])), gini(np.array([1,1,1,1])))

Output:

impurity before split: 0.444
after a perfect split : 0.0 0.0

The split drove impurity from 0.444 to 0 — a clean separation. The tree greedily repeats this at every node.

  • Pros: interpretable ("why? because feature X > 5"), needs no scaling, handles non-linear boundaries and mixed feature types.
  • Cons: a single deep tree overfits badly (it can memorize the training set). The fix is to combine many trees — which gives us the two most important tabular algorithms in the field.

Don't be confused: Gini vs. entropy. Both measure node impurity and give nearly identical trees; Gini is slightly faster (no log). Don't agonize over the choice — interviewers want to know you understand impurity, not which formula you picked.

Random forests — many trees, bagged (variance ↓)

A random forest trains hundreds of decision trees, each on a random bootstrap sample of the rows and a random subset of features at each split, then averages their votes. This technique — training models on resampled data and averaging — is called bagging (bootstrap aggregating), and it crushes the single tree's variance:

  • Pros: robust, hard to overfit, little tuning, gives feature importances, a fantastic default for tabular data.
  • Cons: larger and slower than one tree, less interpretable than a single tree.
  • Key idea: the trees' errors are decorrelated (different rows, different features), so averaging cancels them out. Bagging reduces variance.

Gradient boosting — many trees, sequential (bias ↓)

The other ensemble, and the king of tabular ML: instead of averaging independent trees, gradient boosting builds trees sequentially, each one correcting the errors (the residuals) of the trees so far. Here's the entire idea — fit each new stump to what's left over, add a shrunken version of it:

x = np.linspace(0, 1, 20); y = np.where(x > 0.5, 1.0, 0.0)   # a step to learn
pred = np.zeros_like(y); lr = 0.3
for _ in range(50):
    residual = y - pred                       # what we still get wrong
    # fit the best decision stump (1 split) to the residual:
    best = (1e9, None, 0, 0)
    for t in x:
        l, r = residual[x <= t], residual[x > t]
        lm, rm = (l.mean() if len(l) else 0), (r.mean() if len(r) else 0)
        sse = ((l-lm)**2).sum() + ((r-rm)**2).sum()
        if sse < best[0]: best = (sse, t, lm, rm)
    _, t, lm, rm = best
    pred = pred + lr * np.where(x <= t, lm, rm)   # add the shrunken correction
print("MSE after 50 boosting rounds:", round(float(((y - pred)**2).mean()), 4))

Output:

MSE after 50 boosting rounds: 0.0

Fifty weak stumps, each fixing the last's mistakes, combined into a perfect fit. That's boosting: a sequence of weak learners summed into a strong one — the same "reduce the residual" spirit as gradient descent (Chapter 8), which is why it's called gradient boosting.

You will not implement this in practice — you'll use a library:

LibraryWhy it's famous
XGBoostthe Kaggle-winning workhorse; fast, regularized, battle-tested
LightGBMfaster on big data (histogram splits, leaf-wise growth)
CatBoostbest-in-class handling of categorical features

Don't be confused: bagging (random forest) vs. boosting (XGBoost). Both are tree ensembles, but: bagging trains trees independently and in parallel on resampled data, then averages → reduces variance (fixes overfitting). Boosting trains trees sequentially, each fixing the last's errors → reduces bias (fixes underfitting) but can overfit if over-trained. Rule of thumb: random forest is the safe default; tuned gradient boosting is usually the winner. This is one of the most common ML interview questions — know it cold.

Support Vector Machines (SVM) — the max-margin classifier

An SVM finds the decision boundary with the widest margin — the largest gap between the two classes. Only the points on the edge of that gap (the support vectors) matter. Combined with the kernel trick from Chapter 5, an SVM draws curved boundaries by implicitly mapping data into a higher-dimensional space (the RBF kernel is the usual choice). Its loss is the hinge loss from Chapter 7.

  • Pros: strong on small/medium datasets, effective in high dimensions, elegant theory.
  • Cons: doesn't scale to huge datasets, sensitive to the C and gamma hyperparameters, no native probabilities. Largely superseded by gradient boosting for tabular data and neural nets for perception — but still interview canon.

The decision guide

Your situationReach for
Tabular data, want the best scoregradient boosting (XGBoost/LightGBM)
Tabular data, want a robust defaultrandom forest
Need interpretabilitya single decision tree or logistic regression
Text classification, fast baselineNaive Bayes
Small data, clear marginSVM
Simplest possible baseline / similarity-basedk-NN
Images, audio, languageneural networks (Part III)

The takeaway

Classical ML is not obsolete — it rules tabular data and ML interviews. k-NN votes by distance; Naive Bayes multiplies independent feature probabilities; a decision tree splits to reduce impurity; random forests bag trees to cut variance; gradient boosting sequences trees to cut bias (and wins Kaggle); SVMs maximize the margin. Know bagging-vs-boosting and k-NN-vs-k-means cold. Next, the unsupervised side: finding structure with no labels at all. 👉

Unsupervised learning: clustering & dimensionality reduction

Everything so far has been supervised — every example came with a label. But most data in the world is unlabeled, and a huge part of practical ML is finding structure without answers: grouping similar things (clustering) and squeezing many features into a few (dimensionality reduction). These power customer segmentation, anomaly detection, recommendation, visualization, and the preprocessing behind embeddings.

Clustering: grouping without labels

k-means — the workhorse

k-means partitions data into k groups by repeating two steps until stable: assign each point to its nearest centroid, then move each centroid to the mean of its points. (This alternation is an instance of the expectation-maximization pattern.)

import numpy as np
def kmeans(X, k, iters=10, seed=0):
    rng = np.random.default_rng(seed)
    C = X[rng.choice(len(X), k, replace=False)]          # init: random points
    for _ in range(iters):
        d = ((X[:, None, :] - C[None, :, :]) ** 2).sum(2) # dist to each centroid
        labels = d.argmin(1)                              # assign
        C = np.array([X[labels == j].mean(0) for j in range(k)])  # move
    return C, labels

# two well-separated blobs around (0,0) and (3,3)
C, labels = kmeans(X, k=2)
print("centroids:\n", C.round(3))
print("cluster sizes:", np.bincount(labels))

Output:

centroids:
 [[ 3.121  3.06 ]
 [-0.081  0.045]]
cluster sizes: [20 20]

k-means recovered the two blobs exactly — centroids at ≈(3,3) and ≈(0,0), 20 points each. (This is the building block the IVF-PQ sister book uses to partition vector space for fast search.)

  • You must choose k. Use the elbow method (plot within-cluster variance vs. k, look for the bend) or the silhouette score.
  • It assumes round, equal-sized clusters and is sensitive to initialization — k-means++ seeding (spread initial centroids apart) is the standard fix.
  • Scale your features first (Chapter 3) — k-means uses Euclidean distance, so an unscaled big feature dominates.

Don't be confused: k-means vs. k-NN (again). k-means = unsupervised clustering (k = number of groups, no labels). k-NN = supervised classification (k = neighbors that vote, needs labels). Same letter, opposite problems.

DBSCAN & hierarchical clustering

  • DBSCAN — groups points that are densely packed, marking sparse points as noise/outliers. Finds arbitrary shapes and doesn't need k (you set a radius and a min-points instead). Great for anomaly detection and non-spherical clusters.
  • Hierarchical (agglomerative) — repeatedly merge the two closest clusters, building a tree (dendrogram) you can cut at any level. No k up front; great for understanding nested structure.

Don't be confused: which clustering algorithm? k-means → fast, round clusters, you know k. DBSCAN → arbitrary shapes, outlier detection, density varies, you don't know k. Hierarchical → small data, you want the full merge tree.

Dimensionality reduction: many features → few

High-dimensional data is hard to visualize, slow to process, and prone to overfitting (the curse of dimensionality). Dimensionality reduction compresses features while keeping the signal.

PCA — the linear workhorse

Principal Component Analysis finds the orthogonal directions of maximum variance and projects onto the top few. The most numerically stable way to compute it is the SVD (Chapter 23) of the centered data:

Xc = X - X.mean(0)                          # 1. center the data
U, S, Vt = np.linalg.svd(Xc, full_matrices=False)   # 2. SVD
print("singular values     :", S.round(3))
print("PC1 direction        :", Vt[0].round(3))
print("variance explained   :", ((S**2) / (S**2).sum()).round(3))
print("project onto PC1 ->", (Xc @ Vt[0]).round(2)[:5], "...")

Output:

singular values     : [14.039  1.609]
PC1 direction        : [0.727 0.686]
variance explained   : [0.987 0.013]
project onto PC1 -> [-2.683 -2.45  -2.005 -2.668 -2.123] ...

The first component captures 98.7% of the variance — because our two blobs lie mostly along the (1,1) diagonal, exactly the direction [0.727, 0.686] PCA found. We could drop the second dimension and lose almost nothing. The variance explained ratio is how you decide how many components to keep (e.g. "enough to reach 95%").

  • Uses: compression, denoising, speeding up downstream models, and visualizing high-dim data in 2-D.
  • Limits: PCA is linear — it can't unfold curved structure. It's also not interpretable (components are mixtures of all features).

t-SNE and UMAP — visualization of non-linear structure

When you want to see clusters in high-dim data (e.g. embeddings from Chapter 14), t-SNE and UMAP are the go-to non-linear methods. They place similar points near each other in 2-D, revealing structure PCA misses.

Don't be confused: t-SNE is for looking, not for features. Use t-SNE/UMAP to visualize (2-D plots for humans), never as input to another model: the distances and cluster sizes in a t-SNE plot are distorted and not meaningful, and the result changes run to run. For actual dimensionality reduction feeding a model, use PCA (or an autoencoder). UMAP is faster than t-SNE and preserves more global structure.

Where unsupervised learning shows up

  • Customer/user segmentation — cluster behavior to target groups.
  • Anomaly/fraud detection — points far from any cluster are suspicious (DBSCAN, isolation forests).
  • Recommendation — cluster users/items; compress embeddings.
  • Preprocessing — PCA before a model to cut noise and dimensions.
  • Visualization — t-SNE/UMAP to inspect embedding spaces and debug models.
  • Vector search — k-means partitions space (the IVF index).

The takeaway

Unsupervised learning finds structure without labels. k-means partitions into k round clusters (choose k with the elbow/silhouette); DBSCAN finds dense arbitrary-shaped clusters and outliers without k; hierarchical builds a merge tree. PCA (via SVD) projects onto the directions of maximum variance for compression; t-SNE/ UMAP are for visualization only. Scale features first, and never feed a t-SNE plot to a model. Next, the probability and statistics that underpin all of it — and half of every interview. 👉

Probability & statistics for ML

Machine learning is applied statistics wearing a hoodie. Loss functions are likelihoods, regularization is a prior, evaluation is hypothesis testing, and "data scientist" interviews are half probability puzzles. This chapter is the statistical backbone — distributions, Bayes' theorem, expectation, the Central Limit Theorem, MLE, and hypothesis testing — with the intuition and the code, not the proofs.

The vocabulary of uncertainty

  • Random variable — a quantity whose value is uncertain (a die roll, tomorrow's clicks).
  • Probability distribution — how likely each value is. Discrete (countable outcomes) uses a probability mass function; continuous uses a probability density function.
  • Expectation $E[X]$ — the long-run average. Variance $\text{Var}(X)$ — how spread out values are; standard deviation is its square root.
import numpy as np
x = np.arange(1, 7); p = np.ones(6) / 6          # a fair die
print("E[X]  =", (x * p).sum())                  # mean
print("Var   =", round(((x - 3.5)**2 * p).sum(), 4))

Output:

E[X]  = 3.5
Var   = 2.9167

The distributions you must recognize

DistributionModelsShow up as
Bernoullione yes/no triala single binary label
Binomial# successes in n trialsconversion counts
Normal (Gaussian)bell curve, sums of many effectsnoise, weights, the CLT
Uniformall values equally likelyrandom init, baselines
Poissoncounts of rare events per intervalarrivals, # of clicks
Exponentialtime between eventswait times, survival

The Normal is the celebrity — defined by its mean $\mu$ and standard deviation $\sigma$ — because of the next idea.

The Central Limit Theorem — why Normal is everywhere

The CLT says: average enough independent samples from almost any distribution, and the distribution of that average is approximately Normal. Watch averages of uniform numbers become a tight bell curve:

rng = np.random.default_rng(0)
sample_means = [rng.uniform(0, 1, size=30).mean() for _ in range(10000)]
print("mean of sample-means: %.3f" % np.mean(sample_means))
print("std  of sample-means: %.3f" % np.std(sample_means))
print("CLT prediction 1/sqrt(12*30) = %.3f" % (1/np.sqrt(12*30)))

Output:

mean of sample-means: 0.499
std  of sample-means: 0.053
CLT prediction 1/sqrt(12*30) = 0.053

The spread of the average shrank to exactly the CLT's prediction. This is why we can put confidence intervals on metrics and run A/B tests (Chapter 24) — the uncertainty of an average is predictable, and it shrinks like $1/\sqrt{n}$ (quadruple the data to halve the error).

Bayes' theorem — updating belief with evidence

The single most important formula in ML statistics. It flips a conditional probability:

$$ P(A \mid B) = \frac{P(B \mid A),P(A)}{P(B)} $$

The classic interview gut-punch: a disease affects 1% of people; a test is 99% accurate on the sick and has a 5% false-positive rate. You test positive — what's the chance you're actually sick?

pD, p_pos_given_D, p_pos_given_notD = 0.01, 0.99, 0.05
p_pos = p_pos_given_D * pD + p_pos_given_notD * (1 - pD)   # total prob of a positive
p_D_given_pos = p_pos_given_D * pD / p_pos                 # Bayes
print("P(sick | positive test) =", round(p_D_given_pos, 4))

Output:

P(sick | positive test) = 0.1667

Only 17% — wildly lower than the "99% accurate" framing suggests. Because the disease is rare, the false positives from the huge healthy population swamp the true positives. This base-rate fallacy is why a model with 99% accuracy on rare events can still be useless (Chapter 10) — and why interviewers love this question.

  • Prior $P(A)$ — belief before evidence (the 1%).
  • Likelihood $P(B\mid A)$ — how well evidence fits the hypothesis (the 99%).
  • Posterior $P(A\mid B)$ — updated belief (the 17%).

MLE — where loss functions come from

Maximum Likelihood Estimation picks the parameters that make the observed data most probable. It's not an aside — it's the origin of the losses in Chapter 7:

  • Maximizing the likelihood of Gaussian-noised data ⟺ minimizing MSE.
  • Maximizing the likelihood of Bernoulli labels ⟺ minimizing cross-entropy.

So when you minimize cross-entropy, you are doing maximum likelihood. MAP (Maximum A Posteriori) adds a prior via Bayes — and a Gaussian prior on the weights is exactly L2 regularization (Chapter 9). The whole loss-and-regularization story is Bayesian statistics in disguise.

Hypothesis testing & p-values

The framework for "is this difference real or just noise?" You assume a null hypothesis (no effect), then compute the p-value: the probability of seeing data this extreme or more if the null were true. Small p (conventionally < 0.05) → "unlikely to be chance" → reject the null. A clean, assumption-free way to compute one is a permutation test — shuffle the labels and see how often chance reproduces your effect:

a = rng.normal(0.0, 1, 50); b = rng.normal(0.6, 1, 50)   # b really is higher
observed = b.mean() - a.mean()
pool = np.concatenate([a, b]); count = 0; N = 20000
for _ in range(N):                                       # shuffle group labels
    perm = rng.permutation(pool)
    if abs(perm[50:].mean() - perm[:50].mean()) >= abs(observed):
        count += 1
print("observed difference: %.3f" % observed)
print("permutation p-value: %.4f" % (count / N))

Output:

observed difference: 0.515
permutation p-value: 0.0141

p = 0.014 < 0.05 → the difference is statistically significant (unlikely to be chance). Classic tests (t-test, chi-squared, z-test) compute the same idea with formulas instead of shuffling.

Don't be confused: what a p-value is NOT. A p-value is not "the probability the null is true," and it says nothing about effect size. p = 0.001 on a trivial 0.01% improvement is significant but useless; a big effect with p = 0.06 may matter a lot. Always report the effect size and a confidence interval, not just "p < 0.05." Also: test many things and some will look significant by luck (the multiple- comparisons problem) — correct for it.

Don't be confused: statistical vs. practical significance. "Significant" means "probably not noise," not "big enough to care about." With enough data, any tiny difference becomes statistically significant. Engineers ship on practical significance (is the lift worth the complexity?), informed by statistical significance (is it even real?).

Frequentist vs. Bayesian (the two schools)

  • Frequentist — parameters are fixed unknowns; probability is long-run frequency. p-values, confidence intervals. The default in A/B testing.
  • Bayesian — parameters have distributions representing belief; update with data via Bayes. Priors, posteriors, credible intervals. Natural for small data and online updating.

Both are correct tools; interviewers like to hear you know the difference and when each fits.

The takeaway

Probability is the language of ML uncertainty: distributions describe randomness, the CLT makes averages Normal (and uncertainty shrink like $1/\sqrt n$), Bayes updates belief with evidence (mind the base rate), MLE is where your loss functions come from, and hypothesis testing tells real effects from noise — but a p-value is not an effect size. This is half of any data-science interview. Next, the other mathematical pillar: the linear algebra that everything runs on. 👉

Linear algebra you actually need

Every model in this book is linear algebra underneath: features are vectors, layers are matrix multiplies, embeddings are vectors in a space, and PCA, recommenders, and transformers all reduce to a few matrix operations. You don't need a semester of proofs — you need a working grip on six ideas. This chapter centralizes them, building on the tensor mechanics of Chapter 2.

Vectors & matrices as transformations

A vector is a point/arrow in space (Chapter 0). The key mental shift: a matrix is a function that transforms vectors — rotating, scaling, projecting them. Matrix-times-vector is applying that transformation:

import numpy as np
R = np.array([[0., -1.],
              [1.,  0.]])               # a 90-degree rotation
print("rotate [1,0] ->", (R @ np.array([1., 0.])).round(3))

Output:

rotate [1,0] -> [0. 1.]

The matrix R turned the rightward arrow [1,0] into the upward arrow [0,1] — a 90° rotation. A neural-network layer xW is the same thing: a learned transformation of the input vector into a new space (Chapter 11).

The dot product, geometrically

a · b = Σ aᵢbᵢ measures alignment: it's $\lVert a\rVert,\lVert b\rVert\cos\theta$. Positive when vectors point similarly, zero when perpendicular — the entire basis of cosine similarity and attention (Chapter 4).

Norms — measuring size

A norm is a vector's length. The two you'll meet:

v = np.array([3., 4.])
print("L2 (Euclidean):", np.linalg.norm(v))     # sqrt(9+16)
print("L1 (Manhattan):", np.linalg.norm(v, 1))  # |3|+|4|

Output:

L2 (Euclidean): 5.0
L1 (Manhattan): 7.0

These are the same L1/L2 from regularization (Chapter 9) and distance (Chapter 4) — now you see they're just norms of the difference vector.

Rank — how much independent information

The rank of a matrix is the number of linearly independent rows (or columns) — how many genuinely distinct directions it spans. A matrix whose rows are combinations of each other is low-rank (redundant):

M = np.array([[1., 2, 3],
              [2., 4, 6],     # exactly 2x the first row -> redundant
              [1., 1, 1]])
print("rank:", np.linalg.matrix_rank(M))         # 3 rows, but only 2 independent

Output:

rank: 2

Low rank = compressible. This idea underlies dimensionality reduction, matrix factorization for recommenders (a huge user×item matrix approximated by low-rank user/item embeddings — see the recsys book), and LoRA, the low-rank fine-tuning trick for LLMs (Chapter 27).

Eigenvalues & eigenvectors — a matrix's natural axes

An eigenvector of a matrix is a special direction the matrix only stretches, never rotates; its eigenvalue is the stretch factor. Formally $Av = \lambda v$:

A = np.array([[2., 1.],
              [1., 2.]])
vals, vecs = np.linalg.eig(A)
print("eigenvalues:", np.sort(vals))
v = vecs[:, 0]
print("A v      =", (A @ v).round(3))
print("lambda v =", (vals[0] * v).round(3))      # equal -> v is an eigenvector

Output:

eigenvalues: [1. 3.]
A v      = [2.121 2.121]
lambda v = [2.121 2.121]

Av equals λv exactly — v is a direction the matrix purely scales. Eigenvectors are a matrix's "natural axes." They power PCA (the eigenvectors of the covariance matrix are the principal components), PageRank, and the analysis of dynamical systems.

SVD — the Swiss Army knife

Singular Value Decomposition factors any matrix into $M = U\Sigma V^\top$: a rotation, a scaling by the singular values $\Sigma$, and another rotation. It's the most useful decomposition in all of applied ML because the singular values rank directions by importance, so keeping the top few gives the best possible low-rank approximation:

U, S, Vt = np.linalg.svd(M)
print("singular values:", S.round(3))
rank1 = S[0] * np.outer(U[:, 0], Vt[0])           # keep only the top component
print("rank-1 approximation error:", round(np.linalg.norm(M - rank1), 3))

Output:

singular values: [8.52  0.643 0.   ]
rank-1 approximation error: 0.643

The third singular value is 0 (confirming rank 2), and a single component already reconstructs the matrix with tiny error. SVD is the engine behind PCA (Chapter 21), recommender matrix factorization, latent semantic analysis, image compression, and noise reduction. If you remember one decomposition, remember SVD.

Don't be confused: eigendecomposition vs. SVD. Eigendecomposition applies only to square matrices and can produce complex values; SVD works on any matrix (any shape) and always gives real, non-negative singular values. For the symmetric positive matrices common in ML (like covariance), they coincide. When in doubt, reach for SVD — it's more general and more numerically stable.

The mental cheat sheet

ObjectGeometric meaningShows up in
matrix × vectortransform a pointevery NN layer
dot productalignment / similaritycosine, attention
normvector lengthdistance, regularization
rankindependent directionscompression, matrix factorization
eigenvectoraxis a matrix only scalesPCA, PageRank
SVDbest low-rank factorizationPCA, recommenders, LoRA

The takeaway

A matrix is a transformation; the dot product measures alignment; norms measure size (L1/L2 = your regularizers and distances); rank measures redundancy (the basis of compression and low-rank tricks); eigenvectors are a matrix's natural scaling axes; and SVD — the master tool — gives the best low-rank approximation of any matrix, powering PCA, recommenders, and LoRA. With probability and linear algebra in hand, we can tackle the applied-stats skill every product team needs: running a trustworthy experiment. 👉

Experimentation & A/B testing

A model that scores well offline is a hypothesis, not a win. The only proof it helps is a controlled experiment on real users — an A/B test. This is the bridge between ML and business impact, the skill product teams interview hardest on, and the place careers are made or embarrassed. It's also where the statistics of Chapter 22 become a daily tool.

The core idea

Randomly split users into two groups, change one thing for one of them, and measure the difference:

  • Control (A) — the current experience.
  • Treatment (B) — the new model/feature/UI.
  • Randomization — the magic. Random assignment makes the groups statistically identical on average, so any difference in outcome is caused by your change, not by confounds. This is what makes an A/B test a causal experiment, not just a correlation.

You pick one primary metric ahead of time (conversion, click-through, revenue per user) and ask: is B's metric reliably better than A's, or is the difference just noise?

A worked test — and a humbling result

Say control converts at 10%, your new model at 12%, with 2,000 users per arm:

import numpy as np
from math import erf, sqrt
rng = np.random.default_rng(0)
nA = nB = 2000
a = rng.random(nA) < 0.10          # control: 10% true conversion
b = rng.random(nB) < 0.12          # treatment: 12% true conversion
cA, cB = a.mean(), b.mean()
print("control=%.3f  treatment=%.3f  observed lift=%.1f%%" % (cA, cB, 100*(cB-cA)/cA))

# two-proportion z-test
p_pool = (a.sum() + b.sum()) / (nA + nB)
se = np.sqrt(p_pool * (1 - p_pool) * (1/nA + 1/nB))
z = (cB - cA) / se
p_value = 2 * (1 - 0.5 * (1 + erf(abs(z) / sqrt(2))))
print("z=%.3f  p-value=%.4f" % (z, p_value))
lo, hi = (cB-cA) - 1.96*se, (cB-cA) + 1.96*se
print("95%% CI for the difference: [%.4f, %.4f]" % (lo, hi))

Output:

control=0.105  treatment=0.118  observed lift=12.4%
z=1.306  p-value=0.1915
95% CI for the difference: [-0.0065, 0.0325]

A 12.4% lift that you cannot trust. The p-value is 0.19 (well above 0.05) and the confidence interval includes zero — meaning "no effect" is still plausible. With only 2,000 users per arm, this test is underpowered: the real effect exists, but there isn't enough data to distinguish it from noise. Shipping on this would be a mistake — and recognizing it is exactly what separates a strong candidate from a weak one.

Power & sample size — decide before you run

The fix is to size the experiment up front. Statistical power is the probability of detecting a real effect of a given size. A back-of-envelope rule for the sample size per arm to detect an absolute change delta at baseline rate p (≈80% power, 5% significance):

def sample_size(p, delta):
    return int(16 * p * (1 - p) / delta ** 2)
print("users/arm to detect +2pp at 10% baseline:", sample_size(0.10, 0.02))

Output:

users/arm to detect +2pp at 10% baseline: 3600

We needed 3,600 per arm, not 2,000 — which is why the test above failed to reach significance. The lesson interviewers want: compute the required sample size and runtime before launching, never peek and stop early. Smaller effects need quadratically more users (halve delta → 4× the sample).

The four numbers of every experiment

QuantityQuestion it answers
Significance (α)how often will we falsely claim an effect? (usually 5%)
Power (1−β)how often will we catch a real effect? (aim 80%)
Minimum detectable effectthe smallest lift worth detecting
Sample size / durationhow many users and how long, derived from the above

The classic mistakes (interview gold)

Don't be confused: these traps invalidate most amateur A/B tests.

  • Peeking / early stopping. Checking the p-value daily and stopping when it dips below 0.05 massively inflates false positives — you're running dozens of tests. Fix the duration in advance (or use sequential-testing methods built for peeking).
  • Underpowered tests. Too few users → you'll miss real effects and call winners losers (exactly the demo above).
  • Multiple metrics / comparisons. Test 20 metrics and one will look "significant" by chance. Pick one primary metric; correct the rest.
  • The novelty effect. Users react to anything new; run long enough for the bump to fade.
  • Sample-ratio mismatch (SRM). If your 50/50 split arrives as 52/48, the randomization is broken — trust nothing until you fix it.
  • Simpson's paradox. A treatment can win overall yet lose in every subgroup (or vice versa) when group sizes are imbalanced. Segment your results.

Beyond the basic A/B test

  • A/A test — run control against itself to validate the pipeline; it should show no difference. A great sanity check.
  • Multi-armed bandits — adaptively send more traffic to the winning arm during the test, trading some statistical cleanliness for less regret. Good for many short-lived options.
  • CUPED / variance reduction — use pre-experiment data to shrink noise and reach significance faster.
  • Guardrail metrics — secondary metrics (latency, complaints, churn) that must not get worse even if the primary metric wins.

Connecting it to ML

This is how a model earns its place in production. Offline metrics (Chapter 10) tell you a model is probably better; the A/B test proves it moves the business metric — and the two often disagree (a higher- AUC ranker can lower engagement). The online experiment is the final, authoritative judge. Senior ML engineers live in this loop: offline eval → ship behind a flag → A/B test → roll out or roll back.

The takeaway

A/B testing is the causal proof that an ML change helps real users — randomization removes confounds, and a primary metric plus a p-value and confidence interval tells you if the lift is real. Size the test for adequate power before running it (small effects cost quadratically more users), and avoid the killers: peeking, underpowering, multiple comparisons, novelty, SRM, and Simpson's paradox. Offline metrics propose; the experiment disposes. Now let's zoom out to the tools, languages, and platforms you'll use to do all of this for real. 👉

The practitioner's toolkit & ecosystem

Knowing the math is half the job; the other half is fluency with the tools real teams use every day. This chapter is the lay of the land — the languages, libraries, and workflow an ML engineer is expected to know, and the ones interviewers probe to tell a hobbyist from a professional. You don't need mastery of all of them, but you must know what each is for.

A note on the code here: unlike the rest of the book, the snippets below are idiomatic reference (the libraries aren't all installed in this build), so outputs are described, not captured. They're here so the syntax is familiar when you meet it.

The languages

LanguageRole in MLMust-know level
Pythonthe lingua franca — models, glue, servingfluent
SQLpulling & shaping data from databasesfluent — non-negotiable
Bashscripts, servers, pipelines, Dockercomfortable
Rclassical stats, research, biostatisticsaware of
Scala/Javabig-data (Spark), JVM production systemsaware of
C++/CUDAperformance-critical kernels, inference enginesaware of

Don't be confused: "knowing Python" vs. knowing the data stack. Many beginners can write Python loops but freeze at a JOIN or a groupby. In industry, SQL and data wrangling are where the hours go. Interviewers for any data-adjacent role will test SQL — practice joins, window functions, and aggregations until they're reflex.

SQL — the most underrated ML skill

Before any model, you pull and shape data. A representative query you should read and write without thinking:

-- average order value per customer segment, last 30 days, top spenders first
SELECT  segment,
        COUNT(*)              AS orders,
        AVG(amount)           AS avg_order,
        SUM(amount)           AS revenue
FROM    orders
WHERE   created_at >= NOW() - INTERVAL '30 days'
GROUP BY segment
HAVING  COUNT(*) > 100          -- only meaningful segments
ORDER BY revenue DESC;

Know the difference between WHERE (filters rows before grouping) and HAVING (filters groups), the join types (INNER/LEFT/FULL), and window functions (ROW_NUMBER() OVER (PARTITION BY ...)) for ranking and deduplication. This is the #1 practical skill the math-heavy curricula skip.

The Python data stack

Four libraries do 90% of classical-ML work:

NumPy — the array foundation

Everything you built in this book. Vectorized math, the substrate everything else sits on (Chapter 2).

Pandas — the data wrangler

The spreadsheet-in-code. Loading, cleaning, joining, grouping tabular data:

import pandas as pd
df = pd.read_csv("orders.csv")
df = df.dropna(subset=["amount"])                       # drop missing
df["log_amount"] = np.log1p(df["amount"])              # feature engineering
summary = (df[df.amount > 0]
           .groupby("segment")["amount"]
           .agg(["mean", "count"])
           .sort_values("mean", ascending=False))       # split-apply-combine

groupby → agg (the "split-apply-combine" pattern) is the Pandas idiom you'll use most. Know merge (joins), pivot, apply, and how to avoid the slow row-by-row iterrows() in favor of vectorized operations.

Don't be confused: Pandas vs. NumPy. NumPy is a homogeneous numeric array (fast math). Pandas is a labeled, mixed-type table (columns with names, an index) built on top of NumPy — for data wrangling, not heavy math. Use Pandas to clean and shape, then hand NumPy/arrays to the model.

scikit-learn — classical ML, one consistent API

The home of everything in Chapters 20–21, with a beautifully uniform interface:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=0)
model = RandomForestClassifier(n_estimators=300).fit(X_tr, y_tr)   # .fit()
preds = model.predict(X_te)                                        # .predict()
scores = cross_val_score(model, X, y, cv=5)                        # built-in CV

The fit / predict / transform API is universal across scikit-learn — learn it once and every classical model, scaler, and encoder works the same way. Also know Pipeline (chain preprocessing + model so they're fit together and leak-free — Chapter 16).

Matplotlib / Seaborn — plotting

You must be able to plot a distribution, a loss curve, a confusion matrix, and a scatter. Visualization is how you debug models and communicate results.

Scaling up: when data won't fit in memory

  • Spark (PySpark) — distributed dataframes for terabyte-scale ETL and training across a cluster. The big-data standard.
  • Dask / Polars — faster/larger-than-memory dataframes; Polars is the fast, modern Pandas alternative gaining ground.
  • DuckDB — in-process analytical SQL on local files; "SQLite for analytics."

Gradient boosting libraries

For tabular problems, these are the model (Chapter 20): XGBoost, LightGBM, CatBoost. Know that they exist, that they usually beat neural nets on tables, and roughly when to pick each (LightGBM for speed/scale, CatBoost for many categoricals).

Hyperparameter tuning — finding the best knobs

Models have hyperparameters (Chapter 1) you must search over, always scored by cross-validation (Chapter 9):

MethodHow it searchesWhen
Grid searchevery combination on a gridfew params, cheap models
Random searchrandom combinationsmore params (often beats grid per-cost)
Bayesian (Optuna, Hyperopt)model where good values are, search smartexpensive models, many params
import optuna
def objective(trial):
    lr = trial.suggest_float("lr", 1e-4, 1e-1, log=True)
    depth = trial.suggest_int("max_depth", 3, 10)
    return cross_val_score(make_model(lr, depth), X, y, cv=5).mean()
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=50)        # Bayesian search over the space

Optuna is the modern default for serious tuning — far more efficient than grid search. Mention it in interviews and you sound current.

The version-control & collaboration baseline

  • Git — non-negotiable. Branch, commit, PR, resolve conflicts.
  • Jupyter notebooks — for exploration and analysis; but move code to .py modules for anything production (notebooks hide state and don't test/deploy well).
  • virtualenv / conda / uv / poetry — isolated environments so "works on my machine" becomes "works everywhere." Pin your dependencies.

Don't be confused: notebooks vs. production code. Notebooks are wonderful for exploring and communicating, and a trap for shipping — hidden execution order, no tests, hard to review. The professional move: prototype in a notebook, then refactor into tested modules. Interviewers notice candidates who know the difference.

The takeaway

The job is Python + SQL fluency on top of the data stack: NumPy (arrays), Pandas (wrangling, groupby), scikit-learn (the universal fit/predict API + pipelines), plotting, and gradient-boosting libraries for tables — scaling to Spark/Polars when data explodes, tuning with Optuna, all under Git. Master SQL and the Pandas split-apply- combine idiom; know when a notebook should become a module. Next: where all this code actually runs — the cloud and MLOps. 👉

Cloud & MLOps

A model in a notebook helps no one. MLOps — "DevOps for machine learning" — is the discipline of getting models into production and keeping them working: reproducible training, automated deployment, monitoring, and retraining. This chapter maps the cloud platforms and the MLOps lifecycle so that when an interviewer asks "how would you deploy and maintain this?", you have a real answer.

The companion book Production ML & AI Tools builds many of these tools hands-on (MLflow, FastAPI, Docker, Celery, monitoring). This chapter is the conceptual map.

Why "just deploy it" is hard

Software has code; ML has code + data + model, and all three drift. A deployed model silently rots as the world changes (yesterday's spam isn't today's). MLOps exists to manage three moving parts at once — which is why it's harder than ordinary DevOps and why it's a whole job title now.

The big three cloud providers

You'll work on at least one. They mirror each other:

CapabilityAWSGCPAzure
Managed ML platformSageMakerVertex AIAzure ML
Object storage (data/models)S3GCSBlob Storage
Compute (VMs, GPUs)EC2Compute EngineVMs
Serverless functionsLambdaCloud FunctionsFunctions
Managed containers (K8s)EKSGKEAKS
Data warehouse (SQL at scale)RedshiftBigQuerySynapse

Don't memorize all of it. Know the categories — storage, compute, a managed ML platform, a data warehouse, container orchestration — and that every cloud has one of each. The managed platforms (SageMaker / Vertex / Azure ML) bundle training, tuning, registry, and serving so you don't assemble it yourself.

Compute: CPU, GPU, TPU

  • CPU — data prep, classical ML, light inference.
  • GPU — the workhorse for deep-learning training and heavy inference (parallel matrix multiplies — Chapter 16). NVIDIA dominates; "A100/H100" are the names to know.
  • TPU — Google's custom ML chips, big for large-scale training on GCP.

Don't be confused: training vs. inference hardware. Training is bursty and expensive — rent big GPUs, run, release. Inference is continuous — optimize for cost and latency (quantize the model, batch requests, often run on cheaper GPUs or CPUs). Treating them the same way wastes a fortune. This cost framing impresses interviewers.

Containers: Docker & Kubernetes

  • Docker — package your code + dependencies + model into an image that runs identically everywhere. The unit of deployment; kills "works on my machine."
  • Kubernetes (K8s) — orchestrates many containers across many machines: scaling, restarts, rolling updates, load balancing. The production standard for serving at scale.

You don't need to be a K8s expert, but you must know what Docker is and why (covered hands-on in the tools book), and that K8s is how containers run at scale.

The MLOps lifecycle

The loop a production model lives in:

   data  ─►  train  ─►  evaluate  ─►  register  ─►  deploy  ─►  monitor
     ▲                                                            │
     └──────────────  retrain when it drifts  ◄──────────────────┘

Each stage has a tooling category:

  1. Data versioning — track which data trained which model (DVC, lakeFS, Delta Lake). Reproducibility requires it.
  2. Experiment tracking — log params/metrics/artifacts per run (MLflow, Weights & Biases — Chapter 16).
  3. Model registry — versioned models with stages (staging → production), so you can roll back (MLflow Registry, SageMaker Model Registry).
  4. Pipeline orchestration — schedule and chain steps (Airflow, Prefect, Dagster, Kubeflow). "Retrain every Sunday on fresh data" lives here.
  5. CI/CD for ML — automated testing + deployment, plus model-specific gates (does the new model beat the old on a holdout before it ships?).
  6. Serving — expose predictions via an API (FastAPI, BentoML, TorchServe, Triton, or a managed endpoint).
  7. Monitoring — track latency, errors, and model quality/drift (see the drift section below, Chapter 16, and the tools book).

Serving patterns

  • Online / real-time — a REST/gRPC endpoint answers one request at a time (a recommendation as the page loads). Optimize latency.
  • Batch — score millions of rows on a schedule (nightly churn scores written to a table). Optimize throughput.
  • Streaming — score events as they flow (fraud on each transaction) via Kafka/ Flink.

Don't be confused: online vs. batch serving. Online = low-latency, one-at-a-time, always-on infrastructure (and a feature store to fetch features fast). Batch = high- throughput, scheduled, cheaper, simpler. Many "we need real-time ML!" requirements are actually fine as batch — and saying so in a design interview shows maturity.

Feature stores

A feature store (Feast, Tecton, SageMaker Feature Store) is a central place to define, store, and serve features so that training and serving use the exact same computation. It solves training/serving skew — the silent killer where a feature is computed one way in your training notebook and a subtly different way in production, quietly wrecking accuracy. Know the term and the problem it solves.

Monitoring & drift — the part everyone forgets

A deployed model degrades silently. You must watch:

  • Operational metrics — latency, throughput, error rate, cost (standard ops).
  • Data drift — the input distribution shifts from training (new user behavior, a changed upstream feature).
  • Concept drift — the relationship between inputs and the target changes (fraud patterns evolve).
  • Prediction/label monitoring — are outputs and (eventually) real outcomes still in the expected range?

Tools: Evidently, WhyLabs, Arize, plus Prometheus/Grafana for ops. The trigger to retrain is drift, not the calendar — though scheduled retraining is a common simple proxy.

MLOps maturity (a useful framing)

  • Level 0 — manual: train in a notebook, hand the file to engineering. Fragile.
  • Level 1 — automated pipeline: retraining and deployment are scripted and reproducible.
  • Level 2 — full CI/CD: code, data, and models flow to production automatically with tests, monitoring, and automated retraining.

Most companies are at level 0–1 and trying to climb. Knowing this ladder lets you answer "how mature is your ML process?" credibly.

The takeaway

MLOps is DevOps for the code-plus-data-plus-model trinity: every cloud gives you storage, compute (CPU/GPU/TPU), containers (Docker/K8s), a data warehouse, and a managed ML platform (SageMaker/Vertex/Azure ML). The lifecycle — version data, track experiments, register models, orchestrate pipelines, CI/CD, serve (online vs. batch), and monitor for drift — is what keeps a model alive in production. Feature stores fight training/serving skew; drift, not the calendar, triggers retraining. Next, the fast- moving frontier: the modern AI landscape and the trends you'll be asked about. 👉

The modern AI landscape & trends

The fundamentals in this book are stable for decades; the frontier moves every few months. This chapter is the orientation map of the current landscape (as of the mid-2020s) — the techniques, terms, and tools you'll be expected to recognize and reason about in any AI conversation or interview, even if you've never used them. Each builds on a foundation you already have.

The shift: from training models to adapting foundation models

The biggest change in practice: most teams no longer train models from scratch. They take a foundation model — a huge model pretrained by a big lab — and adapt it. Your job is increasingly choosing, adapting, and orchestrating these, not building them. Three adaptation strategies, in rising cost/power (from Chapter 15):

prompting   ─►   RAG   ─►   fine-tuning
 (cheapest)              (most powerful)

Efficient fine-tuning: LoRA & PEFT

Full fine-tuning of a 70-billion-parameter model updates all 70B weights — enormously expensive. PEFT (Parameter-Efficient Fine-Tuning) updates only a tiny fraction instead. The dominant method, LoRA (Low-Rank Adaptation), freezes the original weights and learns small low-rank (Chapter 23) update matrices — often <1% of the parameters — getting ~full-fine-tuning quality for a fraction of the compute and memory. QLoRA adds quantization to fine-tune big models on a single GPU.

Why it matters: LoRA is the reason a small team can customize a giant model. If asked "how would you specialize an LLM on a budget?", the answer is "LoRA/QLoRA, not full fine-tuning." It's a direct application of the low-rank idea from the linear- algebra chapter.

Quantization & distillation: making big models small

  • Quantization — store weights in fewer bits (FP16 → INT8 → INT4). Roughly halves size and speeds inference per step down, with modest quality loss. It's how a model that needs a data center runs on your laptop (the same compression spirit as the IVF-PQ book).
  • Distillation — train a small "student" model to mimic a large "teacher," capturing much of its ability in a fraction of the size. (DistilBERT, and the small fast models you see everywhere.)
  • Pruning — remove weights that barely matter.

These inference-optimization techniques are increasingly where production value lives — a model you can't afford to serve is worthless.

RAG, in depth

Retrieval-Augmented Generation (Chapter 15) is the default architecture for building on LLMs with private or current data. The full pipeline, end to end, is built entirely from this book:

docs ─► chunk ─► embed (Ch 14) ─► store in vector DB ─► [query] embed ─►
        nearest-neighbor search (HNSW/IVF-PQ books) ─► top-k chunks ─►
        stuff into prompt ─► LLM generates a grounded answer

The hard parts in practice — and the interview follow-ups — are chunking (how to split documents), retrieval quality (hybrid keyword + vector search, re-ranking the candidates), and evaluation (did the answer actually use the right sources?). Advanced variants: agentic RAG (the model decides what to retrieve), GraphRAG (retrieve over a knowledge graph).

Vector databases

RAG needs fast similarity search over millions of embeddings, which created a whole tool category: vector databasesPinecone, Weaviate, Qdrant, Milvus, Chroma, plus pgvector (Postgres) and Elasticsearch/OpenSearch. Under the hood they run the HNSW and IVF-PQ indexes the sister books build from scratch. Know that they exist, what they store (embeddings + metadata), and that they're the memory layer of modern AI apps.

Agents & tool use

An agent wraps an LLM in a loop where it can call tools (search, code execution, APIs, databases), observe results, and decide the next action — turning a text predictor into something that does things. Key terms: function/tool calling, ReAct (reason + act), MCP (Model Context Protocol, a standard for connecting tools), multi-agent systems (agents that delegate to each other). This is one of the hottest areas — and one of the least reliable, so know both the promise and the failure modes (error compounding over long loops, cost, latency).

Diffusion models & generative media

The architecture behind image/video/audio generation (Stable Diffusion, DALL·E, Midjourney, Sora). They generate by starting from pure noise and iteratively denoising it into a coherent image, guided by a text prompt. Different from transformers' next-token prediction, though modern systems combine both. Know the one-liner: "diffusion = learn to reverse noise into data."

Mixture of Experts (MoE)

How the largest models stay affordable: instead of running all parameters for every token, a router sends each token to a few specialist sub-networks ("experts"). The model has enormous total capacity but only activates a slice per token — high quality at lower inference cost. Many frontier models are MoE.

Multimodal models

Models that handle text + images + audio + video in one system (GPT-4o, Gemini, Claude with vision). The trick is the embedding idea from Chapter 14: map every modality into a shared vector space so the model reasons across them. "Describe this photo," "what's in this chart," voice assistants — all multimodal.

Alignment: RLHF & DPO

How a raw next-token predictor becomes a helpful, safe assistant. RLHF (Reinforcement Learning from Human Feedback) trains a reward model on human preferences, then optimizes the LLM against it. DPO (Direct Preference Optimization) achieves similar results more simply, without a separate reward model. These are the alignment step — the Chapter 1 reinforcement-learning flavor at frontier scale.

The open vs. closed landscape

  • Closed/API models — GPT (OpenAI), Claude (Anthropic), Gemini (Google). Strongest, easiest to use, you rent them by the token.
  • Open-weight models — Llama (Meta), Mistral, Qwen, DeepSeek, Gemma. You can run and fine-tune them yourself; vital for privacy, cost control, and customization.

The trade-off — capability and convenience (closed) vs. control, privacy, and cost (open) — is a real architectural decision teams make, and a great thing to reason about aloud in an interview.

How to stay current without drowning

The terms churn, the foundations don't. A new technique is almost always a recombination of ideas you now own: attention, embeddings, low-rank updates, nearest-neighbor search, gradient descent, quantization. When a new buzzword lands, ask: which fundamental is this built on? — and you'll place it in minutes. Follow a few primary sources (lab blogs, papers, a curated newsletter) over hype threads.

The takeaway

The modern stack is about adapting foundation models, not training from scratch: LoRA/ PEFT fine-tune cheaply (low-rank!), quantization/distillation shrink models for serving, RAG + vector DBs add knowledge (the HNSW/IVF-PQ engines), agents add action, diffusion generates media, MoE scales capacity affordably, multimodal unifies senses in a shared embedding space, and RLHF/DPO align it all. Every one of these is a remix of foundations you already hold — which is exactly why this book front-loaded them. Last stop: turning all of this into interview success. 👉

ML system design

The senior-level interview isn't "implement quicksort" — it's "design the recommendation system for a video app" or "build fraud detection for payments." These open-ended ML system design questions test whether you can turn a vague business goal into a working, scalable ML system. There's a repeatable framework, and this chapter gives it to you, then walks one example end to end.

Why this question exists

It reveals everything a coding puzzle can't: do you ask about requirements before coding? Do you know how to frame an ML problem? Can you reason about data, features, models, serving, scale, and failure? Structure beats brilliance here — a calm, organized walk through the framework impresses far more than jumping to "I'd use a transformer."

The framework (memorize the seven steps)

1. Clarify     →  2. Frame as ML  →  3. Data & labels  →  4. Features
      →  5. Model  →  6. Evaluation (offline + online)  →  7. Serve & scale

1. Clarify requirements (never skip this)

Ask before you design. Interviewers plant ambiguity to see if you probe:

  • Goal & scale — what business metric? How many users/items/requests per second?
  • Latency — real-time (<100ms) or batch overnight?
  • Constraints — cold start? Privacy? Interpretability required? Existing systems?

Don't be confused: jumping to the model is the #1 failure. Spending two minutes on requirements is the single highest-signal thing you can do. Candidates who start with "I'd train a neural net" before knowing the scale or latency budget signal inexperience. Clarify first, always.

2. Frame it as an ML problem

Translate the business goal into a precise ML task:

  • What is being predicted (the label/target)?
  • Is it classification, regression, ranking, retrieval, generation?
  • What's the simplest version that delivers value? (Start there; iterate.)

3. Data & labels

  • Where does training data come from? What's the label, and how do you get it?
  • Implicit vs. explicit feedback — a click is implicit and abundant but noisy; a rating is explicit but rare.
  • Watch for leakage (Chapter 16) and label delay (you learn if a loan defaulted only months later).

4. Features

  • What signals predict the target? (User features, item features, context, interactions, history.)
  • Feature freshness — real-time (last click) vs. batch (30-day average) — and the feature store to serve them consistently (Chapter 26).

5. Model

  • Start with a baseline (popularity, logistic regression) — you must beat it.
  • Then the appropriate workhorse: gradient boosting for tabular, two-tower/embeddings for retrieval, a neural ranker for ordering.
  • Justify the choice by the requirements (latency, interpretability, data size), not by what's trendy.

6. Evaluation — offline and online

  • Offline — the right metric for the task (Chapter 10): AUC/F1 for classification, NDCG/Recall@k for ranking, RMSE for regression.
  • Online — the A/B test (Chapter 24) on the real business metric. Name your guardrail metrics.
  • State plainly that offline and online can disagree, and online wins.

7. Serve, scale & maintain

  • Online vs. batch serving (Chapter 26); the latency budget decides.
  • Scale — caching, approximate nearest-neighbor retrieval (HNSW/IVF-PQ), precomputation, the candidate-generation → ranking two-stage pattern.
  • Monitoring & retraining — drift detection, the retraining trigger, rollback.

A worked example: "Design a news feed ranker"

Watch the framework turn a one-line prompt into a system.

1. Clarify. "Rank articles for each user's home feed. ~10M users, ~100K fresh articles/day, must serve in <150ms, optimize for long-term engagement (not just clicks — avoid clickbait)."

2. Frame. For each (user, article) pair, predict P(meaningful engagement). Then sort each user's candidates by that score. It's ranking, fed by retrieval.

3. Data & labels. Implicit feedback from logs: impressions + clicks + dwell time. Label = "engaged" if click and dwell > N seconds (discourages clickbait). Beware position bias (top items get more clicks regardless of quality) and the feedback loop (we only see labels for what we showed).

4. Features. User: history, topic affinities, demographics, time of day. Article: topic, freshness, source quality, popularity. Interaction: embedding similarity between user and article (Chapter 14). Context: device, session.

5. Model — the two-stage pattern (the standard answer for feeds/recommenders):

100K articles ─► CANDIDATE GENERATION ─► ~500 candidates ─► RANKING ─► top 20
                 (cheap: embedding ANN          (expensive: gradient-boosted
                  retrieval, HNSW/IVF-PQ)         or neural ranker on rich features)

Stage 1 is fast and approximate (narrow millions to hundreds with vector search). Stage 2 is slow and accurate (score the few hundred precisely). This split is how you hit <150ms at scale — and naming it signals real seniority.

6. Evaluation. Offline: NDCG@20, Recall@k on held-out engagement (split by time, not randomly). Online: A/B test on long-term engagement and retention; guardrails on diversity, latency, and complaint rate.

7. Serve & scale. Precompute article embeddings hourly; cache user embeddings; ANN index for stage 1; the ranker behind a low-latency endpoint with a feature store. Monitor for drift; retrain daily on fresh engagement. Cold start: fall back to popularity/ trending for new users and content-based features for new articles.

(The sister Recommendation Systems book builds exactly this, including the production capstone.)

The common system-design prompts

Prepare a sketch for each — they recur constantly:

  • Recommendation / feed ranking (the two-stage pattern above)
  • Search / retrieval (embeddings + ANN + re-ranking)
  • Fraud / anomaly detection (imbalanced data, precision/recall trade-off, real-time)
  • Ad click-through-rate prediction (massive scale, calibration matters)
  • Content moderation / classification (precision/recall costs, human-in-the-loop)
  • An LLM-powered feature (RAG, latency, cost, hallucination guardrails)

The takeaway

ML system design rewards structure: clarify requirements, frame the ML problem, reason through data/labels → features → model → offline+online evaluation → serving & scale. Always start with a baseline and the simplest thing that works; know the two-stage candidate-generation → ranking pattern cold; remember the A/B test is the final judge; and talk about drift, monitoring, and cold start unprompted. Now the final chapter: the concept questions, coding drills, and prep plan to bring it all home. 👉

The interview playbook

You now hold the whole foundation. This final chapter turns it into interview success: the question types you'll face, a rapid-fire concept bank with crisp answers, the coding drills that recur, and a prep plan. Treat it as the checklist before you walk in.

The five rounds of an ML interview

RoundWhat it testsWhere this book prepared you
Coding (DS&A)general programming(LeetCode — outside this book)
ML codingimplement an algorithm from scratchChapters 1, 8, 11, 18, 20–23
ML conceptsbreadth & depth of fundamentalsthe whole book
ML system designend-to-end system thinkingChapter 28
Behavioralcollaboration, impact, judgment(your stories)

Most candidates over-prepare DS&A and under-prepare ML concepts and system design — the rounds that actually differentiate. Invest where this book points.

Rapid-fire concept bank

Practice saying each answer out loud in 30–60 seconds. If any feels shaky, reread the linked chapter.

Fundamentals

  • Bias–variance tradeoff? Underfitting (high bias) vs. overfitting (high variance); total error balances both. Ch 9
  • Overfitting — detect and fix? Train ≪ validation error; fix with more data, regularization, dropout, early stopping, simpler model. Ch 9
  • L1 vs. L2? L1 → sparse (feature selection); L2 → smooth shrinkage (weight decay). Ch 9
  • Generative vs. discriminative? Model P(x,y) vs. P(y|x). Naive Bayes vs. logistic regression.

Algorithms

  • Bagging vs. boosting? Parallel independent trees averaged (↓variance) vs. sequential error-correcting trees (↓bias). Ch 20
  • Why do gradient-boosted trees beat neural nets on tabular data? Handle mixed features, need little tuning, capture interactions, robust. Ch 20
  • k-NN vs. k-means? Supervised classification (k voters) vs. unsupervised clustering (k groups). Ch 20, Ch 21
  • How does a decision tree choose splits? Maximize impurity reduction (Gini/entropy). Ch 20

Deep learning

  • Why activation functions? Without non-linearity, stacked layers collapse to one linear layer. Ch 11
  • Vanishing gradients — cause and fix? Sigmoid/tanh saturate; fix with ReLU, residual connections, normalization, good init. Ch 11
  • What is attention? softmax(QKᵀ/√d)V — each token weights every other by query–key similarity. Ch 13
  • Adam vs. SGD? Adam adapts a per-parameter learning rate + momentum; robust default. Ch 8

Stats & evaluation

  • Explain a p-value. P(data this extreme | null true); not P(null true), and not an effect size. Ch 22
  • Precision vs. recall — which when? Costly false positives → precision (spam); costly false negatives → recall (cancer). Ch 10
  • Why is accuracy bad for imbalanced data? "Always predict majority" scores high yet is useless; use F1/AUC. Ch 10
  • What is ROC-AUC? Threshold-free ranking quality; P(score(pos) > score(neg)). Ch 10
  • The base-rate fallacy? A rare positive + a "99% accurate" test still yields mostly false positives. Ch 22

LLMs & modern

  • RAG vs. fine-tuning? RAG adds knowledge (retrieved at query time); fine-tuning adds behavior (baked into weights). Ch 15
  • What is LoRA? Low-rank weight updates — fine-tune <1% of parameters cheaply. Ch 27
  • What does temperature do? Scales randomness of token sampling; low = focused, high = creative. Ch 15
  • Why do LLMs hallucinate? They optimize plausible next tokens, not truth; mitigate with RAG and verification. Ch 15

Coding drills (implement from scratch, no libraries)

These come up in ML-coding rounds. You've already built most of them in this book — redo them on a blank page until fluent:

  • Gradient descent for linear/logistic regression. Ch 1, Ch 6
  • k-means clustering. Ch 21
  • k-NN classifier. Ch 20
  • Backprop for a 2-layer net. Ch 11
  • Softmax / sigmoid / cross-entropy (numerically stable). Ch 17, Ch 18
  • Precision/recall/F1/AUC from predictions. Ch 10
  • Cosine similarity / top-k retrieval. Ch 4, Ch 18
  • Train/test split & a CV loop. Ch 18

The recipe book is your cheat sheet — but practice writing them without it.

How to answer well (meta-skills)

  • Think out loud. Interviewers grade your reasoning, not just the answer. Narrate trade-offs.
  • Start simple, then iterate. Baseline first ("I'd start with logistic regression to establish a number"), then add complexity with justification.
  • Say "it depends" — then say on what. Almost every real answer is conditional; naming the conditions is the signal.
  • Admit unknowns gracefully. "I haven't used X, but it's like Y because…" beats bluffing. Reasoning from fundamentals is the whole point of this book.
  • Tie back to business impact. "This raises NDCG, which should lift engagement, which we'd confirm with an A/B test."

A 4-week prep plan

  1. Week 1 — Fundamentals. Re-read Parts I–II and the concept bank; explain each aloud. Redo the from-scratch gradient descent and metrics.
  2. Week 2 — Algorithms & math. Chapters 20–24; implement k-means, k-NN, a decision- tree split; drill probability and the p-value/Bayes questions.
  3. Week 3 — Deep learning & modern. Chapters 11–15, 27; be able to whiteboard a training loop and explain attention, RAG, and LoRA.
  4. Week 4 — System design & mocks. Chapter 28; practice 5–6 design prompts out loud under time; do mock interviews; prepare behavioral stories (impact, conflict, failure).

The takeaway

Interviews reward structured fundamentals over memorized trivia. The differentiating rounds are ML concepts and system design — exactly what this book built. Rehearse the concept bank aloud, re-implement the core algorithms on a blank page, walk the system- design framework, think out loud, start simple, and tie everything to impact. You can now define and code every term an ML engineer will throw at you — and reason from first principles when you meet a new one.

That was the goal of this whole book. Go get the job — and to prove you can do it (to yourself and to them), the final part is five complete projects you build end to end: a GPT, a LoRA fine-tune, an agent, a CNN, and a diffusion model. Let's go build. 👉

Capstone projects: build the real thing

You've learned the concepts, the math, the tooling, and the interview answers. This final part is where it becomes real: five complete, runnable projects that build the systems modern AI engineers are actually hired to build — each one trains or runs end to end on your laptop's CPU in seconds to minutes, with no GPU and no downloads.

These aren't toy snippets. Each is the real architecture in miniature — the same transformer that powers Claude, the same LoRA that fine-tunes 70-billion-parameter models, the same diffusion process behind Stable Diffusion — shrunk until it runs in a few seconds so you can read every line, run it, and modify it. Scale the numbers up and the code is what runs in production.

The five projects

#ProjectWhat you buildWhy it matters
1GPT from scratcha decoder-only transformer that generates textthe architecture behind every LLM
2Fine-tuning & LoRAadapt a pretrained model efficientlyhow you customize LLMs on a budget
3An LLM agenta ReAct loop that calls toolsthe pattern behind agentic AI
4CNN image classifiertrain a vision model end to endthe workhorse of computer vision
5Diffusion modelgenerate data by reversing noisehow modern image generators work

Together they cover the four things you told an interviewer you could do (Chapter 29): train a model from scratch (1, 4, 5), work with transformers and LLMs (1, 2, 3), fine-tune (2), and build modern generative and agentic AI (3, 5).

How to run them

Every project lives in code/projects/ and needs only PyTorch (CPU build is fine) — except the agent, which is pure Python:

cd code/projects
python gpt.py            # train a transformer, generate text     (~12s)
python finetune_lora.py  # full fine-tune vs LoRA, compared        (~5s)
python agent.py          # a tool-using agent's reasoning trace    (instant)
python cnn.py            # train an image classifier to 100%       (~20s)
python diffusion.py      # generate points by reversing noise      (~15s)

Every output shown in these chapters is real — produced by running exactly this code. Your runs will match (seeds are fixed).

What "production-ready" means here

These projects teach the architecture and training correctly — the part that doesn't change with scale. But a model that runs in a notebook isn't deployed. Each chapter ends with a "Make it production" section that connects the project to the rest of this book and its sister volumes:

  • Serve it — wrap it in an API, containerize it, monitor it (the Production ML & AI Tools book).
  • Track it — log experiments and version the model (Ch 26, the MLflow chapters of the tools book).
  • Scale it — the same code, bigger model, more data, a GPU, mixed precision (Ch 16).
  • Retrieve for it — ground an LLM in your data with RAG (Ch 14, the HNSW/IVF-PQ books).

The gap between "I trained a tiny GPT" and "I shipped an LLM feature" is engineering, not understanding — and you now have both halves.

A note on scaling up

The only difference between these projects and the real systems is size: more parameters, more data, more compute, and a few production tricks (mixed precision, distributed training, learning-rate schedules). Nothing about the architecture changes. When you read that GPT-4 is "a transformer," you'll know precisely what that means — because you built one. When you read that a team "fine-tuned with LoRA," you'll know exactly what they did — because you did it too.

The takeaway

These five projects turn every concept in this book into working code you can run, read, and extend. Train a transformer, fine-tune it with LoRA, give it tools, train a vision model, and generate with diffusion — the actual modern AI stack, on your CPU, today. Pick any one and run it. Let's start with the project at the center of the AI universe: a GPT. 👉

Project 1 — Train a GPT from scratch

This is the project at the center of modern AI: a GPT — a decoder-only transformer that generates text by predicting the next token, over and over. It's the exact architecture behind GPT-4, Claude, and Llama. We build the whole thing — embeddings, masked self-attention, feed-forward layers, residuals, layer norm — train it on a CPU in 12 seconds, and watch it generate coherent text. Scale this up and it's an LLM.

Full code: code/projects/gpt.py (PyTorch, CPU-friendly).

What a GPT does

One thing, repeatedly: predict the next token given all previous tokens (Chapter 15). We work at the character level so the vocabulary is tiny and it trains instantly; real models use subword tokens, but the mechanics are identical.

input:  "to be or not to b"   ->  model predicts ->  "e"

Train it to do this well on a corpus, then feed its own output back in, and it generates.

The architecture, piece by piece

Each piece is a concept you already met, now assembled into the real thing:

class GPT(nn.Module):
    def __init__(self):
        self.tok = nn.Embedding(VOCAB, N_EMBD)    # token → vector (Ch 14)
        self.pos = nn.Embedding(BLOCK, N_EMBD)    # position → vector
        self.blocks = nn.ModuleList(Block() for _ in range(N_LAYER))
        self.head = nn.Linear(N_EMBD, VOCAB)      # vector → next-token logits (Ch 6)
  • Token embedding — each character becomes a learned vector (Chapter 14).
  • Positional embedding — attention has no inherent sense of order, so we add a vector encoding each position. ("to be" ≠ "be to".)
  • Transformer blocks — the heart, below.
  • Head — projects the final vector to a logit per vocabulary item; softmax → next-character probabilities.

The transformer block: attention + MLP, each with a residual

class Block(nn.Module):
    def forward(self, x):
        T = x.size(1)
        mask = torch.triu(torch.ones(T, T), diagonal=1).bool()   # causal mask
        a, _ = self.attn(self.ln1(x), self.ln1(x), self.ln1(x), attn_mask=mask, ...)
        x = x + a                       # residual connection
        x = x + self.mlp(self.ln2(x))   # residual connection
        return x

Three ideas make this work, all from earlier chapters:

  • Masked (causal) self-attention — each position attends to all earlier positions but not future ones (the upper-triangular mask). That mask is what makes it a generator: predicting position $t$ can only use positions $< t$. The attention itself is softmax(QKᵀ/√d)·V from Chapter 13.
  • Residual connections (x = x + ...) — add each sublayer's output back to its input, so gradients flow cleanly through deep stacks (Chapter 11).
  • Layer norm — stabilizes training (Chapter 9).

That's the entire transformer. Stack a few blocks and you have GPT.

Training it

The training loop is the canonical one from Chapter 12: get a batch of (context, next-char) windows, compute cross-entropy loss (Chapter 7) between predictions and the actual next chars, backprop, step.

logits = model(x)
loss = F.cross_entropy(logits.reshape(-1, VOCAB), y.reshape(-1))
opt.zero_grad(); loss.backward(); opt.step()

Run it:

$ python gpt.py

Output:

vocab=13  params=103821
step   0  loss 2.658
step 100  loss 0.063
step 200  loss 0.082
step 300  loss 0.051
step 400  loss 0.051
trained in 11.9s
sample: that is the question to be or not to be that is the question

A 104k-parameter transformer, trained in 12 seconds on a CPU, went from loss 2.66 (random — $\ln 13 \approx 2.56$ is pure-guess level) to 0.05, and now generates coherent text continuing the corpus. That sample line is the model writing, character by character, by repeatedly sampling its own next-token distribution.

Generation: sampling one token at a time

@torch.no_grad()
def generate(self, idx, n, temperature=0.8):
    for _ in range(n):
        logits = self(idx[:, -BLOCK:])                  # only the last BLOCK chars fit
        probs = F.softmax(logits[:, -1, :] / temperature, dim=-1)
        idx = torch.cat([idx, torch.multinomial(probs, 1)], dim=1)  # sample & append
    return idx

This is how ChatGPT and Claude generate — one token at a time, each conditioned on everything so far, with temperature controlling randomness (Chapter 15). The [:, -BLOCK:] is the context window in action: the model can only see its last BLOCK characters.

From this to a real LLM

The gap is scale and data, not architecture. A frontier model is this exact code with:

  • Subword tokenization instead of characters (a ~100k-token vocabulary).
  • Billions of parameters (more layers, bigger N_EMBD, more heads) instead of 104k.
  • Trillions of tokens of training data instead of one sentence.
  • Thousands of GPUs, mixed precision, and learning-rate schedules.
  • Then fine-tuning (next chapter) and alignment (RLHF/DPO, Chapter 27) to make it helpful.

Every one of those is an engineering scale-up of what you just ran. The famous nanoGPT is essentially this file, productionized.

Make it production

  • Serve it behind an API and stream tokens (the Production ML & AI Tools book's FastAPI + LLM-serving chapters).
  • Don't train your own foundation model — pretraining costs millions. In practice you take a pretrained model and adapt it, which is the next project.
  • Evaluate with perplexity ($e^{\text{loss}}$) and downstream task metrics (Chapter 10).

The takeaway

You built a GPT: token + positional embeddings, masked self-attention with residuals and layer norm, trained by next-token cross-entropy, generating one token at a time with temperature. It's 104k parameters and 12 seconds on a CPU — but it's the real architecture, and the only thing between it and GPT-4 is scale. Now let's adapt a pretrained model to a new task — efficiently — with LoRA. 👉

Project 2 — Fine-tuning & LoRA

You almost never train a model from scratch (last chapter showed why — it takes millions of dollars). Instead you take a pretrained model and fine-tune it: a little extra training to adapt its broad knowledge to your task. And the modern, efficient way to do that — the technique behind nearly every custom LLM today — is LoRA. This project builds LoRA from scratch and measures why it's a big deal.

Full code: code/projects/finetune_lora.py.

What fine-tuning is

A pretrained model already learned general features (language, edges, concepts) from a huge corpus. Fine-tuning continues training it on a smaller, task-specific dataset so it specializes — sentiment on your reviews, your company's writing style, a medical domain. It's transfer learning: reuse the expensive pretraining, pay only for the cheap adaptation.

The problem: full fine-tuning updates all the weights. For a 70-billion-parameter model that means storing and training 70 billion numbers — enormous memory and a full copy of the model per task. LoRA fixes exactly this.

The LoRA idea: freeze the giant, train a tiny add-on

LoRA (Low-Rank Adaptation) freezes the pretrained weights $W_0$ and adds a small trainable detour: a low-rank (Chapter 23) pair of matrices $A$ and $B$. The adapted layer is:

$$ y = x W_0 + \frac{\alpha}{r}, x A B $$

where $A$ is $(d \times r)$ and $B$ is $(r \times d)$ with the rank $r$ tiny (4, 8, 16). You train only $A$ and $B$ — a handful of parameters — while the huge $W_0$ never moves. From the project:

class LoRALinear(nn.Module):
    def __init__(self, base, r=4, alpha=8):
        self.base = base
        for p in self.base.parameters():
            p.requires_grad = False                 # FREEZE the pretrained weights
        self.A = nn.Parameter(torch.randn(base.in_features, r) * 0.01)
        self.B = nn.Parameter(torch.zeros(r, base.out_features))   # starts as a no-op
        self.scale = alpha / r

    def forward(self, x):
        return self.base(x) + self.scale * (x @ self.A @ self.B)   # frozen + low-rank

Why it works: the change needed to adapt a model is usually low-rank — it lives in a small subspace. B starts at zero so training begins exactly at the pretrained model and only departs as needed. This is the same low-rank, "best approximation with few numbers" idea as SVD (Chapter 23).

Measuring the win

The project adapts a "pretrained" layer to a new classification task two ways — full fine-tuning vs. LoRA — and counts trainable parameters:

$ python finetune_lora.py

Output:

full fine-tune : trainable params = 1188  acc = 1.000
LoRA (r=4)     : trainable params =  388  acc = 1.000

LoRA trains 3.1x fewer parameters for comparable accuracy.

Same 100% accuracy, a third of the trainable parameters. On our toy layer the ratio is 3×; on a real LLM, where the frozen base is billions of weights and the adapter is millions, the ratio is 100–1000×. That's the difference between needing a data center and fine-tuning on a single GPU.

Why LoRA changed the game

  • Cheap — train <1% of the parameters; fits on modest hardware.
  • Small artifacts — a LoRA adapter is a few megabytes, not gigabytes. Ship one base model + many tiny adapters (one per customer/task) instead of many full copies.
  • No quality loss — matches full fine-tuning on most tasks.
  • Composable & reversible — swap adapters in and out; the base is untouched.

QLoRA adds quantization (Chapter 27) so you can fine-tune a 70B model on a single consumer GPU. This is why a small team can now customize a giant model.

The real recipe (follow-along)

In practice you don't hand-roll LoRALinear — you use Hugging Face transformers + peft. The whole production fine-tune is a few lines:

# pip install transformers peft datasets
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
lora = LoraConfig(r=8, lora_alpha=16, target_modules=["q_lin", "v_lin"])
model = get_peft_model(model, lora)          # wraps the model: freezes base, adds LoRA
model.print_trainable_parameters()           # e.g. "trainable: 0.6% of all params"

Trainer(model, TrainingArguments("out", num_train_epochs=3),
        train_dataset=ds).train()            # trains ONLY the adapter
model.save_pretrained("my-adapter")          # a few MB, not GB

get_peft_model does exactly what our LoRALinear did — freeze the base, inject low-rank adapters into the attention layers — just for a real pretrained transformer.

Don't be confused: fine-tuning vs. RAG vs. prompting. Fine-tuning changes the model's behavior/skill by training (use it for tone, format, a narrow capability). RAG (Chapter 14) adds knowledge at query time without training. Prompting changes nothing. To make a model act a certain way → fine-tune; to make it know your latest docs → RAG. Reaching for fine-tuning when you needed RAG (or vice versa) is the most common modern-AI design mistake.

Make it production

  • Track every fine-tune with MLflow; version the adapter in a registry (tools book).
  • Evaluate before shipping — a held-out set and an eval gate (Chapter 10; the tools book's testing/eval chapters).
  • Serve the base model once and hot-swap adapters per task.

The takeaway

Fine-tuning adapts a pretrained model to your task; LoRA does it efficiently by freezing the base and training a tiny low-rank adapter — you measured 3× fewer parameters here, 100–1000× on real LLMs, at no accuracy cost. It's the same low-rank idea as SVD, and the reason custom LLMs are now within reach of small teams. Next, we give a model the ability to act: an agent with tools. 👉

Project 3 — Build an LLM agent

An agent is an LLM that can act: instead of just emitting text, it calls tools (a calculator, a search, an API, a database), observes the results, and decides what to do next — looping until it has an answer. This is the pattern behind "agentic AI," Claude Code, and every assistant that does things. We build the loop from scratch so the mechanics are crystal clear, then show the real Claude version.

Full code: code/projects/agent.py (pure Python — no dependencies).

Why agents exist

An LLM alone can't do arithmetic reliably, look up today's price, or query your database — that knowledge isn't in its weights, and it can't run code. Give it tools and a loop, and it can: reason about what it needs, call the right tool, read the result, and continue. That turns a text predictor into a problem-solver.

The ReAct loop: Thought → Action → Observation

The dominant pattern is ReAct (Reason + Act). The model alternates between thinking about what to do and acting (calling a tool), reading each tool's observation, until it can answer:

Question → Thought → Action(tool) → Observation → Thought → … → Answer

Tools are just functions

def calculator(expr):
    return str(eval(expr, {"__builtins__": {}}))   # locked-down eval

def knowledge_lookup(query):
    return FACTS.get(...)                            # a stand-in for search / a DB

TOOLS = {"calculator": calculator, "knowledge_lookup": knowledge_lookup}

The loop

def run(question, max_steps=5):
    scratchpad = []
    for _ in range(max_steps):
        action, arg = decide(question, scratchpad)   # the LLM picks the next move
        if action == "answer":
            return arg
        observation = TOOLS[action](arg)             # run the chosen tool
        scratchpad += [f"Action: {action}('{arg}')", f"Observation: {observation}"]

decide() is the brain — given the question and everything observed so far, it returns the next action. In a real agent that's an LLM call; here it's a deterministic stub so the loop runs and is reproducible.

Running it

$ python agent.py

Output:

Question: What is 23 * 19 + 7?
Thought: I should use calculator('23*19+7')
Action: calculator('23*19+7')
Observation: 444
Answer: 444
----------------------------------------
Question: What is the speed of light?
Thought: I should use knowledge_lookup('speed of light')
Action: knowledge_lookup('speed of light')
Observation: 299,792,458 m/s
Answer: 299,792,458 m/s

The agent routed each question to the right tool — arithmetic to the calculator (getting 444 exactly, which an LLM might fumble), and a fact to the lookup tool — then answered from the observation. That routing-and-looping is agency.

The real thing: Claude with tool use

In production, decide() is a Claude API call, and the model itself chooses the tool. You declare your tools as JSON schemas; Claude responds with a tool_use request; you run the tool and feed the tool_result back; repeat until it stops. The loop is identical to ours:

import anthropic
client = anthropic.Anthropic()

tools = [{
    "name": "calculator",
    "description": "Evaluate an arithmetic expression",
    "input_schema": {"type": "object",
                     "properties": {"expr": {"type": "string"}},
                     "required": ["expr"]},
}]
messages = [{"role": "user", "content": "What is 23 * 19 + 7?"}]

while True:
    resp = client.messages.create(model="claude-opus-4-8", max_tokens=1024,
                                  tools=tools, messages=messages)
    if resp.stop_reason != "tool_use":
        break                                        # Claude has its final answer
    messages.append({"role": "assistant", "content": resp.content})
    results = []
    for block in resp.content:
        if block.type == "tool_use":                 # Claude asked to call a tool
            out = TOOLS[block.name](**block.input)    # you run it
            results.append({"type": "tool_result", "tool_use_id": block.id,
                            "content": str(out)})
    messages.append({"role": "user", "content": results})   # feed results back

stop_reason == "tool_use" is the model saying "run this tool and tell me the result" — the API-native version of our Action/Observation step. (The SDK's tool runner automates this loop entirely; the manual version above shows what it does.)

What makes agents hard

The loop is simple; making it reliable is the frontier:

  • Error compounding — a wrong step early derails everything after it. Long loops are fragile.
  • Cost & latency — every step is an LLM call; agents are slow and expensive.
  • Tool design — clear tool names, descriptions, and schemas dramatically change how well the model uses them.
  • Termination & guardrails — cap the steps (our max_steps), validate tool inputs, gate destructive actions behind confirmation.

Frameworks (LangChain, LlamaIndex, the Claude Agent SDK) package the loop, tool plumbing, and memory — but it's the loop you just built.

Don't be confused: an agent is a loop, not a model. The "agent" isn't a special kind of LLM — it's an ordinary LLM wrapped in your loop that lets it call tools and see results. The intelligence is the model; the agency is the harness around it.

Make it production

  • Trace every step (cost, latency, which tools fired) — agents are multi-step, so tracing is essential (the tools book's LLM-observability chapter).
  • Gate side effects — tools that send email, spend money, or delete data go behind confirmation.
  • Cap and budget — max steps and a token budget, or an agent can loop forever and run up a bill.

The takeaway

An agent is an LLM in a Thought→Action→Observation loop with tools — you built the loop, and the real Claude version is the same loop with tool_use/tool_result messages. It turns a predictor into something that acts. The hard parts are reliability, cost, and tool design, not the loop itself. Next, we leave language for vision: train a CNN image classifier from scratch. 👉

Project 4 — Train a CNN image classifier

Language isn't the only modality. Convolutional neural networks are the workhorse of computer vision — medical imaging, self-driving perception, quality inspection, face recognition. This project trains one from scratch, end to end, on synthetic images (no downloads), and watches it climb from coin-flip to 100% accuracy in a few seconds. You'll see a CNN learn the filters that detect visual patterns — exactly how real vision models learn edges and textures.

Full code: code/projects/cnn.py (PyTorch, CPU-friendly).

The task

Classify 12×12 grayscale images: class 0 has a horizontal bar, class 1 has a vertical bar, both buried in noise. Trivial for a human, impossible for a model that can't perceive spatial structure — which is exactly what a CNN is built to do.

def make_images(n):
    X = 0.1 * torch.randn(n, 1, H, W)          # noise
    y = torch.randint(0, 2, (n,))
    for i in range(n):
        if y[i] == 0: X[i, 0, <random row>, :] += 1.0     # a horizontal bar
        else:         X[i, 0, :, <random col>] += 1.0     # a vertical bar
    return X, y

The bar is on a random row/column each time, so the model can't memorize a position — it must learn the concept of horizontal vs. vertical. That's the whole point of a CNN's translation invariance (Chapter 13): detect a pattern wherever it appears.

The architecture

A CNN stacks convolution → activation → pooling to build up from pixels to concepts (Chapter 13):

class CNN(nn.Module):
    def __init__(self):
        self.conv1 = nn.Conv2d(1, 8, kernel_size=3, padding=1)    # 8 learned 3×3 filters
        self.conv2 = nn.Conv2d(8, 16, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(2)                               # downsample 12→6→3
        self.fc = nn.Linear(16 * 3 * 3, 2)                        # classify

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))    # detect features, then shrink
        x = self.pool(F.relu(self.conv2(x)))
        return self.fc(x.flatten(1))
  • Conv2d — the learnable kernels/filters from Chapter 5. The first layer's 8 filters learn to respond to simple patterns (horizontal vs. vertical edges); training discovers them — you don't hand-design them.
  • ReLU — the non-linearity (Chapter 11).
  • MaxPool — keeps the strongest response in each 2×2 region, shrinking the image and adding position-tolerance.
  • Linear head — flattens the final feature map and classifies (Chapter 6).

Training it

Standard mini-batch training loop (Chapter 16) with model.train() / model.eval() (Chapter 12):

$ python cnn.py

Output:

params=1538
epoch 0  loss 0.671  test-acc 0.545
epoch 1  loss 0.616  test-acc 0.775
epoch 2  loss 0.520  test-acc 0.970
epoch 3  loss 0.381  test-acc 0.985
epoch 4  loss 0.233  test-acc 1.000
epoch 5  loss 0.127  test-acc 1.000

A 1,538-parameter CNN started at 54.5% (barely better than guessing) and reached 100% test accuracy by epoch 4 — having learned convolutional filters that detect horizontal vs. vertical structure, evaluated on held-out images it never saw (Chapter 9). That's a complete, honest computer-vision training run.

Scaling to real images

The architecture is identical for real vision; you just grow it and feed it real data:

  • Real datasets — MNIST (digits), CIFAR-10 (objects), ImageNet (1000 classes). One line with torchvision.datasets.
  • Deeper networks — ResNet, EfficientNet, U-Net: the same conv/pool/activation pattern, dozens of layers, residual connections (Chapter 11).
  • Data augmentation — random flips/crops/rotations to fight overfitting (Chapter 9).

You usually don't train from scratch — transfer learning

For real vision tasks, you rarely start from random weights. You take a CNN pretrained on ImageNet (which already learned edges, textures, shapes) and fine-tune it on your task — transfer learning, the vision cousin of Chapter 32's fine-tuning:

# pip install torchvision
from torchvision.models import resnet18, ResNet18_Weights
model = resnet18(weights=ResNet18_Weights.DEFAULT)     # pretrained on ImageNet
model.fc = nn.Linear(model.fc.in_features, NUM_CLASSES)  # swap the head for your task
# freeze the body, train the head on your (small) dataset — fast, data-efficient

This is how you build a strong image classifier with a few hundred examples instead of a million: stand on the shoulders of a model that already learned to see.

Don't be confused: CNN vs. Vision Transformer (ViT). CNNs were the default for a decade; Vision Transformers now match or beat them at scale by applying the transformer to image patches. CNNs remain more data-efficient and are still everywhere; ViTs win with huge data. Know both exist.

Make it production

The takeaway

A CNN learns convolutional filters that detect visual patterns wherever they appear; you trained one from 54% to 100% in seconds, then saw how transfer learning (a pretrained ResNet + a new head) builds strong classifiers from little data. Same conv/pool/activation pattern, scaled up, runs all of computer vision. One project left, and it's the most modern: generating new data with diffusion. 👉

Project 5 — A diffusion model from scratch

The image generators that took the world by storm — Stable Diffusion, DALL·E, Midjourney, Sora — are diffusion models. Their idea sounds like magic but is beautifully simple: learn to reverse noise. Add noise to real data step by step until it's pure static; train a network to undo one step of that; then generate by starting from static and denoising into something real. This project builds a complete diffusion model on 2-D points so it trains in 15 seconds and you can measure that it works.

Full code: code/projects/diffusion.py (PyTorch, CPU-friendly).

The idea: destroy, then learn to rebuild

Two processes:

  • Forward (fixed, no learning) — gradually add Gaussian noise to a real sample over $T$ steps until it's indistinguishable from random noise. This is just a recipe.
  • Reverse (learned) — train a network to predict the noise that was added, so you can subtract it and step back toward real data.

Generate by running the reverse process from pure noise:

forward:   data  → +noise → +noise → … → pure noise   (fixed)
reverse:   noise → denoise → denoise → … → new data    (learned)

The key trick: instead of learning to generate data directly (hard), the model learns the much easier task of predicting the noise in a noisy sample. Subtracting predicted noise is one denoising step. Do it repeatedly and noise becomes data.

The target & the model

Our "real data" is points on the unit circle — simple enough to verify numerically (generated points should land at radius ≈ 1), rich enough to be a real distribution.

T = 50                                   # number of noising steps
betas = torch.linspace(1e-4, 0.1, T)     # how much noise per step (the schedule)
abar = torch.cumprod(1 - betas, dim=0)   # cumulative noise level, ᾱ_t

class Denoiser(nn.Module):               # predicts the noise in x_t, given t
    def forward(self, x, t):
        return self.net(torch.cat([x, (t/T).unsqueeze(1)], dim=1))

The model takes a noisy point x and the timestep t (how noisy it is) and outputs its guess of the noise. It's a tiny MLP (Chapter 11) — the architecture can be anything; the training objective is what makes it a diffusion model.

Training: predict the noise

The whole training step, using the forward-process shortcut $x_t = \sqrt{\bar\alpha_t}, x_0 + \sqrt{1-\bar\alpha_t},\epsilon$:

x0 = sample_data(bs)                 # a real point
t = torch.randint(0, T, (bs,))       # a random noise level
noise = torch.randn_like(x0)         # the noise we add
at = abar[t].unsqueeze(1)
x_t = at.sqrt()*x0 + (1-at).sqrt()*noise        # the noisy version
loss = ((model(x_t, t) - noise) ** 2).mean()    # predict that noise (MSE)

That's it — mean squared error between the model's predicted noise and the actual noise (Chapter 7). Simple regression, repeated over random samples and noise levels.

Generating: denoise from pure noise

Start from random noise and walk the reverse process step by step, subtracting predicted noise each time:

x = torch.randn(n, 2)                            # pure noise
for t in reversed(range(T)):
    eps = model(x, t)                            # predict the noise
    x = (x - (1-a)/(1-ab).sqrt() * eps) / a.sqrt()  # remove a step of it
    x = x + (noise term if t > 0)                # add a little randomness back

Running it

$ python diffusion.py

Output:

step    0  loss 1.1303
step  500  loss 0.3930
step 1000  loss 0.3358
step 1500  loss 0.3416

generated 500 points
mean radius = 1.003  (target 1.000)
std radius  = 0.114   (small = points really are on the circle)

We started generation from pure Gaussian noise and the model denoised it into points with mean radius 1.003 — it learned to generate the unit circle. The small std (0.114) confirms the points genuinely lie on the circle, not scattered randomly. The model created new data that matches the target distribution, from noise — that's generative modeling, and it's the exact process behind every diffusion image generator.

From circles to Stable Diffusion

The leap to generating photorealistic images is — once again — scale and a couple of additions, not a different idea:

  • Bigger denoiser — a U-Net (a CNN, Chapter 13) instead of our MLP, to handle images.
  • Latent diffusion — run the process in a compressed latent space (via an autoencoder) instead of raw pixels, for efficiency. This is the "Latent" in Latent Diffusion (Stable Diffusion).
  • Text conditioning — feed a text embedding (Chapter 14) into the denoiser so the prompt steers generation. "A cat on the moon" → the model denoises toward images matching that text.

The training objective — predict the noise — is identical to what you just ran.

Don't be confused: diffusion vs. GANs vs. autoregressive. Three ways to generate. Diffusion denoises (today's leader for images). GANs pit a generator against a discriminator (older image method). Autoregressive models generate one token at a time — that's your GPT (the leader for text). Different mechanisms, same goal: model and sample a data distribution.

Make it production

  • It's expensive to train — use a pretrained model (Stable Diffusion via Hugging Face diffusers) and fine-tune with LoRA (Chapter 32) for a custom style.
  • Generation is iterative (many denoising steps) → slow; techniques like distillation and fewer-step samplers speed it up for serving.

The takeaway — and the end of the book

A diffusion model learns to reverse noise: train it to predict the noise in a sample (simple MSE), then generate by denoising from pure noise. You built one that generates the unit circle from static — the exact process behind Stable Diffusion, which only adds a U-Net, a latent space, and text conditioning.

And that's the book. You started not knowing what a vector was; you've now built a transformer, fine-tuned it with LoRA, given it tools, trained a vision model, and generated data with diffusion — the complete modern AI stack, from first principles, on your own machine. You can define and code every term an ML engineer uses, reason from fundamentals when you meet a new one, and ship what you build. Go build something real. 🚀

The recipe book (copy-paste NumPy)

A reference card of the small, sharp building blocks you reach for in every AI project. Each recipe is self-contained, vectorized (no Python loops where NumPy can do it), and shown with exact input and output. The similarity/activation/encoding recipes also live, runnable, in code/recipes.py. Copy, paste, ship.

Convention: X is a feature matrix of shape (n_samples, n_features) — axis 0 = samples, axis 1 = features (Chapter 2).

Similarity & distance

import numpy as np

def cosine_sim(a, b, eps=1e-12):                 # direction only, [-1, 1]
    a, b = np.asarray(a, float), np.asarray(b, float)
    return float(a @ b / (np.linalg.norm(a)*np.linalg.norm(b) + eps))

def euclidean(a, b):                             # straight-line distance, >= 0
    return float(np.sqrt(np.sum((np.asarray(a,float)-np.asarray(b,float))**2)))

def rbf(a, b, gamma=1.0):                        # distance -> similarity in (0,1]
    d2 = np.sum((np.asarray(a,float)-np.asarray(b,float))**2)
    return float(np.exp(-gamma * d2))

print(round(cosine_sim([1,0,1], [1,0,0]), 3),
      round(euclidean([0,0], [3,4]), 3),
      round(rbf([0,0], [1,0], gamma=0.5), 3))

Output:

0.707 5.0 0.607

All-pairs (the matrix versions you actually ship)

def normalize_rows(X, eps=1e-12):                # each row -> unit L2 length
    return X / (np.linalg.norm(X, axis=1, keepdims=True) + eps)

def cosine_matrix(X):                            # (n,n) all-pairs cosine
    Xn = normalize_rows(X); return Xn @ Xn.T

def pairwise_sq_dists(X, Y=None):                # (n,m) squared Euclidean, no loops
    Y = X if Y is None else Y
    xx = np.sum(X**2, 1)[:, None]; yy = np.sum(Y**2, 1)[None, :]
    return np.maximum(xx + yy - 2.0*(X @ Y.T), 0.0)

X = np.array([[1.,0.], [0.9,0.1], [0.,1.]])
print(np.round(cosine_matrix(X), 3))

Output:

[[1.    0.994 0.   ]
 [0.994 1.    0.11 ]
 [0.    0.11  1.   ]]

Ranking: top-k (the retrieval primitive)

def top_k(scores, k):                            # indices of k largest, best-first
    scores = np.asarray(scores); k = min(k, len(scores))
    part = np.argpartition(-scores, k-1)[:k]     # O(n) partial select
    return part[np.argsort(-scores[part])]       # then sort just those k

print(top_k([0.1, 0.9, 0.3, 0.7, 0.2], 3))

Output:

[1 3 2]

Activations

def sigmoid(x):  return 1.0 / (1.0 + np.exp(-np.asarray(x, float)))
def softmax(x):                                  # numerically stable (subtract max!)
    x = np.asarray(x, float); e = np.exp(x - x.max()); return e / e.sum()
def relu(x):     return np.maximum(0.0, np.asarray(x, float))

print("sigmoid:", np.round(sigmoid([-2, 0, 2]), 3))
print("softmax:", np.round(softmax([1, 2, 3]), 3), "sums to", round(softmax([1,2,3]).sum(),3))

Output:

sigmoid: [0.119 0.5   0.881]
softmax: [0.09  0.245 0.665] sums to 1.0

Encodings & scaling

def one_hot(labels, n=None):                     # ints -> one-hot rows
    labels = np.asarray(labels); n = n or labels.max()+1
    out = np.zeros((len(labels), n)); out[np.arange(len(labels)), labels] = 1.0
    return out

def standardize(X, eps=1e-12):                   # per column: mean 0, std 1
    X = np.asarray(X, float); return (X - X.mean(0)) / (X.std(0) + eps)

def min_max(X, eps=1e-12):                        # per column: rescale to [0,1]
    X = np.asarray(X, float); lo, hi = X.min(0), X.max(0); return (X-lo)/(hi-lo+eps)

print(one_hot([0, 2, 1]).tolist())
print(np.round(standardize([[1.,10],[2,20],[3,30]]), 3).tolist())

Output:

[[1.0, 0.0, 0.0], [0.0, 0.0, 1.0], [0.0, 1.0, 0.0]]
[[-1.225, -1.225], [0.0, 0.0], [1.225, 1.225]]

L2-normalize a single vector

v = np.array([3., 4.])
print("unit:", v / np.linalg.norm(v), " norm:", np.linalg.norm(v))

Output:

unit: [0.6 0.8]  norm: 5.0

Train / test split (shuffle, then slice)

def train_test_split(X, y, test=0.25, seed=0):
    rng = np.random.default_rng(seed); idx = rng.permutation(len(X))
    n = int(len(X) * test)
    te, tr = idx[:n], idx[n:]
    return X[tr], X[te], y[tr], y[te]

X = np.arange(8).reshape(8, 1); y = np.arange(8)
Xtr, Xte, ytr, yte = train_test_split(X, y, test=0.25)
print("train y:", ytr, " test y:", yte)

Output:

train y: [3 6 5 0 1 7]  test y: [2 4]

Mini-batch iterator

def batches(X, y, bs, seed=0):
    rng = np.random.default_rng(seed); idx = rng.permutation(len(X))
    for s in range(0, len(X), bs):
        b = idx[s:s+bs]; yield X[b], y[b]

for i, (xb, yb) in enumerate(batches(np.arange(10).reshape(10,1), np.arange(10), 4)):
    print(f"batch {i}: y={yb}")

Output:

batch 0: y=[4 6 2 7]
batch 1: y=[3 5 9 0]
batch 2: y=[8 1]

Classification metrics from scratch

def accuracy(probs, y_true):                     # probs shape (n, classes)
    return float((probs.argmax(1) == y_true).mean())

probs = np.array([[0.7,0.2,0.1], [0.1,0.8,0.1], [0.2,0.2,0.6]])
print("accuracy:", round(accuracy(probs, np.array([0, 1, 0])), 3))

def precision_recall(y_pred, y_true):            # binary 0/1 arrays
    TP = ((y_pred==1)&(y_true==1)).sum(); FP = ((y_pred==1)&(y_true==0)).sum()
    FN = ((y_pred==0)&(y_true==1)).sum()
    return TP/(TP+FP), TP/(TP+FN)

p, r = precision_recall(np.array([1,1,0,0,1]), np.array([1,0,0,1,1]))
print(f"precision={p:.3f} recall={r:.3f}")

Output:

accuracy: 0.667
precision=0.667 recall=0.667

Smoothing a loss curve: exponential moving average

def ema(values, alpha=0.3):                      # EMA, for noisy training curves
    out = [values[0]]
    for v in values[1:]:
        out.append(alpha*v + (1-alpha)*out[-1])
    return np.array(out)

print(np.round(ema(np.array([1.,2,3,10,3,2,1]), alpha=0.3), 3))

Output:

[1.    1.3   1.81  4.267 3.887 3.321 2.625]

The spike at 10 is smoothed to 4.267 instead of dominating — exactly what you want when reading a jittery loss curve.

Numerical gradient check (verify any hand-derived gradient)

def grad_check(f, x, eps=1e-6):                  # central difference
    g = np.zeros_like(x, dtype=float)
    for i in range(x.size):
        d = np.zeros_like(x, dtype=float); d.flat[i] = eps
        g.flat[i] = (f(x+d) - f(x-d)) / (2*eps)
    return g

f = lambda v: v[0]**2 + 3*v[0]*v[1]              # analytic grad: [2x+3y, 3x]
print("numeric:", np.round(grad_check(f, np.array([2., 1.])), 5))

Output:

numeric: [7. 6.]

Two more one-liners worth memorizing

print("clip to [0,1]:", np.clip([-2., 0.5, 3.], 0, 1))       # clamp / saturate
print("running mean :", np.cumsum([2.,4,6,8]) / np.arange(1, 5))

Output:

clip to [0,1]: [0.  0.5 1. ]
running mean : [2. 3. 4. 5.]

The takeaway

These are the verbs of practical AI: normalize, similarity, top-k, softmax, one-hot, standardize, split, batch, score, smooth, grad-check. They show up in every project in this series. Keep this page open in a tab. One chapter left — the words an interviewer will assume you already know. 👉

Glossary: words an interviewer assumes you know

A fast lookup for every term this book defined, plus a few you'll meet in the wild. One line each, with a pointer to the chapter that explains it properly. If you can give the one-liner and sketch the code, you're fluent.

The learning loop

  • Model — a function with adjustable numbers (parameters) inside. Ch 1
  • Parameter / weight — a number the model learns via training. Ch 1
  • Hyperparameter — a knob you set and hold fixed (learning rate, #layers). Ch 1
  • Bias (the term +b) — the constant added in w·x + b; shifts the output. Ch 6
  • Training / fitting — adjusting parameters to minimize the loss. Ch 1
  • Inference / prediction — running a trained model on new input (no learning). Ch 1
  • Loss / cost / objective — the single number measuring current wrongness; training minimizes it. Ch 7
  • Metric — what you report (accuracy, F1); needn't be differentiable. Ch 10
  • Supervised / unsupervised / reinforcement — learn from labels / from structure / from rewards. Ch 1
  • Epoch / iteration / batch — full data pass / one update / examples per update. Ch 16

Data & features

  • Tensor — an n-dimensional array; rank = number of dimensions. Ch 2
  • Shape — the size along each dimension; (n_samples, n_features) by convention. Ch 2
  • Broadcasting — auto-stretching size-1 dimensions so arrays of different shapes combine without loops. Ch 2
  • Feature — one numeric property of a sample; the columns of X. Ch 3
  • Feature extraction / engineering / selection — create features / hand-craft new ones / pick a subset. Ch 3
  • One-hot encoding — a 0/1 column per category, no false ordering. Ch 3
  • TF-IDF — text features weighting rare-but-distinctive words up, common words down. Ch 3
  • Standardization (z-score) — per-feature mean 0, std 1. Ch 3
  • Normalization — to unit length (L2) or to [0,1] (min-max) — ask which. Ch 3
  • Embedding — a learned dense vector where geometric closeness = semantic similarity. Ch 14
  • Representation learning — the model invents features instead of you crafting them. Ch 14
  • Data leakage — test information sneaking into training; inflates offline scores. Ch 16

Similarity & math

  • Dot productΣ aᵢbᵢ; similarity that grows with alignment and magnitude. Ch 4
  • Cosine similarity — dot of normalized vectors; direction only, [-1, 1]. Ch 4
  • Euclidean / Manhattan distance — straight-line (L2) / city-block (L1) distance. Ch 4
  • RBF / Gaussian kernelexp(-γ‖a-b‖²); turns distance into tunable similarity. Ch 4
  • Kernel (ML) — a similarity function that's secretly a dot product in a richer space; the kernel trick uses that space without building it. Ch 5
  • Kernel (other 4) — conv filter / GPU function / OS core / stats window. Ch 5
  • Norm — a vector's length; ‖v‖₂ = √Σvᵢ². Ch 4
  • Logit — a raw pre-sigmoid/softmax score in (-∞, ∞). Ch 6

Models & training

  • Linear regression — predict a number with w·x + b. Ch 6
  • Logistic regression — predict a probability via sigmoid(w·x + b); a classifier. Ch 6
  • Sigmoid / softmax — squash to one probability / to competing probs summing to 1. Ch 6
  • Cross-entropy (= log loss = NLL) — the classification loss; punishes confident mistakes. Ch 7
  • MSE / MAE / Huber — regression losses: squared / absolute / hybrid. Ch 7
  • Gradient — the vector of partial derivatives; points uphill in loss. Ch 8
  • Gradient descentθ -= lr · gradient; the optimization engine. Ch 8
  • Learning rate — step size; too big diverges, too small crawls. Ch 8
  • SGD / mini-batch / momentum / Adam — the optimizer family; Adam is the default. Ch 8
  • Backpropagation — the chain rule applied efficiently to get all gradients. Ch 8
  • Autograd — software that does backprop automatically. Ch 12

Generalization

  • Overfitting / underfitting — memorizing noise / too simple to capture the pattern. Ch 9
  • Bias–variance tradeoff — error from wrong assumptions vs. from sensitivity to the sample. Ch 9
  • Train / validation / test — learn params / tune hyperparams / report once. Ch 9
  • Cross-validation — rotate held-out folds to squeeze small datasets. Ch 9
  • Regularization (L1 / L2) — penalize weight size; L1 → sparse, L2 → smooth shrinkage (weight decay). Ch 9
  • Dropout / early stopping / data augmentation — neural-net anti-overfitting tools. Ch 9

Metrics

  • Accuracy — fraction correct; misleading on imbalanced data. Ch 10
  • Confusion matrix — TP / FP / FN / TN; the source of all classification metrics. Ch 10
  • Precision / recall — trust of your alarms / coverage of true positives. Ch 10
  • F1 — harmonic mean of precision and recall; high only if both are. Ch 10
  • ROC-AUC — threshold-free ranking quality; 0.5 = random, 1.0 = perfect. Ch 10
  • RMSE / MAE / R² — regression error (squared / absolute) / variance explained. Ch 10

Neural networks & architectures

  • Neuron / layer / MLPactivation(w·x+b) / many neurons / stacked layers. Ch 11
  • Activation function — the non-linearity that makes depth meaningful (ReLU, GELU, tanh). Ch 11
  • Vanishing/exploding gradient — gradients shrinking to 0 / blowing up through deep nets. Ch 11
  • Weight initialization — small scaled random starts (Xavier/He) so training begins well. Ch 11
  • Inductive bias — a built-in assumption about the data an architecture encodes. Ch 13
  • CNN / convolution / filter — local, position-independent pattern detector for grids. Ch 13
  • RNN / LSTM / GRU — sequence models carrying a hidden state; gates for long memory. Ch 13
  • Transformer / self-attention / Q,K,Vsoftmax(QKᵀ/√d)·V; everything attends to everything. Ch 13
  • Diffusion / GAN / autoencoder / GNN / MoE — generative & specialized architectures. Ch 13

LLMs & the modern stack

  • Token / tokenization — the model's subword unit; text → integer IDs. Ch 15
  • Context window — max tokens the model attends to at once; its working memory. Ch 15
  • Temperature / top-k / top-p — sampling knobs trading focus for creativity. Ch 15
  • Pretraining / fine-tuning / RLHF — broad learning / specialization / alignment. Ch 15
  • RAG — retrieve relevant docs into the prompt; adds knowledge. Ch 15
  • Hallucination — confident falsehood; the model optimizes plausibility, not truth. Ch 15
  • Quantization — fewer bits per weight to shrink/speed models. Ch 15
  • Agent / tool use — an LLM that calls functions in a loop to act. Ch 15

Tools & engineering

  • NumPy — Python's array math library; the foundation of everything here. Ch 0
  • PyTorch — tensors + autograd + nn.Module + optimizers; the deep-learning framework. Ch 12
  • nn.Module / DataLoader — your model class / the batching+shuffling iterator. Ch 12, Ch 16
  • GPU / CUDA / .to(device) — parallel hardware / its programming model / moving tensors to it. Ch 16
  • model.train() / .eval() / no_grad() — dropout&bn on / off / skip the grad graph. Ch 12
  • Mixed precision / gradient accumulation / checkpointing — speed, memory, and crash-safety tricks. Ch 16
  • Seed / reproducibility — fix randomness so runs and comparisons are trustworthy. Ch 16
  • MLflow / Weights & Biases — experiment tracking for honest comparisons. Ch 16

Numerical literacy

  • Floating point — approximate arithmetic; compare with isclose, never ==. Ch 17
  • Log-sum-exp / stable softmax — subtract the max before exp to avoid overflow. Ch 17
  • NaN / inf — poison values; usual cause is a too-high learning rate or log(0). Ch 17

Classical ML algorithms

  • k-NN — classify by majority vote of the k nearest training points; no training. Ch 20
  • Naive Bayes — Bayes' rule assuming feature independence; fast text baseline. Ch 20
  • Decision tree — recursive if/else splits chosen to reduce impurity (Gini/ entropy). Ch 20
  • Random forest / bagging — many independent trees averaged; reduces variance. Ch 20
  • Gradient boosting (XGBoost/LightGBM) — sequential error-correcting trees; reduces bias; king of tabular ML. Ch 20
  • SVM / support vectors / margin — max-margin classifier; kernels for non-linear boundaries. Ch 20
  • k-means — partition data into k round clusters (assign/move). Ch 21
  • DBSCAN — density-based clustering; finds arbitrary shapes & outliers, no k. Ch 21
  • PCA — project onto top variance directions (via SVD) to reduce dimensions. Ch 21
  • t-SNE / UMAP — non-linear methods for visualizing high-dim data (not as features). Ch 21

Math & statistics

  • Expectation / variance — long-run mean / spread of a random variable. Ch 22
  • Central Limit Theorem — averages become Normal; uncertainty shrinks like $1/\sqrt n$. Ch 22
  • Bayes' theorem — update belief with evidence: posterior ∝ likelihood × prior. Ch 22
  • MLE / MAP — parameters that maximize data likelihood / posterior; the origin of losses & L2. Ch 22
  • p-value — P(data this extreme | null true); not P(null), not effect size. Ch 22
  • eigenvector / eigenvalue — a direction a matrix only scales, and by how much. Ch 23
  • SVD — factor any matrix; gives the best low-rank approximation (PCA, LoRA). Ch 23
  • rank — number of independent directions in a matrix; low rank = compressible. Ch 23
  • A/B test / power / guardrail — randomized online experiment; chance of catching a real effect; metrics that must not regress. Ch 24

Tools, cloud & the modern landscape

  • Pandas / SQL / scikit-learn — wrangling (groupby) / querying data / the fit/predict API. Ch 25
  • Hyperparameter search (grid/random/Optuna) — tune knobs by cross-validated search. Ch 25
  • MLOps / drift / feature store — productionizing models / input or concept shift / consistent train-serve features. Ch 26
  • Online vs. batch serving — low-latency one-at-a-time vs. high-throughput scheduled. Ch 26
  • LoRA / PEFT / quantization / distillation — cheap fine-tuning / fewer-bit weights / small student model. Ch 27
  • Vector database — store + nearest-neighbor search over embeddings; the memory of RAG. Ch 27
  • Agent / tool use / RLHF / DPO / MoE / diffusion — LLM that acts / alignment / scaled experts / denoising generation. Ch 27
  • Two-stage retrieval (candidate generation → ranking) — the standard scalable recommender/feed pattern. Ch 28

How to use this glossary

In an interview or a design review, you don't need the derivation of every term — you need the one-liner plus the intuition, and the confidence that you could code it. That's exactly what this book built. If any line above still feels hollow, follow its link, re-read the worked example, and run the code. Then you'll own the word, not just recognize it.

That's the whole foundation. Go build something. 👉

References

The sources behind this book, grouped by where they're most useful. Everything here is either a primary paper, a canonical textbook, or the official docs for a tool we used.

Foundational textbooks

  1. Ian Goodfellow, Yoshua Bengio, Aaron Courville. Deep Learning. MIT Press, 2016. — The standard reference for the math of neural networks, backprop, and regularization. Free at deeplearningbook.org.

  2. Christopher Bishop. Pattern Recognition and Machine Learning. Springer, 2006. — The classic on linear models, kernels, and the bias–variance tradeoff.

  3. Trevor Hastie, Robert Tibshirani, Jerome Friedman. The Elements of Statistical Learning. Springer, 2009. — Deep coverage of regularization (ridge, lasso), cross-validation, and classical ML. Free PDF online.

  4. Kevin Murphy. Probabilistic Machine Learning: An Introduction. MIT Press, 2022. — Modern, comprehensive, with excellent treatment of the numerical and probabilistic foundations.

Landmark papers

  1. Vaswani et al. Attention Is All You Need. NeurIPS, 2017. arXiv:1706.03762. — The transformer. The architecture behind every modern LLM (Ch 13, Ch 15).

  2. Mikolov et al. Efficient Estimation of Word Representations in Vector Space. 2013. arXiv:1301.3781. — word2vec; the king - man + woman ≈ queen embeddings (Ch 14).

  3. Kingma & Ba. Adam: A Method for Stochastic Optimization. ICLR, 2015. arXiv:1412.6980. — The default optimizer (Ch 8).

  4. Rumelhart, Hinton & Williams. Learning representations by back-propagating errors. Nature, 1986. — The paper that made backprop famous (Ch 8).

  5. Srivastava et al. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR, 2014. — (Ch 9).

  6. He et al. Delving Deep into Rectifiers (He initialization). ICCV, 2015. arXiv:1502.01852. — Why initialization scale matters (Ch 11).

  7. Cortes & Vapnik. Support-Vector Networks. Machine Learning, 1995. — The SVM and the kernel trick (Ch 5).

  8. Lewis et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS, 2020. arXiv:2005.11401. — RAG (Ch 15).

  9. Breiman, Leo. Random Forests. Machine Learning, 2001. — Bagging and the random forest (Ch 20).

  10. Chen & Guestrin. XGBoost: A Scalable Tree Boosting System. KDD, 2016. arXiv:1603.02754. — The gradient-boosting workhorse (Ch 20).

  11. Hu et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR, 2022. arXiv:2106.09685. — Efficient fine-tuning (Ch 27).

  12. Ho, Jain & Abbeel. Denoising Diffusion Probabilistic Models. NeurIPS, 2020. arXiv:2006.11239. — Diffusion generative models (Ch 27).

  13. Kohavi, Tang & Xu. Trustworthy Online Controlled Experiments. Cambridge, 2020. — The definitive A/B testing reference (Ch 24).

  14. Andrej Karpathy. nanoGPT and Let's build GPT from scratch. — The clearest reference for the GPT project (Ch 31).

  15. Yao et al. ReAct: Synergizing Reasoning and Acting in Language Models. ICLR, 2023. arXiv:2210.03629. — The agent loop (Ch 33).

Further reading by topic

  • Classical ML & stats — Hastie/Tibshirani/Friedman, The Elements of Statistical Learning (free PDF); James et al., An Introduction to Statistical Learning (gentler) (Ch 20Ch 23).
  • Linear algebra — Gilbert Strang, Introduction to Linear Algebra / MIT 18.06 lectures (Ch 23).
  • MLOps — Huyen, Designing Machine Learning Systems; Google's MLOps white papers (Ch 26).
  • ML system design interviews — Aminian/Xu, Machine Learning System Design Interview (Ch 28).

Tools & documentation

  • NumPy — numpy.org/doc. The array library underpinning every chapter.
  • PyTorch — pytorch.org/docs. Tensors, autograd, nn, DataLoader (Ch 12, Ch 16).
  • scikit-learn — scikit-learn.org. Reference implementations of the classical models, metrics, and preprocessing we built by hand.
  • Hugging Face — huggingface.co/docs. Pretrained transformers and tokenizers (Ch 15).

Sister books in this series

These build directly on the foundations here:

  • HNSW from Scratch — fast approximate nearest-neighbor search; the engine behind embedding retrieval and RAG (Ch 14).
  • IVF & Product Quantization from Scratch — the memory-efficient, compressed side of vector search.
  • Kernel Temporal Segmentation from Scratch — uses the kernel matrix (Ch 5) to segment video.
  • Recommendation Systems from Scratch — embeddings, matrix factorization, and a production capstone (MLflow, FastAPI, RAG) that exercises this whole book.

This book's code

Capstone projects (code/projects/):

The Part I–VIII code depends only on NumPy; the capstone projects need PyTorch (the CPU build is fine), except agent.py, which is pure Python.