Glossary: words an interviewer assumes you know

A fast lookup for every term this book defined, plus a few you'll meet in the wild. One line each, with a pointer to the chapter that explains it properly. If you can give the one-liner and sketch the code, you're fluent.

The learning loop

Model — a function with adjustable numbers (parameters) inside. Ch 1
Parameter / weight — a number the model learns via training. Ch 1
Hyperparameter — a knob you set and hold fixed (learning rate, #layers). Ch 1
Bias (the term +b) — the constant added in w·x + b; shifts the output. Ch 6
Training / fitting — adjusting parameters to minimize the loss. Ch 1
Inference / prediction — running a trained model on new input (no learning). Ch 1
Loss / cost / objective — the single number measuring current wrongness; training minimizes it. Ch 7
Metric — what you report (accuracy, F1); needn't be differentiable. Ch 10
Supervised / unsupervised / reinforcement — learn from labels / from structure / from rewards. Ch 1
Epoch / iteration / batch — full data pass / one update / examples per update. Ch 16

Data & features

Tensor — an n-dimensional array; rank = number of dimensions. Ch 2
Shape — the size along each dimension; (n_samples, n_features) by convention. Ch 2
Broadcasting — auto-stretching size-1 dimensions so arrays of different shapes combine without loops. Ch 2
Feature — one numeric property of a sample; the columns of X. Ch 3
Feature extraction / engineering / selection — create features / hand-craft new ones / pick a subset. Ch 3
One-hot encoding — a 0/1 column per category, no false ordering. Ch 3
TF-IDF — text features weighting rare-but-distinctive words up, common words down. Ch 3
Standardization (z-score) — per-feature mean 0, std 1. Ch 3
Normalization — to unit length (L2) or to [0,1] (min-max) — ask which. Ch 3
Embedding — a learned dense vector where geometric closeness = semantic similarity. Ch 14
Representation learning — the model invents features instead of you crafting them. Ch 14
Data leakage — test information sneaking into training; inflates offline scores. Ch 16

Similarity & math

Dot product — Σ aᵢbᵢ; similarity that grows with alignment and magnitude. Ch 4
Cosine similarity — dot of normalized vectors; direction only, [-1, 1]. Ch 4
Euclidean / Manhattan distance — straight-line (L2) / city-block (L1) distance. Ch 4
RBF / Gaussian kernel — exp(-γ‖a-b‖²); turns distance into tunable similarity. Ch 4
Kernel (ML) — a similarity function that's secretly a dot product in a richer space; the kernel trick uses that space without building it. Ch 5
Kernel (other 4) — conv filter / GPU function / OS core / stats window. Ch 5
Norm — a vector's length; ‖v‖₂ = √Σvᵢ². Ch 4
Logit — a raw pre-sigmoid/softmax score in (-∞, ∞). Ch 6

Models & training

Linear regression — predict a number with w·x + b. Ch 6
Logistic regression — predict a probability via sigmoid(w·x + b); a classifier. Ch 6
Sigmoid / softmax — squash to one probability / to competing probs summing to 1. Ch 6
Cross-entropy (= log loss = NLL) — the classification loss; punishes confident mistakes. Ch 7
MSE / MAE / Huber — regression losses: squared / absolute / hybrid. Ch 7
Gradient — the vector of partial derivatives; points uphill in loss. Ch 8
Gradient descent — θ -= lr · gradient; the optimization engine. Ch 8
Learning rate — step size; too big diverges, too small crawls. Ch 8
SGD / mini-batch / momentum / Adam — the optimizer family; Adam is the default. Ch 8
Backpropagation — the chain rule applied efficiently to get all gradients. Ch 8
Autograd — software that does backprop automatically. Ch 12

Generalization

Overfitting / underfitting — memorizing noise / too simple to capture the pattern. Ch 9
Bias–variance tradeoff — error from wrong assumptions vs. from sensitivity to the sample. Ch 9
Train / validation / test — learn params / tune hyperparams / report once. Ch 9
Cross-validation — rotate held-out folds to squeeze small datasets. Ch 9
Regularization (L1 / L2) — penalize weight size; L1 → sparse, L2 → smooth shrinkage (weight decay). Ch 9
Dropout / early stopping / data augmentation — neural-net anti-overfitting tools. Ch 9

Metrics

Accuracy — fraction correct; misleading on imbalanced data. Ch 10
Confusion matrix — TP / FP / FN / TN; the source of all classification metrics. Ch 10
Precision / recall — trust of your alarms / coverage of true positives. Ch 10
F1 — harmonic mean of precision and recall; high only if both are. Ch 10
ROC-AUC — threshold-free ranking quality; 0.5 = random, 1.0 = perfect. Ch 10
RMSE / MAE / R² — regression error (squared / absolute) / variance explained. Ch 10

Neural networks & architectures

Neuron / layer / MLP — activation(w·x+b) / many neurons / stacked layers. Ch 11
Activation function — the non-linearity that makes depth meaningful (ReLU, GELU, tanh). Ch 11
Vanishing/exploding gradient — gradients shrinking to 0 / blowing up through deep nets. Ch 11
Weight initialization — small scaled random starts (Xavier/He) so training begins well. Ch 11
Inductive bias — a built-in assumption about the data an architecture encodes. Ch 13
CNN / convolution / filter — local, position-independent pattern detector for grids. Ch 13
RNN / LSTM / GRU — sequence models carrying a hidden state; gates for long memory. Ch 13
Transformer / self-attention / Q,K,V — softmax(QKᵀ/√d)·V; everything attends to everything. Ch 13
Diffusion / GAN / autoencoder / GNN / MoE — generative & specialized architectures. Ch 13

LLMs & the modern stack

Token / tokenization — the model's subword unit; text → integer IDs. Ch 15
Context window — max tokens the model attends to at once; its working memory. Ch 15
Temperature / top-k / top-p — sampling knobs trading focus for creativity. Ch 15
Pretraining / fine-tuning / RLHF — broad learning / specialization / alignment. Ch 15
RAG — retrieve relevant docs into the prompt; adds knowledge. Ch 15
Hallucination — confident falsehood; the model optimizes plausibility, not truth. Ch 15
Quantization — fewer bits per weight to shrink/speed models. Ch 15
Agent / tool use — an LLM that calls functions in a loop to act. Ch 15

Tools & engineering

NumPy — Python's array math library; the foundation of everything here. Ch 0
PyTorch — tensors + autograd + nn.Module + optimizers; the deep-learning framework. Ch 12
nn.Module / DataLoader — your model class / the batching+shuffling iterator. Ch 12, Ch 16
GPU / CUDA / .to(device) — parallel hardware / its programming model / moving tensors to it. Ch 16
model.train() / .eval() / no_grad() — dropout&bn on / off / skip the grad graph. Ch 12
Mixed precision / gradient accumulation / checkpointing — speed, memory, and crash-safety tricks. Ch 16
Seed / reproducibility — fix randomness so runs and comparisons are trustworthy. Ch 16
MLflow / Weights & Biases — experiment tracking for honest comparisons. Ch 16

Numerical literacy

Floating point — approximate arithmetic; compare with isclose, never ==. Ch 17
Log-sum-exp / stable softmax — subtract the max before exp to avoid overflow. Ch 17
NaN / inf — poison values; usual cause is a too-high learning rate or log(0). Ch 17

Classical ML algorithms

k-NN — classify by majority vote of the k nearest training points; no training. Ch 20
Naive Bayes — Bayes' rule assuming feature independence; fast text baseline. Ch 20
Decision tree — recursive if/else splits chosen to reduce impurity (Gini/ entropy). Ch 20
Random forest / bagging — many independent trees averaged; reduces variance. Ch 20
Gradient boosting (XGBoost/LightGBM) — sequential error-correcting trees; reduces bias; king of tabular ML. Ch 20
SVM / support vectors / margin — max-margin classifier; kernels for non-linear boundaries. Ch 20
k-means — partition data into k round clusters (assign/move). Ch 21
DBSCAN — density-based clustering; finds arbitrary shapes & outliers, no k. Ch 21
PCA — project onto top variance directions (via SVD) to reduce dimensions. Ch 21
t-SNE / UMAP — non-linear methods for visualizing high-dim data (not as features). Ch 21

Math & statistics

Expectation / variance — long-run mean / spread of a random variable. Ch 22
Central Limit Theorem — averages become Normal; uncertainty shrinks like $1/\sqrt n$. Ch 22
Bayes' theorem — update belief with evidence: posterior ∝ likelihood × prior. Ch 22
MLE / MAP — parameters that maximize data likelihood / posterior; the origin of losses & L2. Ch 22
p-value — P(data this extreme | null true); not P(null), not effect size. Ch 22
eigenvector / eigenvalue — a direction a matrix only scales, and by how much. Ch 23
SVD — factor any matrix; gives the best low-rank approximation (PCA, LoRA). Ch 23
rank — number of independent directions in a matrix; low rank = compressible. Ch 23
A/B test / power / guardrail — randomized online experiment; chance of catching a real effect; metrics that must not regress. Ch 24

Tools, cloud & the modern landscape

Pandas / SQL / scikit-learn — wrangling (groupby) / querying data / the fit/predict API. Ch 25
Hyperparameter search (grid/random/Optuna) — tune knobs by cross-validated search. Ch 25
MLOps / drift / feature store — productionizing models / input or concept shift / consistent train-serve features. Ch 26
Online vs. batch serving — low-latency one-at-a-time vs. high-throughput scheduled. Ch 26
LoRA / PEFT / quantization / distillation — cheap fine-tuning / fewer-bit weights / small student model. Ch 27
Vector database — store + nearest-neighbor search over embeddings; the memory of RAG. Ch 27
Agent / tool use / RLHF / DPO / MoE / diffusion — LLM that acts / alignment / scaled experts / denoising generation. Ch 27
Two-stage retrieval (candidate generation → ranking) — the standard scalable recommender/feed pattern. Ch 28

How to use this glossary

In an interview or a design review, you don't need the derivation of every term — you need the one-liner plus the intuition, and the confidence that you could code it. That's exactly what this book built. If any line above still feels hollow, follow its link, re-read the worked example, and run the code. Then you'll own the word, not just recognize it.

That's the whole foundation. Go build something. 👉

AI Foundations in Depth