Glossary: words an interviewer assumes you know
A fast lookup for every term this book defined, plus a few you'll meet in the wild. One line each, with a pointer to the chapter that explains it properly. If you can give the one-liner and sketch the code, you're fluent.
The learning loop
- Model — a function with adjustable numbers (parameters) inside. Ch 1
- Parameter / weight — a number the model learns via training. Ch 1
- Hyperparameter — a knob you set and hold fixed (learning rate, #layers). Ch 1
- Bias (the term
+b) — the constant added inw·x + b; shifts the output. Ch 6 - Training / fitting — adjusting parameters to minimize the loss. Ch 1
- Inference / prediction — running a trained model on new input (no learning). Ch 1
- Loss / cost / objective — the single number measuring current wrongness; training minimizes it. Ch 7
- Metric — what you report (accuracy, F1); needn't be differentiable. Ch 10
- Supervised / unsupervised / reinforcement — learn from labels / from structure / from rewards. Ch 1
- Epoch / iteration / batch — full data pass / one update / examples per update. Ch 16
Data & features
- Tensor — an n-dimensional array; rank = number of dimensions. Ch 2
- Shape — the size along each dimension;
(n_samples, n_features)by convention. Ch 2 - Broadcasting — auto-stretching size-1 dimensions so arrays of different shapes combine without loops. Ch 2
- Feature — one numeric property of a sample; the columns of
X. Ch 3 - Feature extraction / engineering / selection — create features / hand-craft new ones / pick a subset. Ch 3
- One-hot encoding — a 0/1 column per category, no false ordering. Ch 3
- TF-IDF — text features weighting rare-but-distinctive words up, common words down. Ch 3
- Standardization (z-score) — per-feature mean 0, std 1. Ch 3
- Normalization — to unit length (L2) or to [0,1] (min-max) — ask which. Ch 3
- Embedding — a learned dense vector where geometric closeness = semantic similarity. Ch 14
- Representation learning — the model invents features instead of you crafting them. Ch 14
- Data leakage — test information sneaking into training; inflates offline scores. Ch 16
Similarity & math
- Dot product —
Σ aᵢbᵢ; similarity that grows with alignment and magnitude. Ch 4 - Cosine similarity — dot of normalized vectors; direction only, [-1, 1]. Ch 4
- Euclidean / Manhattan distance — straight-line (L2) / city-block (L1) distance. Ch 4
- RBF / Gaussian kernel —
exp(-γ‖a-b‖²); turns distance into tunable similarity. Ch 4 - Kernel (ML) — a similarity function that's secretly a dot product in a richer space; the kernel trick uses that space without building it. Ch 5
- Kernel (other 4) — conv filter / GPU function / OS core / stats window. Ch 5
- Norm — a vector's length;
‖v‖₂ = √Σvᵢ². Ch 4 - Logit — a raw pre-sigmoid/softmax score in (-∞, ∞). Ch 6
Models & training
- Linear regression — predict a number with
w·x + b. Ch 6 - Logistic regression — predict a probability via
sigmoid(w·x + b); a classifier. Ch 6 - Sigmoid / softmax — squash to one probability / to competing probs summing to 1. Ch 6
- Cross-entropy (= log loss = NLL) — the classification loss; punishes confident mistakes. Ch 7
- MSE / MAE / Huber — regression losses: squared / absolute / hybrid. Ch 7
- Gradient — the vector of partial derivatives; points uphill in loss. Ch 8
- Gradient descent —
θ -= lr · gradient; the optimization engine. Ch 8 - Learning rate — step size; too big diverges, too small crawls. Ch 8
- SGD / mini-batch / momentum / Adam — the optimizer family; Adam is the default. Ch 8
- Backpropagation — the chain rule applied efficiently to get all gradients. Ch 8
- Autograd — software that does backprop automatically. Ch 12
Generalization
- Overfitting / underfitting — memorizing noise / too simple to capture the pattern. Ch 9
- Bias–variance tradeoff — error from wrong assumptions vs. from sensitivity to the sample. Ch 9
- Train / validation / test — learn params / tune hyperparams / report once. Ch 9
- Cross-validation — rotate held-out folds to squeeze small datasets. Ch 9
- Regularization (L1 / L2) — penalize weight size; L1 → sparse, L2 → smooth shrinkage (weight decay). Ch 9
- Dropout / early stopping / data augmentation — neural-net anti-overfitting tools. Ch 9
Metrics
- Accuracy — fraction correct; misleading on imbalanced data. Ch 10
- Confusion matrix — TP / FP / FN / TN; the source of all classification metrics. Ch 10
- Precision / recall — trust of your alarms / coverage of true positives. Ch 10
- F1 — harmonic mean of precision and recall; high only if both are. Ch 10
- ROC-AUC — threshold-free ranking quality; 0.5 = random, 1.0 = perfect. Ch 10
- RMSE / MAE / R² — regression error (squared / absolute) / variance explained. Ch 10
Neural networks & architectures
- Neuron / layer / MLP —
activation(w·x+b)/ many neurons / stacked layers. Ch 11 - Activation function — the non-linearity that makes depth meaningful (ReLU, GELU, tanh). Ch 11
- Vanishing/exploding gradient — gradients shrinking to 0 / blowing up through deep nets. Ch 11
- Weight initialization — small scaled random starts (Xavier/He) so training begins well. Ch 11
- Inductive bias — a built-in assumption about the data an architecture encodes. Ch 13
- CNN / convolution / filter — local, position-independent pattern detector for grids. Ch 13
- RNN / LSTM / GRU — sequence models carrying a hidden state; gates for long memory. Ch 13
- Transformer / self-attention / Q,K,V —
softmax(QKᵀ/√d)·V; everything attends to everything. Ch 13 - Diffusion / GAN / autoencoder / GNN / MoE — generative & specialized architectures. Ch 13
LLMs & the modern stack
- Token / tokenization — the model's subword unit; text → integer IDs. Ch 15
- Context window — max tokens the model attends to at once; its working memory. Ch 15
- Temperature / top-k / top-p — sampling knobs trading focus for creativity. Ch 15
- Pretraining / fine-tuning / RLHF — broad learning / specialization / alignment. Ch 15
- RAG — retrieve relevant docs into the prompt; adds knowledge. Ch 15
- Hallucination — confident falsehood; the model optimizes plausibility, not truth. Ch 15
- Quantization — fewer bits per weight to shrink/speed models. Ch 15
- Agent / tool use — an LLM that calls functions in a loop to act. Ch 15
Tools & engineering
- NumPy — Python's array math library; the foundation of everything here. Ch 0
- PyTorch — tensors + autograd +
nn.Module+ optimizers; the deep-learning framework. Ch 12 nn.Module/DataLoader— your model class / the batching+shuffling iterator. Ch 12, Ch 16- GPU / CUDA /
.to(device)— parallel hardware / its programming model / moving tensors to it. Ch 16 model.train()/.eval()/no_grad()— dropout&bn on / off / skip the grad graph. Ch 12- Mixed precision / gradient accumulation / checkpointing — speed, memory, and crash-safety tricks. Ch 16
- Seed / reproducibility — fix randomness so runs and comparisons are trustworthy. Ch 16
- MLflow / Weights & Biases — experiment tracking for honest comparisons. Ch 16
Numerical literacy
- Floating point — approximate arithmetic; compare with
isclose, never==. Ch 17 - Log-sum-exp / stable softmax — subtract the max before
expto avoid overflow. Ch 17 - NaN / inf — poison values; usual cause is a too-high learning rate or
log(0). Ch 17
Classical ML algorithms
- k-NN — classify by majority vote of the
knearest training points; no training. Ch 20 - Naive Bayes — Bayes' rule assuming feature independence; fast text baseline. Ch 20
- Decision tree — recursive if/else splits chosen to reduce impurity (Gini/ entropy). Ch 20
- Random forest / bagging — many independent trees averaged; reduces variance. Ch 20
- Gradient boosting (XGBoost/LightGBM) — sequential error-correcting trees; reduces bias; king of tabular ML. Ch 20
- SVM / support vectors / margin — max-margin classifier; kernels for non-linear boundaries. Ch 20
- k-means — partition data into
kround clusters (assign/move). Ch 21 - DBSCAN — density-based clustering; finds arbitrary shapes & outliers, no
k. Ch 21 - PCA — project onto top variance directions (via SVD) to reduce dimensions. Ch 21
- t-SNE / UMAP — non-linear methods for visualizing high-dim data (not as features). Ch 21
Math & statistics
- Expectation / variance — long-run mean / spread of a random variable. Ch 22
- Central Limit Theorem — averages become Normal; uncertainty shrinks like $1/\sqrt n$. Ch 22
- Bayes' theorem — update belief with evidence: posterior ∝ likelihood × prior. Ch 22
- MLE / MAP — parameters that maximize data likelihood / posterior; the origin of losses & L2. Ch 22
- p-value — P(data this extreme | null true); not P(null), not effect size. Ch 22
- eigenvector / eigenvalue — a direction a matrix only scales, and by how much. Ch 23
- SVD — factor any matrix; gives the best low-rank approximation (PCA, LoRA). Ch 23
- rank — number of independent directions in a matrix; low rank = compressible. Ch 23
- A/B test / power / guardrail — randomized online experiment; chance of catching a real effect; metrics that must not regress. Ch 24
Tools, cloud & the modern landscape
- Pandas / SQL / scikit-learn — wrangling (
groupby) / querying data / thefit/predictAPI. Ch 25 - Hyperparameter search (grid/random/Optuna) — tune knobs by cross-validated search. Ch 25
- MLOps / drift / feature store — productionizing models / input or concept shift / consistent train-serve features. Ch 26
- Online vs. batch serving — low-latency one-at-a-time vs. high-throughput scheduled. Ch 26
- LoRA / PEFT / quantization / distillation — cheap fine-tuning / fewer-bit weights / small student model. Ch 27
- Vector database — store + nearest-neighbor search over embeddings; the memory of RAG. Ch 27
- Agent / tool use / RLHF / DPO / MoE / diffusion — LLM that acts / alignment / scaled experts / denoising generation. Ch 27
- Two-stage retrieval (candidate generation → ranking) — the standard scalable recommender/feed pattern. Ch 28
How to use this glossary
In an interview or a design review, you don't need the derivation of every term — you need the one-liner plus the intuition, and the confidence that you could code it. That's exactly what this book built. If any line above still feels hollow, follow its link, re-read the worked example, and run the code. Then you'll own the word, not just recognize it.
That's the whole foundation. Go build something. 👉