Glossary: words an interviewer assumes you know

A fast lookup for every term this book defined, plus a few you'll meet in the wild. One line each, with a pointer to the chapter that explains it properly. If you can give the one-liner and sketch the code, you're fluent.

The learning loop

  • Model — a function with adjustable numbers (parameters) inside. Ch 1
  • Parameter / weight — a number the model learns via training. Ch 1
  • Hyperparameter — a knob you set and hold fixed (learning rate, #layers). Ch 1
  • Bias (the term +b) — the constant added in w·x + b; shifts the output. Ch 6
  • Training / fitting — adjusting parameters to minimize the loss. Ch 1
  • Inference / prediction — running a trained model on new input (no learning). Ch 1
  • Loss / cost / objective — the single number measuring current wrongness; training minimizes it. Ch 7
  • Metric — what you report (accuracy, F1); needn't be differentiable. Ch 10
  • Supervised / unsupervised / reinforcement — learn from labels / from structure / from rewards. Ch 1
  • Epoch / iteration / batch — full data pass / one update / examples per update. Ch 16

Data & features

  • Tensor — an n-dimensional array; rank = number of dimensions. Ch 2
  • Shape — the size along each dimension; (n_samples, n_features) by convention. Ch 2
  • Broadcasting — auto-stretching size-1 dimensions so arrays of different shapes combine without loops. Ch 2
  • Feature — one numeric property of a sample; the columns of X. Ch 3
  • Feature extraction / engineering / selection — create features / hand-craft new ones / pick a subset. Ch 3
  • One-hot encoding — a 0/1 column per category, no false ordering. Ch 3
  • TF-IDF — text features weighting rare-but-distinctive words up, common words down. Ch 3
  • Standardization (z-score) — per-feature mean 0, std 1. Ch 3
  • Normalization — to unit length (L2) or to [0,1] (min-max) — ask which. Ch 3
  • Embedding — a learned dense vector where geometric closeness = semantic similarity. Ch 14
  • Representation learning — the model invents features instead of you crafting them. Ch 14
  • Data leakage — test information sneaking into training; inflates offline scores. Ch 16

Similarity & math

  • Dot productΣ aᵢbᵢ; similarity that grows with alignment and magnitude. Ch 4
  • Cosine similarity — dot of normalized vectors; direction only, [-1, 1]. Ch 4
  • Euclidean / Manhattan distance — straight-line (L2) / city-block (L1) distance. Ch 4
  • RBF / Gaussian kernelexp(-γ‖a-b‖²); turns distance into tunable similarity. Ch 4
  • Kernel (ML) — a similarity function that's secretly a dot product in a richer space; the kernel trick uses that space without building it. Ch 5
  • Kernel (other 4) — conv filter / GPU function / OS core / stats window. Ch 5
  • Norm — a vector's length; ‖v‖₂ = √Σvᵢ². Ch 4
  • Logit — a raw pre-sigmoid/softmax score in (-∞, ∞). Ch 6

Models & training

  • Linear regression — predict a number with w·x + b. Ch 6
  • Logistic regression — predict a probability via sigmoid(w·x + b); a classifier. Ch 6
  • Sigmoid / softmax — squash to one probability / to competing probs summing to 1. Ch 6
  • Cross-entropy (= log loss = NLL) — the classification loss; punishes confident mistakes. Ch 7
  • MSE / MAE / Huber — regression losses: squared / absolute / hybrid. Ch 7
  • Gradient — the vector of partial derivatives; points uphill in loss. Ch 8
  • Gradient descentθ -= lr · gradient; the optimization engine. Ch 8
  • Learning rate — step size; too big diverges, too small crawls. Ch 8
  • SGD / mini-batch / momentum / Adam — the optimizer family; Adam is the default. Ch 8
  • Backpropagation — the chain rule applied efficiently to get all gradients. Ch 8
  • Autograd — software that does backprop automatically. Ch 12

Generalization

  • Overfitting / underfitting — memorizing noise / too simple to capture the pattern. Ch 9
  • Bias–variance tradeoff — error from wrong assumptions vs. from sensitivity to the sample. Ch 9
  • Train / validation / test — learn params / tune hyperparams / report once. Ch 9
  • Cross-validation — rotate held-out folds to squeeze small datasets. Ch 9
  • Regularization (L1 / L2) — penalize weight size; L1 → sparse, L2 → smooth shrinkage (weight decay). Ch 9
  • Dropout / early stopping / data augmentation — neural-net anti-overfitting tools. Ch 9

Metrics

  • Accuracy — fraction correct; misleading on imbalanced data. Ch 10
  • Confusion matrix — TP / FP / FN / TN; the source of all classification metrics. Ch 10
  • Precision / recall — trust of your alarms / coverage of true positives. Ch 10
  • F1 — harmonic mean of precision and recall; high only if both are. Ch 10
  • ROC-AUC — threshold-free ranking quality; 0.5 = random, 1.0 = perfect. Ch 10
  • RMSE / MAE / R² — regression error (squared / absolute) / variance explained. Ch 10

Neural networks & architectures

  • Neuron / layer / MLPactivation(w·x+b) / many neurons / stacked layers. Ch 11
  • Activation function — the non-linearity that makes depth meaningful (ReLU, GELU, tanh). Ch 11
  • Vanishing/exploding gradient — gradients shrinking to 0 / blowing up through deep nets. Ch 11
  • Weight initialization — small scaled random starts (Xavier/He) so training begins well. Ch 11
  • Inductive bias — a built-in assumption about the data an architecture encodes. Ch 13
  • CNN / convolution / filter — local, position-independent pattern detector for grids. Ch 13
  • RNN / LSTM / GRU — sequence models carrying a hidden state; gates for long memory. Ch 13
  • Transformer / self-attention / Q,K,Vsoftmax(QKᵀ/√d)·V; everything attends to everything. Ch 13
  • Diffusion / GAN / autoencoder / GNN / MoE — generative & specialized architectures. Ch 13

LLMs & the modern stack

  • Token / tokenization — the model's subword unit; text → integer IDs. Ch 15
  • Context window — max tokens the model attends to at once; its working memory. Ch 15
  • Temperature / top-k / top-p — sampling knobs trading focus for creativity. Ch 15
  • Pretraining / fine-tuning / RLHF — broad learning / specialization / alignment. Ch 15
  • RAG — retrieve relevant docs into the prompt; adds knowledge. Ch 15
  • Hallucination — confident falsehood; the model optimizes plausibility, not truth. Ch 15
  • Quantization — fewer bits per weight to shrink/speed models. Ch 15
  • Agent / tool use — an LLM that calls functions in a loop to act. Ch 15

Tools & engineering

  • NumPy — Python's array math library; the foundation of everything here. Ch 0
  • PyTorch — tensors + autograd + nn.Module + optimizers; the deep-learning framework. Ch 12
  • nn.Module / DataLoader — your model class / the batching+shuffling iterator. Ch 12, Ch 16
  • GPU / CUDA / .to(device) — parallel hardware / its programming model / moving tensors to it. Ch 16
  • model.train() / .eval() / no_grad() — dropout&bn on / off / skip the grad graph. Ch 12
  • Mixed precision / gradient accumulation / checkpointing — speed, memory, and crash-safety tricks. Ch 16
  • Seed / reproducibility — fix randomness so runs and comparisons are trustworthy. Ch 16
  • MLflow / Weights & Biases — experiment tracking for honest comparisons. Ch 16

Numerical literacy

  • Floating point — approximate arithmetic; compare with isclose, never ==. Ch 17
  • Log-sum-exp / stable softmax — subtract the max before exp to avoid overflow. Ch 17
  • NaN / inf — poison values; usual cause is a too-high learning rate or log(0). Ch 17

Classical ML algorithms

  • k-NN — classify by majority vote of the k nearest training points; no training. Ch 20
  • Naive Bayes — Bayes' rule assuming feature independence; fast text baseline. Ch 20
  • Decision tree — recursive if/else splits chosen to reduce impurity (Gini/ entropy). Ch 20
  • Random forest / bagging — many independent trees averaged; reduces variance. Ch 20
  • Gradient boosting (XGBoost/LightGBM) — sequential error-correcting trees; reduces bias; king of tabular ML. Ch 20
  • SVM / support vectors / margin — max-margin classifier; kernels for non-linear boundaries. Ch 20
  • k-means — partition data into k round clusters (assign/move). Ch 21
  • DBSCAN — density-based clustering; finds arbitrary shapes & outliers, no k. Ch 21
  • PCA — project onto top variance directions (via SVD) to reduce dimensions. Ch 21
  • t-SNE / UMAP — non-linear methods for visualizing high-dim data (not as features). Ch 21

Math & statistics

  • Expectation / variance — long-run mean / spread of a random variable. Ch 22
  • Central Limit Theorem — averages become Normal; uncertainty shrinks like $1/\sqrt n$. Ch 22
  • Bayes' theorem — update belief with evidence: posterior ∝ likelihood × prior. Ch 22
  • MLE / MAP — parameters that maximize data likelihood / posterior; the origin of losses & L2. Ch 22
  • p-value — P(data this extreme | null true); not P(null), not effect size. Ch 22
  • eigenvector / eigenvalue — a direction a matrix only scales, and by how much. Ch 23
  • SVD — factor any matrix; gives the best low-rank approximation (PCA, LoRA). Ch 23
  • rank — number of independent directions in a matrix; low rank = compressible. Ch 23
  • A/B test / power / guardrail — randomized online experiment; chance of catching a real effect; metrics that must not regress. Ch 24

Tools, cloud & the modern landscape

  • Pandas / SQL / scikit-learn — wrangling (groupby) / querying data / the fit/predict API. Ch 25
  • Hyperparameter search (grid/random/Optuna) — tune knobs by cross-validated search. Ch 25
  • MLOps / drift / feature store — productionizing models / input or concept shift / consistent train-serve features. Ch 26
  • Online vs. batch serving — low-latency one-at-a-time vs. high-throughput scheduled. Ch 26
  • LoRA / PEFT / quantization / distillation — cheap fine-tuning / fewer-bit weights / small student model. Ch 27
  • Vector database — store + nearest-neighbor search over embeddings; the memory of RAG. Ch 27
  • Agent / tool use / RLHF / DPO / MoE / diffusion — LLM that acts / alignment / scaled experts / denoising generation. Ch 27
  • Two-stage retrieval (candidate generation → ranking) — the standard scalable recommender/feed pattern. Ch 28

How to use this glossary

In an interview or a design review, you don't need the derivation of every term — you need the one-liner plus the intuition, and the confidence that you could code it. That's exactly what this book built. If any line above still feels hollow, follow its link, re-read the worked example, and run the code. Then you'll own the word, not just recognize it.

That's the whole foundation. Go build something. 👉