References
The sources behind this book, grouped by where they're most useful. Everything here is either a primary paper, a canonical textbook, or the official docs for a tool we used.
Foundational textbooks
-
Ian Goodfellow, Yoshua Bengio, Aaron Courville. Deep Learning. MIT Press, 2016. — The standard reference for the math of neural networks, backprop, and regularization. Free at deeplearningbook.org.
-
Christopher Bishop. Pattern Recognition and Machine Learning. Springer, 2006. — The classic on linear models, kernels, and the bias–variance tradeoff.
-
Trevor Hastie, Robert Tibshirani, Jerome Friedman. The Elements of Statistical Learning. Springer, 2009. — Deep coverage of regularization (ridge, lasso), cross-validation, and classical ML. Free PDF online.
-
Kevin Murphy. Probabilistic Machine Learning: An Introduction. MIT Press, 2022. — Modern, comprehensive, with excellent treatment of the numerical and probabilistic foundations.
Landmark papers
-
Vaswani et al. Attention Is All You Need. NeurIPS, 2017. arXiv:1706.03762. — The transformer. The architecture behind every modern LLM (Ch 13, Ch 15).
-
Mikolov et al. Efficient Estimation of Word Representations in Vector Space. 2013. arXiv:1301.3781. — word2vec; the
king - man + woman ≈ queenembeddings (Ch 14). -
Kingma & Ba. Adam: A Method for Stochastic Optimization. ICLR, 2015. arXiv:1412.6980. — The default optimizer (Ch 8).
-
Rumelhart, Hinton & Williams. Learning representations by back-propagating errors. Nature, 1986. — The paper that made backprop famous (Ch 8).
-
Srivastava et al. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR, 2014. — (Ch 9).
-
He et al. Delving Deep into Rectifiers (He initialization). ICCV, 2015. arXiv:1502.01852. — Why initialization scale matters (Ch 11).
-
Cortes & Vapnik. Support-Vector Networks. Machine Learning, 1995. — The SVM and the kernel trick (Ch 5).
-
Lewis et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS, 2020. arXiv:2005.11401. — RAG (Ch 15).
-
Breiman, Leo. Random Forests. Machine Learning, 2001. — Bagging and the random forest (Ch 20).
-
Chen & Guestrin. XGBoost: A Scalable Tree Boosting System. KDD, 2016. arXiv:1603.02754. — The gradient-boosting workhorse (Ch 20).
-
Hu et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR, 2022. arXiv:2106.09685. — Efficient fine-tuning (Ch 27).
-
Ho, Jain & Abbeel. Denoising Diffusion Probabilistic Models. NeurIPS, 2020. arXiv:2006.11239. — Diffusion generative models (Ch 27).
-
Kohavi, Tang & Xu. Trustworthy Online Controlled Experiments. Cambridge, 2020. — The definitive A/B testing reference (Ch 24).
-
Andrej Karpathy. nanoGPT and Let's build GPT from scratch. — The clearest reference for the GPT project (Ch 31).
-
Yao et al. ReAct: Synergizing Reasoning and Acting in Language Models. ICLR, 2023. arXiv:2210.03629. — The agent loop (Ch 33).
Further reading by topic
- Classical ML & stats — Hastie/Tibshirani/Friedman, The Elements of Statistical Learning (free PDF); James et al., An Introduction to Statistical Learning (gentler) (Ch 20–Ch 23).
- Linear algebra — Gilbert Strang, Introduction to Linear Algebra / MIT 18.06 lectures (Ch 23).
- MLOps — Huyen, Designing Machine Learning Systems; Google's MLOps white papers (Ch 26).
- ML system design interviews — Aminian/Xu, Machine Learning System Design Interview (Ch 28).
Tools & documentation
- NumPy — numpy.org/doc. The array library underpinning every chapter.
- PyTorch — pytorch.org/docs. Tensors, autograd,
nn,DataLoader(Ch 12, Ch 16). - scikit-learn — scikit-learn.org. Reference implementations of the classical models, metrics, and preprocessing we built by hand.
- Hugging Face — huggingface.co/docs. Pretrained transformers and tokenizers (Ch 15).
Sister books in this series
These build directly on the foundations here:
- HNSW from Scratch — fast approximate nearest-neighbor search; the engine behind embedding retrieval and RAG (Ch 14).
- IVF & Product Quantization from Scratch — the memory-efficient, compressed side of vector search.
- Kernel Temporal Segmentation from Scratch — uses the kernel matrix (Ch 5) to segment video.
- Recommendation Systems from Scratch — embeddings, matrix factorization, and a production capstone (MLflow, FastAPI, RAG) that exercises this whole book.
This book's code
code/recipes.py— similarity, softmax, normalization, top-k, encodings.code/features.py— one-hot, bag-of-words, TF-IDF, scaling.code/nn_from_scratch.py— a 2-layer MLP with hand-derived backprop.code/pytorch_intro.py— the same network in PyTorch.
Capstone projects (code/projects/):
projects/gpt.py— a transformer language model, trained and sampled.projects/finetune_lora.py— full fine-tuning vs. LoRA, compared.projects/agent.py— a ReAct tool-using agent (pure Python).projects/cnn.py— a CNN image classifier trained end to end.projects/diffusion.py— a diffusion model that generates from noise.
The Part I–VIII code depends only on NumPy; the capstone projects need PyTorch (the
CPU build is fine), except agent.py, which is pure Python.