References

The sources behind this book, grouped by where they're most useful. Everything here is either a primary paper, a canonical textbook, or the official docs for a tool we used.

Foundational textbooks

  1. Ian Goodfellow, Yoshua Bengio, Aaron Courville. Deep Learning. MIT Press, 2016. — The standard reference for the math of neural networks, backprop, and regularization. Free at deeplearningbook.org.

  2. Christopher Bishop. Pattern Recognition and Machine Learning. Springer, 2006. — The classic on linear models, kernels, and the bias–variance tradeoff.

  3. Trevor Hastie, Robert Tibshirani, Jerome Friedman. The Elements of Statistical Learning. Springer, 2009. — Deep coverage of regularization (ridge, lasso), cross-validation, and classical ML. Free PDF online.

  4. Kevin Murphy. Probabilistic Machine Learning: An Introduction. MIT Press, 2022. — Modern, comprehensive, with excellent treatment of the numerical and probabilistic foundations.

Landmark papers

  1. Vaswani et al. Attention Is All You Need. NeurIPS, 2017. arXiv:1706.03762. — The transformer. The architecture behind every modern LLM (Ch 13, Ch 15).

  2. Mikolov et al. Efficient Estimation of Word Representations in Vector Space. 2013. arXiv:1301.3781. — word2vec; the king - man + woman ≈ queen embeddings (Ch 14).

  3. Kingma & Ba. Adam: A Method for Stochastic Optimization. ICLR, 2015. arXiv:1412.6980. — The default optimizer (Ch 8).

  4. Rumelhart, Hinton & Williams. Learning representations by back-propagating errors. Nature, 1986. — The paper that made backprop famous (Ch 8).

  5. Srivastava et al. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR, 2014. — (Ch 9).

  6. He et al. Delving Deep into Rectifiers (He initialization). ICCV, 2015. arXiv:1502.01852. — Why initialization scale matters (Ch 11).

  7. Cortes & Vapnik. Support-Vector Networks. Machine Learning, 1995. — The SVM and the kernel trick (Ch 5).

  8. Lewis et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS, 2020. arXiv:2005.11401. — RAG (Ch 15).

  9. Breiman, Leo. Random Forests. Machine Learning, 2001. — Bagging and the random forest (Ch 20).

  10. Chen & Guestrin. XGBoost: A Scalable Tree Boosting System. KDD, 2016. arXiv:1603.02754. — The gradient-boosting workhorse (Ch 20).

  11. Hu et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR, 2022. arXiv:2106.09685. — Efficient fine-tuning (Ch 27).

  12. Ho, Jain & Abbeel. Denoising Diffusion Probabilistic Models. NeurIPS, 2020. arXiv:2006.11239. — Diffusion generative models (Ch 27).

  13. Kohavi, Tang & Xu. Trustworthy Online Controlled Experiments. Cambridge, 2020. — The definitive A/B testing reference (Ch 24).

  14. Andrej Karpathy. nanoGPT and Let's build GPT from scratch. — The clearest reference for the GPT project (Ch 31).

  15. Yao et al. ReAct: Synergizing Reasoning and Acting in Language Models. ICLR, 2023. arXiv:2210.03629. — The agent loop (Ch 33).

Further reading by topic

  • Classical ML & stats — Hastie/Tibshirani/Friedman, The Elements of Statistical Learning (free PDF); James et al., An Introduction to Statistical Learning (gentler) (Ch 20Ch 23).
  • Linear algebra — Gilbert Strang, Introduction to Linear Algebra / MIT 18.06 lectures (Ch 23).
  • MLOps — Huyen, Designing Machine Learning Systems; Google's MLOps white papers (Ch 26).
  • ML system design interviews — Aminian/Xu, Machine Learning System Design Interview (Ch 28).

Tools & documentation

  • NumPy — numpy.org/doc. The array library underpinning every chapter.
  • PyTorch — pytorch.org/docs. Tensors, autograd, nn, DataLoader (Ch 12, Ch 16).
  • scikit-learn — scikit-learn.org. Reference implementations of the classical models, metrics, and preprocessing we built by hand.
  • Hugging Face — huggingface.co/docs. Pretrained transformers and tokenizers (Ch 15).

Sister books in this series

These build directly on the foundations here:

  • HNSW from Scratch — fast approximate nearest-neighbor search; the engine behind embedding retrieval and RAG (Ch 14).
  • IVF & Product Quantization from Scratch — the memory-efficient, compressed side of vector search.
  • Kernel Temporal Segmentation from Scratch — uses the kernel matrix (Ch 5) to segment video.
  • Recommendation Systems from Scratch — embeddings, matrix factorization, and a production capstone (MLflow, FastAPI, RAG) that exercises this whole book.

This book's code

Capstone projects (code/projects/):

The Part I–VIII code depends only on NumPy; the capstone projects need PyTorch (the CPU build is fine), except agent.py, which is pure Python.