References

The sources behind this book, grouped by where they're most useful. Everything here is either a primary paper, a canonical textbook, or the official docs for a tool we used.

Foundational textbooks

Ian Goodfellow, Yoshua Bengio, Aaron Courville. Deep Learning. MIT Press, 2016. — The standard reference for the math of neural networks, backprop, and regularization. Free at deeplearningbook.org.
Christopher Bishop. Pattern Recognition and Machine Learning. Springer, 2006. — The classic on linear models, kernels, and the bias–variance tradeoff.
Trevor Hastie, Robert Tibshirani, Jerome Friedman. The Elements of Statistical Learning. Springer, 2009. — Deep coverage of regularization (ridge, lasso), cross-validation, and classical ML. Free PDF online.
Kevin Murphy. Probabilistic Machine Learning: An Introduction. MIT Press, 2022. — Modern, comprehensive, with excellent treatment of the numerical and probabilistic foundations.

Landmark papers

Vaswani et al. Attention Is All You Need. NeurIPS, 2017. arXiv:1706.03762. — The transformer. The architecture behind every modern LLM (Ch 13, Ch 15).
Mikolov et al. Efficient Estimation of Word Representations in Vector Space. 2013. arXiv:1301.3781. — word2vec; the king - man + woman ≈ queen embeddings (Ch 14).
Kingma & Ba. Adam: A Method for Stochastic Optimization. ICLR, 2015. arXiv:1412.6980. — The default optimizer (Ch 8).
Rumelhart, Hinton & Williams. Learning representations by back-propagating errors. Nature, 1986. — The paper that made backprop famous (Ch 8).
Srivastava et al. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR, 2014. — (Ch 9).
He et al. Delving Deep into Rectifiers (He initialization). ICCV, 2015. arXiv:1502.01852. — Why initialization scale matters (Ch 11).
Cortes & Vapnik. Support-Vector Networks. Machine Learning, 1995. — The SVM and the kernel trick (Ch 5).
Lewis et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS, 2020. arXiv:2005.11401. — RAG (Ch 15).
Breiman, Leo. Random Forests. Machine Learning, 2001. — Bagging and the random forest (Ch 20).
Chen & Guestrin. XGBoost: A Scalable Tree Boosting System. KDD, 2016. arXiv:1603.02754. — The gradient-boosting workhorse (Ch 20).
Hu et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR, 2022. arXiv:2106.09685. — Efficient fine-tuning (Ch 27).
Ho, Jain & Abbeel. Denoising Diffusion Probabilistic Models. NeurIPS, 2020. arXiv:2006.11239. — Diffusion generative models (Ch 27).
Kohavi, Tang & Xu. Trustworthy Online Controlled Experiments. Cambridge, 2020. — The definitive A/B testing reference (Ch 24).
Andrej Karpathy. nanoGPT and Let's build GPT from scratch. — The clearest reference for the GPT project (Ch 31).
Yao et al. ReAct: Synergizing Reasoning and Acting in Language Models. ICLR, 2023. arXiv:2210.03629. — The agent loop (Ch 33).

Tools & documentation

NumPy — numpy.org/doc. The array library underpinning every chapter.
PyTorch — pytorch.org/docs. Tensors, autograd, nn, DataLoader (Ch 12, Ch 16).
scikit-learn — scikit-learn.org. Reference implementations of the classical models, metrics, and preprocessing we built by hand.
Hugging Face — huggingface.co/docs. Pretrained transformers and tokenizers (Ch 15).

Sister books in this series

These build directly on the foundations here:

HNSW from Scratch — fast approximate nearest-neighbor search; the engine behind embedding retrieval and RAG (Ch 14).
IVF & Product Quantization from Scratch — the memory-efficient, compressed side of vector search.
Kernel Temporal Segmentation from Scratch — uses the kernel matrix (Ch 5) to segment video.
Recommendation Systems from Scratch — embeddings, matrix factorization, and a production capstone (MLflow, FastAPI, RAG) that exercises this whole book.

This book's code

code/recipes.py — similarity, softmax, normalization, top-k, encodings.
code/features.py — one-hot, bag-of-words, TF-IDF, scaling.
code/nn_from_scratch.py — a 2-layer MLP with hand-derived backprop.
code/pytorch_intro.py — the same network in PyTorch.

Capstone projects (code/projects/):

projects/gpt.py — a transformer language model, trained and sampled.
projects/finetune_lora.py — full fine-tuning vs. LoRA, compared.
projects/agent.py — a ReAct tool-using agent (pure Python).
projects/cnn.py — a CNN image classifier trained end to end.
projects/diffusion.py — a diffusion model that generates from noise.

The Part I–VIII code depends only on NumPy; the capstone projects need PyTorch (the CPU build is fine), except agent.py, which is pure Python.

AI Foundations in Depth