The modern AI landscape & trends
The fundamentals in this book are stable for decades; the frontier moves every few months. This chapter is the orientation map of the current landscape (as of the mid-2020s) — the techniques, terms, and tools you'll be expected to recognize and reason about in any AI conversation or interview, even if you've never used them. Each builds on a foundation you already have.
The shift: from training models to adapting foundation models
The biggest change in practice: most teams no longer train models from scratch. They take a foundation model — a huge model pretrained by a big lab — and adapt it. Your job is increasingly choosing, adapting, and orchestrating these, not building them. Three adaptation strategies, in rising cost/power (from Chapter 15):
prompting ─► RAG ─► fine-tuning
(cheapest) (most powerful)
Efficient fine-tuning: LoRA & PEFT
Full fine-tuning of a 70-billion-parameter model updates all 70B weights — enormously expensive. PEFT (Parameter-Efficient Fine-Tuning) updates only a tiny fraction instead. The dominant method, LoRA (Low-Rank Adaptation), freezes the original weights and learns small low-rank (Chapter 23) update matrices — often <1% of the parameters — getting ~full-fine-tuning quality for a fraction of the compute and memory. QLoRA adds quantization to fine-tune big models on a single GPU.
Why it matters: LoRA is the reason a small team can customize a giant model. If asked "how would you specialize an LLM on a budget?", the answer is "LoRA/QLoRA, not full fine-tuning." It's a direct application of the low-rank idea from the linear- algebra chapter.
Quantization & distillation: making big models small
- Quantization — store weights in fewer bits (FP16 → INT8 → INT4). Roughly halves size and speeds inference per step down, with modest quality loss. It's how a model that needs a data center runs on your laptop (the same compression spirit as the IVF-PQ book).
- Distillation — train a small "student" model to mimic a large "teacher," capturing much of its ability in a fraction of the size. (DistilBERT, and the small fast models you see everywhere.)
- Pruning — remove weights that barely matter.
These inference-optimization techniques are increasingly where production value lives — a model you can't afford to serve is worthless.
RAG, in depth
Retrieval-Augmented Generation (Chapter 15) is the default architecture for building on LLMs with private or current data. The full pipeline, end to end, is built entirely from this book:
docs ─► chunk ─► embed (Ch 14) ─► store in vector DB ─► [query] embed ─►
nearest-neighbor search (HNSW/IVF-PQ books) ─► top-k chunks ─►
stuff into prompt ─► LLM generates a grounded answer
The hard parts in practice — and the interview follow-ups — are chunking (how to split documents), retrieval quality (hybrid keyword + vector search, re-ranking the candidates), and evaluation (did the answer actually use the right sources?). Advanced variants: agentic RAG (the model decides what to retrieve), GraphRAG (retrieve over a knowledge graph).
Vector databases
RAG needs fast similarity search over millions of embeddings, which created a whole tool category: vector databases — Pinecone, Weaviate, Qdrant, Milvus, Chroma, plus pgvector (Postgres) and Elasticsearch/OpenSearch. Under the hood they run the HNSW and IVF-PQ indexes the sister books build from scratch. Know that they exist, what they store (embeddings + metadata), and that they're the memory layer of modern AI apps.
Agents & tool use
An agent wraps an LLM in a loop where it can call tools (search, code execution, APIs, databases), observe results, and decide the next action — turning a text predictor into something that does things. Key terms: function/tool calling, ReAct (reason + act), MCP (Model Context Protocol, a standard for connecting tools), multi-agent systems (agents that delegate to each other). This is one of the hottest areas — and one of the least reliable, so know both the promise and the failure modes (error compounding over long loops, cost, latency).
Diffusion models & generative media
The architecture behind image/video/audio generation (Stable Diffusion, DALL·E, Midjourney, Sora). They generate by starting from pure noise and iteratively denoising it into a coherent image, guided by a text prompt. Different from transformers' next-token prediction, though modern systems combine both. Know the one-liner: "diffusion = learn to reverse noise into data."
Mixture of Experts (MoE)
How the largest models stay affordable: instead of running all parameters for every token, a router sends each token to a few specialist sub-networks ("experts"). The model has enormous total capacity but only activates a slice per token — high quality at lower inference cost. Many frontier models are MoE.
Multimodal models
Models that handle text + images + audio + video in one system (GPT-4o, Gemini, Claude with vision). The trick is the embedding idea from Chapter 14: map every modality into a shared vector space so the model reasons across them. "Describe this photo," "what's in this chart," voice assistants — all multimodal.
Alignment: RLHF & DPO
How a raw next-token predictor becomes a helpful, safe assistant. RLHF (Reinforcement Learning from Human Feedback) trains a reward model on human preferences, then optimizes the LLM against it. DPO (Direct Preference Optimization) achieves similar results more simply, without a separate reward model. These are the alignment step — the Chapter 1 reinforcement-learning flavor at frontier scale.
The open vs. closed landscape
- Closed/API models — GPT (OpenAI), Claude (Anthropic), Gemini (Google). Strongest, easiest to use, you rent them by the token.
- Open-weight models — Llama (Meta), Mistral, Qwen, DeepSeek, Gemma. You can run and fine-tune them yourself; vital for privacy, cost control, and customization.
The trade-off — capability and convenience (closed) vs. control, privacy, and cost (open) — is a real architectural decision teams make, and a great thing to reason about aloud in an interview.
How to stay current without drowning
The terms churn, the foundations don't. A new technique is almost always a recombination of ideas you now own: attention, embeddings, low-rank updates, nearest-neighbor search, gradient descent, quantization. When a new buzzword lands, ask: which fundamental is this built on? — and you'll place it in minutes. Follow a few primary sources (lab blogs, papers, a curated newsletter) over hype threads.
The takeaway
The modern stack is about adapting foundation models, not training from scratch: LoRA/ PEFT fine-tune cheaply (low-rank!), quantization/distillation shrink models for serving, RAG + vector DBs add knowledge (the HNSW/IVF-PQ engines), agents add action, diffusion generates media, MoE scales capacity affordably, multimodal unifies senses in a shared embedding space, and RLHF/DPO align it all. Every one of these is a remix of foundations you already hold — which is exactly why this book front-loaded them. Last stop: turning all of this into interview success. 👉