Introduction

A large language model is a function with one input and one output. The input is a block of text called the context: the system prompt, the conversation so far, retrieved documents, tool definitions, tool results, and the user's latest message, all concatenated into one sequence. The output is the text the model writes back. Everything the model "knows" in the moment, beyond the frozen weights it was trained with, has to be in that input. There is nowhere else for knowledge to live during a single call.

That single fact is the reason this book exists. The context is a fixed-size, paid resource. It is fixed-size because every model has a maximum number of tokens it can read at once (its context window). It is paid because you are billed per token, both for what you send and, at a higher rate, for what the model writes. So two pressures sit on every serious LLM application at once: fit the right information into a window that is too small for everything, and do it without spending more tokens, latency, and money than the task is worth.

Context engineering is the craft of managing that resource. It is the set of techniques for deciding what goes into the context, what stays out, what gets compressed, what gets cached, what gets remembered across calls, and how the whole thing is assembled fresh on every turn. It is not prompt writing, although a good prompt is part of it. It is closer to memory management in an operating system, or to cache design in a CPU: you have a small fast resource (the window), a large slow world (everything the model might need to know), and your job is to keep the right things in the small resource at the right time.

Don't be confused. Prompt engineering is about wording: how you phrase an instruction so the model does what you want. Context engineering is about bookkeeping: which tokens are present at all, in what form, and at what cost. You can have a perfectly worded prompt that still fails because the document it needed was evicted three turns ago, or that costs ten times what it should because a stable 50,000-token preamble is re-sent and re-charged on every call. Prompt engineering tunes the words; context engineering tunes the tokens around them.

Why this matters now

For a one-shot question ("summarize this paragraph"), context engineering barely matters: the input is small, you call once, you are done. It becomes the dominant concern the moment any of three things is true, and modern systems make all three true at once:

Long inputs. Whole codebases, long PDFs, hours of transcripts. The window fills up, and the cost of attention grows faster than the input itself (we will see why in Chapter 14).
Many calls over the same context. An agent loop calls the model again and again, re-sending a growing history each time. A naive loop re-charges the entire prefix on every step. Caching turns that quadratic bill back into a linear one (Chapter 6).
State that must persist. A model is stateless between calls; it forgets everything the instant a call returns. Anything it should remember across turns or sessions, a user's name, a fact it learned, a correction you gave it, has to be stored outside the model and re-injected (Chapter 9).

Get this right and an application is fast, cheap, and coherent over long horizons. Get it wrong and it is slow, expensive, and forgetful, often all three, and usually for reasons that never show up in the prompt you were staring at.

How the book is built

Every chapter follows the same contract, because the point is to understand each technique, not to import a library and hope:

The problem, plainly. What goes wrong without the technique, and why.
A from-scratch build. A small, real implementation in Python and NumPy, using only the standard library and NumPy, that you can run. No frameworks, no hidden magic.
Before and after, in numbers. The demo measures something (tokens, cache hits, memory, FLOPs) with the technique off, then on, so you see the size of the win rather than taking it on faith.
The real project. The production open-source system that does this for real, what it adds beyond the toy, and a use case or two. You should finish each chapter able to both explain the idea and pick the right tool.

Where a chapter touches the Anthropic API (prompt caching, token counting, output control), it uses the model id claude-opus-4-8 and shows the exact parameter, written as follow-along you can copy. Those API snippets are labeled illustrative, because the build machine has no API key; the from-scratch NumPy demos, by contrast, are run for real and their output is pasted in verbatim.

The twelve levers

The middle of the book is organized around the twelve things you can actually do to a context, grouped into four families. You do not need them all on day one; you reach for each when its problem appears.

Family	Lever	Chapter
Compression	Prompt and context compression	3
Compression	Output token reduction	4
Compression	Code and structure-aware context	5
Caching	KV-cache and prefix caching	6
Caching	Semantic and response caching	7
Caching	KV-cache serving optimization	8
Memory	Agent memory and persistence	9
Memory	Temporal knowledge graphs	10
Memory	Context-window compaction	11
Memory	Failure and procedural learning	12
Architecture	Context orchestration	13
Architecture	Long-context attention efficiency	14

👉 Before any of the levers, we need a shared picture of the thing they all act on. The next chapter defines context engineering precisely and draws the map the rest of the book fills in.

Context Engineering in Depth

Introduction

Why this matters now

How the book is built

The twelve levers