Introduction
A large language model is a function with one input and one output. The input is a block of text called the context: the system prompt, the conversation so far, retrieved documents, tool definitions, tool results, and the user's latest message, all concatenated into one sequence. The output is the text the model writes back. Everything the model "knows" in the moment, beyond the frozen weights it was trained with, has to be in that input. There is nowhere else for knowledge to live during a single call.
That single fact is the reason this book exists. The context is a fixed-size, paid resource. It is fixed-size because every model has a maximum number of tokens it can read at once (its context window). It is paid because you are billed per token, both for what you send and, at a higher rate, for what the model writes. So two pressures sit on every serious LLM application at once: fit the right information into a window that is too small for everything, and do it without spending more tokens, latency, and money than the task is worth.
Context engineering is the craft of managing that resource. It is the set of techniques for deciding what goes into the context, what stays out, what gets compressed, what gets cached, what gets remembered across calls, and how the whole thing is assembled fresh on every turn. It is not prompt writing, although a good prompt is part of it. It is closer to memory management in an operating system, or to cache design in a CPU: you have a small fast resource (the window), a large slow world (everything the model might need to know), and your job is to keep the right things in the small resource at the right time.
Don't be confused. Prompt engineering is about wording: how you phrase an instruction so the model does what you want. Context engineering is about bookkeeping: which tokens are present at all, in what form, and at what cost. You can have a perfectly worded prompt that still fails because the document it needed was evicted three turns ago, or that costs ten times what it should because a stable 50,000-token preamble is re-sent and re-charged on every call. Prompt engineering tunes the words; context engineering tunes the tokens around them.
Why this matters now
For a one-shot question ("summarize this paragraph"), context engineering barely matters: the input is small, you call once, you are done. It becomes the dominant concern the moment any of three things is true, and modern systems make all three true at once:
- Long inputs. Whole codebases, long PDFs, hours of transcripts. The window fills up, and the cost of attention grows faster than the input itself (we will see why in Chapter 14).
- Many calls over the same context. An agent loop calls the model again and again, re-sending a growing history each time. A naive loop re-charges the entire prefix on every step. Caching turns that quadratic bill back into a linear one (Chapter 6).
- State that must persist. A model is stateless between calls; it forgets everything the instant a call returns. Anything it should remember across turns or sessions, a user's name, a fact it learned, a correction you gave it, has to be stored outside the model and re-injected (Chapter 9).
Get this right and an application is fast, cheap, and coherent over long horizons. Get it wrong and it is slow, expensive, and forgetful, often all three, and usually for reasons that never show up in the prompt you were staring at.
How the book is built
Every chapter follows the same contract, because the point is to understand each technique, not to import a library and hope:
- The problem, plainly. What goes wrong without the technique, and why.
- A from-scratch build. A small, real implementation in Python and NumPy, using only the standard library and NumPy, that you can run. No frameworks, no hidden magic.
- Before and after, in numbers. The demo measures something (tokens, cache hits, memory, FLOPs) with the technique off, then on, so you see the size of the win rather than taking it on faith.
- The real project. The production open-source system that does this for real, what it adds beyond the toy, and a use case or two. You should finish each chapter able to both explain the idea and pick the right tool.
Where a chapter touches the Anthropic API (prompt caching, token counting, output
control), it uses the model id claude-opus-4-8 and shows the exact parameter, written
as follow-along you can copy. Those API snippets are labeled illustrative, because the
build machine has no API key; the from-scratch NumPy demos, by contrast, are run for real
and their output is pasted in verbatim.
The twelve levers
The middle of the book is organized around the twelve things you can actually do to a context, grouped into four families. You do not need them all on day one; you reach for each when its problem appears.
| Family | Lever | Chapter |
|---|---|---|
| Compression | Prompt and context compression | 3 |
| Compression | Output token reduction | 4 |
| Compression | Code and structure-aware context | 5 |
| Caching | KV-cache and prefix caching | 6 |
| Caching | Semantic and response caching | 7 |
| Caching | KV-cache serving optimization | 8 |
| Memory | Agent memory and persistence | 9 |
| Memory | Temporal knowledge graphs | 10 |
| Memory | Context-window compaction | 11 |
| Memory | Failure and procedural learning | 12 |
| Architecture | Context orchestration | 13 |
| Architecture | Long-context attention efficiency | 14 |
👉 Before any of the levers, we need a shared picture of the thing they all act on. The next chapter defines context engineering precisely and draws the map the rest of the book fills in.