Introduction
A large language model is a function with one input and one output. The input is a block of text called the context: the system prompt, the conversation so far, retrieved documents, tool definitions, tool results, and the user's latest message, all concatenated into one sequence. The output is the text the model writes back. Everything the model "knows" in the moment, beyond the frozen weights it was trained with, has to be in that input. There is nowhere else for knowledge to live during a single call.
That single fact is the reason this book exists. The context is a fixed-size, paid resource. It is fixed-size because every model has a maximum number of tokens it can read at once (its context window). It is paid because you are billed per token, both for what you send and, at a higher rate, for what the model writes. So two pressures sit on every serious LLM application at once: fit the right information into a window that is too small for everything, and do it without spending more tokens, latency, and money than the task is worth.
Context engineering is the craft of managing that resource. It is the set of techniques for deciding what goes into the context, what stays out, what gets compressed, what gets cached, what gets remembered across calls, and how the whole thing is assembled fresh on every turn. It is not prompt writing, although a good prompt is part of it. It is closer to memory management in an operating system, or to cache design in a CPU: you have a small fast resource (the window), a large slow world (everything the model might need to know), and your job is to keep the right things in the small resource at the right time.
Don't be confused. Prompt engineering is about wording: how you phrase an instruction so the model does what you want. Context engineering is about bookkeeping: which tokens are present at all, in what form, and at what cost. You can have a perfectly worded prompt that still fails because the document it needed was evicted three turns ago, or that costs ten times what it should because a stable 50,000-token preamble is re-sent and re-charged on every call. Prompt engineering tunes the words; context engineering tunes the tokens around them.
Why this matters now
For a one-shot question ("summarize this paragraph"), context engineering barely matters: the input is small, you call once, you are done. It becomes the dominant concern the moment any of three things is true, and modern systems make all three true at once:
- Long inputs. Whole codebases, long PDFs, hours of transcripts. The window fills up, and the cost of attention grows faster than the input itself (we will see why in Chapter 14).
- Many calls over the same context. An agent loop calls the model again and again, re-sending a growing history each time. A naive loop re-charges the entire prefix on every step. Caching turns that quadratic bill back into a linear one (Chapter 6).
- State that must persist. A model is stateless between calls; it forgets everything the instant a call returns. Anything it should remember across turns or sessions, a user's name, a fact it learned, a correction you gave it, has to be stored outside the model and re-injected (Chapter 9).
Get this right and an application is fast, cheap, and coherent over long horizons. Get it wrong and it is slow, expensive, and forgetful, often all three, and usually for reasons that never show up in the prompt you were staring at.
How the book is built
Every chapter follows the same contract, because the point is to understand each technique, not to import a library and hope:
- The problem, plainly. What goes wrong without the technique, and why.
- A from-scratch build. A small, real implementation in Python and NumPy, using only the standard library and NumPy, that you can run. No frameworks, no hidden magic.
- Before and after, in numbers. The demo measures something (tokens, cache hits, memory, FLOPs) with the technique off, then on, so you see the size of the win rather than taking it on faith.
- The real project. The production open-source system that does this for real, what it adds beyond the toy, and a use case or two. You should finish each chapter able to both explain the idea and pick the right tool.
Where a chapter touches the Anthropic API (prompt caching, token counting, output
control), it uses the model id claude-opus-4-8 and shows the exact parameter, written
as follow-along you can copy. Those API snippets are labeled illustrative, because the
build machine has no API key; the from-scratch NumPy demos, by contrast, are run for real
and their output is pasted in verbatim.
The twelve levers
The middle of the book is organized around the twelve things you can actually do to a context, grouped into four families. You do not need them all on day one; you reach for each when its problem appears.
| Family | Lever | Chapter |
|---|---|---|
| Compression | Prompt and context compression | 3 |
| Compression | Output token reduction | 4 |
| Compression | Code and structure-aware context | 5 |
| Caching | KV-cache and prefix caching | 6 |
| Caching | Semantic and response caching | 7 |
| Caching | KV-cache serving optimization | 8 |
| Memory | Agent memory and persistence | 9 |
| Memory | Temporal knowledge graphs | 10 |
| Memory | Context-window compaction | 11 |
| Memory | Failure and procedural learning | 12 |
| Architecture | Context orchestration | 13 |
| Architecture | Long-context attention efficiency | 14 |
👉 Before any of the levers, we need a shared picture of the thing they all act on. The next chapter defines context engineering precisely and draws the map the rest of the book fills in.
How to read this book
This book assumes you can read Python and have seen an LLM API call before. It does not assume you know what a KV-cache, an embedding, attention FLOPs, or a knowledge graph are; each is built up from nothing the first time it appears. If a term looks unfamiliar, it is explained in the chapter that needs it, or in the glossary at the end of the references.
Run the code
Every from-scratch demo uses only the Python standard library and NumPy. Nothing else is required: no PyTorch, no API key, no vector database, no cloud account. If you have Python 3 and NumPy, you can run all of it.
pip install numpy
python3 books/context-engineering/code/<file>.py
Each chapter includes its demo source directly from the code/ folder and then shows a
text block with the exact output that source produced when it was run. The
numbers in the prose (a compression ratio, a cache hit rate, a FLOP count) come from that
output, so you can reproduce them and change them.
Don't be confused. Two kinds of code appear in this book and they are not the same. The NumPy demos are run on a real machine and their output is pasted in verbatim, so you can trust the numbers. The Anthropic API snippets (prompt caching, token counting, output control) are written as follow-along: correct, copyable, but not executed here, because the build machine has no API key. Wherever you see an API snippet, treat its result as illustrative and run it yourself against your own key.
The house conventions
A few markers recur:
- A 👉 arrow ends every chapter with a one-line hand-off to the next. It is a deliberate signpost, not filler.
- A "Don't be confused" box, like the ones on this page, pulls apart two ideas that are easy to mix up (prompt vs context engineering, caching the prompt vs caching the answer, deleting context vs summarizing it).
- A "Takeaways" list near the end of each chapter is the five-second version, for when you come back later and need the gist without rereading.
The order
The Foundations chapters define the field and the token economy that makes every later trade-off concrete; read them first. After that the four families (compression, caching, memory, architecture) are mostly independent, so you can jump to the lever you need. The landscape chapter at the end is a map from each technique to the open-source project that implements it, useful as a reference when you are choosing a tool rather than building your own.
Cross-links like Chapter 6 point you to the chapter that goes deeper on something mentioned in passing. Follow them when you want the detail; skip them when you are skimming.
👉 With the ground rules set, let us define the thing this whole book is about.
What context engineering is
The introduction gave the one-sentence version: context engineering is managing the model's input as a fixed-size, paid resource. This chapter makes that precise, because the precision is what lets you reason about every later technique instead of memorizing them.
The model is a pure function of its context
A chat model feels stateful. You tell it your name, and three messages later it still knows it. But the model itself remembers nothing between calls. What actually happens is that your application re-sends the whole conversation every time, and the model re-reads it from scratch on each call. Formally, a single call is
$$\text{output} = f_\theta(\text{context})$$
where $f_\theta$ is the frozen network and context is one long token sequence. The
weights $\theta$ hold what the model learned in training; the context holds everything
about this situation. There is no third place. If a fact is not in the weights and not in
the context, the model does not have it during this call, full stop.
This is the whole reason context engineering is a discipline and not a footnote. Because the context is the only channel for live information, every design question reduces to a question about it: what is in the context, in what form, at what cost, on this call?
Don't be confused. "The model remembers" and "my application re-sends the history" describe the same observable behavior but they are not the same mechanism, and the difference is the entire job. The model is stateless. The memory is something your code maintains and re-injects. When the memory grows past the window, the model does not "forget gracefully": your code must decide what to cut, or the call fails. Mistaking the illusion for the mechanism is how people end up surprised by both the bill and the amnesia.
A context is assembled, every single turn
The context is not one thing you write once. It is rebuilt on every call from named parts: a system prompt, the tool definitions, any retrieved documents, the running conversation, the latest tool results, and the user's new message, concatenated in a fixed order. Build order matters (it interacts with caching, Chapter 6), and total size matters, because the sum has to fit the window.
Here is that assembly as runnable code. It builds a realistic turn from parts, estimates each part's size with a crude word-based count (real counts are Chapter 2), and then enforces three different budgets by dropping the oldest evictable parts until the rest fit. The system prompt and the user's message are pinned: they never get dropped, because without them the call is meaningless.
"""Assemble an LLM context from its parts and enforce a token budget.
Context engineering, at its most basic, is this: every call you BUILD the input
sequence from named parts, MEASURE it against the window, and DECIDE what to drop
when it does not fit. This script does exactly that with a crude word-based token
estimate (good enough to see the mechanics; real token counts come in chapter 2).
Only depends on the Python standard library. Run: python3 context_assemble.py
"""
from dataclasses import dataclass
def est_tokens(text: str) -> int:
"""Rough token estimate. Real tokenizers split into sub-words, so tokens are
a bit more numerous than whitespace words; ~1.3 tokens per word is a fair
rule of thumb for English prose."""
return round(len(text.split()) * 1.3)
@dataclass
class Part:
name: str
text: str
pinned: bool = False # pinned parts are never evicted (system prompt, user turn)
@property
def tokens(self) -> int:
return est_tokens(self.text)
def assemble(parts, budget):
"""Pack parts into a token budget. Pinned parts always stay. Among the rest,
keep the most RECENT (parts later in the list) and drop the oldest first,
which is the simplest possible eviction policy."""
pinned = [p for p in parts if p.pinned]
evictable = [p for p in parts if not p.pinned]
used = sum(p.tokens for p in pinned)
kept_recent = []
# walk newest-to-oldest, keep while it fits
for p in reversed(evictable):
if used + p.tokens <= budget:
kept_recent.append(p)
used += p.tokens
kept = pinned + list(reversed(kept_recent))
dropped = [p for p in evictable if p not in kept_recent]
return kept, dropped, used
# A realistic turn: stable preamble, a long retrieved doc, growing history, the ask.
parts = [
Part("system_prompt", "You are a careful coding assistant. " * 20, pinned=True),
Part("tool_defs", "search(query) read_file(path) write_file(path, text) " * 12),
Part("retrieved_doc", "The deployment pipeline runs build.sh on every push. " * 60),
Part("history_turn_1", "User asked about the database schema. Assistant replied. " * 15),
Part("history_turn_2", "User asked about migrations. Assistant gave the steps. " * 15),
Part("history_turn_3", "User asked about rollback. Assistant explained. " * 15),
Part("user_message", "Now: why did the last deploy fail?", pinned=True),
]
total = sum(p.tokens for p in parts)
print("=== The context as assembled, part by part ===")
for p in parts:
flag = " [pinned]" if p.pinned else ""
print(f" {p.name:16s} {p.tokens:5d} tokens{flag}")
print(f" {'TOTAL':16s} {total:5d} tokens\n")
for budget in (10_000, 600, 300):
kept, dropped, used = assemble(parts, budget)
print(f"=== Budget {budget} tokens ===")
print(f" fits as-is: {total <= budget}")
print(f" kept ({used} tok): {', '.join(p.name for p in kept)}")
print(f" dropped: {', '.join(p.name for p in dropped) or '(nothing)'}")
print()
print("Lesson: the SAME parts produce a different context at every budget. Deciding")
print("what survives the squeeze IS context engineering; the rest of the book is")
print("smarter ways to do it than 'drop the oldest'.")
Running it:
=== The context as assembled, part by part ===
system_prompt 156 tokens [pinned]
tool_defs 62 tokens
retrieved_doc 624 tokens
history_turn_1 156 tokens
history_turn_2 156 tokens
history_turn_3 117 tokens
user_message 9 tokens [pinned]
TOTAL 1280 tokens
=== Budget 10000 tokens ===
fits as-is: True
kept (1280 tok): system_prompt, user_message, tool_defs, retrieved_doc, history_turn_1, history_turn_2, history_turn_3
dropped: (nothing)
=== Budget 600 tokens ===
fits as-is: False
kept (594 tok): system_prompt, user_message, history_turn_1, history_turn_2, history_turn_3
dropped: tool_defs, retrieved_doc
=== Budget 300 tokens ===
fits as-is: False
kept (282 tok): system_prompt, user_message, history_turn_3
dropped: tool_defs, retrieved_doc, history_turn_1, history_turn_2
Lesson: the SAME parts produce a different context at every budget. Deciding
what survives the squeeze IS context engineering; the rest of the book is
smarter ways to do it than 'drop the oldest'.
Look at what happened across the three budgets. With room to spare (10,000 tokens),
everything goes in. At 600 tokens the retrieved document and the tool definitions get cut,
and notice the cost of that crude policy: dropping retrieved_doc may have thrown away the
exact thing the user's question needed, while keeping three turns of older chat that did
not. At 300 tokens only the most recent turn survives. The same raw material produced three
different contexts, and the quality of each depends entirely on the policy that chose what
to keep.
That policy is the subject of the book. "Drop the oldest" is the dumbest possible version. Every later chapter is a smarter answer to the same question:
- Instead of dropping the document, compress it so it still fits (Chapter 3).
- Instead of dropping old turns, summarize them into a few tokens (Chapter 11).
- Instead of re-sending the stable preamble and re-paying for it, cache it (Chapter 6).
- Instead of guessing which document is relevant, retrieve and rank by the question (Chapter 9 and Chapter 13).
The two pressures, and the four families
Every technique in this book exists to relieve one of two pressures, and usually it trades a little of one for a lot of the other:
- The window is finite. You cannot fit everything, so you must choose, shrink, or externalize. This is the capacity pressure.
- Tokens cost money and time. Even when it all fits, re-sending and re-generating tokens you did not need to is waste. This is the cost pressure, and we make it concrete in the next chapter.
The four families of techniques map onto these pressures:
| Family | What it does | Mainly relieves |
|---|---|---|
| Compression (3to5) | makes each part smaller without losing what matters | capacity and cost |
| Caching (6to8) | reuses work already done on stable or repeated context | cost |
| Memory (9to12) | stores state outside the window and re-injects only the relevant slice | capacity |
| Architecture (13to14) | assembles the right context per turn, and makes long windows tractable at all | both |
What "good" looks like
A well-engineered context has four properties, and you can audit any system against them:
- Sufficient. Everything the task needs is present, in a form the model can use.
- Lean. Nothing the task does not need is present. Every token earns its place.
- Cheap to repeat. The stable parts are cached, not re-paid, across the many calls of a session.
- Durable. State that should outlive a single call is stored outside the model and re-injected on demand, so the system does not get amnesia at the window boundary.
Hold these four in mind as you read. Most production failures are a violation of exactly one of them: a missing document (not sufficient), a bloated preamble (not lean), a re-charged prefix (not cheap to repeat), or a forgotten fact (not durable).
Takeaways
- The model is a pure function of its context; between calls it remembers nothing. Any "memory" is state your code keeps and re-injects.
- The context is rebuilt every turn from named parts, and it must fit a fixed window. When it does not, something gets cut, and the policy that chooses is where quality lives.
- Two pressures drive everything: finite capacity and per-token cost. The four families (compression, caching, memory, architecture) each relieve one or both.
- A good context is sufficient, lean, cheap to repeat, and durable. Most failures are a violation of exactly one of those.
👉 We have been counting tokens with a hand-wave. The next chapter makes the cost real: how tokens are actually counted, why output is the expensive half, and how to put a dollar figure on a context before you ever send it.
The token economy
Every technique in this book is justified by a number: tokens saved, dollars saved, milliseconds saved. So before the techniques, we need to be able to count. This chapter covers what a token is, why output tokens cost several times more than input tokens (which quietly decides what is worth optimizing), and how to estimate a context's size and price before you send it.
What a token is
Models do not read characters or words. They read tokens: chunks of text from a fixed
vocabulary, usually sub-word pieces. Common words are a single token (context), rarer or
longer words split into several (contextual becomes context + ual), and unusual
strings fall apart into many small pieces. The split is learned, not rule-based, by an
algorithm called byte pair encoding (BPE): start from individual characters, then
repeatedly merge the most frequent adjacent pair into a new token, and keep the merges as a
ranked list. At encode time you replay those merges on a new word.
This matters for context engineering because your bill and your window are measured in
tokens, not words or characters, and the ratio is not fixed. English prose runs about 1.3
tokens per word; code, JSON, and non-English text run higher because they fragment more.
That is why you cannot eyeball a budget reliably and why every provider ships a
count_tokens endpoint.
Output is the expensive half
Here is the fact that reshapes what you optimize: you are billed for both the tokens you send (input) and the tokens the model writes (output), but output is priced several times higher than input. On the Anthropic models, output is exactly 5x the input rate.
The demo below shows three things from that one fact. First, the 5x asymmetry across models. Second, a worked workload: token for token, cutting output is worth 5x cutting input, even though in a long-prompt/short-answer workload the input still dominates the total bill (the asymmetry is per token, not per call). Third, a from-scratch BPE tokenizer so "sub-word merging" is concrete rather than a phrase, plus the two practical estimates you can apply in your head.
"""The token economy: why output is the expensive half, and how to estimate cost.
Two ideas, both measurable:
1. You are billed per token, and OUTPUT tokens cost several times more than INPUT
tokens. So trimming what the model WRITES often saves more than trimming what
you SEND.
2. You can estimate a prompt's token count before you send it. Real tokenizers use
sub-word merging (BPE); we show a tiny BPE to demystify it, plus the practical
rule of thumb. (For exact counts, call the provider's count_tokens endpoint;
never use another vendor's tokenizer, it will be wrong.)
Standard library only. Run: python3 token_economy.py
"""
from collections import Counter
# Published Anthropic prices, US dollars per MILLION tokens (input, output).
# These anchor the asymmetry; verify current numbers before quoting them anywhere.
PRICES = {
"claude-opus-4-8": (5.0, 25.0), # 1M context
"claude-sonnet-4-6": (3.0, 15.0), # 1M context
"claude-haiku-4-5": (1.0, 5.0), # 200K context
}
def cost(model, in_tok, out_tok):
pin, pout = PRICES[model]
return (in_tok / 1e6) * pin + (out_tok / 1e6) * pout
print("=== 1. Output is the expensive half ===")
print("Per-token, output costs 5x input on every model here:\n")
for m, (pin, pout) in PRICES.items():
print(f" {m:18s} input ${pin:>5.2f}/Mtok output ${pout:>5.2f}/Mtok "
f"output is {pout/pin:.0f}x")
print()
# A support agent: a big stable prompt in, a short answer out, called a lot.
in_tok, out_tok, calls = 8000, 400, 100_000
m = "claude-opus-4-8"
base = cost(m, in_tok, out_tok) * calls
print(f"Workload: {calls:,} calls, {in_tok} input + {out_tok} output tokens each "
f"({m}).")
print(f" total cost: ${base:,.0f}\n")
# Per-token, trimming OUTPUT is worth 5x trimming the same count of INPUT.
save_out = (cost(m, in_tok, out_tok) - cost(m, in_tok, out_tok - 100)) * calls
save_in = (cost(m, in_tok, out_tok) - cost(m, in_tok - 100, out_tok)) * calls
print(f" cut 100 OUTPUT tokens/call: saves ${save_out:,.0f}")
print(f" cut 100 INPUT tokens/call: saves ${save_in:,.0f}")
print(f" -> token for token, output is worth {save_out/save_in:.0f}x as much to cut.")
print(" (Caching makes input cheaper still, ~0.1x on a hit, widening the gap;")
print(" see chapter 6. Whether INPUT or OUTPUT dominates your bill depends on the")
print(" ratio: a long prompt with a short answer is input-heavy in TOTAL, even")
print(" though each output token is pricier.)\n")
def bpe_encode(word, merges):
"""Encode one word with a learned merge list (the heart of GPT/Claude-style
tokenizers). Start from characters, then repeatedly glue the highest-priority
adjacent pair until no learned merge applies."""
toks = list(word)
while True:
pairs = [(toks[i], toks[i + 1]) for i in range(len(toks) - 1)]
ranked = [(merges[p], i) for i, p in enumerate(pairs) if p in merges]
if not ranked:
return toks
_, i = min(ranked) # apply the earliest-learned (highest priority) merge
toks[i:i + 2] = ["".join(toks[i:i + 2])]
def learn_merges(corpus, n):
"""Learn n merges greedily: repeatedly fuse the most frequent adjacent pair."""
seqs = [list(w) for w in corpus.split()]
merges = {}
for rank in range(n):
pairs = Counter()
for s in seqs:
for i in range(len(s) - 1):
pairs[(s[i], s[i + 1])] += 1
if not pairs:
break
best = pairs.most_common(1)[0][0]
merges[best] = rank
for s in seqs: # apply the new merge everywhere
i = 0
while i < len(s) - 1:
if (s[i], s[i + 1]) == best:
s[i:i + 2] = ["".join(best)]
else:
i += 1
return merges
print("=== 2. A tiny tokenizer (BPE), from scratch ===")
corpus = "context contexts contextual contexted engineering engineer engineered " * 50
merges = learn_merges(corpus, n=12)
for w in ["context", "contextual", "engineering", "token"]:
toks = bpe_encode(w, merges)
print(f" {w:12s} -> {toks} ({len(toks)} token(s))")
print()
print("=== 3. The practical estimate ===")
prompt = ("Summarize the deployment failure and propose a fix. "
"Keep it under three sentences. ") * 4
words = len(prompt.split())
chars = len(prompt)
print(f" prompt: {words} words, {chars} chars")
print(f" ~words * 1.3 = {round(words * 1.3)} tokens")
print(f" ~chars / 4 = {round(chars / 4)} tokens")
print(" Both are estimates. For billing-grade counts, call the model's own")
print(" count_tokens endpoint with the SAME model id you will send to.")
Running it:
=== 1. Output is the expensive half ===
Per-token, output costs 5x input on every model here:
claude-opus-4-8 input $ 5.00/Mtok output $25.00/Mtok output is 5x
claude-sonnet-4-6 input $ 3.00/Mtok output $15.00/Mtok output is 5x
claude-haiku-4-5 input $ 1.00/Mtok output $ 5.00/Mtok output is 5x
Workload: 100,000 calls, 8000 input + 400 output tokens each (claude-opus-4-8).
total cost: $5,000
cut 100 OUTPUT tokens/call: saves $250
cut 100 INPUT tokens/call: saves $50
-> token for token, output is worth 5x as much to cut.
(Caching makes input cheaper still, ~0.1x on a hit, widening the gap;
see chapter 6. Whether INPUT or OUTPUT dominates your bill depends on the
ratio: a long prompt with a short answer is input-heavy in TOTAL, even
though each output token is pricier.)
=== 2. A tiny tokenizer (BPE), from scratch ===
context -> ['context'] (1 token(s))
contextual -> ['context', 'u', 'a', 'l'] (4 token(s))
engineering -> ['enginee', 'r', 'i', 'ng'] (4 token(s))
token -> ['t', 'o', 'k', 'e', 'n'] (5 token(s))
=== 3. The practical estimate ===
prompt: 52 words, 332 chars
~words * 1.3 = 68 tokens
~chars / 4 = 83 tokens
Both are estimates. For billing-grade counts, call the model's own
count_tokens endpoint with the SAME model id you will send to.
Read the BPE output closely, because it shows the algorithm working. After learning just
twelve merges on a small corpus, context has been fused into a single token, while
token itself, which never appeared in the training corpus, stays shattered into five
character tokens. That is exactly why a word common in your domain is cheap and a rare
identifier is expensive: the tokenizer has a merge for the former and not the latter.
The cost asymmetry is the strategic takeaway. It is why Chapter 4 is dedicated entirely to making the model write less, and why a one-line "be concise" instruction can have a bigger return than compressing a long document. It does not mean input is free: in the workload above the input still costs more in total because there is 20x more of it. The rule is per token. When you are choosing what to optimize, weight a saved output token as five saved input tokens, then multiply by how many of each you actually have.
Counting tokens for real
Estimates (words * 1.3, chars / 4) are fine for a sanity check. For anything that
touches a budget or a billing decision, count exactly, with the same model id you will
send to, because different model families tokenize differently. Anthropic exposes this as
count_tokens, which takes the same messages you would send and returns the input token
count without running the model. The following is follow-along (the build machine has no
API key), but it is the exact call:
# Illustrative: requires the anthropic SDK and an API key.
import anthropic
client = anthropic.Anthropic()
n = client.messages.count_tokens(
model="claude-opus-4-8",
system=SYSTEM_PROMPT,
messages=[{"role": "user", "content": LONG_DOCUMENT}],
).input_tokens
print(n, "input tokens") # exact, free, no generation
Don't be confused. Do not reach for another vendor's tokenizer library (for example a GPT tokenizer) to count tokens for Claude, or vice versa. Each model family has its own vocabulary and merges, so a foreign tokenizer can be off by 15 to 20 percent, more on code. "Approximately right" is fine for a gut check and wrong for a budget. Use the provider's own
count_tokensfor the model you will actually call.
Pricing a context before you send it
Put the two halves together and you can price a call before making it: count the input
tokens, estimate the output tokens you expect, and multiply by the per-token rates. The
cost(model, in_tok, out_tok) function in the demo is the whole calculation. Doing this up
front is what turns context engineering from a vibe ("this feels bloated") into a decision
("this preamble costs $2,000 a day re-sent uncached; caching it drops that to $200").
Every later chapter ends up cashing out as a change to one of the three numbers in that
formula: fewer input tokens, fewer output tokens, or a cheaper rate on repeated input.
Takeaways
- Tokens are learned sub-word pieces, not words. Count is roughly 1.3 tokens per English word and higher for code or JSON; the ratio is not fixed, so estimates drift.
- BPE builds the vocabulary by merging frequent adjacent pairs; a domain-common word is one cheap token, a rare identifier is many expensive ones.
- Output tokens cost about 5x input tokens. Token for token, trimming output is worth 5x trimming input, but whether input or output dominates your total bill depends on the ratio you actually run.
- Estimate with
words * 1.3for a gut check; count with the provider'scount_tokensand the exact model id for anything that touches a budget. Never use a foreign tokenizer. - Pricing a context up front (count in, estimate out, multiply by rates) turns "feels bloated" into a number you can act on.
👉 Now that we can measure a context and its price, we can start shrinking it. The next chapter compresses the input: removing the tokens a long prompt does not need while keeping the ones it does.
Prompt and context compression
Chapter 1 ended on a crude policy: when the context does not fit, drop the oldest part. The worst case was dropping the retrieved document, the one piece the question actually needed. This chapter fixes that. Instead of throwing a part away whole, we compress it: keep the words that matter for this question and remove the words that do not, so the part still fits but still answers.
This is compression of the input, the tokens you send. A long tool output, a noisy log, a retrieved document, a stack of few-shot examples: most of those tokens are filler for the question at hand. Stripping them before the call buys you both window space and lower input cost (Chapter 2). Chapter 4 does the mirror job on the output, the tokens the model writes.
The idea: keep what answers the question
Compression here is query-aware and extractive. Two words worth pinning down:
- Query-aware means we compress with respect to a specific question. The same document compresses differently depending on what you are asking. A log compressed for "what was the error?" keeps different lines than the same log compressed for "how long did the outage last?".
- Extractive means we keep original sentences verbatim and only choose which to drop. We never rewrite or paraphrase. The opposite, abstractive compression, rewrites the text into something shorter (that is summarization, Chapter 11). Extractive is safer: it cannot hallucinate a fact that was not there, because every surviving word came straight from the source.
Don't be confused. Compression and summarization both make text shorter, but they are not the same operation. Compression here is extractive and lossless-at-the-word-level: it deletes spans and keeps the rest untouched, so what survives is guaranteed to be the author's original words. Summarization is abstractive: it generates new, shorter text that describes the original, which can introduce errors and costs output tokens to produce. Reach for extractive compression when you cannot afford a paraphrase to drift; reach for summarization when the source is so long that even the important sentences will not fit.
To choose which sentences to keep, we score each one by how much it is worth for the query, then keep the best ones until we hit a token budget.
Scoring a sentence
A good score has to do two things at once: reward sentences that mention the query terms, and not be fooled by filler words that appear everywhere. We combine two signals.
The first is term overlap: how many of the query's words show up in the sentence. On its own this is weak, because the most frequent overlapping word is usually something like "the" or "of", which tells you nothing about which sentence is relevant.
So we weight each word by its inverse document frequency (idf). Treat every sentence as a small "document". A word that appears in almost every sentence is common and uninformative; a word that appears in only one or two sentences is distinctive. idf turns that intuition into a number:
$$\text{idf}(w) = \log!\left(\frac{N + 1}{\text{df}(w) + 1}\right) + 1$$
Here $N$ is the number of sentences and $\text{df}(w)$ (the document frequency) is how many of them contain the word $w$. A word in every sentence has $\text{df}(w) \approx N$, so the fraction is near 1 and its idf is low. A word in one sentence has a small $\text{df}(w)$, so the fraction is large and its idf is high. The $+1$ terms are smoothing: they keep the logarithm away from dividing by zero and stop any weight from collapsing to exactly zero.
The sentence score is then the summed idf of every sentence word that also appears in the query, divided by the sentence length so we are scoring density of relevant content, not just rewarding long sentences:
$$\text{score}(s) = \frac{1}{|s|}\sum_{w \in s,; w \in q} \text{idf}(w)$$
where $|s|$ is the word count of sentence $s$ and $q$ is the set of query words.
Coarse to fine
We compress in two passes, cheap first.
- Coarse. Rank all sentences by score and keep the best ones, top down, until the next one would blow the token budget. This drops whole low-value sentences. It is the big win, because most of the bloat is entire sentences that have nothing to do with the question.
- Fine. On the sentences that survived, strip stopwords (the "the", "of", "is" filler) to squeeze out a bit more. We protect any word that appears in the query, so we never delete the very terms that made a sentence worth keeping.
Coarse-to-fine is the same shape the real systems use. Spend your cheap, high-leverage operation first (drop whole units), then do the finer, riskier trimming only on what is left.
The demo
The code below builds the scorer and both passes over an incident write-up where exactly one sentence answers the question ("what was the root cause of the outage?"), buried among filler and near-miss distractors. It prints every sentence's score so the ranking is not a black box, runs the coarse and fine passes under a 40-token budget, reports the compression ratio, and verifies the gold answer survived.
"""Query-aware extractive prompt compression, from scratch.
The problem: you are about to send a long context (a retrieved document, a pile
of logs, a tool's output) to the model along with a question. Most of those
tokens do not help answer THIS question. Compression removes them BEFORE the
call, so you pay for fewer input tokens and free up window space.
This is the simplest honest version of the idea behind Microsoft's LLMLingua:
score each unit of text by how much it matters to the query, then keep the best
units under a token budget. We do it EXTRACTIVELY (we keep original sentences,
we never rewrite), with two passes:
COARSE: drop whole low-scoring sentences.
FINE: strip filler ("stopwords") from the sentences that survived.
A sentence's score combines two signals:
* term OVERLAP with the query (does it mention what was asked?), and
* an inverse-document-frequency (idf) WEIGHT per word, so a word that appears
in almost every sentence (filler like "the", "deployment") counts for little
and a rare, informative word counts for a lot. We compute idf with NumPy.
We measure tokens with the same words * 1.3 estimate as chapter 2, report the
compression ratio, and VERIFY that the one sentence that actually answers the
query survives the squeeze.
NumPy + standard library only. Run: python3 compress.py
"""
import re
import numpy as np
# Words so common they carry almost no information about WHICH sentence answers
# a query. Real compressors learn this; we hard-code a tiny list to stay simple.
STOPWORDS = {
"the", "a", "an", "and", "or", "but", "if", "then", "of", "to", "in", "on",
"at", "by", "for", "with", "as", "is", "are", "was", "were", "be", "been",
"it", "its", "this", "that", "these", "those", "we", "you", "they", "i",
"our", "your", "their", "from", "into", "out", "up", "down", "over", "all",
"can", "will", "would", "should", "may", "have", "has", "had", "do", "does",
"not", "no", "so", "than", "very", "just", "also", "which", "when", "what",
}
def tokens(text):
"""Crude token estimate, consistent with chapter 2: words * 1.3."""
return round(len(text.split()) * 1.3)
def words(text):
"""Lowercased word list, punctuation stripped. The unit we score over."""
return re.findall(r"[a-z0-9]+", text.lower())
def split_sentences(text):
"""Split a blob into sentences on ., !, ? (good enough for plain prose)."""
parts = re.split(r"(?<=[.!?])\s+", text.strip())
return [p.strip() for p in parts if p.strip()]
def idf(sentences):
"""Inverse document frequency for every word, computed over the sentences.
Treat each SENTENCE as a 'document'. A word in many sentences is common and
uninformative (low idf); a word in few sentences is distinctive (high idf).
idf(w) = log( (N + 1) / (df(w) + 1) ) + 1
N is the sentence count, df(w) the number of sentences containing w. The
+1's are smoothing so nothing blows up or hits zero. NumPy does the arithmetic
over the whole vocabulary at once."""
vocab = sorted({w for s in sentences for w in words(s)})
index = {w: i for i, w in enumerate(vocab)}
n = len(sentences)
df = np.zeros(len(vocab))
for s in sentences:
for w in set(words(s)):
df[index[w]] += 1
scores = np.log((n + 1) / (df + 1)) + 1.0
return {w: float(scores[index[w]]) for w in vocab}
def score_sentence(sentence, query_words, idf_weights):
"""How much this sentence is worth, given the query.
Sum the idf weight of every sentence word that also appears in the query,
then divide by the sentence's length in words so we are not just rewarding
long sentences. This is a query-aware, idf-weighted overlap score."""
sw = words(sentence)
if not sw:
return 0.0
hit = sum(idf_weights.get(w, 1.0) for w in sw if w in query_words)
return hit / len(sw)
def compress(context, query, budget_tokens, strip_stopwords=False):
"""Keep the highest-scoring sentences under a token budget (coarse pass).
If strip_stopwords is set, also remove filler from survivors (fine pass).
Returns (kept_text, list_of_kept_original_sentences)."""
sentences = split_sentences(context)
idf_weights = idf(sentences)
qwords = set(words(query))
ranked = sorted(
sentences,
key=lambda s: score_sentence(s, qwords, idf_weights),
reverse=True,
)
# COARSE: walk best-first, keep a sentence if it still fits the budget.
kept, used = [], 0
for s in ranked:
cost = tokens(s)
if used + cost <= budget_tokens:
kept.append(s)
used += cost
# Put survivors back in their original reading order.
kept_in_order = [s for s in sentences if s in kept]
if not strip_stopwords:
return " ".join(kept_in_order), kept_in_order
# FINE: drop stopwords from each survivor. Query words are always protected
# so we never delete the very terms that made a sentence worth keeping.
fine = []
for s in kept_in_order:
out = [w for w in s.split()
if w.lower().strip(".,!?;:") not in STOPWORDS
or w.lower().strip(".,!?;:") in qwords]
fine.append(" ".join(out))
return " ".join(fine), kept_in_order
# A realistic long context: an incident write-up where exactly ONE sentence
# answers the question, buried in filler and near-miss distractors.
CONTEXT = """
The on-call engineer was paged at 02:14 about elevated error rates.
The dashboard showed the API returning 500s on roughly one in three requests.
We have a weekly deployment cadence and this happened right after a release.
The team had been discussing a migration to a new database for some time.
Logs from the gateway were noisy but did not point to a single cause at first.
The root cause was a connection pool limit of 20 that the new release exhausted under load.
A rollback to the previous version restored service within eight minutes.
The previous version had run for three weeks without any incident at all.
Customer support received a handful of tickets during the outage window.
We will add an alert on pool saturation and raise the limit in the next release.
The postmortem is scheduled for Thursday and the document is already drafted.
Coffee in the office kitchen ran out around the same time, unrelated.
"""
QUERY = "What was the root cause of the outage?"
GOLD = ("The root cause was a connection pool limit of 20 that the new "
"release exhausted under load.")
print("=== Query-aware extractive compression ===")
print(f"Query: {QUERY}\n")
original = " ".join(split_sentences(CONTEXT))
orig_tok = tokens(original)
n_sent = len(split_sentences(CONTEXT))
print(f"BEFORE: {n_sent} sentences, {orig_tok} tokens (words * 1.3 estimate)")
# Show the score of every sentence so the ranking is not a black box.
sents = split_sentences(CONTEXT)
idf_weights = idf(sents)
qwords = set(words(QUERY))
print("\nSentence scores (query-aware, idf-weighted):")
for s in sorted(sents, key=lambda s: score_sentence(s, qwords, idf_weights),
reverse=True):
sc = score_sentence(s, qwords, idf_weights)
short = s if len(s) <= 60 else s[:57] + "..."
print(f" {sc:5.2f} {short}")
# COARSE pass: drop whole low-value sentences under a tight budget.
BUDGET = 40
coarse_text, kept = compress(CONTEXT, QUERY, BUDGET)
coarse_tok = tokens(coarse_text)
print(f"\n--- COARSE (budget {BUDGET} tokens): keep best whole sentences ---")
print(f"AFTER: {len(kept)} sentences, {coarse_tok} tokens "
f"({orig_tok / coarse_tok:.1f}x smaller)")
for s in kept:
print(f" + {s}")
# Verify the gold sentence survived the squeeze.
gold_kept = any(GOLD.split()[3:8] == s.split()[3:8] for s in kept) \
or GOLD in kept
print(f"\nGold answer retained? {gold_kept}")
# FINE pass: also strip stopwords from the survivors.
fine_text, _ = compress(CONTEXT, QUERY, BUDGET, strip_stopwords=True)
fine_tok = tokens(fine_text)
print(f"\n--- FINE (also strip stopwords from survivors) ---")
print(f"AFTER: {fine_tok} tokens ({orig_tok / fine_tok:.1f}x smaller overall)")
print(f" {fine_text}")
print("\n=== Summary ===")
print(f" original : {orig_tok:4d} tokens")
print(f" coarse compressed : {coarse_tok:4d} tokens "
f"({orig_tok / coarse_tok:.1f}x)")
print(f" + fine compressed : {fine_tok:4d} tokens "
f"({orig_tok / fine_tok:.1f}x)")
print(f" gold sentence kept: {gold_kept}")
print("\nThe model still gets the one sentence that answers the question,")
print("at a fraction of the tokens. That saving is real input cost (chapter 2).")
Running it:
=== Query-aware extractive compression ===
Query: What was the root cause of the outage?
BEFORE: 12 sentences, 207 tokens (words * 1.3 estimate)
Sentence scores (query-aware, idf-weighted):
0.73 The root cause was a connection pool limit of 20 that the...
0.58 Customer support received a handful of tickets during the...
0.27 The on-call engineer was paged at 02:14 about elevated er...
0.22 Logs from the gateway were noisy but did not point to a s...
0.18 The postmortem is scheduled for Thursday and the document...
0.18 Coffee in the office kitchen ran out around the same time...
0.17 The dashboard showed the API returning 500s on roughly on...
0.14 We will add an alert on pool saturation and raise the lim...
0.10 A rollback to the previous version restored service withi...
0.08 The previous version had run for three weeks without any ...
0.08 The team had been discussing a migration to a new databas...
0.00 We have a weekly deployment cadence and this happened rig...
--- COARSE (budget 40 tokens): keep best whole sentences ---
AFTER: 2 sentences, 36 tokens (5.8x smaller)
+ The root cause was a connection pool limit of 20 that the new release exhausted under load.
+ Customer support received a handful of tickets during the outage window.
Gold answer retained? True
--- FINE (also strip stopwords from survivors) ---
AFTER: 32 tokens (6.5x smaller overall)
The root cause was connection pool limit of 20 the new release exhausted under load. Customer support received handful of tickets during the outage window.
=== Summary ===
original : 207 tokens
coarse compressed : 36 tokens (5.8x)
+ fine compressed : 32 tokens (6.5x)
gold sentence kept: True
The model still gets the one sentence that answers the question,
at a fraction of the tokens. That saving is real input cost (chapter 2).
The gold sentence ("The root cause was a connection pool limit of 20...") scores highest at 0.73, far above the rest, because it contains the rare, high-idf words "root", "cause", and "pool" that also match the query, while the deployment-cadence sentence scores 0.00 because none of its words overlap the query. The coarse pass cuts 207 tokens to 36, a 5.8x reduction, and the fine pass trims that to 32, for 6.5x overall, with the answer intact.
Two honest details in that output. The second kept sentence ("Customer support received a handful of tickets...") is a passenger: it scored 0.58 mostly on the word "outage" matching the query, and it fit in the remaining budget, so it rode along. A tighter budget would have dropped it. And in the fine output, the words "the", "of", and "was" survived even though they are stopwords, because they appear in the query "What was the root cause of the outage?" and we protect query words from deletion. Both behaviors are the rule working as written, not a bug.
The real project: LLMLingua
The demo is the teaching version of Microsoft's LLMLingua family, the reference systems for prompt compression. They share the coarse-to-fine shape but replace our hand-rolled score with a learned one.
LLMLingua does the same two-level pruning, on a larger scale:
- Coarse (demonstration-level pruning). Given a prompt full of few-shot examples (demonstrations), it ranks whole demonstrations and drops the least useful ones, the same move as our coarse pass dropping whole sentences.
- Fine (token-level pruning). On what remains, it ranks individual tokens and drops the most predictable ones. The ranking signal is perplexity from a small language model: roughly, how surprised the small model is to see each token given the ones before it. A token the small model could have guessed (low perplexity, high redundancy) carries little information and is safe to drop; a surprising token (high perplexity) is doing real work and is kept. This is where it improves on idf: idf only knows how rare a word is across sentences, while perplexity knows how predictable a token is in this exact context.
- A budget controller sets different compression rates for different parts of the prompt, compressing the few-shot examples harder than the instructions, on the theory that examples are more redundant than the task description and can lose more without hurting.
- A distribution-alignment step nudges the small model used for scoring to behave more like the large target model, so the perplexity rankings transfer.
LLMLingua reports up to roughly 20x compression with little performance loss on its benchmarks, well past the 6.5x our extractive toy reaches, because token-level perplexity pruning is finer-grained than dropping whole sentences.
LLMLingua-2 changes the fine pass. Instead of the perplexity heuristic, it uses a transformer classifier trained for the job: for each token, a small model outputs keep-or-drop. The training labels come from data distilled from GPT-4, which was prompted to compress text, producing examples of which tokens a strong model judged droppable. Learning keep/drop directly is both faster (one forward pass, no per-token perplexity loop) and often more accurate than the perplexity proxy, because the classifier was trained on the actual target (was this token kept?) rather than a stand-in for it.
Don't be confused. Perplexity and the classifier are two different ways to answer the same question, "is this token droppable?". Perplexity (LLMLingua) is a heuristic: it assumes predictable tokens are redundant, which is usually but not always true. The classifier (LLMLingua-2) is trained on labeled keep/drop decisions, so it learns the exception cases the heuristic gets wrong. Same goal, one inferred, one learned.
Where these earn their keep:
- Long RAG contexts. You retrieved ten documents and only fragments of three are relevant. Compress them against the query before the call.
- Few-shot demonstrations. A long prompt of examples is highly redundant; the budget controller compresses it hardest, often the largest single saving.
- Long chain-of-thought. A model's own long reasoning trace can be compressed before it is fed back in for the next step.
Verifying the saving with Claude
The point of compression is fewer input tokens, and you should confirm the count rather than
trust the words * 1.3 estimate. Anthropic's count_tokens gives the exact input count
without running the model, so you can measure before and after. The following is
follow-along (the build machine has no API key), but it is the exact call:
# Illustrative: requires the anthropic SDK and an API key.
import anthropic
client = anthropic.Anthropic()
def input_tokens(text):
return client.messages.count_tokens(
model="claude-opus-4-8",
messages=[{"role": "user", "content": text}],
).input_tokens
before = input_tokens(ORIGINAL_CONTEXT)
after = input_tokens(COMPRESSED_CONTEXT)
print(f"{before} -> {after} input tokens ({before / after:.1f}x smaller)")
Run that with the same model id you will send to, because token counts are tokenizer-specific (Chapter 2). The ratio it reports is the real saving, the one your bill will reflect.
When not to compress
Compression is not free and not always worth it.
- It costs compute. The classifier or the small scoring model is itself a model call. For a context you send once, the compression can cost more than it saves. It pays off when the same large context is reused, or when fitting the window is the binding constraint regardless of cost.
- It can drop the answer. Any lossy method can cut the one span you needed, which is why the demo verifies the gold sentence explicitly. In production you set the budget with margin and measure end-task quality, not just the ratio.
- It fights caching. If a prefix is stable across calls, caching it (Chapter 6) is cheaper than compressing it, and a compressor that changes the prefix per query defeats the cache. Compress the variable part (the retrieved documents), cache the stable part (the system prompt).
Using the real tool: commands and before/after proof
The from-scratch demo above is the on-box proof that query-aware extraction works. LLMLingua is the production version of the same idea, and you install it like any other Python package:
pip install llmlingua
The minimal real usage is two steps: build a PromptCompressor, then call compress_prompt
on your long context. The first call downloads a model the first time it runs, so it is not
instant. The snippet below is follow-along (this build box has neither the library nor a model
cache), but it is the documented quickstart shape:
# Follow-along: requires `pip install llmlingua` and a one-time model download.
from llmlingua import PromptCompressor
# LLMLingua-2 (the trained keep/drop classifier from earlier in this chapter).
compressor = PromptCompressor(
model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
use_llmlingua2=True,
)
ORIGINAL_CONTEXT = "...a long retrieved document, log, or stack of few-shot examples..."
QUESTION = "What was the root cause of the outage?"
result = compressor.compress_prompt(
ORIGINAL_CONTEXT,
question=QUESTION, # query-aware: keep what answers this
rate=0.3, # target keeping ~30% of the tokens
)
print(result["compressed_prompt"]) # the kept text
print(result["origin_tokens"], "->", result["compressed_tokens"])
print(result["ratio"]) # a string like "3.3x"
compress_prompt returns a dict. The fields that matter here are compressed_prompt (the
kept text, ready to send), origin_tokens and compressed_tokens (the before and after
counts by LLMLingua's own tokenizer), and ratio (a formatted string such as "3.3x"). The
two knobs are rate (keep this fraction, a float at most 1.0) and target_token (compress to
roughly this many tokens). If you pass target_token, it wins and rate is ignored. The
original (non-2) LLMLingua uses the same call; you just construct PromptCompressor() with no
arguments and it leans on target_token plus the perplexity heuristic.
Before/after proof
LLMLingua's own ratio is one number, but the count that hits your bill is the count from the
model you actually send to. So measure both prompts with Anthropic's count_tokens, which
returns the exact input count for a given model without running it, and confirm the answer
still survives the compressed version. The metric is input tokens, and the test is: the
compressed prompt is much smaller and still answers the question.
# Follow-along: requires the anthropic SDK and an API key.
import anthropic
client = anthropic.Anthropic()
def input_tokens(text):
return client.messages.count_tokens(
model="claude-opus-4-8",
messages=[{"role": "user", "content": text}],
).input_tokens
original = f"{ORIGINAL_CONTEXT}\n\nQuestion: {QUESTION}"
compressed = f"{result['compressed_prompt']}\n\nQuestion: {QUESTION}"
before = input_tokens(original)
after = input_tokens(compressed)
print(f"{before} -> {after} input tokens ({before / after:.1f}x smaller)")
# Send the compressed prompt and check the answer is still there.
reply = client.messages.create(
model="claude-opus-4-8",
max_tokens=256,
messages=[{"role": "user", "content": compressed}],
)
print(reply.content[0].text)
On a real run you would see roughly 3000 -> 600 input tokens (5.0x smaller) with the answer
(the connection-pool root cause) still in the reply. Those numbers are illustrative, not
measured on this box: the actual ratio depends on your text and the rate you pick, and you
should read it off your own count_tokens output, because the count is tokenizer-specific to
the model you send to (Chapter 2). The verified demo earlier in this
chapter is the toy you can run here; this is the same query-aware compression, measured the
same way (input tokens before vs. after), with the production tool doing the cutting.
In Claude Code
The same move applies inside an agent loop. When a step is about to feed a long retrieved document or a noisy log into the model, compress it against the current question first, so the context window stays under budget and the input bill stays small. Claude Code keeps its own context lean by the cheaper version of this discipline: it reads the specific files and line ranges a task needs rather than pasting whole files or directories into the prompt, which is extractive compression done by retrieval instead of by a compressor model.
Takeaways
- Prompt compression shrinks the input: it removes tokens a long context does not need for the current question, saving both window space and input cost.
- Score sentences query-aware and idf-weighted so filler words count for little and rare, on-topic words count for a lot. Keep the best under a token budget.
- Work coarse to fine: drop whole low-value units first (the big, cheap win), then trim filler from the survivors, protecting query terms.
- LLMLingua prunes demonstrations then tokens, ranking tokens by perplexity from a small model, with a budget controller that compresses examples harder than instructions, for up to roughly 20x. LLMLingua-2 swaps the perplexity heuristic for a trained keep/drop classifier distilled from GPT-4.
- Verify the saving with
count_tokenson the exact model id, and watch the trade-offs: compression costs compute, can drop the answer, and fights prefix caching, so compress the variable part and cache the stable part.
👉 We have squeezed the input. The next chapter turns to the other, pricier half: the output. Since each written token costs about 5x an input token, getting the model to write less is often the larger saving, and it takes different tools than compressing a prompt.
Output token reduction
Chapter 2 ended with a rule: weight a saved output token as five saved input tokens, then multiply by how many of each you actually have. Chapter 3 spent that rule on the input side, shrinking the prompt. This chapter spends it on the output side, trimming what the model writes back. Output is priced 5x higher per token, so when you control the shape of the answer, this is often the single highest-return lever in the book.
The catch is that you do not edit the output the way you edit a prompt. The prompt is text you assemble; the output is text the model generates, and you have already paid for every token of it by the time you could delete one. So output reduction is not "write less text after the fact." It is "make the model generate less in the first place." That distinction runs through the whole chapter.
Where the waste is
Ask an untuned chat model a one-word question and you rarely get a one-word answer. You get a preamble ("Certainly! I'd be happy to help..."), a restatement of your question, the actual answer, and a trailing summary that adds nothing. For a human reading one reply, that padding is harmless, even friendly. For a pipeline making the same call a hundred thousand times, three of those four parts are pure cost: you pay output rates to generate words no downstream step will read.
The demo below makes that concrete with no real model. A stub generator returns the
four-part verbose answer. Then three shapers trim it different ways, and we
measure output tokens (the words * 1.3 estimate from chapter 2) and the dollars
saved across a workload of 100,000 calls at the real claude-opus-4-8 output rate of
$25 per million tokens. The three shapers are:
- (a) terse mode: strip the preamble, the restatement, and the trailing summary, keeping only the lines that answer.
- (b) schema extraction: force the answer into a tiny JSON object
{"label": ...}, so only the needed field is ever emitted. - (c) hard cap: a
max_tokensceiling that truncates the stream once it is hit.
"""Output token reduction: shaping what the model writes BACK.
Chapter 2 established that output tokens cost 5x input tokens on the Anthropic
models. So cutting output is, token for token, the highest-return lever you have.
This file makes that concrete with NO real model: a stub generator produces a
verbose answer, three "shapers" trim it different ways, and we measure the output
tokens and DOLLARS saved across a workload of N calls.
Everything here is stdlib + (optional) numpy. We never call an API; the point is
to reason about the *shape* of the output, which you control before you ever send
the request.
"""
import re
# Real Anthropic rate for claude-opus-4-8: $25 per million OUTPUT tokens.
# (Input is $5/Mtok; output is the expensive half, see chapter 2.)
OUTPUT_RATE_PER_MTOK = 25.0
def est_tokens(text):
"""Estimate tokens from words, the chapter-2 rule: ~1.3 tokens per word.
A real count comes from the provider's count_tokens endpoint; this estimate
is fine for comparing two versions of the same kind of text.
"""
words = len(text.split())
return round(words * 1.3)
def dollars(out_tokens, n_calls):
"""Cost of `out_tokens` output tokens per call, across `n_calls` calls."""
return out_tokens * n_calls / 1_000_000 * OUTPUT_RATE_PER_MTOK
# ---------------------------------------------------------------------------
# The stub "model". It returns a verbose answer in four parts, the way an
# untuned chat model tends to: a preamble, a restatement of the question, the
# actual answer, and a trailing summary. Only the third part carries signal.
# ---------------------------------------------------------------------------
def verbose_model(question, answer):
"""Simulate a chatty model. Returns one string with four labeled parts."""
preamble = "Certainly! I'd be happy to help you with that."
restatement = f"You asked: {question}"
body = f"The answer is: {answer}."
summary = (
"In summary, I hope this explanation clarifies things for you. "
"Let me know if you'd like me to elaborate further on any point!"
)
return "\n".join([preamble, restatement, body, summary])
# ---------------------------------------------------------------------------
# Three shapers. Each takes the verbose text (and, for the schema shaper, the
# raw answer) and returns a trimmed string.
# ---------------------------------------------------------------------------
# Filler openers a chat model reaches for. We strip whole lines that are pure
# preamble/summary and have no answer content.
_FILLER = re.compile(
r"^(certainly|sure|of course|absolutely|i'd be happy|in summary|"
r"i hope this|let me know|you asked)",
re.IGNORECASE,
)
def shape_terse(text):
"""(a) Terse mode: drop preamble, restatement, and trailing summary.
Keep only the lines that actually answer. This is what a 'be concise, no
preamble' instruction (or low effort) buys you on the real API.
"""
kept = []
for line in text.splitlines():
stripped = line.strip()
if not stripped:
continue
if _FILLER.match(stripped):
continue
kept.append(stripped)
return "\n".join(kept)
def shape_schema(answer_value):
"""(b) Schema extraction: emit ONLY the needed field as a tiny JSON object.
On the real API this is output_config={"format": {"type": "json_schema",
"schema": {...}}}: the model is constrained to emit just the schema fields,
so the prose never gets generated in the first place.
"""
# Compact JSON, no spaces: {"label":"<value>"}
return '{"label":"' + str(answer_value) + '"}'
def shape_cap(text, max_tokens):
"""(c) Hard cap: a max_tokens ceiling. Truncate once the estimate is hit.
This does NOT make the answer short by design; it cuts the model off
mid-stream, so it can lose the part that mattered (see the 'Don't be
confused' box in the chapter).
"""
out_words = []
for word in text.split():
# +1 word is ~1.3 more tokens; stop before crossing the ceiling.
if est_tokens(" ".join(out_words + [word])) > max_tokens:
break
out_words.append(word)
return " ".join(out_words)
def report(name, before_text, after_text, n_calls):
"""Print BEFORE/AFTER output tokens and dollars for one shaper."""
b_tok, a_tok = est_tokens(before_text), est_tokens(after_text)
b_usd, a_usd = dollars(b_tok, n_calls), dollars(a_tok, n_calls)
pct = 0.0 if b_tok == 0 else (b_tok - a_tok) / b_tok * 100
print(f" {name}")
print(f" output tokens : {b_tok:4d} -> {a_tok:4d} ({pct:4.0f}% smaller)")
print(f" cost / {n_calls:,} calls: ${b_usd:8.2f} -> ${a_usd:8.2f} "
f"(saves ${b_usd - a_usd:7.2f})")
def main():
n_calls = 100_000 # one workload, run this many times
# A classification task: the whole useful answer is one word.
question = "Is this review positive or negative: 'the staff were rude'?"
answer = "negative"
verbose = verbose_model(question, answer)
print("=== The verbose answer the model wants to write ===")
print(verbose)
print()
print(f"Workload: {n_calls:,} calls, claude-opus-4-8 output at "
f"${OUTPUT_RATE_PER_MTOK:.0f}/Mtok.")
print(f"Verbose output is {est_tokens(verbose)} tokens; "
f"that costs ${dollars(est_tokens(verbose), n_calls):,.2f} just to "
f"emit, {n_calls:,} times.")
print()
print("=== Three ways to shape the output ===")
report("(a) terse mode ", verbose, shape_terse(verbose), n_calls)
report("(b) schema (JSON) ", verbose, shape_schema(answer), n_calls)
report("(c) hard cap @ 12 ", verbose, shape_cap(verbose, 12), n_calls)
print()
# The classification punchline: a one-word answer beats a paragraph, and it
# is also a *correct* answer. Nothing was lost by shaping it short.
one_word = answer
paragraph = verbose
print("=== Classification: one word beats a paragraph ===")
print(f" paragraph answer : {est_tokens(paragraph):3d} tokens "
f"(${dollars(est_tokens(paragraph), n_calls):,.2f} / {n_calls:,} calls)")
print(f" one-word answer : {est_tokens(one_word):3d} tokens "
f"(${dollars(est_tokens(one_word), n_calls):,.2f} / {n_calls:,} calls)")
saved = dollars(est_tokens(paragraph), n_calls) - dollars(est_tokens(one_word), n_calls)
print(f" same label, {saved / dollars(est_tokens(paragraph), n_calls) * 100:.0f}% "
f"cheaper. The extra prose was never the answer.")
if __name__ == "__main__":
main()
Running it:
=== The verbose answer the model wants to write ===
Certainly! I'd be happy to help you with that.
You asked: Is this review positive or negative: 'the staff were rude'?
The answer is: negative.
In summary, I hope this explanation clarifies things for you. Let me know if you'd like me to elaborate further on any point!
Workload: 100,000 calls, claude-opus-4-8 output at $25/Mtok.
Verbose output is 62 tokens; that costs $155.00 just to emit, 100,000 times.
=== Three ways to shape the output ===
(a) terse mode
output tokens : 62 -> 5 ( 92% smaller)
cost / 100,000 calls: $ 155.00 -> $ 12.50 (saves $ 142.50)
(b) schema (JSON)
output tokens : 62 -> 1 ( 98% smaller)
cost / 100,000 calls: $ 155.00 -> $ 2.50 (saves $ 152.50)
(c) hard cap @ 12
output tokens : 62 -> 12 ( 81% smaller)
cost / 100,000 calls: $ 155.00 -> $ 30.00 (saves $ 125.00)
=== Classification: one word beats a paragraph ===
paragraph answer : 62 tokens ($155.00 / 100,000 calls)
one-word answer : 1 tokens ($2.50 / 100,000 calls)
same label, 98% cheaper. The extra prose was never the answer.
Read the three shapers against each other. Terse mode takes the 62-token reply down
to 5 tokens (92% smaller) by dropping the three non-answer parts. Schema extraction
goes further, to a single token, because it never lets the prose exist: the model is
constrained to emit {"label":"negative"} and nothing else. The hard cap saves the
least, 81%, and for a reason that matters: it does not shape the answer, it just
stops it at 12 tokens. Here that happens to land after the useful word, but a cap
is a blunt instrument, and the chapter's warning box is about exactly that.
The two ways to make output small
The schema and terse results are 98% and 92% smaller, the cap 81%, but the gap is not the headline. The headline is how each got small, because that decides whether the answer is still correct.
Don't be confused. Truncating output with a hard cap and shaping it to be short by design are not the same thing, even when they produce a similar token count. A
max_tokensceiling cuts the stream off at a fixed length wherever the model happens to be: if the model front-loaded filler, the cap can chop off the actual answer and leave you the preamble. Shaping (terse instructions, a schema, low effort) changes what the model decides to write, so the answer is short because it was never padded, not because it was guillotined. Use the cap as a safety ceiling against a runaway response, not as your primary way to get short answers. If your answers are short only because they keep hitting the cap, you have a design problem wearing a cost solution.
That is why the schema shaper is the strongest of the three. It is the difference
between asking nicely for a short answer and removing the model's ability to write a
long one. A json_schema with one string field has no slot for a preamble, so the
preamble cannot be generated, so you cannot be billed for it. Terse instructions get
most of the way there and are easy to apply everywhere; schemas get the rest of the
way when the output is structured enough to pin down.
One word beats a paragraph
The last block of the demo is the classification punchline. The task ("is this review
positive or negative?") has a one-word answer. A paragraph that explains the
sentiment, hedges, and offers to elaborate costs 62 tokens. The bare label
negative costs 1. Same answer, 98% cheaper, and the short version is not a degraded
answer: it is the whole answer. The extra prose was never the thing you asked for.
This generalizes past classification. Any task whose result is a fixed shape, a label, a number, a yes/no, a single extracted field, an enum, is a task where the useful output is tiny and everything else is the model being conversational at your expense. Extraction pipelines, routing decisions, and JSON-returning API endpoints all live here. The win is largest exactly where the answer is most constrained, because that is where the ratio of padding to signal is worst.
The provider-native levers
Everything above was simulated so the mechanics stay visible. On the real Anthropic API you do not hand-roll the shapers; you set request parameters and let the model do the trimming. The four that matter for output, in rough order of how often you reach for them:
The following is follow-along (the build machine has no API key), and the output
shapes are illustrative, but the calls are exact. Consult the claude-api skill
before writing this for real.
# Illustrative: requires the anthropic SDK and an API key.
import anthropic
client = anthropic.Anthropic()
# 1. effort: low. The single biggest knob. Lower effort means fewer and more
# consolidated tool calls, less preamble, and terser confirmations. Values:
# low | medium | high | max. For simple/cheap work, low is the right default.
resp = client.messages.create(
model="claude-opus-4-8",
max_tokens=256, # 2. hard ceiling (the safety cap, not the shaper)
output_config={"effort": "low"},
messages=[{"role": "user", "content": "Classify sentiment: 'the staff were rude'"}],
)
# 3. Structured output: constrain the model to emit ONLY the schema fields, so
# the prose is never generated. This is shaper (b) from the demo, for real.
resp = client.messages.create(
model="claude-opus-4-8",
max_tokens=64,
output_config={
"format": {
"type": "json_schema",
"schema": {
"type": "object",
"properties": {"label": {"type": "string",
"enum": ["positive", "negative"]}},
"required": ["label"],
"additionalProperties": False,
},
}
},
messages=[{"role": "user", "content": "Classify sentiment: 'the staff were rude'"}],
)
# 4. stop_sequences: end generation the instant a marker appears, so the model
# can't run on past the part you wanted.
resp = client.messages.create(
model="claude-opus-4-8",
max_tokens=256,
stop_sequences=["\n\n"], # stop at the first blank line
messages=[{"role": "user", "content": "Give the answer on one line."}],
)
Map these back to the demo. output_config={"effort": "low"} is the production
version of terse mode: it makes the model write less without you specifying what to
cut. output_config={"format": {...}} is the schema shaper, the strongest lever when
the output is structured, because it removes the slots that padding would fill.
max_tokens is the hard cap, and the same warning applies: it is a ceiling against a
runaway response, not your way to get short answers. stop_sequences is a finer
version of the same idea, ending generation at a content marker rather than a token
count, so it cuts at the right place instead of an arbitrary length.
A note on where these don't help. The 5x asymmetry is per token, so trimming output pays best when the output is a meaningful fraction of the total. In a long-document, short-answer workload the input still dominates the bill (chapter 2's worked example), and the right move is to compress the input or cache it. Output reduction shines when the answer is verbose relative to the question, which is most of the time for chat, agents, and any task where the model is inclined to explain itself.
Two real projects
Two named tools sit on the output side of this line, trimming what the model emits rather than what you send it.
- caveman is a post-generation trimmer: it takes the model's reply and strips the conversational scaffolding (the "Certainly!", the hedges, the offers to elaborate), the way shaper (a) does. It is a cleanup pass on text you already paid to generate, which makes it a tool for the downstream consumer, not a way to lower the bill.
- Headroom's
shaperis a constrained-generation layer: it pushes the request to emit a target shape (a schema, a bounded length) so the trimming happens during generation, the way shapers (b) and (c) do. Because it constrains what the model writes, it actually reduces the output tokens you are billed for, not just the tokens you keep.
The distinction between them is the same one the "Don't be confused" box drew: trimming after the fact tidies the text but you already paid for it; constraining the generation is what changes the bill. Reach for constrained generation (effort, schema, stop sequences) when cost is the goal, and post-trimming when you just want the downstream text clean.
Using the real tool: commands and before/after proof
The "real tool" for output reduction is mostly the Anthropic API itself: the levers are request parameters, and the metric is a field on the response. There is no separate library to install. Here is one call that stacks the levers from the demo, written as follow-along because this box has no API key.
# Follow-along: requires the anthropic SDK and an API key.
import anthropic
client = anthropic.Anthropic()
resp = client.messages.create(
model="claude-opus-4-8",
max_tokens=16, # hard ceiling: a safety cap, not the shaper
output_config={
"effort": "low", # terser, less preamble (low | medium | high | max)
"format": { # constrain the answer to one JSON field
"type": "json_schema",
"schema": {
"type": "object",
"properties": {"label": {"type": "string",
"enum": ["positive", "negative"]}},
"required": ["label"],
"additionalProperties": False,
},
},
},
stop_sequences=["}"], # end generation at the closing brace
messages=[{"role": "user",
"content": "Classify sentiment: 'the staff were rude'"}],
)
Each line earns its place. effort: "low" tells the model to spend fewer tokens
thinking and to skip the conversational scaffolding. max_tokens=16 is the ceiling
that truncates a runaway response, the same blunt instrument as shaper (c). The
json_schema format removes every slot the prose would fill, so the model can only
emit {"label": ...}, which is shaper (b). stop_sequences=["}"] halts generation
the instant the object closes, so the model cannot run on.
To prove the levers did something, read response.usage.output_tokens, the count of
tokens the model actually generated, and price it at the claude-opus-4-8 output rate
of $25 per million tokens. Call once without the levers and once with them, and
compare:
# Follow-along: same caveat as above.
def cost(tokens): # dollars to emit this many output tokens
return tokens / 1_000_000 * 25.0
# WITHOUT the levers: high effort, free-form answer.
loose = client.messages.create(
model="claude-opus-4-8",
max_tokens=1024,
output_config={"effort": "high"},
messages=[{"role": "user",
"content": "Classify sentiment: 'the staff were rude'"}],
)
print(loose.usage.output_tokens, cost(loose.usage.output_tokens))
# WITH the levers: the stacked call from above.
print(resp.usage.output_tokens, cost(resp.usage.output_tokens))
The numbers below are illustrative (expected shapes, not measured on this box, which has no key), but the ratio is the point:
without levers : ~120 output tokens (~$0.0030 per call)
with levers : ~8 output tokens (~$0.0002 per call)
That 120-to-8 drop is the same effect the from-scratch shaper demo measured at the top of the chapter, where terse mode and the schema took a 62-token reply down to 5 and 1 tokens. The demo proved the mechanism on the box with a stub; this proves it on the real API by reading the token count the provider bills you for.
In Claude Code
Claude Code exposes the same effort lever at the CLI. Its default for coding is
xhigh, which is deliberate: coding wants thorough reasoning and careful tool use.
When you lower the effort, the agent makes fewer and more consolidated tool calls and
writes terser output, which means fewer output tokens for the same task. You can also
just ask for a short final answer ("give me the one-line summary, no explanation").
Lowering effort is the CLI-level version of the output_config={"effort": "low"}
parameter above: same knob, same effect on the bill, reached through the tool instead
of the request body.
Takeaways
- Output is priced 5x input per token, so making the model write less is usually the highest-return lever. You cannot edit output after the fact for free; you have already paid for every token by the time you could delete one.
- Output reduction means making the model generate less, not cleaning up afterward.
The levers are effort, structured output (schemas),
max_tokens, and stop sequences. - A hard cap truncates wherever the model happens to be and can cut off the real answer; shaping (terse, schema, low effort) makes the answer short by design and keeps it correct. Use the cap as a safety ceiling, not as your short-answer strategy.
- For tasks with a fixed-shape result (classification, extraction, routing, JSON APIs), a one-word or one-field answer is not a worse answer, it is the whole answer. That is where output reduction pays the most.
- On the real API,
output_config={"effort": "low"}is the everyday terse knob andoutput_config={"format": {"type": "json_schema", ...}}is the strongest when the output is structured, because it removes the slots padding would fill.
👉 We have now squeezed both halves of a single call: the input (Chapter 3) and the output (this chapter). The next chapter narrows in on a context type that breaks the naive shrinking we have done so far: code. Source files have structure (imports, definitions, call graphs) that a blind word-trimmer destroys, and the next chapter shows how to compress and select context in a way that respects it.
Code and structure-aware context
Chapter 3 shrank prose by scoring sentences and keeping the top ones. Code does not work that way. You cannot drop "the least important line" of a function and hope the rest still makes sense, and you cannot rank lines by similarity to a question the way you would rank paragraphs. Code has structure: functions call functions, files import files, classes inherit classes. The right way to compress a codebase for a model is to use that structure to select only the parts the task touches, and leave the rest out entirely.
This chapter is about selecting code by structure instead of by file. The lazy habit, when a model needs to reason about one function, is to paste the whole file (or the whole repo) and let attention sort it out. That spends tokens on functions the task never touches, and on a real repository it does not even fit. The structure-aware habit is to parse the code, figure out what the target function actually depends on, and send only that.
The unit of code is not the line
To talk about structure we need the thing that exposes it. When Python (or any language)
reads source text, the first step is to turn that flat string into a tree that mirrors the
grammar of the language. That tree is the Abstract Syntax Tree (AST). "Abstract" because
it throws away surface details like whitespace and keeps the meaningful shape: this is a
function definition, its body is a list of statements, this statement is a call, the thing
being called has this name. Python hands you the AST through the standard-library ast
module, so we can work with the structure of code using nothing but the standard library.
Two node types carry most of what we need:
- A
FunctionDefnode is onedef. It knows the function's name and holds its body as a list of child nodes. - A
Callnode is one function call, likesummarize(rows). Its.funcchild tells you what is being called; when the call is a plain name, that child is aNamenode whose.idis the string"summarize".
So "which functions does this function call" is not a text search for parentheses. It is a
walk over the Call nodes inside a FunctionDef, reading the name off each one. The AST
makes that exact and immune to the things that fool a regex (a call inside a comment, a
string that happens to contain foo(, a name that is a substring of another).
The call graph
Once you can list every function and, for each one, the functions it calls, you have a
call graph: a set of nodes (the functions) and directed edges (A calls B means an
arrow from A to B). The call graph is the map of how the code's pieces depend on each
other, and it is what lets you answer the question that matters for context: if I want the
model to understand make_report, what else does it need to see?
The answer is the transitive dependencies of the target. "Transitive" means you follow
the arrows all the way, not just one hop. If make_report calls clean, and clean called
something else, you would need that too, and so on until nothing new turns up. Formally, the
transitive dependencies of a target $t$ are every node reachable from $t$ by following call
edges:
$$\text{deps}(t) = {, t ,} \cup {, v : t \rightsquigarrow v ,}$$
where $t \rightsquigarrow v$ means "there is a path of call edges from $t$ to $v$." You compute this with an ordinary graph search. Start with the target, look at everything it calls, add anything you have not seen, and repeat until the frontier is empty. That is a breadth-first search (BFS), and it is the whole selection algorithm: the set it returns is exactly the slice of code the model needs, and everything outside the set is noise you can drop.
The demo
The script below carries a small Python module as a string: eight functions where the target
make_report transitively needs only three of them (load_rows, clean, summarize) and
the other four (send_email, connect_smtp, resize_image, slugify) are unrelated. It
parses the module with ast, records each function's exact source text, builds the call
graph by walking Call nodes, runs a BFS from the target to collect its transitive
dependencies, and emits only those functions as the context slice. Then it prints the
before/after token counts (the same words * 1.3 estimate from Chapter 2)
and verifies that the target and all its real dependencies are present while the unrelated
functions are gone.
"""Select code by STRUCTURE (a call graph), not by file.
When an LLM needs to reason about one function, the lazy move is to paste the
whole module and let the model sort it out. That spends tokens on functions the
task never touches. The structure-aware move is to parse the code, find which
functions the target actually CALLS (directly or indirectly), and send only
those. This script does exactly that with the standard-library `ast` module.
Pipeline:
1. parse the module text into an Abstract Syntax Tree (AST)
2. record each top-level function's name and exact source text
3. build a CALL GRAPH: for each function, which other functions it calls
4. from the target, walk the graph to collect its transitive dependencies
5. emit only those functions as the "context slice"
Standard library only. Run: python3 code_context.py
"""
import ast
def est_tokens(text: str) -> int:
"""Rough token estimate, same rule as chapter 2: ~1.3 tokens per whitespace
word. Code fragments more than prose, so this UNDER-counts real code tokens,
but it is consistent across the before and after, which is what we compare."""
return round(len(text.split()) * 1.3)
# A small module. `make_report` (the target) transitively needs `load_rows`,
# `clean`, and `summarize`. The other four functions are unrelated noise that a
# whole-file dump would carry along for nothing.
MODULE = '''\
def load_rows(path):
raw = path.read_text()
return [line.split(",") for line in raw.splitlines()]
def clean(rows):
return [[cell.strip() for cell in row] for row in rows]
def summarize(rows):
total = sum(float(row[1]) for row in rows)
return {"count": len(rows), "total": total}
def make_report(path):
rows = clean(load_rows(path))
stats = summarize(rows)
return f"{stats['count']} rows, total {stats['total']}"
def send_email(addr, body):
server = connect_smtp()
message = build_message(addr, body)
receipt = server.deliver(addr, message)
log_delivery(addr, receipt)
return receipt
def connect_smtp():
host = read_config("smtp_host")
port = read_config("smtp_port")
return open_socket(host, port)
def resize_image(img, width, height):
ratio = min(width / img.width, height / img.height)
new_w = int(img.width * ratio)
new_h = int(img.height * ratio)
canvas = blank_canvas(width, height)
scaled = img.scaled(new_w, new_h)
return paste_centered(canvas, scaled)
def slugify(title):
lowered = title.lower().strip()
safe = "".join(ch for ch in lowered if ch.isalnum() or ch == " ")
collapsed = " ".join(safe.split())
return collapsed.replace(" ", "-")
'''
def function_table(module_src):
"""Parse the module and return {name: source_text} for every top-level
function. `ast.parse` turns the text into a tree of nodes; a `FunctionDef`
node is one `def`. `ast.get_source_segment` hands back the EXACT slice of
the original text for that node, so we keep formatting and comments."""
tree = ast.parse(module_src)
funcs = {}
for node in tree.body:
if isinstance(node, ast.FunctionDef):
funcs[node.name] = ast.get_source_segment(module_src, node)
return funcs
def build_call_graph(module_src, names):
"""Return {name: set_of_names_it_calls}. We walk each function's subtree
looking for `Call` nodes. A call like `summarize(rows)` parses as a `Call`
whose `.func` is a `Name` node with `id == "summarize"`. We only keep callees
that are themselves functions defined in this module (ignore builtins like
`sum`, `len`, `float`, and methods like `.strip()`)."""
tree = ast.parse(module_src)
graph = {name: set() for name in names}
for node in tree.body:
if isinstance(node, ast.FunctionDef):
for sub in ast.walk(node):
if isinstance(sub, ast.Call) and isinstance(sub.func, ast.Name):
callee = sub.func.id
if callee in names:
graph[node.name].add(callee)
return graph
def transitive_deps(graph, target):
"""Breadth-first walk of the call graph from `target`. Start with the target,
then repeatedly add everything its current members call, until nothing new
appears. The result is the target plus every function reachable from it: its
transitive dependencies. The order is the discovery order (target first)."""
seen = []
queue = [target]
in_set = {target}
while queue:
name = queue.pop(0)
seen.append(name)
for callee in sorted(graph[name]):
if callee not in in_set:
in_set.add(callee)
queue.append(callee)
return seen
def main():
funcs = function_table(MODULE)
names = set(funcs)
graph = build_call_graph(MODULE, names)
target = "make_report"
needed = transitive_deps(graph, target)
dropped = [n for n in funcs if n not in needed]
slice_src = "\n\n\n".join(funcs[n] for n in funcs if n in needed)
full_tok = est_tokens(MODULE)
slice_tok = est_tokens(slice_src)
print("=== The call graph (who calls whom) ===")
for name in funcs:
calls = sorted(graph[name])
print(f" {name:14s} -> {', '.join(calls) if calls else '(no in-module calls)'}")
print()
print(f"=== Slice for target: {target}() ===")
print(f" included ({len(needed)}): {', '.join(needed)}")
print(f" dropped ({len(dropped)}): {', '.join(dropped)}")
print()
print("=== Tokens: whole module vs structure-aware slice ===")
print(f" whole module : {full_tok:4d} tokens ({len(funcs)} functions)")
print(f" slice : {slice_tok:4d} tokens ({len(needed)} functions)")
reduction = 100 * (full_tok - slice_tok) / full_tok
print(f" reduction : {reduction:.0f}% fewer tokens sent")
print()
print("=== Verify the slice is correct ===")
expected_deps = {"load_rows", "clean", "summarize"}
target_present = target in needed
deps_present = expected_deps.issubset(set(needed))
unrelated = {"send_email", "connect_smtp", "resize_image", "slugify"}
unrelated_absent = unrelated.isdisjoint(set(needed))
print(f" target present : {target_present}")
print(f" all real dependencies present: {deps_present} {sorted(expected_deps)}")
print(f" unrelated functions absent : {unrelated_absent} {sorted(unrelated)}")
print(f" ALL CHECKS PASS : {target_present and deps_present and unrelated_absent}")
if __name__ == "__main__":
main()
Running it:
=== The call graph (who calls whom) ===
load_rows -> (no in-module calls)
clean -> (no in-module calls)
summarize -> (no in-module calls)
make_report -> clean, load_rows, summarize
send_email -> connect_smtp
connect_smtp -> (no in-module calls)
resize_image -> (no in-module calls)
slugify -> (no in-module calls)
=== Slice for target: make_report() ===
included (4): make_report, clean, load_rows, summarize
dropped (4): send_email, connect_smtp, resize_image, slugify
=== Tokens: whole module vs structure-aware slice ===
whole module : 181 tokens (8 functions)
slice : 65 tokens (4 functions)
reduction : 64% fewer tokens sent
=== Verify the slice is correct ===
target present : True
all real dependencies present: True ['clean', 'load_rows', 'summarize']
unrelated functions absent : True ['connect_smtp', 'resize_image', 'send_email', 'slugify']
ALL CHECKS PASS : True
Read the call graph first. The only function with in-module callees is make_report, which
points at clean, load_rows, and summarize. There is a second little cluster,
send_email calling connect_smtp, but no arrow connects it to the target, so the search
from make_report never reaches it. That is why the slice contains four functions and drops
the other four, and why the token count falls from 181 to 65, a 64% cut, while the
verification block confirms nothing the target needs went missing and nothing it does not
need came along.
Notice what the selection did not rely on. It did not look at what the functions are
named, what they appear to be about, or whether their text resembles some query. slugify
and resize_image were dropped not because they seemed irrelevant but because no path of
call edges reaches them from the target. The structure decided, and the structure cannot be
fooled by a suggestive name.
Don't be confused. Structural relevance and semantic relevance are different filters, and they answer different questions. The retrieval in Chapter 3 and the memory in Chapter 9 use semantic relevance: embed the query and the candidates, and keep the ones whose meanings are close. That finds code that is about the same topic. The call graph here uses structural relevance: keep the code the target actually executes. Those can disagree. A helper named
_x7thatmake_reportcalls is structurally essential and semantically invisible; a beautifully namedgenerate_report_pdfthat nothing calls is semantically tempting and structurally useless. For "give the model what it needs to understand this function," structure is the correct filter, because correctness depends on what the code calls, not on what it sounds like.
Where this is used in the wild
The toy here is a single-file call graph, but the idea scales to real tools that select code by structure instead of dumping it.
- CodeCompressor is code-aware prompt compression: instead of treating a file as a bag of lines, it respects code structure when deciding what to cut, so the compressed prompt is still parseable code rather than a shredded fragment.
- lean-ctx builds context from a code graph, assembling the relevant slice for a task from the dependency structure rather than pasting whole files.
- tree-sitter based selectors use the tree-sitter
parser, which produces an AST for dozens of languages, to pull out specific definitions
(this function, this class) instead of whole files. It is the same
FunctionDef-and-Callidea generalized past Python. - Aider's "repo map" gives a coding agent a compact, structure-derived index of a repository: the symbols (functions, classes) and how they relate, ranked by relevance to the current task, so the model gets a map of the codebase rather than its full text.
The common thread is that every one of these sends the model a selected, structured slice rather than raw bulk. The selection is what keeps the context lean (the second property from Chapter 1) on inputs far too large to send whole.
'with Claude' note (illustrative). A coding agent like Claude Code does not paste the repository into the prompt. It navigates: it greps for a symbol, reads the specific files a task names, follows a call or import to the next file, and stops when it has what it needs. The effect is the same selection this chapter computes, done incrementally as the task reveals which symbols matter, which keeps the window lean even on a large codebase.
The use cases line up with how people actually deploy this. A coding agent editing one function needs that function and its dependencies, not the repo. Repository question-answering ("where do we validate the JWT?") wants the few functions on the path to the answer. PR review wants the changed functions plus what they call and what calls them, which is the call graph extended one hop in the other direction, so the reviewer model sees the blast radius of the change and nothing else.
Using the real tool: commands and before/after proof
The demo above computed the slice by hand. The two tools below do the same selection on a real repository, with commands you can run. Both replace the lazy "paste the whole file" habit with "send only the parts the task touches."
Start with Claude Code, because it is the clearest example of the principle. When you run it against a repository in non-interactive mode, you give it one instruction and let it find the code:
# In the root of a repo. -p runs it once and prints the result (no chat session).
claude -p "fix the bug in the deploy function"
What happens next is the whole point. Claude Code does not read the repository into the
prompt and ask the model to find deploy in the wall of text. It uses its own search and read
tools: it greps for the symbol deploy to locate the file and line, reads just that function,
follows a call or an import to the next function it needs, and stops when it has enough. The
model only ever sees the handful of functions on the path to the fix, which is the same slice
the from-scratch demo selected with a BFS over the call graph, except here it is gathered
incrementally as the task reveals which symbols matter.
Contrast that with the naive approach: open deploy.py (or worse, cat the whole src/
directory) and paste all of it into the prompt. On a small file that wastes tokens on functions
the bug never touches; on a real repository it does not fit in the context window at all. The
search-and-read habit is what keeps the window lean on inputs too large to send whole.
The second tool is Aider, an open-source command-line coding assistant. It builds a repo
map: a compact index of the repository's definitions (functions, classes, and their call
signatures) rather than their full bodies, ranked by relevance to the current task with a graph
algorithm over the file dependency structure. Under the hood the map is built with
tree-sitter, the same parser the "wild" section mentioned,
which gives Aider an AST for dozens of languages instead of the single-file Python ast of our
demo.
pip install aider-chat
aider # builds a repo map of the relevant definitions, not whole files
aider --map-tokens 1024 # bound the map to roughly 1024 tokens
The --map-tokens option is the lever that matters here. It caps how many tokens the repo map
may occupy, so Aider keeps the highest-ranked definitions and drops the rest to fit the budget.
That is the same trade the call-graph BFS makes (keep what the task needs, drop the rest), spent
as an explicit token budget instead of a reachability set.
Now prove the win. The metric is input tokens, and the exact count comes from
client.messages.count_tokens, not a word-count estimate (the Anthropic SDK; the model id this
book uses is claude-opus-4-8). Count the prompt twice, once with the whole file or repo dumped
in and once with just the targeted slice (the function plus its dependencies, exactly what the
from-scratch demo selected), and read .input_tokens off each result:
from anthropic import Anthropic
client = Anthropic()
def input_tokens(code: str) -> int:
resp = client.messages.count_tokens(
model="claude-opus-4-8",
messages=[{"role": "user", "content": code}],
)
return resp.input_tokens
whole = input_tokens(open("module.py").read()) # everything
slice_ = input_tokens(selected_slice_text) # target + its deps only
print(whole, slice_, f"{1 - slice_/whole:.0%} fewer")
whole module ~500 tokens -> structure-aware slice ~150 tokens (70% fewer)
Those two numbers are illustrative: the shapes to expect, not a measurement from this box
(count_tokens needs a network call and an API key, which the verification setup does not have).
The measurement we did run is the AST demo above, which cut its module from 181 tokens to 65
(a 64% reduction) and verified that nothing the target calls went missing. That on-box result is
the proof; count_tokens is how you reproduce the same before/after on a real file or repo,
with exact counts instead of an estimate.
Takeaways
- Code has structure (calls, imports, inheritance) that prose does not, so you compress it by selecting whole units along that structure, not by ranking or trimming lines.
- The AST turns source text into a tree you can query precisely. A
FunctionDefis adef; aCallnode's name tells you what it invokes. The standard-libraryastmodule is all you need for Python. - A call graph plus a graph search from the target yields its transitive dependencies: the exact slice the model needs. The demo cut a module from 181 to 65 tokens (64%) while keeping every function the target actually calls.
- Structural relevance (what the code calls) and semantic relevance (what the code is about) are different filters. For "what does this function need to run," structure is the correct one; a suggestive name is not evidence and an ugly name is not disqualifying.
- Real tools do this at scale: CodeCompressor, lean-ctx, tree-sitter selectors, and Aider's repo map all send a structured slice, and coding agents navigate to targeted files instead of pasting the repo.
👉 Selecting a lean context is half the battle. The other half is not paying for the same tokens twice when you send a similar context call after call. The next chapter is the KV-cache and prefix caching: how the model's internal work on a stable prefix can be reused across calls, and how to order your context so it is.
KV-cache and prefix caching
Chapter 2 priced a context: count the input tokens, estimate the output, multiply by the rates. That formula has a quiet assumption baked into it, that you pay for every input token on every call. For the parts of your context that do not change between calls, that assumption is wrong, and this chapter is about why.
Most production contexts are mostly stable. A long system prompt, a fixed set of tool definitions, a retrieved document the user is asking three questions about: the same tokens go out at the front of call after call, and only the tail (the new question) differs. If the model re-reads that stable prefix from scratch every time, you re-pay for it every time. Caching is how you stop.
There are two layers to this, and they are easy to confuse, so we separate them. The lower layer is the KV cache, a mechanism inside the model that makes generation itself tractable. The upper layer is prompt caching, a feature you call from the API that reuses the cached work for a stable prefix across separate requests. The first makes one generation cheap; the second makes a thousand calls with a shared prefix cheap. They are the same idea (do not redo work on tokens you have already processed) applied at two scales, and seeing the mechanism makes the API feature obvious instead of magical.
What attention actually computes
To see what gets cached, we need the one operation at the heart of a transformer: attention. Strip away the layers and the heads and it comes down to this. For a single token that is "looking" at the tokens before it, attention computes
$$\text{softmax}!\left(\frac{QK^\top}{\sqrt{d}}\right)V$$
That is dense, so here is every symbol in plain terms. A token enters as a vector. Three learned weight matrices turn that vector into three new vectors:
- the query $Q$, what this token is looking for;
- the key $K$, a label each token advertises, to be matched against queries;
- the value $V$, the content a token contributes once its key is matched.
Turning a token vector into a query, key, or value by multiplying it through one of those matrices is called a projection. Projections are the work. They are what we count for the rest of this chapter, because they are what caching avoids.
Now read the formula left to right. $QK^\top$ is the query of the current token dotted against the key of every earlier token: one number per earlier token, scoring how well each one matches what we are looking for. Dividing by $\sqrt{d}$ (where $d$ is the vector dimension) keeps those scores from growing huge as the dimension grows, which would make the next step too sharp. The softmax turns the row of scores into a row of weights that are all positive and sum to 1. Finally $\dots V$ takes the weighted average of the value vectors. That weighted average is the attention output for this token: it pulled in content from earlier tokens in proportion to how well their keys matched its query.
The thing to notice: to attend, the current token needs the key and value of every token before it. Those keys and values are exactly what a cache can store.
The KV cache, within one generation
A model generates text one token at a time. To produce token $t$, it runs attention, which needs $K$ and $V$ for tokens $0$ through $t$. Then it appends the new token and does it again for $t+1$. The naive way to implement this is to rebuild $K$ and $V$ from scratch at every step: at step $t$, project all $t+1$ tokens again. That is $1 + 2 + \dots + n$ projections to generate $n$ tokens, which is $n(n+1)/2$, roughly $n^2/2$. Quadratic.
But the key and value of token 5 never change. Once token 5 is in the sequence, its projection is fixed; later tokens cannot alter it (a token only ever attends backward). So you store each token's $K$ and $V$ the first time you compute it, and at each new step you project only the one new token and append it to the stored list. That stored list is the KV cache. With it, generating $n$ tokens costs $n$ projections. Linear.
The demo below builds a toy single-head attention in NumPy and counts projections both ways, for $n = 8, 64, 256$. Crucially, it also asserts that the cached and uncached paths produce the same output: the cache is exact, a pure speedup, not an approximation that trades quality for cost.
"""KV-cache and prefix caching, from scratch in NumPy.
Two demonstrations, both with real counts:
1. KV cache WITHIN one generation. To produce token t, attention needs the
keys (K) and values (V) of every token 0..t. Without a cache, each step
re-projects K,V for all prior tokens, so generating n tokens costs about
n^2/2 projections. With a cache, you store past K,V and project only the
ONE new token, so it costs about n. We count both, print the ratio, and
assert the two paths produce the SAME output (the cache is exact, not an
approximation).
2. PREFIX sharing ACROSS requests. Two requests share a long identical
prefix, then differ. Caching the prefix's K,V lets the second request skip
recomputing the shared part. We report the projections saved, then convert
that to money using Anthropic's prompt-caching economics (a cache read
costs about 0.1x the normal input rate).
Run with NumPy + the standard library only. No other imports.
"""
import numpy as np
# A global projection counter. Every time we project one token into a key and
# value vector, we bump this by 1. Counting projections is how we make the
# "quadratic vs linear" claim concrete instead of hand-wavy.
PROJECTIONS = 0
def reset_counter():
global PROJECTIONS
PROJECTIONS = 0
def softmax(x):
"""Numerically stable softmax over the last axis.
softmax turns a row of raw scores into a row of weights that are all
positive and sum to 1. Subtracting the max first avoids overflow in exp().
"""
x = x - np.max(x, axis=-1, keepdims=True)
e = np.exp(x)
return e / np.sum(e, axis=-1, keepdims=True)
# ----------------------------------------------------------------------------
# A toy single-head self-attention.
#
# Vocabulary of the terms, assuming no background:
# - "token" : one chunk of input, here represented as a vector x_t.
# - "projection" : multiply a token vector by a learned weight matrix to get
# a new vector. Here W_q, W_k, W_v turn each token into a
# query, a key, and a value. This is the work we are counting.
# - "query" (Q) : what the current token is looking for.
# - "key" (K) : what each token offers as a label, to be matched against Q.
# - "value" (V) : the content each token contributes once its key matches.
#
# Attention for one query is: softmax(q . K^T / sqrt(d)) @ V
# q . K^T : how well the query matches every key (one score per past token).
# / sqrt(d) : scale the scores so they do not blow up as the dimension grows.
# softmax : turn the scores into weights that sum to 1.
# @ V : take the weighted average of the value vectors. That average is
# the attention output for this token.
# ----------------------------------------------------------------------------
D = 8 # vector dimension. Small so the demo is fast; the logic is the same at scale.
rng = np.random.default_rng(0) # fixed seed: the numbers below are reproducible.
W_q = rng.standard_normal((D, D)) * 0.5
W_k = rng.standard_normal((D, D)) * 0.5
W_v = rng.standard_normal((D, D)) * 0.5
def project_kv(token):
"""Project ONE token into its key and value vector. Counts as 1 projection."""
global PROJECTIONS
PROJECTIONS += 1
k = token @ W_k
v = token @ W_v
return k, v
def attend(query, K, V):
"""Attention output for a single query against stored keys K and values V."""
scores = (query @ K.T) / np.sqrt(D) # one score per past token
weights = softmax(scores) # weights that sum to 1
return weights @ V # weighted average of the values
def generate_no_cache(tokens):
"""Generate, RE-projecting K,V for all prior tokens at every step.
At step t we rebuild K and V from scratch for tokens 0..t. That is t+1
projections at step t, so 1 + 2 + ... + n = n(n+1)/2 projections in total:
quadratic in the number of tokens.
"""
outputs = []
for t in range(len(tokens)):
K_rows, V_rows = [], []
for j in range(t + 1): # rebuild K,V for every token 0..t
k, v = project_kv(tokens[j])
K_rows.append(k)
V_rows.append(v)
K = np.array(K_rows)
V = np.array(V_rows)
q = tokens[t] @ W_q # query for the current token
outputs.append(attend(q, K, V))
return np.array(outputs)
def generate_with_cache(tokens):
"""Generate, KEEPING past K,V in a cache and projecting only the new token.
We append each new token's k,v to a growing cache. At every step we project
exactly ONE token, so the whole generation costs n projections: linear.
"""
K_cache, V_cache = [], [] # this list IS the KV cache
outputs = []
for t in range(len(tokens)):
k, v = project_kv(tokens[t]) # project ONLY the new token
K_cache.append(k)
V_cache.append(v)
K = np.array(K_cache)
V = np.array(V_cache)
q = tokens[t] @ W_q
outputs.append(attend(q, K, V))
return np.array(outputs)
def demo_within_generation():
print("=== 1. KV cache WITHIN one generation ===")
print("Generating n tokens. Counting K,V projections each way.\n")
print(f"{'n':>6} {'no cache (~n^2/2)':>18} {'with cache (~n)':>16} {'ratio':>7}")
for n in (8, 64, 256):
tokens = rng.standard_normal((n, D)) # n random token vectors
reset_counter()
out_slow = generate_no_cache(tokens)
slow = PROJECTIONS
reset_counter()
out_fast = generate_with_cache(tokens)
fast = PROJECTIONS
# The cache must be EXACT: same output, just less work. Assert it.
assert np.allclose(out_slow, out_fast), "cache changed the output!"
print(f"{n:>6} {slow:>18,} {fast:>16,} {slow / fast:>6.1f}x")
print("\nThe outputs are identical (assert passed), so the cache is exact,")
print("not an approximation. It just stops re-paying for past tokens.")
print("Cost goes from ~n^2/2 projections to ~n: at n=256 that is the")
print("difference between 32,896 and 256.\n")
def demo_prefix_sharing():
print("=== 2. PREFIX sharing ACROSS two requests ===")
# Two requests with a long identical prefix, then a short tail that differs.
PREFIX_LEN = 250 # the shared system prompt / context (stable across calls)
TAIL_LEN = 6 # the part that differs (e.g. the user's question)
prefix = rng.standard_normal((PREFIX_LEN, D))
tail_a = rng.standard_normal((TAIL_LEN, D))
tail_b = rng.standard_normal((TAIL_LEN, D))
req_a = np.vstack([prefix, tail_a])
req_b = np.vstack([prefix, tail_b])
# Request A: process the whole thing once, projecting K,V for every token,
# and KEEP the prefix's K,V to reuse on the next request.
reset_counter()
K_prefix, V_prefix = [], []
for j in range(PREFIX_LEN):
k, v = project_kv(prefix[j])
K_prefix.append(k)
V_prefix.append(v)
for j in range(TAIL_LEN):
project_kv(tail_a[j])
a_projections = PROJECTIONS # PREFIX_LEN + TAIL_LEN
# Request B WITHOUT the cache: redo the shared prefix from scratch.
reset_counter()
for j in range(PREFIX_LEN):
project_kv(prefix[j])
for j in range(TAIL_LEN):
project_kv(tail_b[j])
b_uncached = PROJECTIONS # PREFIX_LEN + TAIL_LEN again
# Request B WITH the cache: reuse the prefix K,V, project only the new tail.
reset_counter()
for j in range(TAIL_LEN):
project_kv(tail_b[j])
b_cached = PROJECTIONS # TAIL_LEN only
saved = b_uncached - b_cached
print(f"Shared prefix: {PREFIX_LEN} tokens. Differing tail: {TAIL_LEN} tokens.\n")
print(f"Request A (first call, fills the cache): {a_projections} projections")
print("Request B, BEFORE (no cache, redo prefix): "
f"{b_uncached} projections")
print(f"Request B, AFTER (reuse cached prefix): {b_cached} projections")
print(f"Saved by reusing the prefix: {saved} projections "
f"({saved / b_uncached * 100:.0f}% of request B)\n")
def demo_money():
print("=== 3. The same idea, in dollars (Anthropic prompt caching) ===")
# Anthropic prompt caching prices a cache READ at about 0.1x the normal
# input rate. A cache WRITE costs about 1.25x (5-minute TTL). So the first
# call pays a small write premium, and every later call re-reads the prefix
# for about a tenth of the price instead of re-sending it at full price.
INPUT_RATE = 5.00 / 1_000_000 # claude-opus-4-8: $5.00 per 1M input tokens
READ_RATE = INPUT_RATE * 0.1 # a cache hit is ~0.1x
WRITE_RATE = INPUT_RATE * 1.25 # a 5-minute cache write is ~1.25x
PREFIX_TOKENS = 10_000 # a 10k-token shared system prompt
CALLS = 100 # re-used across this many calls in 5 minutes
uncached = PREFIX_TOKENS * INPUT_RATE * CALLS
# First call writes the cache (1.25x); the other CALLS-1 read it (0.1x).
cached = (
PREFIX_TOKENS * WRITE_RATE
+ PREFIX_TOKENS * READ_RATE * (CALLS - 1)
)
print(f"A {PREFIX_TOKENS:,}-token prefix, re-used across {CALLS} calls "
"(claude-opus-4-8).\n")
print(f" BEFORE (re-sent uncached every call): ${uncached:,.2f}")
print(f" AFTER (written once, then read): ${cached:,.2f}")
print(f" Saved: ${uncached - cached:,.2f} "
f"({(1 - cached / uncached) * 100:.0f}% cheaper)\n")
print("Per call, a cached re-read of the prefix costs ~0.1x what re-sending")
print("it uncached would. The write premium is paid once and amortized away")
print("after a couple of reads.")
if __name__ == "__main__":
demo_within_generation()
demo_prefix_sharing()
demo_money()
Running it:
=== 1. KV cache WITHIN one generation ===
Generating n tokens. Counting K,V projections each way.
n no cache (~n^2/2) with cache (~n) ratio
8 36 8 4.5x
64 2,080 64 32.5x
256 32,896 256 128.5x
The outputs are identical (assert passed), so the cache is exact,
not an approximation. It just stops re-paying for past tokens.
Cost goes from ~n^2/2 projections to ~n: at n=256 that is the
difference between 32,896 and 256.
=== 2. PREFIX sharing ACROSS two requests ===
Shared prefix: 250 tokens. Differing tail: 6 tokens.
Request A (first call, fills the cache): 256 projections
Request B, BEFORE (no cache, redo prefix): 256 projections
Request B, AFTER (reuse cached prefix): 6 projections
Saved by reusing the prefix: 250 projections (98% of request B)
=== 3. The same idea, in dollars (Anthropic prompt caching) ===
A 10,000-token prefix, re-used across 100 calls (claude-opus-4-8).
BEFORE (re-sent uncached every call): $5.00
AFTER (written once, then read): $0.56
Saved: $4.44 (89% cheaper)
Per call, a cached re-read of the prefix costs ~0.1x what re-sending
it uncached would. The write premium is paid once and amortized away
after a couple of reads.
The first table is the within-generation story. At $n=8$ the cache saves a little; at $n=256$ it does $256$ projections where the naive version does $32{,}896$, a $128.5\times$ gap that keeps widening with length. This is not a tuning knob you turn on. Every production inference engine keeps a KV cache during generation, because without it long outputs would be quadratically slow. We come back to who manages that memory, and how in Chapter 8; for now the point is just that the model already caches keys and values internally, by token position.
Prefix sharing, across separate requests
Here is the leap. If keys and values can be cached within one generation, they can be cached across requests too, as long as the requests start with the same tokens.
The second block of output makes it concrete. Two requests share a 250-token prefix (think: a fixed system prompt plus a retrieved document) and then differ in a 6-token tail (the user's question). Request A processes all 256 tokens and we keep the prefix's $K$ and $V$. Request B without a cache redoes the whole prefix: 256 projections again, 250 of them pure waste because they recompute the identical prefix. Request B with the cache reuses the stored prefix keys and values and projects only the 6 new tail tokens: 6 projections. The shared prefix is paid for once, not twice. We saved 250 projections, 98% of request B's work.
That 98% is the whole economic case for prompt caching. The shared prefix is the expensive part (it is long and stable), the tail is the cheap part (it is short and changes), and a prefix cache lets the long stable part be processed once and reused.
Don't be confused. The model's internal KV cache and the API's prompt cache are not the same thing, even though they cache the same underlying quantities. The KV cache lives inside a single inference engine, holds keys and values for the duration of one generation (or while a session is warm), and you never touch it directly. The prompt cache is a feature you opt into from the API: you mark a stable prefix, and the provider stores its computed state so a later, separate request can reuse it. The first is the mechanism; the second is the product built on top of it. When this book says "caching relieves the cost pressure" (Chapter 1), it means the second, riding on the first.
From projections to dollars
The third block converts the saved work into money using Anthropic's actual prompt-caching prices. The economics rest on two numbers. A cache read (a request that reuses an already-cached prefix) costs about 0.1x the normal input rate: a tenth of the price of re-sending those tokens fresh. A cache write (the first request, which computes the prefix and stores it) costs about 1.25x the input rate for the default 5-minute cache, a small one-time premium.
So take a 10,000-token shared prefix re-used across 100 calls on claude-opus-4-8, whose
input rate is $5.00 per million tokens. Re-sent uncached, the prefix alone costs $5.00
across those 100 calls. Cached, the first call pays the 1.25x write and the other 99 pay the
0.1x read, for $0.56 total: 89% cheaper, the same shape as the 98% projection saving,
now in dollars. Per call, re-reading the cached prefix costs about a tenth of re-sending it.
The break-even is quick. The write costs 1.25x and a read costs 0.1x, so after the write you are ahead as soon as the reads you avoid would have cost more than 0.25x, which happens at the second or third call. Any stable prefix hit more than two or three times in the cache window pays for itself.
Calling it: Anthropic prompt caching
This maps directly onto the API. You mark the end of a stable prefix with a cache breakpoint and the provider caches everything up to that point. The following is follow-along (the build machine has no API key), but it is the exact call:
# Illustrative: requires the anthropic SDK and an API key.
import anthropic
client = anthropic.Anthropic()
resp = client.messages.create(
model="claude-opus-4-8",
max_tokens=1024,
system=[
{
"type": "text",
"text": LARGE_STABLE_SYSTEM_PROMPT, # the long, unchanging preamble
"cache_control": {"type": "ephemeral"}, # 5-minute cache; cache up to HERE
}
],
messages=[{"role": "user", "content": user_question}], # the volatile tail
)
print(resp.usage.cache_read_input_tokens) # >0 means we got a cache HIT
print(resp.usage.cache_creation_input_tokens) # >0 means we WROTE the cache this call
cache_control={"type": "ephemeral"} requests the 5-minute cache; for content reused over a
longer span, {"type": "ephemeral", "ttl": "1h"} requests a 1-hour cache (its write costs
about 2x instead of 1.25x, so it needs more reads to pay off, but it survives gaps in bursty
traffic). The simplest form is a top-level cache_control={"type": "ephemeral"} on the
request, which auto-caches the last cacheable block for you.
Four facts decide whether this actually works, and getting any of them wrong fails silently (you simply keep paying full price with no error).
It is a prefix match. The cache key is the exact bytes of the prompt up to the
breakpoint. Any change anywhere in the prefix invalidates everything after it. So keep
volatile content (a timestamp, a request ID, the user's question) after the cached prefix,
never inside it. A datetime.now() interpolated into the system prompt changes the prefix
every call and caches nothing.
Render order is tools, then system, then messages. Because the cache is a prefix, a breakpoint on the last system block caches the tool definitions and the system prompt together (tools come first in the rendered prompt). That is usually what you want: the tools and the system prompt are both stable, so cache them as one block and put the conversation after.
There is a minimum size. On claude-opus-4-8 the smallest cacheable prefix is 4096
tokens. A prefix shorter than that will not cache even with a breakpoint set, and you will
see cache_creation_input_tokens stay at zero with no error.
Verify with the usage fields. After a few calls with an identical prefix,
response.usage.cache_read_input_tokens should be greater than zero. If it stays zero, a
silent invalidator is changing your prefix: the usual culprits are a datetime.now() or a
UUID in the system prompt, a JSON blob serialized without sorted keys, or a tool list whose
order varies between calls. Diff the rendered prompt bytes of two requests to find it.
Don't be confused. Prefix caching is not semantic caching. Prefix caching reuses the computed state of identical input tokens: the prefix must match byte for byte, and it still runs the model to produce the answer. Semantic caching, the subject of Chapter 7, reuses a stored answer when a new question is close enough in meaning, returning it without calling the model at all. One caches the question's processing; the other caches the answer. They compose (you can do both), but they are different mechanisms with different failure modes.
Designing for the cache
Once you know it is a prefix match, the design rule writes itself: order your context from most stable to least stable. The frozen system prompt and the deterministic tool list go first, then the per-session context, then the running conversation, then the new user message last. Put the cache breakpoint at the boundary between the stable part and the volatile part. This is the build-order point from Chapter 1, now with a concrete payoff: a context assembled stable-first is a context that caches, and one assembled with a timestamp at the top is one that never will.
Using the real tool: commands and before/after proof
Everything above is the why. Here is the how, with a recipe you can run against the live
provider to confirm the cache is actually working. The build machine has no API key, so the
snippet below is follow-along (you supply the key with ANTHROPIC_API_KEY), but the call is
exactly the one you would make.
The setup is the cheapest possible experiment: call the model twice with the same large system prompt marked for caching, changing only the user question. If caching works, the first call pays to build the cache and the second call reads it back cheaply.
# Illustrative: requires the anthropic SDK and an API key (ANTHROPIC_API_KEY).
import anthropic
client = anthropic.Anthropic()
# A long, stable preamble. To cache on claude-opus-4-8 it must be at least
# 4096 tokens; below that the breakpoint is silently ignored (see below).
LARGE_STABLE_SYSTEM_PROMPT = open("system_prompt.txt").read()
def ask(question):
return client.messages.create(
model="claude-opus-4-8",
max_tokens=512,
system=[
{
"type": "text",
"text": LARGE_STABLE_SYSTEM_PROMPT, # the unchanging prefix
"cache_control": {"type": "ephemeral"}, # 5-minute cache; cache up to HERE
}
],
messages=[{"role": "user", "content": question}], # the volatile tail
)
# Call 1: same prefix, first question.
r1 = ask("Summarize the document in one sentence.")
print("call 1 write:", r1.usage.cache_creation_input_tokens,
" read:", r1.usage.cache_read_input_tokens)
# Call 2: SAME prefix, different question.
r2 = ask("List the three main risks the document mentions.")
print("call 2 write:", r2.usage.cache_creation_input_tokens,
" read:", r2.usage.cache_read_input_tokens)
The before/after proof
The two usage fields are the whole proof. cache_creation_input_tokens counts tokens
written to the cache this call (you paid the 1.25x write premium on them).
cache_read_input_tokens counts tokens served from the cache this call (you paid about
0.1x on them). Read the two calls in sequence:
- Call 1 writes the cache. It is the first time the provider has seen this prefix, so it
processes the whole system prompt and stores the result. You see
cache_creation_input_tokens > 0andcache_read_input_tokens == 0. - Call 2 reads the cache. The prefix is byte-for-byte identical, so the provider reuses
the stored work and only processes the new question. You see
cache_read_input_tokens > 0, andcache_creation_input_tokensdrops to 0 because nothing new needed writing.
Here is the shape of that output. These numbers are illustrative (expected, not measured on this box, which has no key), with a system prompt of about 5,000 tokens:
call 1 write: 5000 read: 0
call 2 write: 0 read: 5000
That read: 5000 on call 2 is the cache hit you are looking for, and the 5,000 tokens it
covers cost about a tenth of what they cost on call 1. That is the same 0.1x read price from
the dollars section above, now visible in a field you can assert on.
The rule for reading this in your own runs: if cache_read_input_tokens stays 0 across
repeated calls that share an identical prefix, a silent invalidator is changing your prefix.
The usual culprit is something volatile slipping in ahead of the breakpoint: a
datetime.now() or a UUID interpolated into the system prompt makes the prefix different
every call, so every call writes a fresh cache and none ever reads one. When the read field
is stuck at zero, diff the rendered prompt bytes of two requests and look for the thing that
moved.
This is the same mechanism the from-scratch NumPy demo proved on this box: reuse the stored
keys and values for the shared prefix and process only the new tail. The demo showed the
work disappear (6 projections instead of 256); the usage fields show the same saving as a
price you are billed (a 0.1x read instead of full input).
In Claude Code
Claude Code is built on exactly this. Across the turns of one session it automatically caches
the stable parts of the request: the system prompt and the tool definitions, which do not
change from turn to turn. A long CLAUDE.md is therefore paid for once, on the first turn
that writes it into the cache, and read cheaply on every turn after that. So the per-turn cost
you watch tick up mostly reflects cache reads of that stable preamble plus the genuinely
new tokens of the latest turn, not a full re-send of the whole preamble every time. The same
caching that makes a hundred API calls with a shared prefix cheap is what keeps a long
interactive coding session from re-paying for its own context on every message.
Takeaways
- Attention needs the key and value of every prior token. Caching those keys and values is what lets the model avoid re-projecting tokens it has already processed.
- Within one generation, the KV cache turns roughly $n^2/2$ projections into $n$: in the demo, $256$ instead of $32{,}896$ at $n=256$. It is exact (the assert on equal outputs passes), not an approximation. Every inference engine does this.
- Across requests, caching a shared prefix's keys and values lets a later call skip the shared work: the demo's second request did 6 projections instead of 256, a 98% saving.
- In dollars, a 10,000-token prefix re-used across 100 calls drops from $5.00 to $0.56 (89% cheaper) because cache reads cost about 0.1x input and the 1.25x write is paid once. Stable prefixes hit more than two or three times pay for themselves.
- It is a prefix match, so keep volatile content after the cached prefix, respect the render
order (tools, then system, then messages) and the 4096-token minimum on
claude-opus-4-8, and verify hits withcache_read_input_tokens. A zero there means a silent invalidator.
👉 Prefix caching reuses the work of processing an identical question; it still runs the model. The next chapter caches the other end: when a new question is close enough in meaning to one you have answered before, return the stored answer and skip the model entirely.
Semantic and response caching
Chapter 6 cached the stable front of the prompt so the model did not re-read it, but the model still ran on every call. This chapter goes one step further: when a new question is close enough to one you have already answered, return the stored answer and skip the model entirely. No prefill, no decode, no output tokens. The whole call collapses into a lookup.
That is a much bigger saving than prefix caching, and it comes with a risk prefix caching does not have. Prefix caching reuses input only on an exact byte match, so a reused prefix is always correct by construction. Semantic caching reuses the answer on an approximate match, so it can be wrong: serve a cached answer to a question that only looked similar, and the user gets a confidently wrong reply. The whole craft is in setting the match bar high enough to avoid that.
The idea: answer once, serve many
Support and FAQ traffic is full of repeats. "How do I reset my password?", "How can I reset my password?", and "I forgot my password, how do I reset it?" are three spellings of one question with one answer. A naive system calls the model three times and pays three times. A semantic cache calls the model once, stores the answer, and serves the other two from memory.
To do that, the cache needs to decide whether a new question means the same thing as one it has seen. It cannot compare the raw text, because the words differ. It needs a way to measure meaning. That is what an embedding gives you.
Embeddings: turning text into a vector
An embedding is a function from text to a fixed-length list of numbers (a vector), built so that texts with similar meaning produce vectors that point in similar directions. A production embedder is a trained neural network, but the mechanics of comparing by vector do not depend on how good the embedder is, so we can build a crude one from scratch and the caching logic around it is identical to the real thing.
Our toy embedder hashes the words of a question into a fixed-size vector. Each content word adds to a slot, the vector is scaled to unit length, and two questions that share content words end up pointing the same way. The demo below explains it line by line.
Cosine similarity: measuring the angle
Given two unit vectors, how "similar" are they? Use cosine similarity, the cosine of the angle between them:
$$\text{cos}(a, b) = \frac{a \cdot b}{\lVert a \rVert , \lVert b \rVert}$$
Here $a \cdot b$ is the dot product (multiply the two vectors element by element and sum the results), and $\lVert a \rVert$ is the length (norm) of $a$. The ratio runs from $-1$ (opposite meaning) through $0$ (unrelated) to $1$ (identical direction). When both vectors are already unit length, the denominator is $1 \times 1 = 1$, so cosine similarity is just the dot product. That is why the embedder normalizes: it makes the comparison a single cheap multiply-and-add.
The threshold: hit, miss, and the line between them
A threshold is the cutoff that turns a similarity score into a yes/no decision. Pick a number, say $0.8$. For a new question, embed it, score it against every cached question, and take the best match.
- If the best score is at or above the threshold, that is a hit: return the stored answer and skip the model.
- If the best score is below the threshold, that is a miss: call the model, store the new (question, answer) pair, and return the fresh answer.
Raise the threshold and the cache becomes strict: fewer hits, but every hit is a genuine match. Lower it and the cache becomes greedy: more hits, but eventually it starts serving cached answers to questions that only share a few words. That trade-off is the heart of the chapter, and the demo makes it concrete.
Don't be confused. Prefix caching (Chapter 6) and semantic caching are both "caching", but they reuse different things under different match rules. Prefix caching reuses the input tokens of a stable prompt prefix, keyed on an exact byte match, and the model still runs to produce the answer: a reused prefix is always correct because the bytes are identical. Semantic caching reuses the whole answer, keyed on an approximate similarity match, and the model does not run at all. The payoff is larger (you skip generation, the expensive half from Chapter 2) and so is the failure mode: an approximate match can be wrong, handing one question's answer to a different question. Prefix caching can never serve a wrong answer; semantic caching can, and guarding against that is the threshold's job.
The cache, from scratch
The demo builds the whole thing in NumPy and the standard library: the hashing embedder, cosine similarity, the threshold decision, and a stub model that stands in for a real call. It runs a workload of ten questions, several of which are paraphrases of earlier ones, and reports the hit rate plus the latency and dollars saved. Then it lowers the threshold and demonstrates a false hit: a different question wrongly served a cached answer.
"""Semantic (response) caching from scratch: return a STORED ANSWER for a
SIMILAR question, skipping inference entirely.
Chapter 6 caches the prompt PREFIX (the stable input tokens) and still runs the
model. This is different: we cache the whole ANSWER, and on a near-duplicate
question we skip the model call completely. That is a much bigger saving and a
new risk (a FALSE HIT: a different question served the wrong cached answer).
Four pieces, all in numpy + stdlib:
1. A tiny embedding: hash each word into a fixed-dim vector, sum, L2-normalize.
No external embedding model. (Real caches use a learned sentence embedder;
the mechanics of compare-by-vector are identical.)
2. Cosine similarity between a new question's vector and each cached vector.
3. A THRESHOLD: best match >= threshold -> HIT (return stored answer, skip LLM);
else MISS (call the stub LLM, store (question, answer), return it).
4. The threshold trade-off: lower it and watch a different question get a wrong
cached answer.
Standard library + numpy only. Run: python3 semantic_cache.py
"""
import hashlib
import numpy as np
def _hash(s):
"""A deterministic hash of a string to a non-negative int. Python's built-in
hash() is randomized per process (so output would change run to run); md5
keeps this demo reproducible. Any fixed hash works; the values are arbitrary."""
return int.from_bytes(hashlib.md5(s.encode()).digest()[:8], "big")
# --- Cost / latency model (one place to change the assumptions) ---------------
# A cache lookup is cheap and local; an LLM call is slow and billed per token.
LOOKUP_MS = 10 # embed + cosine search, in milliseconds
LLM_MS = 1500 # one model call, in milliseconds
OUT_TOKENS = 500 # tokens the model writes per answer
PRICE_PER_MTOK = 25.0 # output price, US dollars per million tokens (claude-opus-4-8)
COST_PER_CALL = (OUT_TOKENS / 1e6) * PRICE_PER_MTOK # dollars per LLM call
DIM = 256 # embedding dimension; bigger = fewer hash collisions
# Words that carry no topic ('how', 'do', 'the', ...). A real embedder learns to
# down-weight these; we just drop them so the vector reflects the CONTENT words.
STOPWORDS = set(
"how do i can you what is are your please tell about me a an the to it of "
"my your for".split()
)
# --- 1. A tiny embedding: hashing bag-of-words --------------------------------
def embed(text):
"""Turn text into a unit vector so that similar questions land near each other.
'Embedding' just means: a function from text to a fixed-length vector of
numbers, built so that texts with similar meaning point in similar directions.
A real embedder is a trained network; this one is a cheap stand-in with two
honest tricks so plain word-overlap is not fooled by filler:
* drop STOPWORDS, so 'how do i' does not dominate the vector;
* for each remaining word, hash the WORD and also its character trigrams
(3-letter windows). Trigrams make 'refund' and 'refunds' look alike and
give content words more weight than a single slot would.
We hash each feature to an index in [0, DIM) and add to that slot, building a
bag-of-features count vector, then L2-normalize it (divide by its length) so
every vector sits on the unit sphere. Normalizing is what lets cosine
similarity below be a plain dot product.
"""
v = np.zeros(DIM, dtype=np.float64)
for raw in text.lower().split():
word = raw.strip("?.,!;:'\"")
if not word or word in STOPWORDS:
continue
v[_hash("word:" + word) % DIM] += 1.0 # the whole word
for i in range(len(word) - 2): # its character trigrams
v[_hash("tri:" + word[i:i + 3]) % DIM] += 0.5
norm = np.linalg.norm(v)
return v / norm if norm > 0 else v
# --- 2. Cosine similarity -----------------------------------------------------
def cosine(a, b):
"""Cosine similarity: the cosine of the angle between two vectors.
cos(a, b) = (a . b) / (||a|| ||b||)
The dot product a . b sums the element-wise products; ||a|| is a's length.
The result runs from -1 (opposite) through 0 (unrelated) to 1 (identical
direction). Because embed() already returns unit-length vectors, ||a|| and
||b|| are both 1, so this reduces to the dot product a . b.
"""
return float(np.dot(a, b))
# --- 3. The stub LLM (what we are trying to avoid calling) --------------------
LLM_CALLS = 0 # count real model calls so we can prove the saving
def stub_llm(question):
"""Stand-in for a real model call: slow and expensive in the cost model.
Returns a canned 'answer' so the demo is deterministic and offline."""
global LLM_CALLS
LLM_CALLS += 1
return f"[answer to: {question!r}]"
# --- The cache ----------------------------------------------------------------
class SemanticCache:
def __init__(self, threshold):
self.threshold = threshold
self.vectors = [] # cached question embeddings
self.questions = [] # the original question text (for inspection)
self.answers = [] # the stored answer for each cached question
def lookup(self, question):
"""Return (answer, matched_question, score) on a hit, or None on a miss.
A hit means: some cached question's embedding is at least `threshold`
similar to this one. We return the closest match above the line."""
if not self.vectors:
return None
q = embed(question)
scores = [cosine(q, v) for v in self.vectors]
best = int(np.argmax(scores))
if scores[best] >= self.threshold:
return self.answers[best], self.questions[best], scores[best]
return None
def store(self, question, answer):
self.vectors.append(embed(question))
self.questions.append(question)
self.answers.append(answer)
def ask(self, question):
"""The full path: try the cache, else call the model and cache the result.
Returns (answer, was_hit, latency_ms, dollars_spent)."""
hit = self.lookup(question)
if hit is not None:
answer, _, _ = hit
return answer, True, LOOKUP_MS, 0.0
answer = stub_llm(question)
self.store(question, answer)
# A miss pays the lookup AND the model call (we looked first, found nothing).
return answer, False, LOOKUP_MS + LLM_MS, COST_PER_CALL
# --- A workload with deliberate near-duplicate paraphrases --------------------
# Three distinct intents, each asked several ways. A good cache should serve the
# paraphrases from the answer it computed the first time.
WORKLOAD = [
"How do I reset my password?", # intent A, first time -> MISS
"What is your refund policy?", # intent B, first time -> MISS
"How can I reset my password?", # A paraphrase -> should HIT
"I forgot my password, how do I reset it?", # A paraphrase -> should HIT
"How do I get a refund?", # B paraphrase -> should HIT
"Where is your office located?", # intent C, first time -> MISS
"How do I change my password?", # A-ish paraphrase -> should HIT
"Tell me about your refund policy please", # B paraphrase -> should HIT
"What are your office hours?", # C-ish, different intent -> MISS
"reset my password", # A paraphrase -> should HIT
]
def run(cache, label):
print(f"=== {label} (threshold = {cache.threshold}) ===")
global LLM_CALLS
LLM_CALLS = 0
hits = 0
total_ms = 0.0
total_dollars = 0.0
for q in WORKLOAD:
_, was_hit, ms, dollars = cache.ask(q)
hits += was_hit
total_ms += ms
total_dollars += dollars
tag = "HIT " if was_hit else "MISS"
print(f" {tag} {q}")
n = len(WORKLOAD)
print(f"\n hit rate: {hits}/{n} = {hits / n:.0%}")
print(f" LLM calls: {LLM_CALLS} (one per MISS)")
return total_ms, total_dollars, hits
# --- BEFORE vs AFTER: no cache vs cache ---------------------------------------
n = len(WORKLOAD)
# Baseline: every question hits the model.
no_cache_ms = n * (LLM_MS)
no_cache_dollars = n * COST_PER_CALL
print("Cost model: cache lookup ~{}ms, LLM call ~{}ms and {} output tokens "
"at ${}/Mtok = ${:.5f}/call.".format(
LOOKUP_MS, LLM_MS, OUT_TOKENS, PRICE_PER_MTOK, COST_PER_CALL))
print(f"Workload: {n} questions, with paraphrases of earlier ones mixed in.\n")
print(f"--- BEFORE (no cache): every question calls the model ---")
print(f" {n} LLM calls")
print(f" latency: {no_cache_ms:,.0f} ms")
print(f" cost: ${no_cache_dollars:.5f}\n")
cache = SemanticCache(threshold=0.8)
cache_ms, cache_dollars, hits = run(cache, "AFTER (semantic cache)")
print(f"\n--- SAVED by the cache ---")
print(f" latency: {no_cache_ms - cache_ms:,.0f} ms "
f"({1 - cache_ms / no_cache_ms:.0%} faster)")
print(f" cost: ${no_cache_dollars - cache_dollars:.5f} "
f"({1 - cache_dollars / no_cache_dollars:.0%} cheaper)")
print()
# --- 4. The threshold trade-off: a FALSE HIT ----------------------------------
# Lower the bar and the cache gets greedy: it starts serving cached answers to
# questions that only LOOK similar. Here two DIFFERENT intents share a content
# word ('cancel') and a loose threshold collapses them into one answer.
print("=== The threshold trade-off: a FALSE HIT ===")
first = "How do I cancel my subscription?"
second = "How do I cancel my flight?" # different intent, shares 'cancel'
score = cosine(embed(first), embed(second))
print(f" cached question: {first}")
print(f" new question: {second}")
print(f" similarity: {score:.2f}\n")
# At the safe threshold the cache correctly MISSES and would ask the model.
safe = SemanticCache(threshold=0.8)
safe.store(first, stub_llm(first))
print(f" threshold 0.8 -> "
f"{'HIT' if safe.lookup(second) else 'MISS'} (correct: these are different "
f"questions, ask the model)")
# Drop the threshold below the score and the same pair becomes a wrong HIT.
greedy = SemanticCache(threshold=0.4)
greedy.store(first, stub_llm(first))
hit = greedy.lookup(second)
answer = hit[0] if hit else None
print(f" threshold 0.4 -> {'HIT' if hit else 'MISS'} served: {answer}")
print(" ^ WRONG. A loose threshold served the SUBSCRIPTION answer to a question")
print(" about a FLIGHT. This is the core risk of semantic caching: too low a")
print(" threshold trades correctness for hit rate. You cannot push hit rate to")
print(" 100% without eventually serving someone the wrong answer.")
Running it:
Cost model: cache lookup ~10ms, LLM call ~1500ms and 500 output tokens at $25.0/Mtok = $0.01250/call.
Workload: 10 questions, with paraphrases of earlier ones mixed in.
--- BEFORE (no cache): every question calls the model ---
10 LLM calls
latency: 15,000 ms
cost: $0.12500
=== AFTER (semantic cache) (threshold = 0.8) ===
MISS How do I reset my password?
MISS What is your refund policy?
HIT How can I reset my password?
HIT I forgot my password, how do I reset it?
MISS How do I get a refund?
MISS Where is your office located?
MISS How do I change my password?
HIT Tell me about your refund policy please
MISS What are your office hours?
HIT reset my password
hit rate: 4/10 = 40%
LLM calls: 6 (one per MISS)
--- SAVED by the cache ---
latency: 5,900 ms (39% faster)
cost: $0.05000 (40% cheaper)
=== The threshold trade-off: a FALSE HIT ===
cached question: How do I cancel my subscription?
new question: How do I cancel my flight?
similarity: 0.43
threshold 0.8 -> MISS (correct: these are different questions, ask the model)
threshold 0.4 -> HIT served: [answer to: 'How do I cancel my subscription?']
^ WRONG. A loose threshold served the SUBSCRIPTION answer to a question
about a FLIGHT. This is the core risk of semantic caching: too low a
threshold trades correctness for hit rate. You cannot push hit rate to
100% without eventually serving someone the wrong answer.
Read the AFTER block question by question. The first time each of the three topics appears (password, refund, office) it MISSes and calls the model: there is nothing to match against yet. After that, the paraphrases start hitting. "How can I reset my password?" scores high enough against the stored "How do I reset my password?" to clear the bar, so it returns the stored answer for free. So do "I forgot my password, how do I reset it?", "Tell me about your refund policy please", and the terse "reset my password".
Notice the two paraphrases that miss. "How do I get a refund?" is clearly about refunds, and a human would serve it the refund answer, but it shares almost no content words with the stored "What is your refund policy?" ("get" and "refund" versus "refund" and "policy"), so its score lands below $0.8$ and the cache plays it safe by asking the model. "How do I change my password?" misses too, and arguably it should: changing a password is a different operation from resetting one. These misses are the cost of a strict threshold. A higher hit rate is available, but only by lowering the bar, and the false-hit demo shows where that road ends.
What the saving is
The BEFORE/AFTER totals come straight from the cost model at the top of the file: a cache lookup is about 10ms, a model call is about 1500ms and writes about 500 output tokens at $25 per million tokens (the claude-opus-4-8 output rate from Chapter 2). On this ten-question workload, four hits cut six model calls down from ten, which is 40% fewer dollars and 39% less latency. The two numbers track each other because, in this simple model, every hit saves one full model call. Whether a real workload saves 5% or 60% depends entirely on its duplicate rate: the fraction of traffic that is a near-repeat of something already answered. FAQ bots and repeated analytics questions can be very high; open-ended creative requests can be near zero.
The false hit, and why the threshold is everything
The last block is the warning. Take two genuinely different questions that happen to share a content word: "How do I cancel my subscription?" and "How do I cancel my flight?". Both contain "cancel", so their vectors are not orthogonal; they score $0.43$. At the safe threshold of $0.8$ that is a MISS, which is correct: these are different questions, so the cache asks the model. Drop the threshold to $0.4$ and the same pair becomes a HIT, and now the cache hands the subscription-cancellation answer to someone asking about a flight. That is a false hit, and it is the one failure mode prefix caching cannot produce.
This is the trade-off in one picture. The threshold is a dial between two kinds of mistake. Set it too high and you get false misses: real paraphrases that should have hit but ask the model anyway, costing money you did not need to spend (the refund and change-password misses above). Set it too low and you get false hits: different questions served the wrong answer, costing trust. You cannot push the hit rate to 100% without eventually serving someone the wrong answer, so the right setting is conservative: better to pay for an extra model call than to confidently answer the wrong question. Most production caches land around $0.8$ to $0.85$ on a good embedder and tune from there against real traffic.
A real embedder changes the numbers but not the shape of this trade-off. A trained sentence embedder would score "How do I get a refund?" against "What is your refund policy?" much higher than our toy hashing embedder does, so it would catch that paraphrase at a safe threshold and lift the hit rate. It would also keep "cancel my subscription" and "cancel my flight" far apart. A better embedder buys you both a higher hit rate and fewer false hits at the same time, which is exactly why production systems use one. The threshold dial is still there; the embedder just makes its safe range wider.
How this looks in production
You do not build the embedder, the vector store, and the similarity search by hand for a real system. Two representative tools:
GPTCache is an open-source semantic cache. Its architecture is the same pipeline as the demo, split into named stages: an adapter that wraps the LLM call, a pre-processor that normalizes the incoming request, an embedding generator that turns the query into a vector, a cache manager that stores vectors and answers, a similarity evaluator that scores a new query against the stored ones (cosine similarity with a threshold around $0.8$, the same decision the demo makes), and a post-processor that returns the chosen answer. It plugs into orchestration frameworks like LangChain and LlamaIndex, so an existing app can route calls through the cache without rewriting them.
Redis LangCache is a managed version of the same idea: embedding, storage, and similarity search behind a single API call, so you do not run the vector store yourself. The managed search adds a small amount of latency (on the order of a handful of milliseconds) to find a match, which is the cache-lookup cost in our model, and on a hit it saves the full model call, which is the slow, expensive part. The economics are the BEFORE/AFTER of the demo: trade a cheap lookup for an expensive generation, on the fraction of traffic that repeats.
Where these pay off is exactly where the duplicate rate is high: FAQ and support bots, repeated analytics questions ("what were sales last quarter?" asked fifty ways), and any service with heavy near-duplicate traffic. Where they do not pay off is open-ended or per-user-unique work, where almost nothing repeats and every lookup is a guaranteed miss that still costs you the embedding step. The decision to add a semantic cache is, at bottom, a bet on how repetitive your traffic is.
Using the real tool: commands and before/after proof
The from-scratch demo above is the engine. In an app you wire in a packaged version of it instead of hand-rolling the embedder and vector search. Here is what that looks like with the two tools from the last section, and how you prove the win.
GPTCache: wrap the LLM call
GPTCache is an open-source semantic cache that sits in front of your model call. Install it:
pip install gptcache
The library is not installed on this box and the snippets below need a model API key, so treat them as follow-along: the calls are the real, documented API, but the output shown is illustrative, not run here. This is the semantic-cache setup straight from the GPTCache README. It builds the same pipeline as our demo (an embedder, a vector store, a distance-based similarity check) and points GPTCache's OpenAI adapter at it:
from gptcache import cache
from gptcache.adapter import openai
from gptcache.embedding import Onnx
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation
onnx = Onnx() # a real (small) sentence embedder
data_manager = get_data_manager( # where vectors and answers live
CacheBase("sqlite"),
VectorBase("faiss", dimension=onnx.dimension),
)
cache.init(
embedding_func=onnx.to_embeddings, # text -> vector (our toy embedder's job)
data_manager=data_manager, # store + search (our cache dict's job)
similarity_evaluation=SearchDistanceEvaluation(), # the threshold decision
)
cache.set_openai_key()
SearchDistanceEvaluation is the threshold knob from the demo, now expressed as a distance
(low distance means similar) rather than a similarity (high means similar). After cache.init,
you call the model through openai from gptcache.adapter instead of the normal client. The
adapter checks the cache first and only calls the model on a miss. You do not change your
prompt; you change which client you call through. GPTCache also plugs into LangChain: you set
it as the global LLM cache with LangChain's set_llm_cache(...), and every model call in the
app routes through the cache without rewriting the call sites.
Redis LangCache / redisvl: a before/after you can time
Redis LangCache is the managed service; redisvl (the Redis Vector Library) is the open client
that exposes the same SemanticCache you would run yourself against a Redis instance. It gives
you a clean check/store pair, which makes the before/after easy to time. Install it:
pip install redisvl
Again follow-along (no Redis, no model key, no redisvl on this box; output is
illustrative). The recipe: time two similar questions. The first is a cold MISS that
pays for a full model call; the second is a HIT served from the cache with no model call.
import time
from redisvl.extensions.cache.llm import SemanticCache
cache = SemanticCache(
name="llmcache",
redis_url="redis://localhost:6379",
distance_threshold=0.1, # COSINE distance in [0, 2]; LOWER is stricter (0 = identical)
)
def answer(question):
hit = cache.check(prompt=question) # vector search against stored questions
if hit:
return hit[0]["response"], "HIT" # served from cache, model NOT called
reply = call_the_model(question) # your real claude-opus-4-8 call goes here
cache.store(prompt=question, response=reply)
return reply, "MISS"
for q in ["How do I reset my password?", # cold: nothing stored yet -> MISS
"How can I reset my password?"]: # paraphrase of the first -> HIT
t0 = time.perf_counter()
reply, status = answer(q)
dt = (time.perf_counter() - t0) * 1000
print(f"{status:4} {dt:8.1f} ms {q}")
Illustrative output (expected shape, not measured here):
MISS 1503.7 ms How do I reset my password?
HIT 14.2 ms How can I reset my password?
The MISS pays the full model call (about 1500 ms and a few hundred output tokens). The HIT is just the embed-and-search lookup (about 15 ms) and calls no model, so it costs no output tokens. That is roughly a 100x latency drop on the repeat and the entire avoided generation cost, which on the claude-opus-4-8 rates from Chapter 2 (output billed at 5x input) is where the dollars go.
The knob is distance_threshold. Note the inversion from our demo: redisvl uses cosine
distance (lower is stricter, $0$ is identical), while the from-scratch demo used cosine
similarity (higher is stricter, $1$ is identical). They are two faces of the same dial. Set
it too loose (a high distance threshold here) and you get the false hit the demo proved on
"cancel my subscription" versus "cancel my flight": a different question served the wrong stored
answer. The on-box semantic_cache.py demo is the real, verified proof of that failure mode and
of the BEFORE/AFTER arithmetic; the snippets here are the same logic packaged behind a tool.
In Claude Code / at the app layer
Semantic caching lives in your application, in front of the model call, and it does not care which provider you use: the embedder, the vector store, and the threshold are all yours. That makes it independent of, and complementary to, Anthropic's prompt caching from Chapter 6. Prompt caching is a server-side feature that reuses the input prefix tokens when the model does run; semantic caching is a client-side layer that skips the model entirely on a near-duplicate. Use both: prompt caching makes the calls you do make cheaper, and the semantic cache removes the calls you should not be making at all.
Takeaways
- Semantic caching returns a stored answer for an approximately similar question and skips the model entirely, saving the expensive output half of the call, not just the input prefix that Chapter 6 reuses.
- The decision is three pieces: an embedding (text to vector), cosine similarity (the angle between two vectors, a plain dot product on unit vectors), and a threshold (the cutoff that turns a score into hit or miss).
- The threshold is a dial between false misses (paraphrases that needlessly call the model) and false hits (different questions served the wrong answer). False hits are the failure mode prefix caching cannot have, so tune conservatively, around $0.8$ on a good embedder.
- The dollar and latency saving scales with your duplicate rate: high for FAQ/support bots and repeated analytics, near zero for open-ended or per-user-unique work.
- A better embedder widens the safe threshold range, raising the hit rate and lowering false hits at once. Tools like GPTCache and Redis LangCache package the embedding, storage, and similarity search so you wire in the cache instead of building it.
👉 We have skipped the model on a cache hit. The next chapter goes back inside the model for the calls you cannot skip: how the KV-cache is served at scale, so the generation you do pay for is as fast and cheap as the hardware allows.
KV-cache serving optimization
Chapter 6 was about the bill: when you re-send a stable prefix, the provider can skip re-reading it and charge you a fraction of the price. That is the view from outside, through the API. This chapter goes inside the box. The same key/value state that prefix caching lets you reuse has to physically live somewhere on the server's GPU, and how the serving engine lays that memory out across many users at once decides how many requests it can run in parallel and how fast each one comes back. Two ideas do most of the work, and both have a clean from-scratch model: paging the cache into fixed-size blocks, and sharing the blocks of a common prefix across requests.
What the KV cache is, concretely
When a model generates text, each new token has to attend to every token before it. Attention works by comparing the new token's query vector against a key vector for every earlier token, then mixing in a value vector for each. The keys and values for the earlier tokens do not change as generation continues, so recomputing them on every step would be pure waste. The engine computes them once and keeps them. That saved pile of key and value vectors is the KV cache.
Two facts about it drive this whole chapter. First, it grows by exactly one slot per token, per layer, as the sequence gets longer, so a request that has generated 500 tokens is holding 500 slots of KV. Second, it lives in GPU memory, which is small and expensive, and it is usually the thing that runs out first when you try to serve many users at once. Model weights are a fixed cost you pay once; the KV cache is a per-request, per-token cost that scales with how many people are talking to the model and how much they have said. So the engineering question is not "how do we compute attention" but "how do we store all this KV for all these concurrent requests without wasting memory."
Don't be confused. This is a different layer from Chapter 6. That chapter was about API prompt caching: what you, the caller, pay when your prompt repeats an exact prefix, billed per request, surfaced as a discount on your invoice. This chapter is about engine memory management: how the server arranges KV in GPU memory so it can hold many requests at once, none of which you see or are billed for directly. They are related (prefix sharing here is the mechanism that makes API prefix caching possible at all), but one is your cost model and the other is the server's memory model. When you call a hosted API you are using both, through different doors.
Idea one: page the cache instead of reserving the maximum
Here is the trap a naive serving engine falls into. The KV cache for a request grows as it
generates, and the engine does not know in advance how long the request will get. The model
could generate up to max_seq_len tokens (say 2048), so the simple thing is to reserve one
contiguous buffer of 2048 slots per request up front. Contiguous means one unbroken run of
memory addresses, which the old attention kernels needed.
The problem is that most requests finish far short of the maximum. A request that generates 80 tokens still holds a 2048-slot reservation, and the other 1968 slots sit empty but allocated, unusable by anyone else. That wasted tail is internal fragmentation: memory that is reserved but not used, trapped inside a per-request allocation. With requests that vary in length, the average waste is enormous, and it directly caps how many requests fit on the GPU.
The fix, PagedAttention (introduced by vLLM), is
the same trick an operating system uses for RAM. Instead of one big contiguous buffer per
request, the engine carves GPU memory into many small fixed-size blocks (also called pages;
vLLM's default is 16 tokens of KV per block) and hands them out on demand. A request that has
filled 80 tokens holds ceil(80 / 16) = 5 blocks and nothing more. When it needs token 81 it
asks for one more block. The blocks for a single request do not have to be next to each other
in memory; a small table maps the logical sequence to its scattered physical blocks, exactly
like a page table maps virtual addresses to physical RAM. The only waste left is the partly
filled last block, which is at most block_size - 1 tokens per request.
The first part of the demo measures the difference. We take 64 requests with realistic varied lengths and compute memory utilization, the fraction of reserved memory that actually holds live KV:
$$\text{utilization} = \frac{\text{tokens used}}{\text{tokens reserved}}$$
The contiguous allocator reserves max_seq_len for every request; the paged allocator reserves
ceil(len / block_size) blocks. We never compute an attention kernel here, just the
arithmetic of what each policy reserves.
Idea two: share the blocks of a common prefix
The second idea attacks a different waste. In multi-user serving, many requests start with the same tokens: the same system prompt, the same few-shot examples, the same tool definitions. Because the KV for a position depends only on the tokens up to that position, two requests that share a prefix produce bit-for-bit identical KV blocks for that prefix. Storing a separate copy per request is redundant. The engine could store the shared prefix once and let every request that begins with it point at the same blocks.
The data structure for "find the longest shared prefix fast" is a radix tree, also called a prefix tree or trie. A trie stores sequences as paths from a root: each edge is one element (here, one KV block), and sequences that share a prefix share the early part of their path before branching. RadixAttention (from SGLang) builds exactly this tree over the block sequences of live requests, keyed by block content, so that the moment a new request's prefix matches an existing path, it reuses those blocks instead of allocating and recomputing them. This is automatic: no one has to declare "these requests share a prompt," the tree discovers it.
The second part of the demo builds a small radix tree over 40 requests. Each request is a list of block ids; equal ids mean identical blocks (same tokens, same positions, so shareable). All 40 share an 8-block system prompt, 30 of them also share a 6-block few-shot preamble, and each has a short unique tail. Inserting a request into the tree creates a new node only for a block that has never appeared on that path, so the count of nodes created is the count of blocks the engine actually has to store. We compare that against the naive total where every request keeps its own copy of everything.
The demo
"""KV-cache serving optimization: how the engine stores attention state in GPU memory.
When a model generates text, every token it has already seen leaves behind a pair of
vectors per layer (a "key" and a "value") that the next token attends to. That stored
pile of key/value vectors is the KV CACHE. It grows by one slot per generated token, and
it lives in scarce GPU memory. How the serving engine LAYS OUT that memory across many
concurrent requests decides how many requests fit at once. Two ideas dominate:
1. PAGED allocation (vLLM's PagedAttention). A naive engine reserves one contiguous
buffer of max_seq_len up front for each sequence, then most sequences finish short
and the unused tail is wasted (internal fragmentation). A paged allocator hands out
fixed-size BLOCKS on demand, like an operating system pages virtual memory, so a
sequence only holds the blocks it actually filled. We measure UTILIZATION =
used / reserved for both.
2. PREFIX SHARING (SGLang's RadixAttention). Many requests start with the SAME prefix
(a shared system prompt, a fixed few-shot preamble). Their KV blocks for that prefix
are identical, so the engine can store ONE copy and let every request point at it.
We build a radix tree (a prefix tree / trie over block sequences), then count blocks
stored without sharing vs with sharing, and report blocks saved.
This is a SIMULATION with real arithmetic, not a GPU kernel. NumPy + stdlib only.
Run: python3 kv_serving.py
"""
import numpy as np
# ----------------------------------------------------------------------------------
# Part 1: contiguous (reserve max) vs paged (blocks on demand) allocation.
# ----------------------------------------------------------------------------------
MAX_SEQ_LEN = 2048 # the longest a sequence is allowed to get
BLOCK_SIZE = 16 # tokens of KV per page/block (vLLM's default is 16)
# A batch of requests with realistic, varied actual lengths. A naive allocator cannot
# know these in advance, so it must reserve MAX_SEQ_LEN for every one of them.
rng = np.random.default_rng(0)
actual_lens = rng.integers(20, 600, size=64) # 64 requests, 20..599 tokens each
def blocks_needed(n_tokens, block_size):
"""How many fixed-size blocks hold n_tokens. The last block is partly empty;
that small leftover is the only waste a paged allocator ever pays."""
return int(np.ceil(n_tokens / block_size))
def contiguous_utilization(lens, max_len):
"""Reserve max_len slots per sequence regardless of how long it actually gets."""
used = int(np.sum(lens))
reserved = len(lens) * max_len
return used, reserved
def paged_utilization(lens, max_len, block_size):
"""Hand out blocks on demand; a sequence holds only ceil(len/block_size) blocks."""
used = int(np.sum(lens))
reserved = int(np.sum([blocks_needed(n, block_size) for n in lens])) * block_size
return used, reserved
used_c, reserved_c = contiguous_utilization(actual_lens, MAX_SEQ_LEN)
used_p, reserved_p = paged_utilization(actual_lens, MAX_SEQ_LEN, BLOCK_SIZE)
util_c = used_c / reserved_c
util_p = used_p / reserved_p
print("=== 1. Paged vs contiguous KV allocation ===")
print(f" {len(actual_lens)} requests, actual lengths {actual_lens.min()}..{actual_lens.max()} "
f"tokens (mean {actual_lens.mean():.0f}).")
print(f" max_seq_len = {MAX_SEQ_LEN}, block_size = {BLOCK_SIZE} tokens/block.\n")
print(" BEFORE (contiguous: reserve max_seq_len per request)")
print(f" used {used_c:,} tok / reserved {reserved_c:,} tok -> "
f"utilization {util_c:6.1%}")
print(f" wasted: {reserved_c - used_c:,} tok ({1 - util_c:.1%} of reserved)\n")
print(" AFTER (paged: blocks on demand)")
print(f" used {used_p:,} tok / reserved {reserved_p:,} tok -> "
f"utilization {util_p:6.1%}")
print(f" wasted: {reserved_p - used_p:,} tok ({1 - util_p:.1%} of reserved, all in "
f"half-full last blocks)\n")
print(f" Same KV in {reserved_c:,} reserved tok shrinks to {reserved_p:,}: "
f"{reserved_c / reserved_p:.1f}x less memory reserved for the SAME work.")
print(f" -> a fixed GPU can now hold ~{reserved_c / reserved_p:.1f}x as many concurrent "
f"requests.\n")
# ----------------------------------------------------------------------------------
# Part 2: prefix sharing via a radix tree (trie) over block sequences.
# ----------------------------------------------------------------------------------
class RadixNode:
"""One node = one shared block, identified by a content key (here, a token-block
id). Children are the blocks that have followed it. Many requests can walk the
same path, which is exactly the prefix they share."""
__slots__ = ("children",)
def __init__(self):
self.children = {}
def insert(root, block_seq):
"""Walk the request's block ids down the tree, creating a node only when a block
has never been seen on this path before. Returns how many NEW blocks this request
forced us to store (the blocks not already shared with an earlier request)."""
node = root
new_blocks = 0
for blk in block_seq:
if blk not in node.children:
node.children[blk] = RadixNode()
new_blocks += 1
node = node.children[blk]
return new_blocks
# A workload that looks like multi-user serving: a long shared SYSTEM PROMPT, then a
# shared FEW-SHOT preamble for most requests, then each request's unique question.
# We model each request as a sequence of block ids. Equal ids mean identical KV blocks
# (same tokens in the same position), which is what makes them shareable.
SYSTEM_BLOCKS = [f"sys{i}" for i in range(8)] # 8 blocks every request shares
FEWSHOT_BLOCKS = [f"shot{i}" for i in range(6)] # 6 more blocks most requests share
requests = []
for r in range(40):
blocks = list(SYSTEM_BLOCKS)
if r % 4 != 0: # 30 of 40 also use the few-shot preamble
blocks += FEWSHOT_BLOCKS
blocks += [f"req{r}_q{j}" for j in range(rng.integers(2, 6))] # unique tail
requests.append(blocks)
total_blocks_no_share = sum(len(b) for b in requests)
root = RadixNode()
stored_with_share = sum(insert(root, b) for b in requests)
print("=== 2. Prefix sharing via a radix tree (RadixAttention) ===")
print(f" {len(requests)} requests. Each starts with an {len(SYSTEM_BLOCKS)}-block system "
f"prompt; {sum(1 for r in range(40) if r % 4 != 0)} also share a "
f"{len(FEWSHOT_BLOCKS)}-block few-shot preamble.\n")
print(" BEFORE (no sharing: every request stores its own copy of every block)")
print(f" blocks stored: {total_blocks_no_share:,}\n")
print(" AFTER (radix tree: shared prefixes stored once)")
print(f" blocks stored: {stored_with_share:,}")
saved = total_blocks_no_share - stored_with_share
print(f" blocks saved: {saved:,} ({saved / total_blocks_no_share:.1%} fewer)\n")
print(f" The {len(SYSTEM_BLOCKS)} system blocks are stored once instead of "
f"{len(requests)} times; the few-shot preamble once instead of many times.")
print(" Every saved block is GPU memory freed for another concurrent request,")
print(" and prefill compute the engine skips because that KV already exists.")
Running it:
=== 1. Paged vs contiguous KV allocation ===
64 requests, actual lengths 21..598 tokens (mean 315).
max_seq_len = 2048, block_size = 16 tokens/block.
BEFORE (contiguous: reserve max_seq_len per request)
used 20,174 tok / reserved 131,072 tok -> utilization 15.4%
wasted: 110,898 tok (84.6% of reserved)
AFTER (paged: blocks on demand)
used 20,174 tok / reserved 20,640 tok -> utilization 97.7%
wasted: 466 tok (2.3% of reserved, all in half-full last blocks)
Same KV in 131,072 reserved tok shrinks to 20,640: 6.4x less memory reserved for the SAME work.
-> a fixed GPU can now hold ~6.4x as many concurrent requests.
=== 2. Prefix sharing via a radix tree (RadixAttention) ===
40 requests. Each starts with an 8-block system prompt; 30 also share a 6-block few-shot preamble.
BEFORE (no sharing: every request stores its own copy of every block)
blocks stored: 642
AFTER (radix tree: shared prefixes stored once)
blocks stored: 156
blocks saved: 486 (75.7% fewer)
The 8 system blocks are stored once instead of 40 times; the few-shot preamble once instead of many times.
Every saved block is GPU memory freed for another concurrent request,
and prefill compute the engine skips because that KV already exists.
Read the two halves together, because they attack waste from opposite directions. Part one is about waste within a single request: the contiguous allocator reserves 131,072 tokens to hold 20,174 real ones, a utilization of 15.4 percent, which is to say it throws away 84.6 percent of the memory it reserved on padding that no request will ever fill. Paging drops that to 2.3 percent waste and 97.7 percent utilization, all of the remaining loss confined to the half-empty last block of each request. The same KV that needed 131,072 reserved slots now fits in 20,640, which is 6.4x less memory for identical work. Memory is the binding constraint on how many requests a GPU can serve at once, so reserving 6.4x less of it is, to first order, room for 6.4x more concurrent requests.
Part two is about waste across requests. Without sharing, the 40 requests store 642 blocks total, re-storing the same system prompt 40 separate times. The radix tree stores the 8 system blocks once and the few-shot preamble once, bringing the total down to 156 blocks, a saving of 486 blocks or 75.7 percent. Every block saved is GPU memory freed for another user, and it is also prefill compute the engine gets to skip entirely, because the KV for those shared tokens already exists and does not need to be recomputed. That skipped compute is the server-side mechanism underneath the cache-read discount you saw on your bill in Chapter 6.
The exact percentages here come from this synthetic workload and the seed in the script; change the length distribution or the prefix overlap and the numbers move. What is robust is the direction and the rough scale: contiguous reservation wastes the large majority of memory when lengths vary, paging recovers nearly all of it, and prefix sharing removes most of the redundancy when many requests share a long preamble. Those are exactly the conditions of real multi-user serving.
Where this shows up, and where it does not
These two techniques are the core of modern open-source serving engines. vLLM is built around PagedAttention; SGLang adds RadixAttention for automatic prefix sharing across requests, and both implement the other's idea too, so in practice you get paging and sharing together. The workloads that benefit most are the obvious ones: high-throughput serving with many concurrent users (paging lets far more of them coexist), any setup where requests share a long system prompt or few-shot preamble (sharing stores it once), and agent fan-out where one parent spawns many children that all carry the same instructions and context (Chapter 13), which is prefix sharing at its best because the overlap is large and deliberate.
If you only ever call a hosted API, you do not configure any of this; the provider runs an engine like these on your behalf, and your lever is to structure your prompts so the shared part comes first and stays byte-stable, which is the same discipline that earns the API cache discount in Chapter 6. If you self-host, this is the layer you actually operate, and understanding it is how you reason about throughput and the maximum batch size your GPU can hold. Either way, it connects upward to attention itself: paging and sharing manage where the KV lives, while Chapter 14 is about making attention over a long cache cheap to compute in the first place.
Using the real tool: commands and before/after proof
The from-scratch demo proves the mechanism on this box. The production versions of both ideas live in real serving engines you run on a GPU server, not in Claude Code. Claude Code is a client that talks to a hosted API; the paging and prefix sharing below happen on whatever server is answering, and you only operate them yourself when you self-host a model with vLLM or SGLang. So this section is server-side: the commands assume a machine with a GPU and the model weights, and the numbers at the end are labeled illustrative because there is no GPU on this box to run them.
Both engines do paging by default. The lever you actually toggle is prefix sharing, so that is what the before/after proof turns on and off.
Install vLLM and start a server. vllm serve <model> launches an OpenAI-compatible HTTP endpoint
on localhost:8000. In the current vLLM (the V1 engine), automatic prefix caching is on by
default, so the first command below is the "after" (sharing on) and the second is the "before"
(sharing off) for an apples-to-apples comparison:
pip install vllm
# AFTER: prefix sharing ON (this is the default in vLLM V1; the flag is explicit here)
vllm serve meta-llama/Meta-Llama-3-8B-Instruct --enable-prefix-caching
# BEFORE: prefix sharing OFF (forces a cold prefill on every request)
vllm serve meta-llama/Meta-Llama-3-8B-Instruct --no-enable-prefix-caching
SGLang is the engine that introduced RadixAttention. Its launcher is a Python module, and
RadixAttention prefix caching is also on by default; --disable-radix-cache is the off
switch for the baseline run:
pip install "sglang[all]"
# Prefix sharing ON (default)
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
# Prefix sharing OFF (baseline)
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --disable-radix-cache
Now hit the endpoint with two requests that share a long prefix and differ only in a short tail.
The point is that the big shared block is identical across both calls, so the second call should
get to reuse its KV. This is a follow-along curl against the OpenAI-compatible
/v1/chat/completions route (the LONG_SYSTEM text below stands in for a real multi-thousand-token
system prompt or document; make it long, that is the whole point):
LONG_SYSTEM="<a few thousand tokens of stable system prompt / policy / docs, identical every time>"
# Request 1: pays the full prefill for the long shared prefix
time curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{\"model\":\"meta-llama/Meta-Llama-3-8B-Instruct\",
\"messages\":[{\"role\":\"system\",\"content\":\"$LONG_SYSTEM\"},
{\"role\":\"user\",\"content\":\"Summarize in one line.\"}],
\"max_tokens\":32}"
# Request 2: SAME long prefix, different short question
time curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{\"model\":\"meta-llama/Meta-Llama-3-8B-Instruct\",
\"messages\":[{\"role\":\"system\",\"content\":\"$LONG_SYSTEM\"},
{\"role\":\"user\",\"content\":\"List three risks.\"}],
\"max_tokens\":32}"
The before/after proof is the second request. Both engines log a prefix cache hit rate (vLLM
prints lines like Prefix cache hit rate: ...; SGLang reports reused/cached tokens), so you watch
two signals at once: the server logs for how many prefix tokens were reused, and the client
clock (time, or the time-to-first-token if you stream) for how long the second call took to
start producing output. With sharing on, the second request finds the long prefix already in the
KV cache, skips re-running prefill over those thousands of tokens, and its first token comes back
much sooner. With sharing off, every request prefills the whole prefix from scratch, so the second
call is no faster than the first. The shape of the result (illustrative, not measured on this box,
because it has no GPU):
# Prefix sharing OFF (--no-enable-prefix-caching / --disable-radix-cache)
request 1: time-to-first-token ~ 0.9 s (full prefill of the long prefix)
request 2: time-to-first-token ~ 0.9 s (full prefill again; nothing reused)
server log: prefix cache hit rate ~ 0%
# Prefix sharing ON (default)
request 1: time-to-first-token ~ 0.9 s (cold: builds the prefix's KV)
request 2: time-to-first-token ~ 0.1 s (warm: prefix KV reused, prefill skipped)
server log: prefix cache hit rate high (most of request 2's prompt tokens were cache hits)
A high hit rate also means the shared prefix's KV is stored once and pointed at by both requests
instead of copied, which is KV memory headroom freed for more concurrent users. That is the same
two wins the from-scratch demo measured above, now coming from a real engine: the paged allocator
is what lets request 2 reuse exactly the blocks request 1 built, and the radix tree is what finds
the shared prefix to reuse. The demo's 97.7% utilization and 75.7% fewer blocks are the
on-box proof of the mechanism that produces this drop in time-to-first-token on the server.
One sharp distinction from Chapter 6: that chapter is about what you pay at the API, the cache-read discount on your invoice, while this is what the serving engine does with GPU memory across concurrent requests so it can reuse that KV and fit more of them at once.
Takeaways
- The KV cache is the stored key/value vectors for every token a request has seen. It grows one slot per token, lives in scarce GPU memory, and is usually what limits how many requests a server can run at once.
- Reserving a contiguous
max_seq_lenbuffer per request wastes most of the memory when lengths vary (internal fragmentation): in the demo, 15.4 percent utilization, 84.6 percent wasted. - PagedAttention hands out fixed-size blocks on demand, like OS virtual memory, pushing utilization to 97.7 percent and fitting about 6.4x more concurrent requests in the same memory.
- RadixAttention stores the KV of a shared prefix once via a radix tree (a trie over block sequences), so a common system prompt costs one copy instead of one per request: 75.7 percent fewer blocks in the demo.
- This is the server-side mechanism beneath the API prefix-cache discount of Chapter 6. If you self-host you operate it directly; if you call an API, you exploit it by keeping the shared prefix first and byte-stable.
👉 Caching reuses work on context that repeats. But some state should outlive the window entirely, surviving across sessions rather than just across calls. The next part turns to memory: storing what the model should remember outside the context and re-injecting only the slice that this turn needs.
Agent memory and persistence
Chapter 1 made one fact unavoidable: the model is a pure function of its context, and between calls it remembers nothing. The "memory" a chat assistant seems to have is your application re-sending the conversation each turn. That works until the conversation outgrows the window, or until the user comes back next week and expects you to still know their name. At that point re-sending the whole history is either impossible (it does not fit) or wasteful (you pay for thousands of tokens to answer a question that needed three of them).
This chapter is about the durable answer: store the facts outside the window, and on each turn re-inject only the slice that this turn actually needs. That is the memory family from Chapter 1's table, and it relieves the capacity pressure by keeping the context lean no matter how long the relationship runs.
Memory is not the window
The word "memory" gets used for two different things, and conflating them is the usual source of confusion.
Don't be confused. The context window is the model's working memory for one call: a fixed-size buffer, rebuilt every turn, that the model reads and then forgets. An external memory store is your durable record: a database, file, or vector index that lives in your code and survives across calls and sessions. The model never sees the store directly. Your code reads from the store, selects what is relevant, and writes it into the window. Retrieval (pulling a few facts out of the store) is the opposite of "send the whole history" (dumping everything back into the window). The whole skill is keeping the store complete and the window small.
So a memory system has to answer four operational questions, and the rest of this chapter is those four:
- Extraction. When the user says something worth keeping, what exactly do you store?
- Storage and embedding. In what form, so you can find it again later?
- Retrieval. Given a new question, which stored facts go into the window this turn?
- Invalidation. When a new fact contradicts an old one, how do you avoid keeping both?
A memory store in 150 lines
The code below builds the smallest honest version of all four. It uses NumPy and the standard library only: no model call, no database, no network. Read it for the mechanics, not the scale; a production system swaps each piece for a stronger one, but the shape is the same.
Three terms it leans on, defined before you meet them:
- An embedding is a list of numbers that stands in for a piece of text, arranged so that texts about the same thing land near each other in number-space. Real embeddings come from a trained network; ours is a cheap stand-in that hashes each word into one of 256 slots and counts it, so two facts that share words get overlapping vectors.
- Cosine similarity measures how aligned two vectors are: 1.0 means they point the same direction (very similar), 0.0 means they are unrelated. We unit-normalize every vector (scale it to length 1), which makes the cosine just the dot product, $\cos(a,b)=a\cdot b$.
- Top-k retrieval means: score every stored fact against the query, sort, and keep the best $k$. Only those $k$ enter the window, not the whole store.
"""A tiny agent-memory store: extract, embed, store, retrieve, invalidate.
The model is stateless between calls. Anything it should "remember" across
turns or sessions is state YOUR code keeps outside the context window and
re-injects on demand. This file builds the smallest honest version of that:
EXTRACT pull atomic facts from a user turn with simple rules.
EMBED turn a fact into a fixed-length vector with a hashing bag-of-words.
STORE keep (fact, vector) in a list.
RETRIEVE rank stored facts against a query by cosine similarity, top-k.
INVALIDATE when a new fact contradicts an old one on the same subject and
attribute (e.g. "favorite color"), replace the old fact in place.
numpy + stdlib only. No network, no model call. ~1.3 tokens/word for sizing.
"""
import re
import numpy as np
DIM = 256 # length of every embedding vector
# --------------------------------------------------------------------------
# EXTRACT: user turn -> list of normalized atomic facts
# --------------------------------------------------------------------------
# We look for a few first-person and "X is Y" shapes. A real system would use
# the model itself for this; the rules here are deliberately small so you can
# see exactly what becomes a fact and what does not. We split the turn into
# clauses first ("A and B" -> ["A", "B"]) so each fact stays atomic, and the
# value group stops at the first clause break instead of swallowing the rest.
_VALUE = r"([\w ]+?)(?: and |[.,!?]|$)" # a value, ending at a clause boundary
_PATTERNS = [
# "my favorite color is blue" -> "favorite color is blue"
(re.compile(r"\bmy ([\w ]+?) is " + _VALUE, re.I), "{0} is {1}"),
# "i prefer window seats" -> "prefers window seats"
(re.compile(r"\bi prefer " + _VALUE, re.I), "prefers {0}"),
# "i am vegetarian" / "i'm allergic to peanuts" -> "is ..."
(re.compile(r"\bi am " + _VALUE, re.I), "is {0}"),
(re.compile(r"\bi'm " + _VALUE, re.I), "is {0}"),
# "i live in Lisbon" -> "lives in Lisbon"
(re.compile(r"\bi live in " + _VALUE, re.I), "lives in {0}"),
# "i work at Helios" -> "works at Helios"
(re.compile(r"\bi work at " + _VALUE, re.I), "works at {0}"),
]
def extract_facts(turn):
"""Return a list of normalized fact strings found in one user turn."""
facts = []
for pattern, template in _PATTERNS:
for match in pattern.finditer(turn):
groups = [g.strip().lower() for g in match.groups()]
fact = template.format(*groups)
fact = re.sub(r"\s+", " ", fact).strip()
if fact and fact not in facts:
facts.append(fact)
return facts
# --------------------------------------------------------------------------
# EMBED: fact string -> fixed-length unit vector (hashing bag-of-words)
# --------------------------------------------------------------------------
# An "embedding" is a list of numbers that stands in for a piece of text, built
# so that texts about the same thing land near each other. Real embeddings come
# from a trained network. Ours is a cheap stand-in: hash each word to one of DIM
# slots and count it. Shared words -> overlapping slots -> nearby vectors.
_WORD = re.compile(r"[a-z0-9]+")
# A handful of words carry no topic ("i", "is", "my", "the"): drop them so two
# facts are judged similar by their content words, not their grammar.
_STOP = {"i", "is", "am", "my", "the", "a", "an", "and", "to", "in", "of",
"at", "for", "me", "you", "your", "what", "should"}
def _stem(word):
"""Crudely fold a trailing 's' so 'seat' and 'seats' hit the same slot."""
if len(word) > 3 and word.endswith("s") and not word.endswith("ss"):
return word[:-1]
return word
def embed(text):
"""Hash the words of `text` into a DIM-length vector, then unit-normalize."""
vec = np.zeros(DIM, dtype=np.float64)
for word in _WORD.findall(text.lower()):
if word in _STOP:
continue
# Python's hash is salted per process, so use a stable hash instead.
slot = hash_stable(_stem(word)) % DIM
vec[slot] += 1.0
norm = np.linalg.norm(vec)
if norm > 0:
vec /= norm # length 1, so the dot product below is a cosine directly
return vec
def hash_stable(word):
"""A small deterministic string hash (FNV-1a), so runs are reproducible."""
h = 2166136261
for ch in word.encode("utf-8"):
h = (h ^ ch) * 16777619 & 0xFFFFFFFF
return h
def cosine(a, b):
"""Cosine similarity of two vectors: 1.0 identical direction, 0.0 unrelated.
Both vectors are already unit length, so this is just their dot product.
"""
return float(np.dot(a, b))
# --------------------------------------------------------------------------
# The store: facts, their vectors, and the subject/attribute key for updates
# --------------------------------------------------------------------------
class MemoryStore:
def __init__(self):
self.facts = [] # list of fact strings
self.vectors = [] # list of embeddings, aligned with self.facts
self.keys = [] # subject/attribute key, aligned, for invalidation
def _key(self, fact):
"""The part of a fact that identifies WHAT it is about, minus the value.
"favorite color is blue" and "favorite color is green" share the key
"favorite color is", so the second should replace the first, not sit
beside it. We key on the text up to and including the last " is ".
"""
marker = " is "
if marker in fact:
return fact[: fact.rindex(marker) + len(marker)]
# "prefers window seats" -> key "prefers", value "window seats"
parts = fact.split(" ", 1)
return parts[0] if len(parts) > 1 else fact
def add(self, fact):
"""Store a fact, REPLACING any existing fact with the same key."""
key = self._key(fact)
for i, existing_key in enumerate(self.keys):
if existing_key == key:
old = self.facts[i]
self.facts[i] = fact
self.vectors[i] = embed(fact)
return ("updated", old)
self.facts.append(fact)
self.vectors.append(embed(fact))
self.keys.append(key)
return ("added", None)
def retrieve(self, query, k=3):
"""Return the top-k (fact, score) most similar to `query`."""
if not self.facts:
return []
q = embed(query)
scored = [(self.facts[i], cosine(q, self.vectors[i]))
for i in range(len(self.facts))]
scored.sort(key=lambda pair: pair[1], reverse=True)
return scored[:k]
# --------------------------------------------------------------------------
# Sizing helper: estimate tokens the way Chapter 2 does (words * 1.3)
# --------------------------------------------------------------------------
def est_tokens(text):
return round(len(text.split()) * 1.3)
# --------------------------------------------------------------------------
# Demo
# --------------------------------------------------------------------------
def main():
# ---- Session 1: the user tells us things across several turns ----
session_1 = [
"Hi! My name is Dana and I live in Lisbon.",
"I prefer window seats and I am vegetarian.",
"By the way my favorite color is blue.",
"I work at Helios and my budget is 2000 dollars.",
]
store = MemoryStore()
print("=== Session 1: extract facts from each turn and store them ===")
for turn in session_1:
found = extract_facts(turn)
for fact in found:
status, old = store.add(fact)
if status == "updated":
print(f" turn: {turn!r}")
print(f" UPDATED {old!r} -> {fact!r}")
else:
print(f" turn: {turn!r}")
print(f" stored {fact!r}")
print(f"\n memory now holds {len(store.facts)} facts:")
for fact in store.facts:
print(f" - {fact}")
# ---- A new session. The model remembers NOTHING on its own. ----
# The user asks something only answerable from stored memory.
print("\n=== Session 2 (later): a question unanswerable without memory ===")
question = "Which seat do I prefer?"
print(f" user asks: {question!r}")
# BEFORE memory: the only context is this one question. No stored facts.
print("\n -- BEFORE: no memory injected --")
print(" context contains only the question; the model has never")
print(" seen 'Dana', 'window seats', or anything from session 1.")
print(" Best it can do: ask the user to repeat their preference.")
# AFTER memory: retrieve the few relevant facts and inject only those.
print("\n -- AFTER: retrieve top-k relevant memories and inject them --")
hits = store.retrieve(question, k=3)
for fact, score in hits:
print(f" score {score:.3f} {fact}")
injected = "\n".join(f"- {fact}" for fact, _ in hits)
print(" injected memory block:")
for line in injected.splitlines():
print(f" {line}")
print(" With 'prefers window seats' in context, the model answers:")
print(" 'I'll book you a window seat.'")
# ---- The saving: inject everything vs inject only the relevant slice ----
print("\n=== The saving: whole history vs top-k memories ===")
full_history = " ".join(session_1)
full_tok = est_tokens(full_history)
topk_tok = est_tokens(injected)
print(f" inject ALL session-1 history : {full_tok:4d} tokens")
print(f" inject top-{len(hits)} memories : {topk_tok:4d} tokens")
saved = full_tok - topk_tok
pct = 100.0 * saved / full_tok
print(f" saved : {saved:4d} tokens ({pct:.0f}% smaller)")
print(" (and history grows every turn; the top-k slice does not.)")
# ---- Contradiction: retrieval returns the UPDATED fact, not the stale one ----
print("\n=== Contradiction handling: the update wins ===")
color_q = "what is my favorite color?"
before = store.retrieve(color_q, k=1)[0]
print(f" current best answer for {color_q!r}: {before[0]!r}")
print(" user now says: 'Actually my favorite color is green.'")
for fact in extract_facts("Actually my favorite color is green."):
status, old = store.add(fact)
print(f" {status.upper()} {old!r} -> {fact!r}")
after = store.retrieve(color_q, k=1)[0]
print(f" best answer now: {after[0]!r}")
print(f" memory still holds {len(store.facts)} facts (no duplicate color).")
if __name__ == "__main__":
main()
Running it:
=== Session 1: extract facts from each turn and store them ===
turn: 'Hi! My name is Dana and I live in Lisbon.'
stored 'name is dana'
turn: 'Hi! My name is Dana and I live in Lisbon.'
stored 'lives in lisbon'
turn: 'I prefer window seats and I am vegetarian.'
stored 'prefers window seats'
turn: 'I prefer window seats and I am vegetarian.'
stored 'is vegetarian'
turn: 'By the way my favorite color is blue.'
stored 'favorite color is blue'
turn: 'I work at Helios and my budget is 2000 dollars.'
stored 'budget is 2000 dollars'
turn: 'I work at Helios and my budget is 2000 dollars.'
stored 'works at helios'
memory now holds 7 facts:
- name is dana
- lives in lisbon
- prefers window seats
- is vegetarian
- favorite color is blue
- budget is 2000 dollars
- works at helios
=== Session 2 (later): a question unanswerable without memory ===
user asks: 'Which seat do I prefer?'
-- BEFORE: no memory injected --
context contains only the question; the model has never
seen 'Dana', 'window seats', or anything from session 1.
Best it can do: ask the user to repeat their preference.
-- AFTER: retrieve top-k relevant memories and inject them --
score 0.577 prefers window seats
score 0.000 name is dana
score 0.000 lives in lisbon
injected memory block:
- prefers window seats
- name is dana
- lives in lisbon
With 'prefers window seats' in context, the model answers:
'I'll book you a window seat.'
=== The saving: whole history vs top-k memories ===
inject ALL session-1 history : 47 tokens
inject top-3 memories : 16 tokens
saved : 31 tokens (66% smaller)
(and history grows every turn; the top-k slice does not.)
=== Contradiction handling: the update wins ===
current best answer for 'what is my favorite color?': 'favorite color is blue'
user now says: 'Actually my favorite color is green.'
UPDATED 'favorite color is blue' -> 'favorite color is green'
best answer now: 'favorite color is green'
memory still holds 7 facts (no duplicate color).
Read the four sections in order, because each one is a piece of the answer.
Extraction turns chatty turns into atomic facts. The rules are small and rule-based here:
a few patterns for my X is Y, I prefer Y, I am Y, I live in Y, I work at Y. The
turn "My name is Dana and I live in Lisbon" becomes two separate facts, name is dana and
lives in lisbon, not one run-on string, because a fact you can retrieve has to be about one
thing. A real system hands this job to the model itself, asking it to emit clean facts; the
rules above are just enough to make the step concrete.
Retrieval is where the window stays small. In session 2 the user asks "Which seat do I
prefer?" The query shares the word "seat" with the stored fact "prefers window seats" (after
a crude singular/plural fold so seat and seats collide), so that fact scores 0.577 while
the rest score 0.000. The system injects the top 3 and the model can answer. Note what did
not happen: the budget, the employer, and the diet stayed in the store, out of the window,
because this question did not touch them.
The saving is the number that justifies the whole apparatus. Dumping all of session 1 into the window costs 47 tokens here; injecting the 3 retrieved facts costs 16, about two-thirds smaller. On a four-turn toy that is a rounding error. The point is the trend: the full history grows every single turn, so by turn fifty it is hundreds of tokens of mostly irrelevant chatter, while the top-k slice stays roughly constant because it is always "the few facts this question needs." Memory turns a cost that scales with conversation length into one that scales with question complexity.
Invalidation is the part people forget, and forgetting it is how assistants end up
believing two contradictory things. The user said blue, then later said green. A naive store
appends both and retrieval starts returning whichever happens to rank higher, which is a coin
flip. The fix is to give each fact a key: the part that says what the fact is about,
minus the value. Both color facts share the key favorite color is, so the second one
replaces the first in place. After the update the store still holds 7 facts, not 8, and a
query for the favorite color returns green. The key is what lets you tell "a new fact about
the same thing" (update) apart from "a fact about a different thing" (add).
What the real systems do
The toy above is the shape; three open projects are the production versions, and they make different bets about how memory should be structured.
-
Mem0 is the closest to what we built: a vector-first memory layer you bolt onto any agent. It extracts atomic facts from the conversation with an LLM, embeds them, stores them in a vector database, and retrieves the relevant ones with a multi-signal ranking (not just one cosine score). It also does the invalidation step, deciding whether a new fact adds to, updates, or contradicts what is stored. Its pitch is that this beats stuffing the full history into the window, and it reports strong numbers on the LongMemEval long-conversation benchmark to back that up.
-
Letta (the project formerly published as MemGPT) takes an operating-system view. It gives the agent a memory hierarchy: a small core tier always in the window (the agent's persona and the most important facts), a recall tier for recent conversation, and a large archival tier in external storage. The twist is that the agent pages its own memory: it has tools to move information between tiers, reading an archival fact into the window when it needs it and writing one back out when it does not, exactly the way an OS pages memory between RAM and disk. The model manages its own context instead of your code doing it from outside.
-
Zep, built on the Graphiti engine, stores memory as a temporal knowledge graph: facts are nodes and edges, and each one carries a validity window saying when it was true. That extra dimension is the whole point. A vector store answers "what is true now"; a temporal graph also answers "what was true in Q1," because a superseded fact is not deleted, it is marked as valid until the moment it was replaced. When the history of a fact matters, that wins. It is the subject of Chapter 10.
Where do you reach for this? Three patterns recur. Cross-session personalization: the assistant remembers a user's preferences, tone, and past decisions from one visit to the next. Long-running assistants: an agent on a task for hours keeps durable notes outside the window so it does not lose the thread when old turns get evicted. Customer histories: a support agent recalls a specific account's prior tickets and context without re-reading the whole record every message.
With Claude
You can implement memory entirely on your side, the way the demo does: keep the store, run retrieval, and paste the selected facts into the prompt. The Anthropic SDK also offers a memory tool that lets the model drive instead. You declare it on the request,
# Illustrative: requires the anthropic SDK and an API key.
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-8",
max_tokens=1024,
tools=[{"type": "memory_20250818", "name": "memory"}],
messages=[{"role": "user", "content": "Remember that I prefer window seats."}],
)
and the model can then read and write files in a memory directory across calls. You supply
the storage backend (the tool tells you which file to read, create, or edit; where those
bytes live is your code), which keeps you in control of where durable user data is kept. This
is the Letta idea in miniature: the model manages its own notes rather than your retrieval
code deciding for it. Consult the claude-api skill before wiring it up; the tool name and
backend interface are exact.
Using the real tool: commands and before/after proof
The 150-line store above shows the mechanics. Here is how you reach for the production version. Mem0 is the closest match to what we built, so it is the one to run first.
It is a Python package called mem0ai (the import name is mem0). Install it, create a
Memory, add a couple of facts under a user_id, then search to get back only the relevant
ones. This is follow-along: the library and an embedding key are not on this box, so the
output is labeled as expected, not measured here.
pip install mem0ai
# Follow-along: requires mem0ai and an embedding/LLM key in the environment.
from mem0 import Memory
m = Memory() # default config: an LLM extracts facts, embeds them, stores them in a vector DB
# Session 1: hand Mem0 raw conversation turns. It extracts the atomic facts itself.
m.add(
[
{"role": "user", "content": "Hi, I'm Dana and I live in Lisbon."},
{"role": "user", "content": "I prefer window seats and I am vegetarian."},
],
user_id="dana",
)
# Session 2 (a later call, even a later day): retrieve only what this question needs.
hits = m.search(query="Which seat do I prefer?", filters={"user_id": "dana"}, top_k=3)
for h in hits["results"]:
print(round(h["score"], 3), h["memory"])
0.41 Prefers window seats
0.22 Lives in Lisbon
0.19 Name is Dana
That is the same four-operation shape as the demo, with each piece swapped for a stronger
one. add() does the extraction: you pass whole turns and Mem0's LLM splits them into atomic
facts (Prefers window seats, not the run-on sentence). It embeds and stores them in a vector
database keyed by user_id, and it runs the invalidation step, deciding whether a new fact
adds to or replaces an old one. search() does retrieval: it scores the stored facts against
the query and returns the top few. The filters={"user_id": "dana"} argument is the scoping
key, so one store can hold many users without their facts leaking into each other.
Before and after: the metric is injected-context tokens
The claim worth proving is that retrieval keeps the window small. The metric is the number of injected-context tokens: how many tokens of prior knowledge you paste into the prompt to answer one cross-session question. Count it the way you would in production, with Anthropic's token counter, which returns the exact input-token count for a set of messages.
# Follow-along: requires the anthropic SDK and an API key.
import anthropic
client = anthropic.Anthropic()
question = "Which seat do I prefer?"
# (a) Dump the WHOLE prior history into the prompt.
full_history = "\n".join(prior_turns) # every message from every past session
before = client.messages.count_tokens(
model="claude-opus-4-8",
messages=[{"role": "user", "content": full_history + "\n\n" + question}],
).input_tokens
# (b) Inject only the top-k facts Mem0.search returned.
memories = "\n".join(h["memory"] for h in hits["results"])
after = client.messages.count_tokens(
model="claude-opus-4-8",
messages=[{"role": "user", "content": memories + "\n\n" + question}],
).input_tokens
print(before, "->", after)
3000 -> 60
Those two numbers are illustrative (expected for a relationship with a few dozen past turns, not measured on this box), but the gap is the whole point. Dumping the full history costs roughly 3000 tokens and climbs every session; injecting the three facts Mem0 retrieved costs about 60 and stays flat, because it is always "the few facts this question needs."
The correctness half of the proof is sharper than the token count. Without memory, the
cross-session question is simply unanswerable: a fresh call has never seen "Dana" or "window
seats," so the best the model can do is ask the user to repeat themselves. With memory, the
one fact that matters (Prefers window seats) is in the window, and the model answers
"I'll book you a window seat." That is the same before/after you can run on this box right now:
the from-scratch store demo above prints exactly that flip (score 0.577 prefers window seats, then the booked-seat answer), which is the on-box proof that the Mem0 numbers stand in
for.
The other two libraries install the same way. Letta (the MemGPT project) ships the server as
pip install letta and the client SDK as pip install letta-client; you talk to a running
agent that pages its own memory tiers. Zep is pip install zep-cloud, a hosted temporal graph
you write conversations to and query for the facts that were true at a given time.
In Claude Code / Anthropic
Two more pieces sit on the Anthropic side. The memory tool lets the model do the reading and writing instead of your retrieval code. You declare it on the request,
tools=[{"type": "memory_20250818", "name": "memory"}]
and the model issues read and write commands against a memory directory that you back with storage (the tool says which file to view, create, or edit; where those bytes live is your code). That is the Letta idea in miniature, the model managing its own notes.
A project CLAUDE.md is the other kind: a hand-maintained, long-lived memory. It is not
retrieved per question; it is re-injected in full at the start of every session, so it is the
place for the stable facts about a project that every turn should know. Retrieval keeps the
per-question slice small; a CLAUDE.md keeps the always-on baseline correct.
Takeaways
- The model is stateless between calls, so durable memory is something your code stores outside the window and re-injects a relevant slice of. The store is not the window.
- A memory system is four operations: extract atomic facts from a turn, embed and store them, retrieve the top-k relevant ones for a new query, and invalidate the stale fact when a new one contradicts it.
- Retrieval is what keeps the window small. Injecting the few relevant facts (16 tokens in the demo) instead of the whole history (47, and climbing every turn) is a real and growing saving, because top-k scales with the question, not the conversation length.
- Invalidation needs a key (the subject and attribute of a fact, minus its value) so a new value replaces the old one in place instead of sitting beside it. Without it, retrieval returns contradictory facts at random.
- Mem0 is the vector-first layer you bolt on; Letta gives the agent an OS-style memory hierarchy it pages itself; Zep stores a temporal graph for when "what was true then" matters. Anthropic's memory tool lets the model read and write a directory you back with storage.
👉 The store we built throws away the past: update the favorite color and blue is gone. But plenty of questions are about the past, and answering "what was the budget last quarter" needs a memory that records when each fact was true. The next chapter builds exactly that: a temporal knowledge graph, where facts carry validity windows and nothing is ever silently overwritten.
Temporal knowledge graphs
Chapter 9 gave the agent a memory: a store of facts outside the window that it retrieves the relevant slice of and re-injects each turn. That store answers "what do we know about Acme?" by returning the most similar fact. It cannot answer "who was Acme's CTO in Q1?", because a plain fact has no notion of when it was true. A vector memory returns the closest match; it has no clock.
This chapter adds the clock. We attach a validity interval to every fact, so the store records not just what is true but for which span of time it was true. That one addition turns "the CTO is Mei" (which silently goes stale the moment Mei leaves) into "Mei was CTO from Q4 onward, Ravi before that, Dana before that," a record that stays correct as the world changes.
A fact is a triple, and now it carries time
A triple is the smallest unit of a knowledge graph: three parts, written
(subject, predicate, object).
- The subject is the thing the fact is about:
Acme. - The predicate is the relation or attribute:
CTO. - The object is the value:
Dana.
Read it left to right: "Acme's CTO is Dana." Triples are the edges of a graph (subject and object are nodes, the predicate is the labeled arrow between them), which is why a store of them is a knowledge graph. So far this is just a fact, the same thing Chapter 9 stored as text.
The temporal part is one more piece: a validity interval, the span of time over which the fact held. We write it as a half-open interval
$$[t_0, t_1)$$
meaning the fact is true from $t_0$ inclusive up to $t_1$ exclusive. The square bracket includes the endpoint; the round bracket excludes it. Half-open is the right choice because it makes successive facts tile time with no gap and no overlap: the instant one fact ends is exactly the instant the next begins. If Dana is CTO over $[Q1, Q2)$ and Ravi over $[Q2, Q4)$, then at $Q2$ there is exactly one answer (Ravi), not zero and not two. When $t_1$ is unknown because the fact is still true, we leave it open ($t_1 = \text{None}$), which means "true from $t_0$ onward, no known end."
Don't be confused. A fact's value and a fact's validity are two different things, and the whole chapter rests on keeping them apart. The value is the object:
Dana,Ravi,Mei. The validity is the interval the value held:[Q1, Q2). A plain key-value store keeps only the current value and throws the interval away, which is exactly why it cannot answer questions about the past. The temporal store keeps both, so the valueDanadoes not vanish when it stops being current; it just gets an end stamped on its interval.
Two clocks: valid time and ingestion time
There are two different times you might mean when you say "when." Keeping both is what makes a store bi-temporal (two clocks).
- Valid time is when the fact was true in the world. Mei became CTO at the start of Q4; her valid time starts at Q4.
- Ingestion time (also called transaction time) is when your system learned the fact. You might not record Mei's appointment until a week later, or discover a back-dated change during an audit months on.
These two clocks come apart constantly. A price changed on Monday but your pipeline ingested
it on Wednesday. A role changed in March but you only heard about it in May. With one clock
you can answer "what was true at time $T$." With both clocks you can also answer "what did
we believe at time $T$," which is the question every audit and every "why did the agent do
that?" investigation actually asks. The valid-time interval lives on the edge as
[valid_from, valid_to); the ingestion time is a separate stamp, ingested_on.
Supersession: close, do not delete
Here is the move that makes the whole thing work. When a new fact replaces an old one about
the same (subject, predicate), for example the CTO changes from Dana to Ravi, the naive
instinct is to overwrite: point (Acme, CTO) at Ravi and move on. That destroys history.
Instead we supersede. We close the old edge by setting its valid_to to the change
time, and we open a new edge starting at that same time. Nothing is deleted. Dana's edge
becomes [Q1, Q2) (closed), Ravi's edge becomes [Q2, None) (open, current). Because the
new fact's start equals the old fact's end and the intervals are half-open, the timeline
stays gapless and overlap-free. Run this forward through several changes and you get a
complete succession, every past value still queryable.
The demo: a bi-temporal triple store from scratch
The code below builds two stores from the same stream of events so the contrast is fair: a
naive KeyValueStore (a dict from key to current value, overwrite on every change) and a
TemporalKG (an append-only list of edges, close-and-open on every change). It feeds Acme's
year through both, prints the timeline the temporal store preserved, and then asks the
question that separates them: "who was the CTO in Q1?" Everything is standard library, no
imports beyond dataclasses, datetime, and typing.
"""A bi-temporal triple store from scratch: track WHEN a fact was true, not
just what it currently is.
Chapter 9 stored facts as text and retrieved the most SIMILAR one. That answers
"what do we know about X" but has no notion of WHEN a fact held. This chapter
adds time. A fact here is a TRIPLE plus a validity window:
(subject, predicate, object) valid over the half-open interval [t0, t1)
- subject: the thing the fact is about ("Acme")
- predicate: the relation / attribute ("CTO")
- object: the value ("Dana")
- [t0, t1): the fact is true from t0 up to but NOT including t1.
t1 = None means "still true, no known end" (an open interval).
Two clocks, hence "bi-temporal":
- valid time: when the fact was true IN THE WORLD (the CTO started in Q1).
- ingestion time: when OUR SYSTEM learned it (we recorded it on some date).
These differ all the time: a role may have changed in March but we only hear
about it in May. Tracking both lets us answer "what was true at T" AND "what did
we BELIEVE at T", which is what audits need.
The key move: when a new fact SUPERSEDES an old one about the same
(subject, predicate), we do NOT overwrite. We CLOSE the old edge (set its valid_to
to the change time) and OPEN a new edge. The old value is still there, just no
longer current. History is preserved, so point-in-time queries stay correct.
We prove it by contrast with a naive key-value store that keeps only the latest
value: it answers "who is the CTO NOW" fine but gives the WRONG answer to
"who was the CTO in Q1", because it overwrote and forgot.
Standard library only. Run: python3 temporal_kg.py
"""
from dataclasses import dataclass
from datetime import date
from typing import Optional
# --- A point in time -----------------------------------------------------------
# We use plain dates so the demo reads like a calendar. Quarters of 2024:
Q1 = date(2024, 1, 1)
Q2 = date(2024, 4, 1)
Q3 = date(2024, 7, 1)
Q4 = date(2024, 10, 1)
# A far-future sentinel only used for printing an open interval as a width.
OPEN = "now"
# --- 1. The naive baseline: a plain key-value store ---------------------------
class KeyValueStore:
"""The thing most systems actually do: a dict from key to CURRENT value.
Writing a new value OVERWRITES the old one. There is exactly one slot per
key, so the moment a fact changes, the previous value is gone. This is fine
for "what is true now" and silently wrong for any question about the past.
"""
def __init__(self):
self.data = {} # (subject, predicate) -> object
def set(self, subject, predicate, obj):
self.data[(subject, predicate)] = obj # clobbers whatever was there
def get(self, subject, predicate):
return self.data.get((subject, predicate))
# --- 2. The temporal edge ------------------------------------------------------
@dataclass
class Edge:
"""One fact with its validity window and the time we learned it.
valid_from / valid_to are the VALID-TIME interval [valid_from, valid_to):
when the fact held in the world. valid_to = None means open (still true).
ingested_on is the INGESTION-TIME stamp: when this row entered our store.
"""
subject: str
predicate: str
obj: str
valid_from: date
valid_to: Optional[date] # None = open interval, still valid
ingested_on: date
def is_open(self):
return self.valid_to is None
def valid_at(self, t):
"""Is this edge true at instant t? Half-open: t in [valid_from, valid_to).
Half-open means valid_from is included and valid_to is excluded, so the
instant a fact ends is exactly the instant its successor begins, with no
overlap and no gap. An open edge (valid_to is None) is true for every t
at or after valid_from.
"""
if t < self.valid_from:
return False
if self.valid_to is None:
return True
return t < self.valid_to
# --- 3. The bi-temporal triple store ------------------------------------------
class TemporalKG:
"""An append-only list of edges. Supersession closes the old edge instead of
deleting it, so every past value survives for point-in-time queries."""
def __init__(self):
self.edges = [] # all edges ever, in insertion order
def add(self, subject, predicate, obj, valid_from, ingested_on=None):
"""Record that (subject, predicate) became `obj` at valid_from.
If an OPEN edge already exists for this (subject, predicate), it is
superseded: we CLOSE it by setting its valid_to to valid_from (the new
fact's start = the old fact's end, no overlap), then append the new open
edge. We never mutate the object of an existing edge and never delete.
"""
if ingested_on is None:
ingested_on = valid_from # default: learned it when it happened
# Close any currently-open edge for this same key.
for e in self.edges:
if (e.subject == subject and e.predicate == predicate
and e.is_open()):
e.valid_to = valid_from # CLOSE, do not delete
self.edges.append(Edge(subject, predicate, obj,
valid_from, None, ingested_on))
def current(self, subject, predicate):
"""What is true NOW: the one open edge for this key (valid_to is None)."""
for e in self.edges:
if (e.subject == subject and e.predicate == predicate
and e.is_open()):
return e.obj
return None
def as_of(self, subject, predicate, t):
"""Point-in-time query: what was true at instant t.
A "point-in-time query" asks the store to rewind: given a past instant t,
return the value whose validity window contained t. We scan for the edge
whose [valid_from, valid_to) interval covers t.
"""
for e in self.edges:
if (e.subject == subject and e.predicate == predicate
and e.valid_at(t)):
return e.obj
return None
def history(self, subject, predicate):
"""All edges for one key, oldest first, for printing the timeline."""
rows = [e for e in self.edges
if e.subject == subject and e.predicate == predicate]
return sorted(rows, key=lambda e: e.valid_from)
# --- Helpers for pretty timelines ---------------------------------------------
QNAME = {Q1: "Q1", Q2: "Q2", Q3: "Q3", Q4: "Q4"}
def qname(d):
return QNAME.get(d, str(d))
def interval_str(e):
end = qname(e.valid_to) if e.valid_to is not None else OPEN
return f"[{qname(e.valid_from)}, {end})"
# --- The scenario --------------------------------------------------------------
# A small company "Acme" whose facts change over the year. We feed events in the
# order they happened, each one superseding the previous value for that key.
print("=== Scenario: Acme's facts change over 2024 ===")
print("Events, in the order they occurred:\n")
events = [
# (subject, predicate, object, when it became true)
("Acme", "CTO", "Dana", Q1), # Dana is CTO from Q1
("Acme", "plan", "Free", Q1), # on the Free plan from Q1
("Acme", "CTO", "Ravi", Q2), # Ravi replaces Dana in Q2 (supersession 1)
("Acme", "plan", "Pro", Q3), # upgrades to Pro in Q3 (supersession 2)
("Acme", "CTO", "Mei", Q4), # Mei replaces Ravi in Q4 (supersession 3)
]
# Build BOTH stores from the same events so the comparison is apples to apples.
kv = KeyValueStore()
kg = TemporalKG()
for subject, predicate, obj, when in events:
print(f" {qname(when)}: {subject}.{predicate} := {obj}")
kv.set(subject, predicate, obj) # naive: overwrite
kg.add(subject, predicate, obj, when) # temporal: close + open
print()
print("The key-value store kept only the LAST write for each key.")
print("The temporal KG kept every edge, closing each as the next one opened.\n")
# --- The timeline the temporal KG preserved -----------------------------------
print("=== Timeline preserved by the temporal KG ===")
for key in [("Acme", "CTO"), ("Acme", "plan")]:
subject, predicate = key
print(f" {subject}.{predicate}:")
for e in kg.history(subject, predicate):
status = "current" if e.is_open() else "closed"
print(f" {interval_str(e):>14} = {e.obj:<5} ({status})")
print()
# --- BEFORE vs AFTER: who was the CTO in Q1? ----------------------------------
print("=== Query A: who is the CTO NOW? (both stores should agree) ===")
print(f" key-value store : {kv.get('Acme', 'CTO')}")
print(f" temporal KG : {kg.current('Acme', 'CTO')}")
print(" -> agree: the current value is the easy case.\n")
print("=== Query B: who was the CTO in Q1? (point-in-time) ===")
print(f" key-value store : {kv.get('Acme', 'CTO')} <- WRONG")
print(" it only stores the latest value; it overwrote Dana and Ravi and")
print(" cannot answer about the past. It returns today's CTO for every date.")
print(f" temporal KG : {kg.as_of('Acme', 'CTO', Q1)} <- correct")
print(" it scans for the edge whose [valid_from, valid_to) interval covers Q1.\n")
# --- A full point-in-time sweep -----------------------------------------------
print("=== Query C: walk the CTO forward, quarter by quarter ===")
for t in [Q1, Q2, Q3, Q4]:
kv_ans = kv.get("Acme", "CTO") # same wrong answer every time
kg_ans = kg.as_of("Acme", "CTO", t) # the right answer for each date
print(f" as of {qname(t)}: key-value={kv_ans:<5} temporal-KG={kg_ans}")
print(" the key-value column is frozen at the latest value; the temporal-KG")
print(" column tracks the real succession Dana -> Ravi -> Ravi -> Mei.\n")
# --- The bi-temporal twist: knew-late ------------------------------------------
# Now the second clock earns its keep. Suppose Acme's plan actually changed back
# to Free at the start of Q4, but we did not LEARN this until a later audit on
# 2024-12-15. Valid time and ingestion time diverge.
print("=== Query D: bi-temporal, valid time vs ingestion time ===")
kg.add("Acme", "plan", "Free", Q4, ingested_on=date(2024, 12, 15))
plan_edges = kg.history("Acme", "plan")
print(" plan timeline (with ingestion stamps):")
for e in plan_edges:
print(f" {interval_str(e):>14} = {e.obj:<5} learned_on {e.ingested_on}")
print()
print(f" valid-time answer 'what WAS the plan at Q4?': "
f"{kg.as_of('Acme', 'plan', Q4)}")
print(" (Free: the change took effect at Q4 in the world.)")
print(" ingestion-time fact 'when did we LEARN the Q4 plan?': 2024-12-15")
print(" (the edge was valid from Q4 but only entered the store in December.)")
print(" An audit needs both: what was true, and when we knew it.")
Running it:
=== Scenario: Acme's facts change over 2024 ===
Events, in the order they occurred:
Q1: Acme.CTO := Dana
Q1: Acme.plan := Free
Q2: Acme.CTO := Ravi
Q3: Acme.plan := Pro
Q4: Acme.CTO := Mei
The key-value store kept only the LAST write for each key.
The temporal KG kept every edge, closing each as the next one opened.
=== Timeline preserved by the temporal KG ===
Acme.CTO:
[Q1, Q2) = Dana (closed)
[Q2, Q4) = Ravi (closed)
[Q4, now) = Mei (current)
Acme.plan:
[Q1, Q3) = Free (closed)
[Q3, now) = Pro (current)
=== Query A: who is the CTO NOW? (both stores should agree) ===
key-value store : Mei
temporal KG : Mei
-> agree: the current value is the easy case.
=== Query B: who was the CTO in Q1? (point-in-time) ===
key-value store : Mei <- WRONG
it only stores the latest value; it overwrote Dana and Ravi and
cannot answer about the past. It returns today's CTO for every date.
temporal KG : Dana <- correct
it scans for the edge whose [valid_from, valid_to) interval covers Q1.
=== Query C: walk the CTO forward, quarter by quarter ===
as of Q1: key-value=Mei temporal-KG=Dana
as of Q2: key-value=Mei temporal-KG=Ravi
as of Q3: key-value=Mei temporal-KG=Ravi
as of Q4: key-value=Mei temporal-KG=Mei
the key-value column is frozen at the latest value; the temporal-KG
column tracks the real succession Dana -> Ravi -> Ravi -> Mei.
=== Query D: bi-temporal, valid time vs ingestion time ===
plan timeline (with ingestion stamps):
[Q1, Q3) = Free learned_on 2024-01-01
[Q3, Q4) = Pro learned_on 2024-07-01
[Q4, now) = Free learned_on 2024-12-15
valid-time answer 'what WAS the plan at Q4?': Free
(Free: the change took effect at Q4 in the world.)
ingestion-time fact 'when did we LEARN the Q4 plan?': 2024-12-15
(the edge was valid from Q4 but only entered the store in December.)
An audit needs both: what was true, and when we knew it.
Read the contrast in Query C, because it is the entire point. The key-value column prints
Mei for every quarter, including Q1 and Q2 when Mei had nothing to do with Acme. It is not
lying on purpose; it simply has one slot per key and Mei was the last write, so that is all
it can ever return. The temporal-KG column tracks the real succession, because each value
kept its interval and the point-in-time query scans for the edge whose
[valid_from, valid_to) window contains the date you asked about.
A point-in-time query is just that rewind: you hand the store a past instant and it returns the value that was current then, not now. Query A shows that for the "now" case both stores agree, which is the trap. The naive store looks correct as long as you only ever ask about the present, and it stays wrong and silent the moment anyone asks about the past.
Query D is the bi-temporal payoff. We record a Q4 plan change that the system did not learn
about until December 15. The valid-time interval starts at Q4 (when the change took effect)
while learned_on is December 15 (when we ingested it). Asking "what was the plan at Q4?"
returns Free by valid time; asking "when did we know?" returns December 15 by ingestion
time. One clock could not have told both stories.
How this connects to context and to the real tools
In context-engineering terms, a temporal KG is a kind of memory (Chapter 9) that you query for a time-scoped slice before assembling the turn. Instead of injecting "Acme's CTO is Mei" (true now, wrong for any historical question), you inject the value valid at the time the task is about, or the whole short timeline if the task reasons across the change. The win is point-in-time correctness: the agent stops confidently answering past questions with present facts.
This is a real and active design, not a toy. Graphiti is an open-source temporal knowledge-graph engine that ingests episodes (chunks of conversation or documents), extracts triples, and maintains bi-temporal validity on the edges, closing old edges and opening new ones exactly as the demo does. Zep builds agent memory on top of Graphiti and tracks fact-validity windows so an agent can reason about facts that change over a long-running relationship. On temporal-reasoning benchmarks (the kind that ask "what was true at time $T$" rather than "what is true"), this approach is reported to outperform vector-only memory, which is the expected result: a vector store retrieves the most similar fact and has no representation of when, so it answers "who is the CTO" and "who was the CTO in Q1" with the same nearest neighbor. The use cases are wherever facts have a lifetime: prices, plan tiers, roles, account statuses, feature flags, and any "what did we know, and when did we know it" audit.
Using the real tool: commands and before/after proof
The from-scratch store above is the whole idea in 100 lines. The production tool that does the same thing, plus the entity extraction we hand-waved past, is Graphiti. You give it plain text (an "episode") and it pulls out the triples for you, stamps each edge with a validity interval, and closes the old edge when a new fact supersedes it, which is exactly the close-and-open move from the demo. Below are the real commands.
Graphiti is not self-contained the way our demo is. It needs two things you have to provide:
a graph database to store the nodes and edges (Neo4j 5.26 or newer is the default; it
also supports FalkorDB and Amazon Neptune), and an LLM to do the extraction, because
turning a sentence into triples is itself a model call (it defaults to OpenAI and reads
OPENAI_API_KEY; Anthropic, Gemini, and Groq are also supported). State that honestly: this
is heavier than the dict-of-edges we built. The payoff is that it extracts facts from raw
prose instead of making you hand-write every triple.
# The library (Python 3.10+).
pip install graphiti-core
# Needs a graph DB backend. Easiest is Neo4j in Docker:
docker run -d --name neo4j -p 7687:7687 -p 7474:7474 \
-e NEO4J_AUTH=neo4j/password neo4j:5.26
# Needs an LLM key for extraction (OpenAI is the default backend):
export OPENAI_API_KEY=sk-...
A minimal real session ingests two facts that disagree across time, then asks a point-in-time
question. The methods are async, so they run inside asyncio. The names below are the real
ones: add_episode to ingest a chunk of text, search to query.
import asyncio
from datetime import datetime, timezone
from graphiti_core import Graphiti
from graphiti_core.nodes import EpisodeType
async def main():
# 1. Connect to the graph DB. Three args: URI, user, password.
graphiti = Graphiti("bolt://localhost:7687", "neo4j", "password")
await graphiti.build_indices_and_constraints() # one-time setup
# 2. Ingest two episodes whose CTO fact changes between Q1 and Q4.
# reference_time is the VALID time: when the fact was true in the world.
await graphiti.add_episode(
name="q1_update",
episode_body="In Q1, Acme's CTO is Dana.",
source=EpisodeType.text,
source_description="status update",
reference_time=datetime(2024, 1, 1, tzinfo=timezone.utc),
)
await graphiti.add_episode(
name="q4_update",
episode_body="As of Q4, Acme's CTO is Mei.",
source=EpisodeType.text,
source_description="status update",
reference_time=datetime(2024, 10, 1, tzinfo=timezone.utc),
)
# 3. Query. Each result edge carries .fact and a .valid_at interval,
# so you can keep the edge whose validity covers the date you asked about.
results = await graphiti.search("Who was Acme's CTO in Q1?")
for edge in results:
print(edge.fact, "| valid_at:", edge.valid_at)
await graphiti.close()
asyncio.run(main())
This snippet is follow-along: the graphiti-core library, the Neo4j backend, and the LLM
key are not installed on this box, so we are not pasting measured output for it. The pieces
are the real ones. add_episode is how text goes in, reference_time is the valid-time
stamp, and search returns edges that carry their own validity windows.
Before and after: proving point-in-time correctness
The metric to prove is point-in-time correctness: does the store return the value that was true at the date you asked about, or does it return whatever it saw last? Here is the recipe, and it is the same scenario the demo already ran on this box.
- Ingest a fact that changes. "Acme's CTO is Dana (Q1)," then later "Acme's CTO is Mei (Q4)." Two episodes, one valid in Q1 and one valid from Q4.
- Ask a past question. "Who was the CTO in Q1?"
- Compare two memories.
A flat vector memory (Chapter 9) stores both sentences as embeddings and returns the most
similar one. "Who was the CTO in Q1?" is close to both, and since it has no clock it
typically surfaces the most recent or highest-scoring fact, which is Mei. That is the wrong
answer for Q1. The temporal graph instead keeps Dana's edge with its [Q1, Q4) validity and
Mei's edge with its [Q4, now) validity, and returns the one whose interval covers Q1.
Illustrative, expected (not measured here, since the library is not on this box):
question: "Who was Acme's CTO in Q1?"
flat vector memory -> "Mei" (WRONG: returns the most similar/recent fact, no clock)
temporal graph -> "Dana" (correct: returns the value whose validity covers Q1)
That before/after is exactly what the on-box demo measured for real. Look back at Query B
and Query C in the verified output above: the key-value store (a stand-in for any
memory that keeps one current value per fact) printed Mei for every quarter, while the
temporal KG walked the real succession Dana -> Ravi -> Ravi -> Mei. The illustrative
Graphiti result here and the measured demo result there are the same finding: keep the
interval, get the past right; drop it, and every historical question collapses onto the
latest value. (Zep, the managed product built on Graphiti, is pip install zep-cloud; it
wraps the same fact-validity windows behind a hosted memory API.)
In an agent
Temporal memory earns its keep the moment an assistant has to reason about facts that change: a product's price, a person's role, an account's status, a feature flag, an order's state. For those, the useful question is rarely "what is true now"; it is "what was true when this happened," and a similarity-only memory cannot answer it, because it stores what without when. Before assembling the turn, query the temporal store for the value valid at the relevant time (or the short timeline if the task spans the change) and inject that, so a question about Q1 gets Q1's answer instead of today's.
Takeaways
- A temporal fact is a triple
(subject, predicate, object)plus a validity interval $[t_0, t_1)$. The interval, not the value, is what lets you ask about the past. - A plain key-value store keeps only the current value, so it answers "now" correctly and any point-in-time question with the same wrong latest value, silently.
- Supersession means close, not delete: stamp the old edge's
valid_toand open a new edge. History is preserved and the timeline tiles without gaps or overlaps. - Bi-temporal means two clocks. Valid time is when a fact was true in the world; ingestion time is when your system learned it. Audits and "why did the agent do that" need both.
- For an agent, query the KG for the value valid at the relevant time before assembling the context, so it stops answering past questions with present facts. Graphiti and Zep do this in production.
👉 We now have memory that knows what was true and when. But the live conversation still grows without bound inside the window. The next chapter is about that growth directly: context-window management and compaction, where we summarize and prune the running history so a long session keeps fitting.
Context-window management and compaction
A chat that runs long enough will, eventually, not fit. Every turn re-sends the whole history (Chapter 1), the history only grows, and the window is fixed. So at some point your code has to drop something. Chapter 1 showed the dumbest version of that decision: drop the oldest turn. It keeps you under budget, but it is amnesia by design. The fact that mattered most might be the one you said first.
This chapter does better. Instead of throwing old turns away, we summarize them into a few tokens and keep that summary in the window. The raw turns move to cheap storage out of the window; the gist stays. This is called compaction, and it is what lets a multi-hour agent run or a long support chat stay under budget without forgetting the codename you agreed on in turn 2.
The vocabulary, defined
A handful of terms get used loosely in this area, so here they are pinned down. All of them describe state your code maintains, since the model itself is stateless (Chapter 1).
- Working set: the recent turns you keep verbatim and send to the model on this call. These are in-window: they cost tokens and they fit the budget.
- Archive: older raw turns you keep somewhere cheap (a database, a file) but do not send. Out-of-window. They cost storage, not tokens, and the model never sees them unless you pull one back.
- Eviction: the act of removing a turn from the working set. "Drop the oldest" evicts to the trash. Compaction evicts to the archive, but only after extracting the gist.
- Compaction: replacing a batch of old turns with a short summary of them, so the summary stays in-window while the raw turns leave it.
- Tiered memory: the whole arrangement, a small fast in-window tier (working set plus summary) backed by a large slow out-of-window tier (the archive). The system pages between them the way an operating system pages between RAM and disk.
Don't be confused. Deleting context and summarizing it are different operations with different costs. Deleting (also called context editing) removes blocks outright: you might clear out old tool results that no longer matter, and those bytes are simply gone. Summarizing (compaction) replaces a block with a shorter version that keeps the gist: the raw bytes leave the window but the meaning stays, in fewer tokens. Delete when the content is genuinely spent (a stale tool output nobody will reference again). Summarize when the content is old but its conclusions still matter (the early turns where you settled the requirements). Picking the wrong one is how you either bloat the window with dead tool dumps or forget a decision you still need.
The mechanism
The plan is a loop. Keep appending turns to the working set. After each turn, check whether the in-window total (the summary plus the working set) would blow the budget. While it would, take the oldest working-set turn, fold its key facts into a running summary, and move the raw turn to the archive. Stop when you fit again, or when the working set is down to a small floor of the most recent turns that you always keep verbatim.
Two design choices make this work and not silently lose things:
- The summarizer keeps facts, not prose. A real system asks the model to write the summary. To keep this demo reproducible (and runnable on a box with no API key, see the provider section below), we use a deterministic stub that pulls out the sentences that look load-bearing: a stated decision (a codename, a deadline, a chosen option) or a sentence that names an entity. Everything chatty is dropped. The exact rule does not matter; the point is that the summary is much smaller than the turns it replaces.
- The summary itself is bounded. If you only ever append to the summary, it grows without limit and eventually breaks the budget on its own. So the summary is capped: when it is full, the lowest-priority facts fall off first. Decisions outrank passing mentions, so the codename (a decision) is never the fact that gets dropped to make room for small talk.
Here is the whole thing, with a head-to-head against Chapter 1's drop-oldest policy built in. We plant an important fact early ("the project is codenamed Atlas" in turn 2), run a realistic 19-turn conversation that piles up tokens, and then, on the last turn, ask for the codename. The question is answerable only if "Atlas" still appears in what we would send.
"""Context-window management by COMPACTION (summarize, don't just drop).
A long conversation has to keep fitting a fixed window, but you don't want to
lose the facts that were said early on. This script builds a TIERED context
manager that keeps a token budget while preserving the gist of evicted turns:
- WORKING SET: recent turns kept verbatim, in-window, sent to the model.
- ARCHIVE: older raw turns kept out-of-window (cheap storage, not sent).
- SUMMARY: a compact running note that captures the key facts (decisions,
named entities) from turns that have been pushed out of the
working set, so they survive even though the raw text doesn't.
When the in-window total would exceed the budget, we COMPACT: take the oldest
working-set turns, extract their key facts into the running summary, move the
raw turns to the archive, and keep the summary plus the most recent turns.
We compare this against chapter 1's policy ("drop the oldest") by planting an
important fact early and asking about it late. Drop-oldest loses the fact;
compaction keeps it in the summary, under the same budget.
Standard library only. Run: python3 compaction.py
"""
import re
# Same crude estimate as chapters 1 and 2: ~1.3 tokens per English word.
# Real billing-grade counts come from the provider's count_tokens endpoint.
def est_tokens(text):
return round(len(text.split()) * 1.3)
# --- The stub summarizer ----------------------------------------------------
# A real system asks the model to summarize. Here we use a deterministic stub
# so the output is reproducible and the mechanism is visible: we keep the lines
# that look load-bearing (a stated decision, or a sentence naming an entity)
# and drop the chatty rest. This is a stand-in for "the model wrote a summary",
# not a serious summarizer.
DECISION_RE = re.compile(
r"\b(codenamed|named|is called|decided|chose|will use|deadline|budget|"
r"must|the goal is|launch|ships?|version)\b",
re.IGNORECASE,
)
# A capitalized word that is not the first word of the sentence: a crude proxy
# for a named entity (a project, person, product, or place).
ENTITY_RE = re.compile(r"(?<!^)(?<![.!?]\s)\b([A-Z][a-zA-Z0-9]{2,})\b")
def key_facts(turn_text):
"""Pull the fact-bearing sentences out of one turn's text, each tagged with
a priority: 2 if it states a decision (codename, deadline, choice), 1 if it
only names an entity. Higher-priority facts are kept first when the summary
itself has to be trimmed, so a decision like the codename outranks chatter."""
facts = []
for sentence in re.split(r"(?<=[.!?])\s+", turn_text.strip()):
s = sentence.strip()
if not s:
continue
if DECISION_RE.search(s):
facts.append((2, s))
elif ENTITY_RE.search(s):
facts.append((1, s))
return facts
def summarize(turns, prior_summary, max_facts=4):
"""Fold a batch of turns into the running summary, de-duplicating, then
keep only the top `max_facts` by priority so the summary stays bounded.
Ties break toward the EARLIER fact, so a fact stated once early survives."""
facts = list(prior_summary)
seen = {s for _, s in facts}
for t in turns:
for prio, s in key_facts(t["text"]):
if s not in seen:
facts.append((prio, s))
seen.add(s)
# Stable sort by descending priority; Python's sort is stable, so within a
# priority the original (chronological) order is preserved.
facts.sort(key=lambda pf: -pf[0])
return facts[:max_facts]
def render_summary(facts):
if not facts:
return ""
lines = [s for _, s in facts]
return "Summary of earlier conversation:\n- " + "\n- ".join(lines)
# --- The tiered context manager --------------------------------------------
class TieredContext:
"""Keeps the in-window context under `budget` tokens by compacting the
oldest working-set turns into a running summary."""
def __init__(self, budget, keep_recent=2):
self.budget = budget # token ceiling for what we SEND
self.keep_recent = keep_recent # min recent turns to keep verbatim
self.working = [] # recent raw turns (in-window)
self.archive = [] # older raw turns (out-of-window)
self.summary = [] # running list of key facts (in-window)
self.compactions = 0
def in_window_tokens(self):
used = est_tokens(render_summary(self.summary))
used += sum(est_tokens(t["text"]) for t in self.working)
return used
def add(self, role, text):
self.working.append({"role": role, "text": text})
self._compact_if_needed()
def _compact_if_needed(self):
# While we're over budget and still have turns we're allowed to evict,
# fold the oldest working-set turn into the summary and archive it.
while self.in_window_tokens() > self.budget and \
len(self.working) > self.keep_recent:
oldest = self.working.pop(0)
self.summary = summarize([oldest], self.summary)
self.archive.append(oldest)
self.compactions += 1
# --- Chapter 1's policy, for comparison ------------------------------------
class DropOldest:
"""Naive baseline: keep recent raw turns under budget by discarding the
oldest. No summary, so anything dropped is gone for good."""
def __init__(self, budget):
self.budget = budget
self.working = []
self.dropped = 0
def in_window_tokens(self):
return sum(est_tokens(t["text"]) for t in self.working)
def add(self, role, text):
self.working.append({"role": role, "text": text})
while self.in_window_tokens() > self.budget and len(self.working) > 1:
self.working.pop(0)
self.dropped += 1
# --- Answerability check ----------------------------------------------------
# Late in the chat we ask "what is the project codename?". The answer is
# available only if "Atlas" still appears somewhere in what we'd send: the
# summary or the working-set turns.
def can_answer_codename(visible_text):
return "Atlas" in visible_text
def visible_text_tiered(ctx):
parts = [render_summary(ctx.summary)] + [t["text"] for t in ctx.working]
return "\n".join(parts)
def visible_text_dropoldest(ctx):
return "\n".join(t["text"] for t in ctx.working)
# --- A scripted ~18-turn conversation --------------------------------------
# Turn 2 plants the load-bearing fact. The rest is realistic filler that piles
# up tokens and pushes the early turns toward eviction.
SCRIPT = [
("user", "Hi, I'm kicking off the planning for our new internal search service."),
("assistant", "Great. For the record, the project is codenamed Atlas. I'll refer to it that way from here on."),
("user", "We need it to index about ten million documents to start."),
("assistant", "Ten million is fine for a first cut. We can shard the index across a few nodes and grow later."),
("user", "What embedding dimension should we use for the vectors?"),
("assistant", "Start at 768 dimensions. It balances recall against memory, and you can revisit it after measuring."),
("user", "How should we handle re-indexing when documents change?"),
("assistant", "Use an append-only log of changes and replay it nightly. That keeps the live index stable during the day."),
("user", "Our latency target is under 100 milliseconds at the 95th percentile."),
("assistant", "Achievable. Keep the hot shards in memory and cache the most frequent queries to hit that target."),
("user", "Should we expose a REST API or a gRPC one for the search endpoint?"),
("assistant", "Offer gRPC internally for speed and a thin REST gateway for external callers who want simplicity."),
("user", "What about access control on the documents?"),
("assistant", "Filter results by the caller's permission set at query time, and never index secrets into the vectors."),
("user", "How many engineers do you think this needs for the first quarter?"),
("assistant", "Three is a reasonable starting team: one on indexing, one on serving, one on the API and client work."),
("user", "When should we aim to ship the first internal preview?"),
("assistant", "Target the end of the quarter for an internal preview, then harden it before any wider rollout."),
("user", "Remind me, what is the project codename again? I need it for the ticket."),
]
def run(strategy_factory, visible_fn, label):
ctx = strategy_factory()
print(f"=== {label} ===")
print(f" budget: {ctx.budget} tokens, "
f"turns: {len(SCRIPT)}")
over = 0
peak = 0
for i, (role, text) in enumerate(SCRIPT, start=1):
ctx.add(role, text)
used = ctx.in_window_tokens()
peak = max(peak, used)
if used > ctx.budget:
over += 1
answerable = can_answer_codename(visible_fn(ctx))
print(f" peak in-window tokens: {peak} (budget {ctx.budget})")
print(f" turns over budget: {over} of {len(SCRIPT)}")
if hasattr(ctx, "compactions"):
print(f" compactions: {ctx.compactions}")
print(f" raw turns archived: {len(ctx.archive)}")
print(f" summary facts kept: {len(ctx.summary)}")
if hasattr(ctx, "dropped"):
print(f" raw turns dropped: {ctx.dropped}")
verdict = "ANSWERABLE" if answerable else "UNANSWERABLE (fact lost)"
print(f" late codename question: {verdict}")
print()
return ctx
BUDGET = 120
print("Planted fact: turn 2 says the project is codenamed Atlas.")
print(f"Late question (turn {len(SCRIPT)}): 'what is the project codename?'\n")
drop = run(lambda: DropOldest(BUDGET), visible_text_dropoldest, "Chapter 1 policy: drop the oldest")
tier = run(lambda: TieredContext(BUDGET), visible_text_tiered, "This chapter: tiered compaction")
print("What the compacting manager would actually SEND on the final turn")
print("(running summary, then the verbatim working set):")
print("-" * 68)
print(visible_text_tiered(tier))
print("-" * 68)
print()
print("Both stayed under the same", BUDGET, "token budget. Drop-oldest lost the")
print("codename when turn 2 fell out of the window; compaction kept it in the")
print("summary, so the late question is still answerable.")
Running it:
Planted fact: turn 2 says the project is codenamed Atlas.
Late question (turn 19): 'what is the project codename?'
=== Chapter 1 policy: drop the oldest ===
budget: 120 tokens, turns: 19
peak in-window tokens: 119 (budget 120)
turns over budget: 0 of 19
raw turns dropped: 13
late codename question: UNANSWERABLE (fact lost)
=== This chapter: tiered compaction ===
budget: 120 tokens, turns: 19
peak in-window tokens: 118 (budget 120)
turns over budget: 0 of 19
compactions: 17
raw turns archived: 17
summary facts kept: 4
late codename question: ANSWERABLE
What the compacting manager would actually SEND on the final turn
(running summary, then the verbatim working set):
--------------------------------------------------------------------
Summary of earlier conversation:
- For the record, the project is codenamed Atlas.
- When should we aim to ship the first internal preview?
- Should we expose a REST API or a gRPC one for the search endpoint?
- Offer gRPC internally for speed and a thin REST gateway for external callers who want simplicity.
Target the end of the quarter for an internal preview, then harden it before any wider rollout.
Remind me, what is the project codename again? I need it for the ticket.
--------------------------------------------------------------------
Both stayed under the same 120 token budget. Drop-oldest lost the
codename when turn 2 fell out of the window; compaction kept it in the
summary, so the late question is still answerable.
Read the two blocks side by side, because the comparison is the point. Both policies held the line on the same 120-token budget: drop-oldest peaked at 119 tokens, compaction at 118, and neither went over on any of the 19 turns. So on the budget metric they are a tie. The difference is what survived. Drop-oldest discarded 13 raw turns into nothing, and turn 2 was one of them, so by the final turn the codename is simply gone and the question cannot be answered. Compaction archived 17 raw turns but distilled them into a 4-fact running summary first, and "Atlas" is the top line of that summary, so the same question is answerable at the same cost.
Notice the summary in the final block sits at four facts and stays there. That is the bound doing its job: as new decisions arrive, the lowest-priority facts age out, but "the project is codenamed Atlas" is a decision, ranks high, and holds its place. Without that cap the summary would have grown every turn and eventually broken the budget itself, which is the trap a naive "just keep summarizing" loop falls into.
When you reach for this
Compaction earns its complexity on long-running, single-conversation work where the history genuinely outgrows the window:
- Multi-hour chats. Support sessions or pairing sessions that accumulate hundreds of turns. The early turns set context that still matters; you cannot afford to drop them and you cannot afford to keep them all verbatim.
- Tool-heavy agent loops. An agent that calls tools dozens of times fills the window with tool results fast. Here you often want both operations from the "Don't be confused" box: delete the spent tool outputs (context editing) and summarize the reasoning that produced them (compaction).
- Long autonomous runs. An agent working a task overnight will blow any fixed window without some form of paging. Compaction plus an archive is how it keeps going.
For state that has to outlive a single conversation (across sessions, across restarts), this in-conversation summary is not enough; you want the durable, queryable memory of Chapter 9 and Chapter 10. Compaction manages one conversation's window; agent memory manages knowledge across many.
How the real systems do it
The tiered model in the demo is the same one production systems use, with more machinery.
Letta (formerly MemGPT) makes the tiers explicit and lets the agent manage them itself. It splits memory into core (always in-window, the essential persona and facts), recall (the recent conversation, like our working set), and archival (everything older, like our archive). The agent self-pages: it calls tools to summarize recall memory into core when the window fills, and to search archival memory to pull an old fact back in when it needs one. The demo's compaction loop is the same idea with the policy hard-coded instead of model-driven.
Anthropic's API offers both operations as server-side features, so you do not have to hand-roll the loop. The follow-along below shows the exact calls (the build box has no API key, so the output is illustrative, not a verified block).
Compaction summarizes earlier context into a compaction block as you approach the window.
The one rule that bites people: you must append response.content back to your messages each
turn, because the compaction state lives in those returned blocks. Strip out just the text and
the summary is silently lost.
# Illustrative: requires the anthropic SDK and an API key.
import anthropic
client = anthropic.Anthropic()
messages = []
def chat(user_message):
messages.append({"role": "user", "content": user_message})
response = client.beta.messages.create(
betas=["compact-2026-01-12"], # the compaction beta header
model="claude-opus-4-8",
max_tokens=16000,
messages=messages,
context_management={"edits": [{"type": "compact_20260112"}]},
)
# Append the FULL content, not just the text: the compaction blocks in here
# are what the API uses to replace the compacted history next turn.
messages.append({"role": "assistant", "content": response.content})
return next(b.text for b in response.content if b.type == "text")
Context editing is the delete side of the "Don't be confused" box: it removes old blocks
rather than summarizing them. The clear_tool_uses strategy strips out stale tool results,
which is exactly right for a tool-heavy loop where the outputs are spent but the conversation
should keep its shape.
# Illustrative: requires the anthropic SDK and an API key.
response = client.beta.messages.create(
betas=["context-management-2025-06-27"], # the context-editing beta header
model="claude-opus-4-8",
max_tokens=16000,
messages=messages,
context_management={"edits": [{"type": "clear_tool_uses_20250919"}]},
tools=tools,
)
The two are complementary, and a long agent run often uses both: clear the tool results that
are done with (clear_tool_uses), and compact the reasoning and decisions that are old but
still load-bearing (compact_20260112). Note the different beta headers and edit types: the
clearing strategy is not the compaction one, and mixing them up is a common slip.
Using the real tool: commands and before/after proof
The "How the real systems do it" section showed the call shapes. This section runs a real chat loop with them and explains how you prove the compaction is actually working: you watch the input-token count instead of letting it climb forever.
Here is a long chat loop that turns server-side compaction on. The point to get right is the
last line of the loop: you append response.content back to messages, not just the text you
pulled out of it. The compaction state the API builds up lives inside those returned content
blocks. If you keep only the text and throw the rest away, the next request has no summary to
stand on and the API has to start the whole job over.
# Follow-along: needs the anthropic SDK and an API key (neither is on this box,
# so the token figures further down are labeled illustrative, not measured here).
import anthropic
client = anthropic.Anthropic()
messages = []
def chat(user_message):
messages.append({"role": "user", "content": user_message})
response = client.beta.messages.create(
betas=["compact-2026-01-12"], # turns on server-side compaction
model="claude-opus-4-8",
max_tokens=16000,
messages=messages,
context_management={"edits": [{"type": "compact_20260112"}]},
)
# Append the FULL content, not just the text. The compaction block the API
# may have written lives in here; strip it and the summary is lost next turn.
messages.append({"role": "assistant", "content": response.content})
print("input tokens this turn:", response.usage.input_tokens)
return next(b.text for b in response.content if b.type == "text")
# Run this dozens of times. As the history grows toward the window, the API
# starts replacing the older turns with a compaction block on its own.
for turn in range(40):
chat(f"... turn {turn} of a long working session ...")
The delete-side sibling is context editing, which clears spent blocks instead of summarizing
them. It uses a different beta header and a different edit type, so do not mix the two up: the
clearing strategy is clear_tool_uses_20250919, not the compaction one.
# Follow-along: same caveats. This clears stale tool results rather than summarizing.
response = client.beta.messages.create(
betas=["context-management-2025-06-27"], # the context-editing beta header
model="claude-opus-4-8",
max_tokens=16000,
messages=messages,
context_management={"edits": [{"type": "clear_tool_uses_20250919"}]},
tools=tools,
)
The before/after proof: bounded input tokens
The metric that tells you compaction is working is input tokens per call, and the test is
whether that number stays bounded as the conversation gets long. Every request re-sends the
whole history (Chapter 1), so on the naive path, where
you just keep appending raw turns, the input-token count on each call climbs in step with the
transcript and eventually slams into the window. With compaction on, once the history nears the
trigger the API folds the older turns into a compaction block, and the prompt you send on later
calls is that block plus the recent turns rather than the full transcript. So input_tokens
on a turn-30 call lands far below what the same turn would have cost if you had sent everything.
You read the result off two places in the response:
response.usage.input_tokens, the count printed in the loop above. On the naive path it rises every turn; with compaction on it rises, then drops back down at the turn where the API emits the summary, then rises again from that lower floor.- a compaction content block in
response.content, which is the summary the API wrote. Its presence is the direct signal that compaction fired on that turn (and it is exactly the block you must append back, per the loop above).
Concretely, with small illustrative numbers (expected shape, not measured on this box):
naive "send the whole transcript" with compaction on
turn 1 ~3k input tokens ~3k input tokens
turn 10 ~60k input tokens ~60k input tokens
turn 30 ~180k input tokens ~30k input tokens <- summary fired
The two paths track each other early, while the history still fits cheaply. They split once the history nears the window: the naive path keeps climbing toward turn-30's ~180k, while the compacted path drops back to ~30k when the API replaces the early turns with a summary block. Same conversation, a fraction of the per-call cost, and the conversation keeps going instead of hitting the wall.
We cannot run that against the live API on this box (no key), so the figures above are the expected shape, not a measured one. The on-box proof is the verified demo earlier in this chapter: it holds the in-window total under a fixed 120-token budget across all 19 turns while keeping the planted "Atlas" fact answerable. That is the same bounded-cost, kept-the-gist result the real tool produces, run end to end with output you can check.
In Claude Code
You see this without writing any of it when you use Claude Code. A long session there compacts
on its own as it approaches the window, so the conversation keeps going without you pruning old
turns by hand. You can also trigger it yourself with the /compact command when you want to fold
the history down at a chosen moment rather than waiting for the automatic point. Either way it is
the same mechanism: summarize the old turns, keep the gist in the window, drop the raw history to
cheap storage.
Takeaways
- A long conversation cannot keep fitting a fixed window, so something gets evicted. "Drop the oldest" stays under budget but is amnesia by design: the most important fact is often the earliest.
- Compaction evicts smarter. It summarizes the oldest working-set turns into a compact running summary, archives the raw turns out of the window, and keeps the gist in-window for a fraction of the tokens.
- Tiered memory is the arrangement: a small in-window tier (working set plus summary) backed by a large out-of-window archive, paged between like RAM and disk.
- Bound the summary, or it becomes the new leak. Cap it and let low-priority facts age out first, so decisions survive and small talk does not.
- Deleting context and summarizing it are different tools. Delete spent tool results (context editing); summarize old-but-still-relevant reasoning (compaction). Long agent runs use both.
- The providers offer both server-side: Anthropic's
compact_20260112summarizes andclear_tool_uses_20250919deletes; Letta exposes core/recall/archival tiers the agent pages itself. Same tiered idea, less hand-rolling.
👉 Compaction keeps the facts of a long run alive. The next chapter keeps the lessons alive: when an agent fails, how it records what went wrong and turns that into a procedure it follows next time, so the same mistake is not repeated. On to Chapter 12.
Failure and procedural learning
Chapter 9 gave the agent a memory for facts: the user lives in Lisbon, prefers window seats. That memory makes the agent better informed. It does not make the agent better behaved. An agent can know every fact about your project and still commit without being asked, ship a chapter with a broken link, or call an LLM with the wrong model id, again and again, because nothing it knows changes what it does.
This chapter is about the other kind of memory: rules of conduct. We will mine an agent's past sessions for the mistakes it keeps repeating, turn those mistakes into instructions, and write the instructions back into the agent's system prompt so the same mistake does not happen a fourth time. The jargon for this is procedural memory, and learning it from observed failures is procedural learning.
Two kinds of memory
A memory system can store two very different things, and the distinction is the whole point of this chapter.
- Semantic memory is facts the agent knows. "The user's budget is 2000 dollars." "This repo deploys to Cloudflare Pages." You retrieve a fact and put it in the context so the model can use it as information. That was Chapter 9.
- Procedural memory is how-to rules that change the agent's behavior. "Run the tests before reporting done." "Never commit unless asked." You do not retrieve these per question; they live in the system prompt and apply to every call, steering what the agent does rather than telling it a fact.
There is a third category you will see named in the libraries, episodic memory, which stores whole past episodes (a full transcript of a previous session) to recall as a worked example. We touched it in passing with few-shot retrieval; this chapter is specifically about procedural memory, because that is the one you learn from failure and write into instructions.
Don't be confused. Semantic memory and procedural memory are not two flavors of the same thing. Semantic memory answers "what does the agent know?" and is retrieved into the context as data: a fact sits there inertly until a question needs it. Procedural memory answers "how does the agent act?" and is injected as instruction: a rule changes behavior on every call whether or not anyone asks about it. Storing "never commit without asking" as a retrievable fact would be a mistake, because the agent would only see it when something reminded it to look, which is exactly when it is too late. A rule of conduct has to live in the system prompt, always on. Facts are recalled; rules are obeyed.
The instruction file is procedural memory you wrote by hand
You have already met procedural memory, you just did not call it that. The instruction file
at the root of a coding project (CLAUDE.md, or AGENTS.md for some tools) is exactly a
block of how-to rules that gets prepended to the agent's system prompt on every call. It is
not facts about the code; it is conduct: how to behave while working here.
This very repository's CLAUDE.md is a worked example of procedural memory. It says to
build all six books before pushing, because someone once pushed a broken SUMMARY.md link
and took down the whole deploy. It says no em dashes in prose, because that is a reliable AI
tell and reviewers kept flagging it. It says to humanize prose before committing, and to end
commit messages with specific trailers. Every one of those rules is a lesson learned from a
past failure, written down so the next session does not relearn it the hard way. That file
was assembled by hand, one painful mistake at a time.
Procedural learning automates that loop. Instead of a human noticing "the agent keeps
forgetting to run tests" and editing the instruction file, the system mines the agent's own
session history, finds the recurring mistakes, and proposes the rules itself. The hand-built
CLAUDE.md is the target shape; the question is how to grow one from observed behavior.
The pipeline: mine, synthesize, evaluate
The mechanism is small and has three steps.
- Mine. Collect past sessions, each labeled with an outcome (success or failure) and,
for failures, a tag naming what went wrong:
forgot_to_run_tests,committed_without_asking,used_wrong_model_id. Count the tags. A tag that shows up once is probably an accident; a tag that shows up repeatedly is a pattern worth a rule. A frequency threshold separates the two. - Synthesize. Turn each recurring tag into an imperative rule string, and append those rules to the instruction file. The "before" file is the short base; the "after" file is the base plus the newly learned rules.
- Evaluate. Take a held-out set of failures the rule-writer never saw, and ask: how many of these would the new rules have caught? A rule "catches" a failure whose tag it addresses. The drop from the before count to the after count is the improvement.
Here is the whole pipeline as runnable code. The session tags are written out explicitly so you can see exactly what becomes a rule and what does not; a production system would tag transcripts automatically (a classifier, or the model itself reading each transcript), but the logic downstream is identical.
"""Procedural learning: mine past failures, rewrite the agent's instructions.
An agent's INSTRUCTION FILE (CLAUDE.md / AGENTS.md) is part of the system prompt:
it is re-injected on every call and shapes how the agent behaves. SEMANTIC memory
stores FACTS the agent knows (chapter 9). PROCEDURAL memory stores HOW-TO RULES that
change what the agent does. This file builds the smallest honest pipeline that turns
observed mistakes into new procedural rules:
TRACES past sessions, each with an outcome (success/failure). A failure
carries a TAG naming the mistake, e.g. "forgot_to_run_tests".
MINE count the failure tags; a tag is RECURRING if it appears at least
THRESHOLD times. One-off failures are noise; repeats are a pattern.
SYNTHESIZE map each recurring tag to an imperative rule string, and append the
new rules to a short base instruction file. Show BEFORE and AFTER.
EVALUATE on a HELD-OUT set of failure traces the rules never saw, count how
many the new rules would have CAUGHT (a rule catches a failure whose
tag it addresses). Print BEFORE vs AFTER failure counts.
Python stdlib only. No network, no model call.
"""
from collections import Counter
# --------------------------------------------------------------------------
# A trace is one past session. We keep only what mining needs: the outcome,
# and (for failures) the tag naming WHAT went wrong. A real system would tag
# transcripts automatically (a classifier, or the model itself reading the
# transcript); here the tags are explicit so you can see the whole pipeline.
# --------------------------------------------------------------------------
def trace(outcome, tag=None):
return {"outcome": outcome, "tag": tag}
# The training traces: sessions we have already seen and labeled. The failures
# repeat a few distinct mistakes, plus one rare one-off (a flaky network).
TRAIN = [
trace("success"),
trace("failure", "forgot_to_run_tests"),
trace("success"),
trace("failure", "committed_without_asking"),
trace("failure", "forgot_to_run_tests"),
trace("failure", "used_wrong_model_id"),
trace("success"),
trace("failure", "forgot_to_run_tests"),
trace("failure", "skipped_link_check"),
trace("failure", "committed_without_asking"),
trace("failure", "used_wrong_model_id"),
trace("failure", "forgot_to_run_tests"),
trace("success"),
trace("failure", "flaky_network_once"),
trace("failure", "committed_without_asking"),
trace("failure", "skipped_link_check"),
]
# Held-out failures: sessions the rule-writer never saw. We use these to ask an
# honest question: of failures we did NOT learn from, how many would the new
# rules have prevented? (Evaluating on the training traces would flatter us.)
HELD_OUT = [
trace("failure", "forgot_to_run_tests"),
trace("failure", "committed_without_asking"),
trace("failure", "skipped_link_check"),
trace("failure", "used_wrong_model_id"),
trace("failure", "forgot_to_run_tests"),
trace("failure", "typo_in_filename_once"), # a one-off, not a learned rule
]
# The synthesis step: each mistake tag maps to one imperative rule. In a real
# system the model would draft the rule text from clusters of failing
# transcripts; the mapping below is the same idea, frozen so the demo is exact.
TAG_TO_RULE = {
"forgot_to_run_tests": "Always run the test suite and confirm it is green before reporting a task done.",
"committed_without_asking": "Never commit or push unless the user explicitly asks; stage the work and wait.",
"used_wrong_model_id": "Use the model id 'claude-opus-4-8'; check the model id before sending any LLM call.",
"skipped_link_check": "Validate every cross-link and include path builds before claiming a chapter is done.",
}
THRESHOLD = 2 # a tag must recur at least this many times to earn a rule
BASE_INSTRUCTIONS = """\
# Agent instructions (base)
- Explain concepts from first principles; assume no background.
- Prefer small, runnable examples over prose."""
# --------------------------------------------------------------------------
# MINE: count failure tags, keep the ones at or above THRESHOLD.
# --------------------------------------------------------------------------
def mine_failures(traces, threshold):
"""Return (Counter of all failure tags, sorted list of recurring tags)."""
tags = Counter(t["tag"] for t in traces if t["outcome"] == "failure")
# Sort recurring tags by frequency (desc), then name, so output is stable.
recurring = sorted(
(tag for tag, n in tags.items() if n >= threshold),
key=lambda tag: (-tags[tag], tag),
)
return tags, recurring
# --------------------------------------------------------------------------
# SYNTHESIZE: turn recurring tags into rule lines and append to the base file.
# --------------------------------------------------------------------------
def synthesize_rules(recurring):
"""Map recurring tags to imperative rule strings (skip tags we can't phrase)."""
return [TAG_TO_RULE[tag] for tag in recurring if tag in TAG_TO_RULE]
def rewrite_instructions(base, rules):
"""Append a learned-rules section to the base instruction file."""
if not rules:
return base
learned = "\n".join(f"- {rule}" for rule in rules)
return f"{base}\n\n# Learned rules (mined from past failures)\n{learned}"
# --------------------------------------------------------------------------
# EVALUATE: a rule "catches" a held-out failure whose tag it addresses.
# --------------------------------------------------------------------------
def covered_tags(rules):
"""The set of failure tags the learned rules address (reverse the mapping)."""
rule_to_tag = {rule: tag for tag, rule in TAG_TO_RULE.items()}
return {rule_to_tag[rule] for rule in rules}
def evaluate(held_out, rules):
"""Count held-out failures the rules would have prevented vs not."""
caught_tags = covered_tags(rules)
failures = [t for t in held_out if t["outcome"] == "failure"]
prevented = [t for t in failures if t["tag"] in caught_tags]
remaining = [t for t in failures if t["tag"] not in caught_tags]
return failures, prevented, remaining
# --------------------------------------------------------------------------
# Demo
# --------------------------------------------------------------------------
def main():
# ---- MINE ----
print("=== 1. Mine the past failures ===")
n_fail = sum(1 for t in TRAIN if t["outcome"] == "failure")
print(f" {len(TRAIN)} training sessions, {n_fail} of them failures.")
tags, recurring = mine_failures(TRAIN, THRESHOLD)
print(f" failure tags by frequency (threshold to act = {THRESHOLD}):")
for tag, n in tags.most_common():
mark = "RECURRING" if n >= THRESHOLD else "one-off "
print(f" {n:>2}x {mark} {tag}")
print(f" -> recurring tags worth a rule: {recurring}")
# ---- SYNTHESIZE ----
rules = synthesize_rules(recurring)
before_file = BASE_INSTRUCTIONS
after_file = rewrite_instructions(BASE_INSTRUCTIONS, rules)
print("\n=== 2. The instruction file BEFORE (base only) ===")
for line in before_file.splitlines():
print(f" | {line}")
print("\n=== 3. The instruction file AFTER (base + learned rules) ===")
for line in after_file.splitlines():
print(f" | {line}")
print(f" ({len(rules)} new rule(s) appended from mined failures.)")
# ---- EVALUATE ----
print("\n=== 4. Evaluate on HELD-OUT failures the rules never saw ===")
failures, prevented, remaining = evaluate(HELD_OUT, rules)
print(f" held-out failures: {len(failures)}")
print(" with the BEFORE instructions (no learned rules):")
print(f" failures prevented: 0 / {len(failures)}")
print(" with the AFTER instructions (learned rules in the system prompt):")
print(f" failures prevented: {len(prevented)} / {len(failures)}")
for t in prevented:
print(f" caught {t['tag']}")
for t in remaining:
print(f" missed {t['tag']} (no rule: one-off or below threshold)")
drop = len(prevented)
print(f"\n improvement: {len(failures)} repeat-class failures -> "
f"{len(failures) - drop} after learning "
f"({drop} prevented by the new rules).")
print(" The one-off that remains is correct to leave alone: procedural")
print(" learning codifies PATTERNS, not every single accident.")
if __name__ == "__main__":
main()
Running it:
=== 1. Mine the past failures ===
16 training sessions, 12 of them failures.
failure tags by frequency (threshold to act = 2):
4x RECURRING forgot_to_run_tests
3x RECURRING committed_without_asking
2x RECURRING used_wrong_model_id
2x RECURRING skipped_link_check
1x one-off flaky_network_once
-> recurring tags worth a rule: ['forgot_to_run_tests', 'committed_without_asking', 'skipped_link_check', 'used_wrong_model_id']
=== 2. The instruction file BEFORE (base only) ===
| # Agent instructions (base)
| - Explain concepts from first principles; assume no background.
| - Prefer small, runnable examples over prose.
=== 3. The instruction file AFTER (base + learned rules) ===
| # Agent instructions (base)
| - Explain concepts from first principles; assume no background.
| - Prefer small, runnable examples over prose.
|
| # Learned rules (mined from past failures)
| - Always run the test suite and confirm it is green before reporting a task done.
| - Never commit or push unless the user explicitly asks; stage the work and wait.
| - Validate every cross-link and include path builds before claiming a chapter is done.
| - Use the model id 'claude-opus-4-8'; check the model id before sending any LLM call.
(4 new rule(s) appended from mined failures.)
=== 4. Evaluate on HELD-OUT failures the rules never saw ===
held-out failures: 6
with the BEFORE instructions (no learned rules):
failures prevented: 0 / 6
with the AFTER instructions (learned rules in the system prompt):
failures prevented: 5 / 6
caught forgot_to_run_tests
caught committed_without_asking
caught skipped_link_check
caught used_wrong_model_id
caught forgot_to_run_tests
missed typo_in_filename_once (no rule: one-off or below threshold)
improvement: 6 repeat-class failures -> 1 after learning (5 prevented by the new rules).
The one-off that remains is correct to leave alone: procedural
learning codifies PATTERNS, not every single accident.
Read the four sections in order, because they show the loop closing. The mine step found
five distinct failure tags but only four are recurring: flaky_network_once happened a
single time and stays below the threshold, so it earns no rule. The synthesize step turned
the four recurring tags into four imperative lines and appended them under a "Learned rules"
heading, leaving the base instructions untouched. The evaluate step is the honest part: of
six failures the rule-writer never saw, the new rules would have caught five, taking the
held-out failure count from six to one.
The threshold is doing real work, and it is the same judgment a careful human applies. Of
the held-out failures, the one the rules miss (typo_in_filename_once) is exactly the kind
of thing you should not write a rule for. If every single accident became a permanent line
in the system prompt, the instruction file would bloat into hundreds of brittle rules, most
of them firing on situations that will never recur, each one spending tokens on every call
(Chapter 2) and crowding out the rules that matter. Procedural
learning codifies patterns, not history. The threshold is where you set how much evidence
a behavior change requires.
Why this is a context-engineering technique, not just a logging trick
It would be easy to read this as "keep a log of bugs," but the payload is a context change, and that is what puts it in this book. The learned rules are not stored in a database the agent queries; they are written into the system prompt, the most expensive and most privileged real estate in the context. Every rule you add is paid for on every single call for the life of the project, so the threshold is not just noise control, it is a budget decision: a rule has to prevent enough failures to justify its standing token cost.
That framing also tells you the failure mode. A procedural-learning loop with no threshold, or one that mines too aggressively, produces a system prompt that grows without bound, contradicts itself ("always commit" learned in one session, "never commit" in another), and slowly degrades the agent it was meant to improve. The discipline is the same one from Chapter 1: the system prompt must stay lean. A good procedural-learning system adds a rule only when the evidence clears the bar, phrases it once and crisply, and is willing to retire a rule whose failures have stopped happening.
Where you meet this in the wild
The named version of semantic versus episodic versus procedural memory comes from LangChain's
LangMem, which treats these as distinct stores and, for the procedural kind, can update
an agent's system prompt from accumulated feedback rather than just remembering facts. The
specific "mine past transcripts and rewrite the rules file" loop shows up in tools built
around coding agents (for example a learn command that reads prior sessions and proposes
edits to AGENTS.md or CLAUDE.md). Underneath the product names the shape is the one
above: failures in, recurring patterns out, instruction file rewritten.
The use cases are where this earns its keep. A self-improving coding agent that stops repeating the same review-comment-worthy mistake after it makes it twice. Codifying team conventions automatically: if three engineers keep getting the same nit in review, that nit is a rule the agent could learn and apply before the human ever sees the diff. Reducing repeated review comments in general, by turning the feedback an agent receives into rules it will not need the feedback for next time. In each case the win is the same: a behavior that used to require a human to catch every time is converted, once, into a standing instruction.
Using the real tool: commands and before/after proof
The from-scratch demo above is the whole mechanism. In practice you reach for two real tools: the coding agent that already keeps a procedural-memory file, and a library that updates one from feedback. Here is how each maps onto the pipeline you just ran.
Claude Code: the instruction file is the procedural memory, and you grow it
Claude Code (Anthropic's CLI agent, running model claude-opus-4-8) keeps its procedural
memory in a CLAUDE.md at the repo root, and it can write the first draft for you. The /init
slash command reads the repository and generates the file:
# Inside the repo, start Claude Code and run the init command:
claude
> /init
# Claude reads the project (build scripts, source layout, conventions it can infer)
# and writes a CLAUDE.md describing how to work here. You then edit it by hand.
What /init produces is exactly procedural memory: not facts the agent looks up, but rules
of conduct prepended to the system prompt on every call. This very repository's own
CLAUDE.md is the worked example we have been describing. Every line in it is a rule that was
learned the hard way and written down so the next session does not relearn it:
- Build all six books before pushing (one broken SUMMARY link takes down the whole deploy).
- No em dashes or en dashes in prose (a reliable AI tell reviewers kept flagging).
- Humanize prose before committing (run it through the humanizer checklist).
- End commit messages with the Co-Authored-By and Claude-Session trailers.
Those are the rules. Because the file is re-injected at the start of every session, the agent stops repeating those specific mistakes. The update step, the thing procedural learning automates, is just appending a newly learned rule to the same file:
# A new failure pattern showed up (the agent kept fabricating command output).
# Codify it once, and every future session inherits the rule:
cat >> CLAUDE.md <<'RULE'
- Never fabricate code output; run the snippet and paste the real text block.
RULE
# For agents that read AGENTS.md instead of CLAUDE.md, append to that file the same way.
That >> append is the synthesize step from the pipeline, done by hand. The mining can be
automated too: headroom learn (from the headroom-ai package) reads past session
transcripts, finds the failed tool calls, correlates them with what eventually succeeded, and
writes the learnings into CLAUDE.md or AGENTS.md for you. The shape is identical to the
demo: failures in, recurring patterns out, instruction file rewritten.
LangMem: optimize a procedural instruction from feedback
LangMem (from LangChain) is the library that names semantic, episodic, and procedural memory as distinct stores. For the procedural kind it gives you a prompt optimizer: feed it past runs plus feedback and it proposes an improved system prompt. Install it:
pip install -U langmem
The minimal procedural-memory use is create_prompt_optimizer. You give it a model, a set of
trajectories (each one a conversation paired with feedback about how it went), and the
current prompt; it returns an updated prompt with the lesson folded in:
from langmem import create_prompt_optimizer
# A trajectory is (conversation, feedback). Feedback can be None, a {"score", "comment"}
# dict, or a corrected response. Here the agent committed without being asked:
trajectories = [
(
[
{"role": "user", "content": "Fix the typo in chapter 3."},
{"role": "assistant", "content": "Fixed and pushed the commit."},
],
{"score": 0.0, "comment": "It committed and pushed without being asked."},
),
]
optimizer = create_prompt_optimizer(
"anthropic:claude-opus-4-8", # the model that does the rewriting
kind="prompt_memory", # also: "metaprompt", "gradient"
)
before = "You are a careful coding assistant."
after = optimizer.invoke({"trajectories": trajectories, "prompt": before})
# `after` is the updated prompt string, now carrying a rule like
# "do not commit or push unless the user explicitly asks."
This is the same loop as the demo, with the LLM doing the synthesize step: the feedback is the
mined failure, and the returned string is the instruction file's "after" version. (We do not
paste a captured run here because the anthropic SDK and a live key are not installed on this
box; treat the snippet as follow-along and run it where the key is set. The official quickstart
at langchain-ai.github.io/langmem shows the same call against a live model.)
The proof that matters: repeat-failure rate, before vs after
A learned rule is only worth its token cost if it actually stops the failure from recurring, so measure it. The metric is the repeat-failure rate: on a fixed set of tasks, how many hit a known failure mode. The recipe is the evaluate step you already ran, applied to the real tool:
- Build a held-out eval set: tasks that previously triggered a known failure mode (committed without asking, used the wrong model id, shipped a broken link).
- Run them with the before instructions and count how many repeat the failure.
- Add the learned rule to
CLAUDE.md/AGENTS.md(or apply LangMem's updated prompt). - Run the same set with the after instructions and count again.
- The drop from the before count to the after count is the prevented-failure number.
An illustrative result, the kind you would expect on a small set:
(illustrative / expected)
repeat-failure rate on 6 held-out tasks:
BEFORE rules: 6 / 6 repeated a known failure mode
AFTER rules: 1 / 6 (5 prevented; the 1 left is a true one-off, correct to leave alone)
That table is labeled illustrative because the numbers depend on your eval set and model. The
honest, on-box version of exactly this measurement is the from-scratch demo earlier in this
chapter: it ran the held-out evaluation for real and printed failures prevented: 5 / 6,
which is the same repeat-failure-rate proof against verified output rather than expected
output. The real tools change who writes the rule (a CLI command, an LLM optimizer) but not
how you prove it worked: you measure the repeat-failure rate before and after.
Takeaways
- Procedural memory is how-to rules that change behavior, stored in the system prompt and applied on every call. It is distinct from semantic memory (facts the agent knows, recalled on demand) and episodic memory (whole past sessions recalled as examples).
- A project's
CLAUDE.md/AGENTS.mdis procedural memory written by hand: each rule is a lesson from a past failure. Procedural learning automates accumulating those rules from observed mistakes. - The pipeline is mine, synthesize, evaluate: count failure tags, promote the recurring ones to imperative rules, append them to the instruction file, and measure prevented failures on a held-out set.
- A frequency threshold is the core control. One-off accidents stay out; only patterns earn a rule, because every rule costs tokens on every call and a bloated system prompt degrades the agent it was meant to fix.
- In the demo, four mined rules took a held-out failure count from six to one, leaving exactly the single one-off that no rule should cover.
👉 We now have an agent that knows facts, remembers conversations, and has learned its own rules of conduct. The next chapter steps up a level to context orchestration: deciding, per turn, which of these pieces (which facts, which memories, which rules, which tools) actually get assembled into the context the model sees.
Context orchestration
The last ten chapters each handed you one lever. Chapter 3 shrinks a part. Chapter 6 caches a stable prefix. Chapter 9 recalls a fact from outside the window. Each is a good answer to "how do I make this one part smaller or cheaper or durable?" None of them answers the question that sits above all of them: on this turn, for this query, which parts do I even need?
That is the job of orchestration. It is the conductor. It looks at the query that just arrived, decides which sources of context this particular query requires, and assembles only those, applying compression or caching or memory recall at the moments they actually help rather than always. A "hi there" should not drag in a 600-token tool catalog and a code index. A stack trace should. Orchestration is the layer that knows the difference and acts on it per turn.
The default is to include everything, and it is wasteful
The naive way to build a context is to gather every source you have (retrieved documents, long-term memory, the tool definitions, the relevant code) and concatenate them, every turn. Call this the kitchen sink. It is simple and it is never wrong in the sense of missing something, because everything is always present. But it is wrong in the sense that matters in this book: it is not lean. Most turns pay for sources they will not use. The model also has to read past all of them, which dilutes its attention and, past a point, degrades the answer.
The fix is to choose. Different queries need different context. A factual lookup needs the retrieved documents and maybe a memory of who is asking; it does not need the code index. A coding request needs the code and the tools; it does not need the refund policy. Small talk needs neither, and the model can answer from the system prompt alone. Once you accept that the right set of sources depends on the query, you need a small machine to make that decision on each turn. That machine is a router, and the cleanest way to build it is as a state graph.
A state graph: nodes, edges, and a decision in the middle
A state graph is a tiny program drawn as boxes and arrows. Each box is a node: a step that takes the current state, does one job, and passes the (possibly updated) state to the next node. Each arrow is an edge: it says which node runs next. A conditional edge is an arrow that branches: it reads the state and picks a different next node depending on what it finds. The state is just the bag of values flowing through (here: the query, its label, the chosen sources, the assembled context). "Durable state" means that bag can be saved to disk between turns and reloaded, so an agent survives a crash or a pause.
Our router is three nodes with one branch in the middle:
┌──────────┐ label ┌─────────┐ sources ┌──────────┐
query ─────► │ classify │ ───────────► │ route │ ──────────► │ assemble │ ──► context
└──────────┘ └────┬────┘ └──────────┘
│ conditional edge
┌────────────────────┼────────────────────┐
▼ ▼ ▼
coding: code, factual: retrieval, chitchat:
tools, memory memory (nothing)
- classify reads the query and labels it (
coding,factual, orchitchat). - route is the conditional edge. Given the label, it picks the set of sources to include. This single decision is the whole point of orchestration.
- assemble packs the chosen sources into the token budget, in priority order, skipping any that would overflow. This is the same packing from Chapter 1, but now operating on a selected set instead of on everything.
Read that diagram against the kitchen sink: the kitchen sink skips the branch entirely and always lands on "include all four sources." The router keeps the branch, and the branch is where the tokens get saved.
The router in code
Here is the whole thing. Four sources, each with a token cost. A keyword classifier (kept
deliberately simple and deterministic; a real system might use a small model or the main model
itself, but the routing logic is identical). A ROUTES table that is the routing policy in
four lines. Then two assemblers, kitchen_sink and routed, run over the same five queries
so you can read the difference directly.
"""A small state-graph context router.
Earlier chapters each gave you ONE lever: compress a part, cache a prefix,
recall a memory. Orchestration is the layer above them. On each turn it looks
at the query, decides WHICH sources of context this particular query needs, and
assembles only those under a token budget. The query "hey what's up" should not
drag in a 600-token tool catalog; the query "fix this stack trace" should.
This script builds that router as a tiny GRAPH of three nodes:
classify -> route -> assemble
- classify: read the query, label it (factual / coding / chitchat).
- route: given the label, pick the set of sources to include.
- assemble: pack the chosen sources into a token budget.
We then compare two assemblers over the same queries:
- KITCHEN SINK: always include every source (the naive default).
- ROUTED: include only what the query type asked for.
Only depends on the Python standard library. Run: python3 orchestrate.py
"""
from dataclasses import dataclass, field
def est_tokens(text: str) -> int:
"""Rough token estimate: ~1.3 tokens per English word (see chapter 2)."""
return round(len(text.split()) * 1.3)
# ---------------------------------------------------------------------------
# SOURCES: each is a named place context can come from, with a token cost.
# A "source" is just a producer of context: a retriever, a memory store, a
# tool catalog, a code indexer. In a real system each call is expensive (a DB
# query, an embedding lookup). Here each holds a fixed blob so the cost is
# visible. The point is not the text but its SIZE and whether a turn needs it.
# ---------------------------------------------------------------------------
@dataclass
class Source:
name: str
text: str
@property
def tokens(self) -> int:
return est_tokens(self.text)
SOURCES = {
"retrieval": Source(
"retrieval",
"Retrieved docs: The refund policy allows returns within 30 days. "
"Shipping is free over fifty dollars. Support hours are 9 to 5. " * 12,
),
"memory": Source(
"memory",
"Long-term memory: user prefers terse answers, is on the Pro plan, "
"lives in Berlin, asked about invoices twice before. " * 8,
),
"tools": Source(
"tools",
"Tool definitions: search(query) read_file(path) write_file(path,text) "
"run_tests() git_diff() open_pr(title,body) list_dir(path). " * 14,
),
"code": Source(
"code",
"Code context: def process(order): validate(order); charge(order.card); "
"return Receipt(order.id) # plus 200 lines of the surrounding module. " * 16,
),
}
# ---------------------------------------------------------------------------
# CLASSIFY node: label the query by simple keyword rules. A real system might
# use a tiny classifier or the model itself; keywords keep the demo readable
# and deterministic. The label is the only thing the router needs.
# ---------------------------------------------------------------------------
CODING_WORDS = {"bug", "stack", "trace", "function", "code", "test", "error",
"deploy", "refactor", "import", "exception", "fix"}
FACTUAL_WORDS = {"what", "when", "how", "policy", "refund", "hours", "price",
"where", "who", "does", "is"}
CHITCHAT_WORDS = {"hi", "hey", "hello", "thanks", "thank", "lol", "cool",
"nice", "ok", "okay", "bye"}
def classify(query: str) -> str:
"""Return one of: 'coding', 'factual', 'chitchat'. First match wins, in
priority order, so a coding question that also contains 'what' stays coding."""
words = {w.strip(".,!?").lower() for w in query.split()}
if words & CODING_WORDS:
return "coding"
if words & FACTUAL_WORDS:
return "factual"
if words & CHITCHAT_WORDS:
return "chitchat"
return "factual" # safe default: when unsure, allow a lookup
# ---------------------------------------------------------------------------
# ROUTE node: map a label to the set of sources that label needs. THIS is the
# orchestration decision. Chitchat needs nothing heavy. A factual lookup needs
# retrieval (and memory, to personalize). Coding needs the code and the tools,
# but not the refund policy. Routing is choosing this set, per turn.
# ---------------------------------------------------------------------------
ROUTES = {
"coding": ["code", "tools", "memory"],
"factual": ["retrieval", "memory"],
"chitchat": [], # answer from the system prompt alone; pull in nothing
}
def route(label: str) -> list:
"""Given a query label, return the ordered list of source names to assemble."""
return ROUTES[label]
# ---------------------------------------------------------------------------
# ASSEMBLE node: pack chosen sources into a token budget, in priority order,
# skipping any that would overflow. This is the same packing idea as chapter 1,
# but now operating on a SELECTED set rather than everything.
# ---------------------------------------------------------------------------
@dataclass
class Assembled:
label: str
included: list = field(default_factory=list)
skipped: list = field(default_factory=list)
tokens: int = 0
def assemble(source_names, budget) -> Assembled:
out = Assembled(label="")
for name in source_names:
src = SOURCES[name]
if out.tokens + src.tokens <= budget:
out.included.append(name)
out.tokens += src.tokens
else:
out.skipped.append(name)
return out
# ---------------------------------------------------------------------------
# The two assemblers under test.
# ---------------------------------------------------------------------------
def kitchen_sink(query, budget) -> Assembled:
"""Naive default: ignore the query, include every source every time."""
res = assemble(list(SOURCES.keys()), budget)
res.label = classify(query) # label only for reporting; not used to route
return res
def routed(query, budget) -> Assembled:
"""classify -> route -> assemble: include only what this query needs."""
label = classify(query)
chosen = route(label)
res = assemble(chosen, budget)
res.label = label
return res
QUERIES = [
"What is your refund policy?",
"Fix this stack trace in the deploy function",
"hey thanks, that was nice",
"When are your support hours?",
"There is a bug in the test, the import throws an exception",
]
BUDGET = 4000
def fmt(names):
return ", ".join(names) if names else "(nothing)"
print("=== Per-query: KITCHEN SINK vs ROUTED ===")
print(f"(budget {BUDGET} tokens per turn)\n")
ks_total = 0
rt_total = 0
for q in QUERIES:
ks = kitchen_sink(q, BUDGET)
rt = routed(q, BUDGET)
ks_total += ks.tokens
rt_total += rt.tokens
print(f"query: {q!r}")
print(f" classified as : {rt.label}")
print(f" kitchen sink : {ks.tokens:5d} tok [{fmt(ks.included)}]")
print(f" routed : {rt.tokens:5d} tok [{fmt(rt.included)}]")
print(f" saved : {ks.tokens - rt.tokens:5d} tok")
print()
print("=== Totals across all queries ===")
pct = 100 * (ks_total - rt_total) / ks_total
print(f" kitchen sink total : {ks_total:6d} tok")
print(f" routed total : {rt_total:6d} tok")
print(f" saved : {ks_total - rt_total:6d} tok ({pct:.0f}% smaller)")
print()
print("Each routed turn still carries the RIGHT source: the coding queries get")
print("'code' and 'tools', the factual queries get 'retrieval', and chitchat")
print("pulls in nothing heavy. Routing spends tokens where the query needs them.")
Running it:
=== Per-query: KITCHEN SINK vs ROUTED ===
(budget 4000 tokens per turn)
query: 'What is your refund policy?'
classified as : factual
kitchen sink : 1038 tok [retrieval, memory, tools, code]
routed : 541 tok [retrieval, memory]
saved : 497 tok
query: 'Fix this stack trace in the deploy function'
classified as : coding
kitchen sink : 1038 tok [retrieval, memory, tools, code]
routed : 695 tok [code, tools, memory]
saved : 343 tok
query: 'hey thanks, that was nice'
classified as : chitchat
kitchen sink : 1038 tok [retrieval, memory, tools, code]
routed : 0 tok [(nothing)]
saved : 1038 tok
query: 'When are your support hours?'
classified as : factual
kitchen sink : 1038 tok [retrieval, memory, tools, code]
routed : 541 tok [retrieval, memory]
saved : 497 tok
query: 'There is a bug in the test, the import throws an exception'
classified as : coding
kitchen sink : 1038 tok [retrieval, memory, tools, code]
routed : 695 tok [code, tools, memory]
saved : 343 tok
=== Totals across all queries ===
kitchen sink total : 5190 tok
routed total : 2472 tok
saved : 2718 tok (52% smaller)
Each routed turn still carries the RIGHT source: the coding queries get
'code' and 'tools', the factual queries get 'retrieval', and chitchat
pulls in nothing heavy. Routing spends tokens where the query needs them.
Look at what routing bought, and notice it is two things at once, not one. First, it is
smaller: 2,472 tokens against 5,190, a 52% cut over the five queries, and the chitchat turn
drops from 1,038 tokens to zero because none of the heavy sources earn their place in "hey
thanks." Second, and this is the part a blunt token cut would get wrong, every routed turn
still carries the right source. The two coding queries both pull in code and tools; the
two factual queries both pull in retrieval. The router did not save tokens by starving the
query. It saved tokens by not loading the sources this query never needed. That is the
distinction between orchestration and crude truncation: truncation cuts to hit a number,
orchestration cuts what is irrelevant and keeps what is not.
The conductor invokes the earlier levers, at the right moment
The router above only selects sources, but the same node-and-branch structure is where every earlier chapter plugs in. Orchestration is the conductor that decides when to use each lever, instead of using it always:
- Retrieve, or not? The cheapest retrieval is the one you skip. A chitchat turn routes to zero sources, so the router never runs the retriever at all. A "RAG router" is exactly this: a branch that decides whether the query even needs a document lookup before paying for one.
- Compress, but only when it would overflow. The
assemblenode can call the compressor from Chapter 3 on a source only when including it whole would blow the budget, rather than compressing on every turn. - Cache, on the stable branches. Sources that recur turn after turn (the tool catalog on a coding agent) are the prefix-cache candidates from Chapter 6. The router decides which branch is stable enough to cache.
- Recall memory, when the query is about the user. The factual and coding routes pull in
memory; the chitchat route does not. The router is what turns the memory store from Chapter 9 into a conditional read instead of an always-on tax.
So the four families are not a flat menu you apply uniformly. They are levers the conductor pulls per turn, on the branch where each one pays off.
Don't be confused. Three things sit near each other and are easy to merge. The prompt is the wording handed to the model: the system instruction, the phrasing of the task. Orchestration is not the prompt; it is the decision about which sources, tools, and state get assembled into the context this turn, before any wording is finalized. A linear chain is the third thing: a fixed pipeline of steps that always runs in the same order, with no branch and no saved state (retrieve, then stuff, then generate, every time, for every query). Orchestration is a chain that branches: it reads the query and routes to different context-gathering steps, and it can carry durable state across turns. You can have a great prompt inside a dumb linear chain that retrieves on a "hello," and you can have a smart orchestrator that picks the right sources and then hands them to a mediocre prompt. They are different layers, and a serious system needs all three to be good.
The real projects
Two pieces of the open-source landscape (Chapter 15) make this concrete.
LangGraph builds agents as exactly the graph in this chapter: you declare nodes (each a function over the shared state), wire them with edges, and use conditional edges to route to different context-gathering steps based on what the state holds. Its defining features are the explicit control flow (you can see and test every branch) and durable state (the state bag is checkpointed, so an agent can pause, survive a restart, and resume mid-graph). A multi-tool agent in LangGraph is a router node that picks which tool's context to load next; a RAG router is a conditional edge that decides whether to retrieve at all.
lean-ctx focuses on the assemble node: lean per-turn context assembly, building the
smallest sufficient context for the current query rather than the kitchen sink. It is the same
"select, then pack under budget" idea, treated as a first-class step.
The shared lesson is the framing of this whole chapter. Compression, caching, and memory are the instruments. Orchestration is the conductor that decides which to play, and when, for the query in front of it.
Using the real tool: commands and before/after proof
The router above is a from-scratch state graph so you can see every part. LangGraph is the
same shape with the bookkeeping handled for you: you declare nodes and edges, and it runs them
and threads the state through. Install it with pip (the package name is langgraph, all
lowercase):
pip install langgraph
Here is the router as a real LangGraph graph. The state is a typed dictionary (a TypedDict):
a plain dict whose keys and value types are spelled out, so each node knows what is in the bag
flowing through. classify and assemble are nodes (each takes the state and returns the
keys it changed). route is not a node: it is the function the conditional edge calls to pick
the next step. We wire START → classify, then add_conditional_edges from classify using
route plus a path_map that maps each label to the assemble node, and assemble → END. The
LangGraph parts are from langgraph.graph import StateGraph, START, END; the rest is ordinary
Python. (Follow-along: LangGraph is not installed on this box, so the output below is labeled
expected, not measured here.)
from typing import TypedDict
from langgraph.graph import StateGraph, START, END
# The state is the bag of values that flows through the graph.
class RouterState(TypedDict):
query: str
label: str
sources: list[str]
context: str
# The routing policy: which sources each query type needs.
ROUTES = {
"coding": ["code", "tools", "memory"],
"factual": ["retrieval", "memory"],
"chitchat": [],
}
def classify(state: RouterState) -> dict:
q = state["query"].lower()
if any(w in q for w in ("fix", "bug", "exception", "stack trace")):
label = "coding"
elif any(w in q for w in ("policy", "hours", "refund", "when")):
label = "factual"
else:
label = "chitchat"
return {"label": label}
def route(state: RouterState) -> str:
# The conditional edge calls this and uses the return value as the key.
return state["label"]
def assemble(state: RouterState) -> dict:
sources = ROUTES[state["label"]]
return {"sources": sources, "context": "\n".join(sources)}
builder = StateGraph(RouterState)
builder.add_node("classify", classify)
builder.add_node("assemble", assemble)
builder.add_edge(START, "classify")
builder.add_conditional_edges(
"classify",
route,
# path_map: route's return value -> the next node to run.
{"coding": "assemble", "factual": "assemble", "chitchat": "assemble"},
)
builder.add_edge("assemble", END)
graph = builder.compile()
result = graph.invoke({"query": "Fix this stack trace in the deploy function"})
print(result["label"], "->", result["sources"])
coding -> ['code', 'tools', 'memory']
That is the on-box from-scratch router (the verified demo above) wearing LangGraph's clothes:
the labels, the ROUTES table, and the branch are identical. What LangGraph adds is the graph
runtime and the durable state from the start of the chapter, so the same router can checkpoint
and resume.
The before/after proof: input tokens per turn
The metric that shows orchestration paying off is input tokens per turn: how many tokens
the model has to read on each request, before it writes a single token back. The kitchen sink
sends every source every turn; the router sends only what it selected. To prove the gap with
real numbers, count the tokens in each context with the Anthropic SDK rather than guessing.
client.messages.count_tokens(...) returns an object whose .input_tokens is the exact count
the API would charge for that prompt:
from anthropic import Anthropic
client = Anthropic()
def turn_tokens(context: str, query: str) -> int:
resp = client.messages.count_tokens(
model="claude-opus-4-8",
messages=[{"role": "user", "content": f"{context}\n\n{query}"}],
)
return resp.input_tokens
ALL_SOURCES = ["retrieval", "memory", "tools", "code"]
def build(sources: list[str]) -> str:
return "\n".join(f"<{s}>...</{s}>" for s in sources)
for query in ["Fix this stack trace", "What is your refund policy?", "thanks!"]:
label = classify({"query": query})["label"]
routed = turn_tokens(build(ROUTES[label]), query)
sink = turn_tokens(build(ALL_SOURCES), query)
print(f"{label:8} kitchen sink {sink:>5} routed {routed:>5}")
coding kitchen sink 1024 routed 695
factual kitchen sink 1024 routed 512
chitchat kitchen sink 1024 routed 8
(Token counts illustrative/expected, not measured here: the anthropic SDK is not installed
on this box and the source bodies above are stubs. Run it with a real key and full sources to
get the actual numbers.) The shape is the point. The kitchen sink reads about 1,000 tokens
every turn no matter what was asked. The routed turn reads only the sources the branch chose:
roughly half for a coding or factual query, and near zero for chitchat, because the model can
answer "thanks!" from the system prompt alone. Critically, the routed coding turn still carries
code and tools and the routed factual turn still carries retrieval: the cut came from
dropping the sources the query never needed, not from starving the query. That is the same
result the verified from-scratch demo proved on this box (52% smaller, right source kept); the
count_tokens recipe is how you put a real, billable number on it for your own sources.
In Claude Code
Claude Code is an orchestrator in this exact sense, not a single flat prompt. Two of its moves are orchestration in practice. First, it spawns subagents with the Task tool: when a job has an independent sub-part (search this part of the repo, run and read the tests), it hands that sub-part to a fresh subagent that gets only its own focused context and reports back a short result, instead of pouring every intermediate file into the main window. Second, it reads just the files a step needs: it greps and opens the specific lines relevant to the current edit rather than loading the whole codebase up front. Both are the router's lesson applied to a real agent: decide what this step needs, assemble only that, and keep the rest out of the window.
Takeaways
- Orchestration is the layer above the individual levers: per turn, it decides which sources, tools, and state to assemble, instead of including everything.
- The kitchen sink (always include every source) is never missing anything but is rarely lean. Routing makes the context smaller and keeps the source each query actually needs.
- Build the router as a state graph: a
classifynode, aroutenode that branches on the label (a conditional edge), and anassemblenode that packs the chosen sources under budget. - In the demo, routing cut total context by 52%, dropped a chitchat turn to zero heavy tokens, and still gave every coding query its code and tools and every factual query its documents.
- The conductor invokes the earlier levers at the right moment: retrieve only when the query needs a lookup, compress only on overflow, cache the stable branches, recall memory only when the query is about the user.
- Orchestration is not the prompt (the wording) and not a linear chain (a fixed pipeline with no branch or state). It is the branching, stateful layer that chooses what the prompt gets built from. LangGraph and lean-ctx are the real-world versions.
👉 Routing keeps each turn's context as small as it can be, but some tasks genuinely need a long window: a whole codebase, a long transcript, a book. Chapter 14 goes inside the model to ask why long contexts are expensive in the first place, and what makes attention tractable when the window is huge.
Long-context attention efficiency
Every technique so far has worked from outside the model: choosing what goes in the
context, compressing it, caching it, storing it elsewhere. This chapter goes inside the
model itself, because there is a hard limit those techniques cannot move on their own.
The reason a 1M-token context is expensive is not the disk it sits on or the bytes you
send over the wire. It is that the model's attention mechanism, the thing that lets
each token look at the others, costs work that grows with the square of the context
length, and the KV cache it leaves behind costs memory that grows with the length.
Those two costs are what model architects attack, and the 1M-token windows of models like
claude-opus-4-8 exist only because they found ways to attack them. This chapter explains
the two costs and the two families of fixes.
What attention is, and why it costs $n^2$
When a model processes a sequence, each token needs to look at the others to figure out what it means in context. The word "it" has to find its referent; a closing brace has to find its opening one. The mechanism that does this looking is attention, and it works with three vectors per token. For each token the model computes a query (what this token is looking for), a key (what this token offers to others), and a value (the information this token will pass along if attended to). A query from one token is compared against the keys of all the others; wherever a query and a key match well, that token's value gets mixed in. Written out, attention is
$$\text{attention}(Q, K, V) = \text{softmax}!\left(\frac{QK^\top}{\sqrt{d}}\right)V$$
and every symbol there earns its place. $Q$, $K$, $V$ are the stacks of query, key, and value vectors for all $n$ tokens, each vector of length $d$ (the head dimension, the size of one attention head's working space). $QK^\top$ is the score matrix: row $i$, column $j$ is the dot product of query $i$ with key $j$, a single number saying how much token $i$ should attend to token $j$. Dividing by $\sqrt{d}$ keeps those numbers from growing too large as $d$ grows. softmax turns each row into weights that are positive and sum to one, so each query spreads a total attention of exactly 1 across all keys. Multiplying those weights by $V$ produces the output: each token's new representation is a weighted blend of every token's value.
Now look at the shape of $QK^\top$. With $n$ tokens it is an $n \times n$ matrix: every token scored against every other token. Each of those $n^2$ entries is a dot product of length $d$, and then the weight-times-value step is another $n \times n$ pass. So the work is proportional to $2 n^2 d$, which we write as $O(n^2 d)$: it grows with the square of the sequence length. "$O(\cdot)$" is just shorthand for "grows in proportion to," ignoring constant factors. The consequence is brutal. Double the context and attention does four times the work. Go from a 1,000-token prompt to a 1,000,000-token one, a 1000x increase in length, and attention does a million times the work. That single fact is why long context is hard, and the first part of the demo makes the quadratic curve a concrete set of numbers.
Fix one, less compute: sparse and linear attention
The first family of fixes notices that the $n \times n$ score matrix is almost always wasteful. A query does not actually need a precise score against all $n$ keys; in real language most of a token's attention lands on a handful of others (its near neighbours, a few key earlier tokens) and the rest get a weight so close to zero they may as well not have been computed. So why compute them?
Sparse attention computes only a chosen subset of the score matrix. The simplest pattern is a sliding window: each query attends only to the $w$ most recent tokens instead of all $n$, where $w$ is a fixed window size much smaller than $n$. That turns the score matrix from $n \times n$ into roughly $n \times w$, so the cost drops from $O(n^2 d)$ to $O(n w d)$, linear in $n$ instead of quadratic. A richer version is top-$k$ selection: instead of a fixed local window, each query keeps only the $k$ keys it scores highest and ignores the rest, so it can still reach an important token far back while paying for only $k$ comparisons. DeepSeek Sparse Attention (DSA), used in DeepSeek-V3.2, does exactly this kind of learned selection: for each query it picks a sparse set of keys to attend to, so the cost stops growing with the full sequence length.
A different route is linear attention, which avoids the $n \times n$ matrix altogether. By dropping or reshaping the softmax, attention can be rewritten so that keys and values are summed into a fixed-size running state that each query reads from, making the cost grow linearly with $n$ and using a fixed amount of memory regardless of length. MiniMax's lightning attention (in MiniMax-01) is a production example: it replaces the quadratic softmax attention with a linear-cost mechanism for long sequences, which is what lets the model take very long inputs without the $n^2$ wall.
The second part of the demo shows the catch and why these methods work anyway. Cheaper operations are worthless if the answer comes out wrong. So it runs full attention and windowed attention on the same input and measures how far apart their outputs are. The input is built so that attention weight decays with distance (the way it does in real text, where a token leans mostly on what is near it), and the result is that almost all of every query's attention mass already falls inside the window. Clipping to the window then changes the output by a tiny amount: the approximation is reasonable, not random.
Fix two, less memory: MLA and KV compression
Sparse and linear attention attack compute. They do nothing for the other cost, which is memory. Recall the KV cache from Chapter 6 and Chapter 8: so the model does not recompute the past on every generated token, it stores the key and value vectors of every token it has seen. Each token contributes a key and a value, each of size $d$, so the cache holds on the order of $n \cdot d$ numbers per layer, and it lives in scarce GPU memory. For a 1M-token context this cache is enormous, and as Chapter 8 established, the KV cache is usually the thing that runs out first when serving long contexts. Sparse attention does not shrink it: even if a query only looks at a window, the model still has to keep the keys and values around in case a later query needs them.
Multi-head Latent Attention (MLA), introduced in DeepSeek-V2 and carried into DeepSeek-V3, shrinks the cache directly. The idea is low-rank compression. The full $d$-dimensional key and value vectors carry less independent information than their size suggests; they can be squeezed into a much smaller latent vector of dimension $r$, where $r$ is far smaller than $d$ (the "low-rank" part means a few dimensions capture most of what matters). MLA projects each token's key and value down to one shared latent of size $r$, stores only that latent in the cache, and projects it back up to full keys and values when attention actually needs them. The down-projection and up-projection are fixed weight matrices, paid once, not stored per token. So the per-token cache cost falls from about $d$ to about $r$, and the whole cache shrinks by roughly $d / r$.
The third part of the demo puts a number on it: the KV-cache bytes for a 1M-token context stored in full versus stored as MLA-style latents, and the compression ratio. It also runs a small sanity check that a rank-$r$ latent really can carry the information, by compressing genuinely low-rank vectors down and back up and confirming almost nothing is lost.
Don't be confused. Reducing attention compute and reducing KV-cache memory are two different problems with two different fixes, and it is easy to blur them because both wear the word "attention." Sparse and linear attention (DSA, MiniMax lightning) cut the number of score computations: they make the model do less arithmetic per token by not comparing every query against every key. MLA cuts the bytes per token in the cache: it makes each stored token smaller, but it does not change how many comparisons attention performs. One attacks the $n^2$ in the FLOPs, the other attacks the $n \cdot d$ in the memory. They are orthogonal, and the strongest long-context models combine them: sparse or linear attention running over a compressed MLA cache, paying less compute and less memory at once. When you read that a model "does efficient long context," ask which of the two it means, because a model can fix one and still be bottlenecked on the other.
The demo
"""Long-context attention efficiency: why long windows are expensive, and the two
architecture-level fixes.
A transformer's attention compares EVERY token against EVERY earlier token. For a
sequence of n tokens, that comparison forms an n x n SCORE MATRIX, so the compute grows
like n^2. Double the context and you quadruple the attention work. Separately, to avoid
recomputing the past on every step, the engine stores a KEY and a VALUE vector (each of
size d) per token: the KV CACHE, whose memory grows like n*d. These are two DIFFERENT
bottlenecks, and there are two different fixes:
1. FULL attention is O(n^2 * d). We count the multiply-add operations for the
score matrix (n*n*d) plus the value mixing (n*n*d) as n grows, so the quadratic
blow-up is visible as a number.
2. SPARSE / WINDOWED attention cuts COMPUTE. Each query attends only to a local
WINDOW of w nearby keys instead of all n, giving about O(n*w*d). We count its
ops vs full, and on a SMOOTH input show the windowed output stays close to full
attention (small max-abs difference), so the approximation is reasonable, not random.
3. MLA-style KV compression cuts MEMORY. Multi-head Latent Attention projects the
d-dim key/value down to a shared low-rank LATENT of size r << d, stores only that
latent per token, and projects back up when needed. We compute KV-cache BYTES for
full (n*d) vs latent (n*r) and the compression ratio for a long n.
Everything below is plain NumPy with float32. No GPU, no transformer library; just the
arithmetic that the real systems are built on. NumPy + stdlib only.
Run: python3 attention_efficiency.py
"""
import numpy as np
BYTES_PER_ELEM = 2 # KV caches are usually stored in fp16/bf16: 2 bytes/number
rng = np.random.default_rng(0)
# ----------------------------------------------------------------------------------
# Part 1: FULL attention is O(n^2 * d). Count the operations as n grows.
# ----------------------------------------------------------------------------------
# Attention is softmax(Q K^T / sqrt(d)) V. With n tokens and head dimension d:
# * Q K^T compares every query against every key: an n x n matrix, each entry a
# dot product of length d -> n * n * d multiply-adds.
# * Multiplying that n x n weight matrix by V mixes the values -> another n * n * d.
# So the cost is proportional to 2 * n^2 * d, the famous "quadratic in sequence length".
def full_attention_ops(n, d):
"""Multiply-add count for full attention over n tokens, head dim d."""
score_ops = n * n * d # Q K^T : the n x n score matrix
mix_ops = n * n * d # weights @ V : mixing the values
return score_ops + mix_ops
def windowed_attention_ops(n, d, w):
"""Each query attends to at most w keys (a local window) instead of all n.
The score matrix becomes n x w, so the cost is about 2 * n * w * d."""
eff = min(w, n)
return n * eff * d + n * eff * d
D = 64 # head dimension: the size of each query/key/value vector
W = 64 # local window: how many nearby keys a query may attend to
print("=== 1. Full attention is O(n^2): cost as the context grows ===")
print(f" head dim d = {D}, window w = {W}. Ops = multiply-adds for one attention head.\n")
print(f" {'n':>6} {'FULL ops (~2 n^2 d)':>22} {'WINDOWED ops (~2 n w d)':>24} {'full/windowed':>14}")
prev_full = None
for n in (64, 256, 1024, 4096):
full = full_attention_ops(n, D)
win = windowed_attention_ops(n, D, W)
grow = "" if prev_full is None else f" (n x4 -> ops x{full / prev_full:.0f})"
print(f" {n:>6} {full:>22,} {win:>24,} {full / win:>13.1f}x{grow}")
prev_full = full
print("\n Each 4x in n makes FULL attention ~16x more work (quadratic), while WINDOWED")
print(" grows only ~4x (linear in n). That gap is why naive attention cannot reach 1M tokens.\n")
# ----------------------------------------------------------------------------------
# Part 2: windowed attention APPROXIMATES full attention on smooth input.
# ----------------------------------------------------------------------------------
# Cheaper ops are useless if the answer is wrong. On structured input (here a smooth
# signal where nearby tokens are similar), almost all of a query's attention mass lands
# on nearby keys anyway, so restricting to a local window changes the output very little.
# We run BOTH on the same input and report the largest per-element difference.
def softmax(x, axis=-1):
x = x - np.max(x, axis=axis, keepdims=True) # subtract max for numerical safety
e = np.exp(x)
return e / np.sum(e, axis=axis, keepdims=True)
def _distance_bias(n, slope):
"""A score penalty that grows with how far apart two tokens are: position j gets
-slope * |i - j| added to query i's score. This is the ALiBi recipe, and it bakes in
the locality real language has: a token leans on what is near it. The further away a
key is, the lower its score, so its softmax weight decays toward zero with distance."""
idx = np.arange(n)
return -slope * np.abs(idx[:, None] - idx[None, :])
def full_attention(Q, K, V, slope):
"""The real thing: softmax(Q K^T / sqrt(d) + distance_bias) V, causal (a query may
only see itself and earlier tokens), every query over every visible key."""
n, d = Q.shape
idx = np.arange(n)
scores = Q @ K.T / np.sqrt(d) + _distance_bias(n, slope)
future = idx[None, :] > idx[:, None] # a token cannot see the future
scores = np.where(future, -np.inf, scores)
return softmax(scores, axis=-1) @ V
def windowed_attention(Q, K, V, w, slope):
"""Each query i attends only to keys in [i-w, i]. Positions outside the window get
-inf score, so softmax assigns them exactly zero weight. This is the local/sliding-
window sparsity pattern long-context models use; it never even forms the far scores."""
n, d = Q.shape
idx = np.arange(n)
scores = Q @ K.T / np.sqrt(d) + _distance_bias(n, slope)
# mask True where a key is OUTSIDE query i's window (too far back, or in the future)
too_far = (idx[None, :] < idx[:, None] - w) | (idx[None, :] > idx[:, None])
scores = np.where(too_far, -np.inf, scores)
return softmax(scores, axis=-1) @ V
N = 256
SLOPE = 0.05
# A realistic input: random token embeddings PLUS the distance bias above, so attention
# weight decays with distance the way it does in real language (nearby tokens matter most,
# far ones fade out). Under that decay, almost all of every query's attention mass already
# lands inside a local window, so clipping to the window changes the output only slightly.
X = rng.normal(0, 1, size=(N, D))
Q, K, V = X.copy(), X.copy(), X.copy() # tie Q=K=V to keep the demo simple
out_full = full_attention(Q, K, V, SLOPE)
out_win = windowed_attention(Q, K, V, W, SLOPE)
max_abs_diff = float(np.max(np.abs(out_full - out_win)))
rel = max_abs_diff / float(np.max(np.abs(out_full)))
# How much of each query's full-attention weight already lands inside the window:
idx = np.arange(N)
full_scores = Q @ K.T / np.sqrt(D) + _distance_bias(N, SLOPE)
full_w = softmax(np.where(idx[None, :] > idx[:, None], -np.inf, full_scores), axis=-1)
in_window = (idx[None, :] >= idx[:, None] - W) & (idx[None, :] <= idx[:, None])
avg_window_mass = float((full_w * in_window).sum(axis=1).mean())
print("=== 2. Windowed attention stays close to full attention (locality of language) ===")
print(f" n = {N} tokens, window w = {W}. Same input through both.")
print(f" full attention ops: {full_attention_ops(N, D):>14,}")
print(f" windowed attention ops: {windowed_attention_ops(N, D, W):>14,} "
f"({full_attention_ops(N, D) / windowed_attention_ops(N, D, W):.1f}x cheaper)")
print(f" avg attention mass already inside the window: {avg_window_mass:.1%}")
print(f" max abs difference in output: {max_abs_diff:.4f} "
f"({rel:.2%} of the largest output value)")
print(" -> a fraction of the compute, and the output barely moves. The window is a good")
print(" approximation because nearby tokens already carry almost all the attention mass.\n")
# ----------------------------------------------------------------------------------
# Part 3: MLA-style KV compression cuts MEMORY (not compute).
# ----------------------------------------------------------------------------------
# The KV cache stores, per token, a key and a value vector of size d. For n tokens that
# is on the order of n*d numbers, and it lives in scarce GPU memory (see chapters 6, 8).
# Multi-head Latent Attention (MLA, from DeepSeek-V2/V3) does not store the full d-dim
# K and V. It PROJECTS them down to a shared LOW-RANK LATENT of size r << d and stores
# only that latent per token; when attention needs the full K and V, it projects the
# latent back UP with fixed matrices. So the per-token cache shrinks from ~d to ~r.
def kv_cache_bytes_full(n, d, bytes_per_elem=BYTES_PER_ELEM):
"""Full cache: one key vector + one value vector of size d per token."""
return n * (2 * d) * bytes_per_elem
def kv_cache_bytes_latent(n, r, bytes_per_elem=BYTES_PER_ELEM):
"""MLA-style cache: one shared latent vector of size r per token (the up-projection
matrices are fixed weights, paid once, not per token)."""
return n * r * bytes_per_elem
def latent_roundtrip_error(d, r, n_probe=256, seed=1):
"""Sanity check that a rank-r latent can carry most of a d-dim vector when the data
itself is low-rank: build vectors that live in an r-dim subspace, compress to the
latent and back, and report the worst reconstruction error."""
g = np.random.default_rng(seed)
basis = g.normal(size=(r, d)) # an r-dimensional subspace of R^d
coeffs = g.normal(size=(n_probe, r))
kv = coeffs @ basis # genuinely rank-r data
down = np.linalg.pinv(basis) # d -> r (compress)
latent = kv @ down # store this (size r per token)
recon = latent @ basis # r -> d (project back up)
return float(np.max(np.abs(kv - recon)))
N_LONG = 1_000_000 # a 1M-token context, like claude-opus-4-8's long window
D_MODEL = 128 # per-head key/value dimension
R_LATENT = 16 # MLA latent dimension, r << d
full_bytes = kv_cache_bytes_full(N_LONG, D_MODEL)
latent_bytes = kv_cache_bytes_latent(N_LONG, R_LATENT)
err = latent_roundtrip_error(D_MODEL, R_LATENT)
print("=== 3. MLA-style KV compression cuts cache MEMORY ===")
print(f" n = {N_LONG:,} tokens, per-head d = {D_MODEL}, latent r = {R_LATENT} "
f"(r << d), {BYTES_PER_ELEM} bytes/number.\n")
print(" BEFORE (full cache: store K and V of size d per token)")
print(f" {full_bytes:,} bytes = {full_bytes / 1e9:.2f} GB (per layer, per head)\n")
print(" AFTER (MLA: store one shared latent of size r per token)")
print(f" {latent_bytes:,} bytes = {latent_bytes / 1e9:.2f} GB (per layer, per head)\n")
print(f" compression ratio: {full_bytes / latent_bytes:.1f}x smaller KV cache")
print(f" (low-rank round-trip on genuinely rank-{R_LATENT} data loses at most "
f"{err:.1e} per element: the latent carries the information.)")
print(" -> same context, a fraction of the cache. Memory, not compute, is what MLA buys.\n")
# ----------------------------------------------------------------------------------
# Summary: the two bottlenecks are independent and combine.
# ----------------------------------------------------------------------------------
print("=== Summary: two different bottlenecks, two different fixes ===")
print(" COMPUTE (score matrix is n x n): sparse/windowed/linear attention -> fewer ops.")
print(" MEMORY (KV cache is n x d): MLA low-rank latent -> fewer bytes.")
print(" They are orthogonal: a model can use BOTH (sparse attention over an MLA cache)")
print(" to serve 1M-token contexts cheaply. That is what makes whole-repo and whole-book")
print(" inputs affordable at all.")
Running it:
=== 1. Full attention is O(n^2): cost as the context grows ===
head dim d = 64, window w = 64. Ops = multiply-adds for one attention head.
n FULL ops (~2 n^2 d) WINDOWED ops (~2 n w d) full/windowed
64 524,288 524,288 1.0x
256 8,388,608 2,097,152 4.0x (n x4 -> ops x16)
1024 134,217,728 8,388,608 16.0x (n x4 -> ops x16)
4096 2,147,483,648 33,554,432 64.0x (n x4 -> ops x16)
Each 4x in n makes FULL attention ~16x more work (quadratic), while WINDOWED
grows only ~4x (linear in n). That gap is why naive attention cannot reach 1M tokens.
=== 2. Windowed attention stays close to full attention (locality of language) ===
n = 256 tokens, window w = 64. Same input through both.
full attention ops: 8,388,608
windowed attention ops: 2,097,152 (4.0x cheaper)
avg attention mass already inside the window: 99.9%
max abs difference in output: 0.0134 (0.34% of the largest output value)
-> a fraction of the compute, and the output barely moves. The window is a good
approximation because nearby tokens already carry almost all the attention mass.
=== 3. MLA-style KV compression cuts cache MEMORY ===
n = 1,000,000 tokens, per-head d = 128, latent r = 16 (r << d), 2 bytes/number.
BEFORE (full cache: store K and V of size d per token)
512,000,000 bytes = 0.51 GB (per layer, per head)
AFTER (MLA: store one shared latent of size r per token)
32,000,000 bytes = 0.03 GB (per layer, per head)
compression ratio: 16.0x smaller KV cache
(low-rank round-trip on genuinely rank-16 data loses at most 2.7e-14 per element: the latent carries the information.)
-> same context, a fraction of the cache. Memory, not compute, is what MLA buys.
=== Summary: two different bottlenecks, two different fixes ===
COMPUTE (score matrix is n x n): sparse/windowed/linear attention -> fewer ops.
MEMORY (KV cache is n x d): MLA low-rank latent -> fewer bytes.
They are orthogonal: a model can use BOTH (sparse attention over an MLA cache)
to serve 1M-token contexts cheaply. That is what makes whole-repo and whole-book
inputs affordable at all.
Read the three parts in order, because they trace the problem and both fixes. Part one is the problem. At $n = 64$ full and windowed attention cost the same, because the window is as wide as the sequence. From there they diverge fast: every time the length quadruples, full attention's op count goes up 16x (that is the $n^2$, since $4^2 = 16$), while windowed attention goes up only 4x (linear in $n$, since the window stays fixed). By $n = 4096$ full attention is doing 64x the work of the window for the same sequence, and that ratio keeps widening without limit. Extend the table to a million tokens and the full-attention column becomes a number no accelerator will run interactively. That is the wall.
Part two is the first fix and its justification. The window does a quarter of the operations at $n = 256$, and the gap only grows with $n$. The question is whether the answer survives, and the demo shows it does: 99.9 percent of every query's attention weight already lands inside the window, so dropping the keys outside it changes the output by 0.0134, which is 0.34 percent of the largest output value. The reason is the locality the input was built to have, and that real language has: a token leans mostly on what is near it. This is the property DSA's learned selection and a sliding window both exploit, and it is why sparse attention is a good approximation and not a gamble.
Part three is the second fix, on the other axis entirely. A 1M-token context stored as full keys and values takes 0.51 GB per layer per head; the same context stored as MLA latents of dimension 16 takes 0.03 GB, a 16x reduction, which is exactly $d / r = 128 / 16$. The round-trip sanity check confirms the latent is not throwing the information away: on genuinely rank-16 data the compress-and-restore loses at most $2.7 \times 10^{-14}$ per number, which is floating-point noise. Multiply that 16x by every layer and head in a real model and the cache for a long context goes from "does not fit" to "fits comfortably," which is the difference between offering a 1M-token window and not.
The exact numbers here come from the toy dimensions and the seed in the script; change $d$, $r$, the window, or the input and they move. What is robust is the structure: full attention is quadratic in length and falls over at scale, local or learned-sparse attention is linear and stays accurate when attention is local, and low-rank KV compression shrinks the cache by $d / r$ with almost no loss when the keys and values are genuinely low-rank.
Why this is the floor under everything else
This chapter is the architecture layer the rest of the book stands on. The capacity pressure from Chapter 1, the finite window, only relaxes to a million tokens because attention was made sub-quadratic and the cache was made small. The KV-cache serving work of Chapter 8, paging and prefix sharing, manages where the cache lives and how it is shared; MLA changes how big each token's entry is in the first place, and the two compound. And the cost economics of Chapter 2 reach their limit here: cheap long-context serving, the kind that makes feeding a whole repository or a whole book into one call affordable, is downstream of these architectural choices. When you can hand a model an entire codebase and ask one question across all of it, you are using a model whose architects paid down both the $n^2$ compute and the $n \cdot d$ memory so that the call is tractable at all.
It is worth being clear about what you, as someone engineering context, control here and what you do not. You do not choose a model's attention pattern or whether it uses MLA; those are fixed when the model is trained. What this chapter gives you is the why behind the prices and limits you do see. It explains why a provider can offer a 1M-token window at a workable price, why long-context calls are still more expensive per token than short ones even with these fixes, and why the discipline of the earlier chapters (sending less, caching the stable prefix, storing state outside the window) still pays off: a smaller, leaner context is cheaper on any architecture, efficient attention or not. The fixes here lower the ceiling on what is possible; the rest of the book lowers your bill underneath it.
Using the real tool: commands and before/after proof
The demo above proves the mechanism on this box. This section is about the two ways you actually meet efficient attention as a practitioner: you serve an open-weights model that has it built in, or you call a long-context model over an API that has it built in. Either way the thing you can measure and show a colleague is the same: how cost and memory scale with the context length $n$.
Serving an efficient-attention open-weights model
The open-weights models that ship the fixes from this chapter (DeepSeek with MLA and DeepSeek Sparse Attention, MiniMax with lightning attention) run under vLLM, a serving engine that downloads the weights from Hugging Face and exposes an HTTP endpoint. You point it at a model id and it serves it:
pip install vllm
# DeepSeek: Multi-head Latent Attention (small KV cache) + DeepSeek Sparse
# Attention (sub-quadratic compute). Use the current DeepSeek model id on
# Hugging Face, e.g. a DeepSeek-V3 / V3.2 / V4 family checkpoint.
vllm serve deepseek-ai/<current-deepseek-model>
# MiniMax: lightning (linear) attention for long sequences.
# Use the current MiniMax model id on Hugging Face.
vllm serve MiniMaxAI/<current-minimax-model>
Replace <current-deepseek-model> and <current-minimax-model> with the model id you find on
the model's Hugging Face page; the exact names move as new versions ship, so check the page
rather than trusting a name from memory. vLLM reads the architecture from the checkpoint and
runs the right attention kernel: MLA stores each token's key and value as the small latent from
the third part of the demo, so the KV cache for a long context is a fraction of a vanilla
transformer's at the same length, and the sparse or linear path means the per-token compute
stops growing with the full sequence. You did not configure any of that. It is in the weights,
and serving the model is how you use it. The point of the chapter is to know why that one
command can hold a context a vanilla transformer could not: both costs were paid down in the
architecture.
Calling a long-context model over an API
The other way you use efficient attention is to not run a model at all. When you call a
long-context model like claude-opus-4-8, with its 1M-token context window, the fact that
you can put a whole repository or a whole book in a single prompt at all is downstream of this
chapter. A 1M-token window is only offered because the attention is sub-quadratic and the KV
cache is small enough to serve; a vanilla transformer at a million tokens would be doing the
million-times-more work from part one of the demo and holding the half-gigabyte-per-head cache
from part three. You send the long prompt and read the answer; the architecture is what makes
the call tractable on the provider's side.
The before/after proof: cost versus context length
Here is a recipe you can run yourself to see the thing efficient attention controls. The metric
is cost, and the variable is context length. You do not need a GPU for it. Count the
tokens of inputs of increasing size with the Anthropic SDK, then multiply by the input rate
(claude-opus-4-8 is $5 per million input tokens, $25 per million output):
import anthropic
client = anthropic.Anthropic()
RATE_PER_TOKEN = 5.0 / 1_000_000 # $5 per million input tokens
# Feed progressively larger inputs (a file, then a directory, then a whole repo).
for label, text in [("one file", small), ("a package", medium), ("the whole repo", large)]:
n = client.messages.count_tokens(
model="claude-opus-4-8",
messages=[{"role": "user", "content": text}],
).input_tokens
print(f"{label:>16}: {n:>9,} tokens -> ${n * RATE_PER_TOKEN:.2f} per call")
count_tokens is the exact tokenizer the model uses, so the counts are real, not estimates.
Run it across inputs from a few thousand tokens up to the edge of the window and you watch the
per-call cost climb in a straight line with the length:
one file: 8,000 tokens -> $0.04 per call (illustrative / expected)
a package: 120,000 tokens -> $0.60 per call (illustrative / expected)
the whole repo: 900,000 tokens -> $4.50 per call (illustrative / expected)
(The numbers above are illustrative: the token counts depend on your actual files, and the SDK call needs network and an API key. The structure is the load-bearing part.) Notice that the price tracks the length linearly: ten times the context is about ten times the input bill, not a hundred times. That linearity is the whole point. The provider's per-token serving cost is roughly flat as $n$ grows, so they can charge you per token; if their attention were the naive $O(n^2)$ from the start of the chapter, the work behind each token would itself grow with the context, the half-gigabyte-per-head cache would not fit, and a 1M-token call would be priced out of reach (or not offered). The from-scratch demo is the on-box proof of that mechanism: part one showed the full quadratic compute curve that efficient attention flattens, and part three showed the MLA KV-memory drop directly, the $16\times$ smaller cache that lets the long context be held at all. The cost-versus-length table here is the same story measured from the outside, at the level you actually pay for.
Takeaways
- Attention compares every token against every other, so its score matrix is $n \times n$ and its compute is $O(n^2 d)$: quadruple the context, quadruple the work. That quadratic cost is the reason long context is hard.
- Sparse attention (a sliding window, or top-$k$/learned selection as in DeepSeek Sparse Attention) and linear attention (MiniMax lightning) cut the compute by scoring only the keys that matter, dropping the cost from $O(n^2 d)$ toward $O(n w d)$.
- A local window is a good approximation, not a gamble, because attention in real language is local: in the demo 99.9 percent of each query's weight is already inside the window, so the output barely moves.
- MLA (Multi-head Latent Attention, from DeepSeek-V2/V3) cuts the memory by storing each token's key and value as a small low-rank latent of size $r \ll d$ instead of the full $d$, shrinking the KV cache by about $d / r$ (16x in the demo) with negligible loss.
- Compute and memory are different bottlenecks with different fixes; the strongest
long-context models combine sparse/linear attention with an MLA cache. The 1M-token windows
of models like
claude-opus-4-8exist because of this layer.
👉 That closes the techniques. The final chapter steps back to the ecosystem: the open-source libraries and serving stacks that implement everything in this book, so you know what to reach for instead of building it yourself.
The open-source landscape
You have now built a small version of every lever. This chapter is the map from each lever to the real project that implements it at production scale, so that when you have a problem you can reach for the right tool instead of reinventing it. The aim is to leave you able to do two things: name the technique a project belongs to, and say when you would choose it.
A note on links, in the spirit of the rest of this site: the field moves fast and URLs rot, so most entries name the project and, where the address is one I am confident about, give a plain GitHub org or root domain. For the newer or more niche tools, the name is enough to search. Nothing here is an endorsement; pick by fit and by the quote you get today.
The map, by family
Compression: shrink what is sent and written
| Lever | Project(s) | What it does | Reach for it when |
|---|---|---|---|
| Prompt and context compression (Ch 3) | LLMLingua and LLMLingua-2 (Microsoft, github.com/microsoft/LLMLingua); plus newer tools like Headroom and RTK | Prune low-information tokens from a long prompt before sending, coarse-to-fine, keeping the meaning | A long RAG context or heavy few-shot block is blowing the budget |
| Output token reduction (Ch 4) | Provider params (Anthropic effort, structured outputs, stop sequences); output shapers like caveman and Headroom's shaper | Make the model write less: terser answers, schema-only fields, hard caps | Output volume dominates cost, or you only need a label or a JSON field |
| Code and structure-aware context (Ch 5) | CodeCompressor; lean-ctx; tree-sitter; Aider's repo map | Select code by AST or call graph instead of dumping whole files | A coding agent or repo QA system is pasting far more code than the task needs |
Caching: stop paying twice
| Lever | Project(s) | What it does | Reach for it when |
|---|---|---|---|
| KV-cache and prefix caching (Ch 6) | Provider-native prompt caching (Anthropic and others); CacheAligner-style tooling | Reuse the cached computation and charge for a stable prompt prefix across calls | A large system prompt or document is re-sent on every call in a session |
| Semantic and response caching (Ch 7) | GPTCache (github.com/zilliztech/GPTCache); Redis LangCache (redis.io) | Return a stored answer for an approximately-similar query, skipping inference | Traffic has many near-duplicate questions (FAQ, support, repeated analytics) |
| KV-cache serving optimization (Ch 8) | vLLM PagedAttention (github.com/vllm-project/vllm); SGLang RadixAttention (github.com/sgl-project/sglang) | Page KV memory and share prefix blocks across concurrent requests in the engine | You run your own inference server and need throughput and memory efficiency |
Memory and state
| Lever | Project(s) | What it does | Reach for it when |
|---|---|---|---|
| Agent memory and persistence (Ch 9) | Mem0 (github.com/mem0ai/mem0); Letta (formerly MemGPT, github.com/letta-ai/letta); Zep (github.com/getzep/zep) | Extract, store, retrieve, and invalidate facts across turns and sessions | The assistant must remember users and prior context across sessions |
| Temporal knowledge graphs (Ch 10) | Graphiti (github.com/getzep/graphiti); Zep | Store facts with validity windows so you can ask what was true at a past time | Facts change over time and point-in-time correctness matters |
| Context-window compaction (Ch 11) | Letta tiered memory; provider compaction and context editing (Anthropic) | Summarize or clear old context when the window fills, keeping the gist | Long agent runs or multi-hour chats that overflow the window |
| Failure and procedural learning (Ch 12) | LangMem (github.com/langchain-ai/langmem); headroom learn | Mine past sessions to rewrite the agent's instructions (CLAUDE.md, AGENTS.md) | The agent keeps repeating the same avoidable mistakes |
Orchestration and architecture
| Lever | Project(s) | What it does | Reach for it when |
|---|---|---|---|
| Context orchestration (Ch 13) | LangGraph (github.com/langchain-ai/langgraph); lean-ctx | Assemble the right context, tools, and state per turn via a graph with branching and state | A multi-tool agent needs to decide which sources to pull on each turn |
| Long-context attention efficiency (Ch 14) | DeepSeek sparse attention (DSA) and MLA (github.com/deepseek-ai); MiniMax lightning attention (github.com/MiniMax-AI) | Cut attention compute (sparse/linear) or KV memory (low-rank latent) so long windows are feasible | You are choosing or serving a model for very long context, or want to understand why 1M windows are possible |
Don't be confused. The rows are not alternatives to each other; they are complements. A serious system uses several at once: prefix caching (Ch 6) on the stable preamble, compression (Ch 3) on the retrieved docs, a memory store (Ch 9) for cross-session facts, compaction (Ch 11) when the chat runs long, and an orchestrator (Ch 13) deciding which to apply this turn. The question is rarely "which one"; it is "which combination, in what order".
A decision guide
When a context problem shows up, name the symptom first, then the lever:
- "It does not fit." Capacity problem. Compress the biggest part (Ch 3, Ch 5), or move state out to memory and retrieve only the relevant slice (Ch 9), or summarize the old turns (Ch 11).
- "It is too expensive." Cost problem. If the input is large and stable, cache the prefix (Ch 6). If the answer repeats, cache the answer (Ch 7). If the output is verbose, shape it (Ch 4). Remember output is 5x input (Ch 2).
- "It forgets." Durability problem. Add a memory store (Ch 9); if the facts change over time, make it temporal (Ch 10).
- "It repeats the same mistake." Learning problem. Mine the failures and rewrite the instructions (Ch 12).
- "It is slow at high load." Serving problem. Use a paged, prefix-sharing engine (Ch 8), and pick a model whose attention is efficient at your context length (Ch 14).
- "It pulls the wrong things." Routing problem. Put an orchestrator in front that decides per turn what to assemble (Ch 13).
Build or buy
The from-scratch versions in this book are for understanding, not for production. A real semantic cache needs a vector index, eviction, and persistence; a real memory layer needs durability, concurrency, and access control; a real compressor needs a tuned scorer. The projects above have solved those parts. Build the toy to know what the tool is doing, then use the tool. The one place where "build" often wins is the orchestrator (Ch 13): the routing policy is specific to your application, and a hundred lines of your own control flow is frequently clearer than bending a framework to fit.
Takeaways
- Every lever in this book has a production project behind it; the map above is the technique-to-tool lookup.
- The families are complements, not alternatives. Real systems stack caching, compression, memory, and orchestration together.
- Diagnose by symptom: does not fit (compress, externalize, summarize), too expensive (cache prefix or answer, shape output), forgets (memory, temporal), repeats mistakes (procedural learning), slow at load (paged serving, efficient attention), pulls the wrong things (orchestrate).
- Build the toy to understand the tool, then use the tool. The orchestrator is the part most worth writing yourself.
👉 The final page collects the projects, the papers behind them, and a glossary of the terms used throughout, as a reference you can return to.
References and further reading
How to use this page: you do not need any of it to have understood the book. Treat it as a shelf to reach for when a specific question comes up. It collects the projects named throughout, the papers behind the ideas, a glossary of the terms the chapters built up, and the runnable code so you can find a demo again.
A note on links: the field moves quickly and URLs rot, so most entries name the thing and, where I am confident of the address, give a plain GitHub org or root domain. Everything else is a name you can type into a search engine. Nothing here is fabricated; where I was unsure of an exact address I left the name and let you search.
Projects, by lever
- Prompt and context compression (Ch 3): LLMLingua and
LLMLingua-2 (
github.com/microsoft/LLMLingua). Also named in this space: Headroom, RTK. - Output token reduction (Ch 4): the provider's own controls
(Anthropic
effort, structured outputs, stop sequences); output shapers such as caveman and Headroom's shaper. - Code and structure-aware context (Ch 5): CodeCompressor,
lean-ctx, tree-sitter (
tree-sitter.github.io), Aider's repo map. - KV-cache and prefix caching (Ch 6): provider-native prompt caching (Anthropic and other vendors); CacheAligner-style tooling.
- Semantic and response caching (Ch 7): GPTCache
(
github.com/zilliztech/GPTCache), Redis LangCache (redis.io). - KV-cache serving optimization (Ch 8): vLLM
(
github.com/vllm-project/vllm, PagedAttention), SGLang (github.com/sgl-project/sglang, RadixAttention). - Agent memory and persistence (Ch 9): Mem0
(
github.com/mem0ai/mem0), Letta (github.com/letta-ai/letta, formerly MemGPT), Zep (github.com/getzep/zep). - Temporal knowledge graphs (Ch 10): Graphiti
(
github.com/getzep/graphiti), Zep. - Context-window compaction (Ch 11): Letta tiered memory; the provider's own compaction and context-editing features (Anthropic).
- Failure and procedural learning (Ch 12): LangMem
(
github.com/langchain-ai/langmem), headroom learn. - Context orchestration (Ch 13): LangGraph
(
github.com/langchain-ai/langgraph), lean-ctx. - Long-context attention efficiency (Ch 14): DeepSeek
(
github.com/deepseek-ai, sparse attention DSA and Multi-head Latent Attention), MiniMax (github.com/MiniMax-AI, lightning/linear attention).
Papers behind the ideas
Named so you can find the current version on arxiv.org or the project page; the exact
identifiers change as papers revise, so search the title.
- LLMLingua and LLMLingua-2 (Microsoft Research): prompt compression by perplexity (v1) and by a learned token classifier distilled from a strong model (v2).
- MemGPT (now Letta): an operating-system metaphor for LLM memory, with core, recall, and archival tiers the model pages between.
- PagedAttention ("Efficient Memory Management for Large Language Model Serving"): the paper behind vLLM, treating the KV cache like OS virtual memory.
- RadixAttention (SGLang): automatic KV prefix sharing across requests via a radix tree.
- DeepSeek-V2 / V3 technical reports: Multi-head Latent Attention (MLA) for KV compression, and the sparse attention used in later versions.
- MiniMax-01 technical report: lightning (linear) attention for long sequences.
- The provider documentation for prompt caching, token counting, compaction, and context editing is the authority for the exact parameters; consult the claude-api reference for the model you call.
Glossary
Terms the chapters built up, in one place.
- Context. The full token sequence sent to the model on one call: system prompt, tools, retrieved documents, history, and the user's message. See Chapter 1.
- Token. A sub-word unit from the model's vocabulary; the thing you are billed and windowed in. See Chapter 2.
- Context window. The maximum number of tokens a model can read at once.
- BPE (byte pair encoding). The algorithm that builds the tokenizer by merging frequent adjacent pairs. See Chapter 2.
- Compression ratio. Original tokens divided by compressed tokens. See Chapter 3.
- Effort / structured output / stop sequence. Provider controls that make the model write fewer output tokens. See Chapter 4.
- AST (abstract syntax tree). A program's structure as a tree; the basis for selecting code by call graph. See Chapter 5.
- Query, key, value. The three projections inside attention. See Chapter 6.
- KV cache. Stored keys and values for past tokens so they are not recomputed each step. See Chapter 6.
- Prefix caching / prompt caching. Reusing the cached work and charge for a stable prompt prefix across calls. See Chapter 6.
- Semantic cache. Returning a stored answer for an approximately-similar query. See Chapter 7.
- Embedding / cosine similarity. A vector for a piece of text, and the angle-based measure of how close two vectors are. See Chapter 7 and Chapter 9.
- PagedAttention / RadixAttention. Engine techniques: KV memory in fixed-size pages, and KV prefix sharing across requests. See Chapter 8.
- Extraction / retrieval / invalidation. The three memory operations: pull facts out, fetch the relevant ones, replace stale ones. See Chapter 9.
- Bi-temporal / validity interval. Tracking when a fact was true (event time) and when the system learned it (ingestion time). See Chapter 10.
- Compaction. Summarizing old context when the window fills, as opposed to deleting it. See Chapter 11.
- Procedural memory. Learned how-to rules that change the agent's behavior, stored in its instructions, as opposed to facts. See Chapter 12.
- Orchestration. Deciding which context, tools, and state to assemble per turn. See Chapter 13.
- Sparse / linear attention, MLA. Architecture-level ways to cut attention compute or KV memory so long windows are feasible. See Chapter 14.
This book's code
Every demo is in the code/ folder of this book's repository, runnable with python3 and
NumPy, no API key. By chapter:
context_assemble.py,token_economy.py: foundations.compress.py,output_shape.py,code_context.py: compression.kv_cache.py,semantic_cache.py,kv_serving.py: caching.agent_memory.py,temporal_kg.py,compaction.py,procedural_learning.py: memory.orchestrate.py,attention_efficiency.py: architecture.
Read them, change a number, and watch the trade-off move. That is the fastest way to make the ideas your own.
One last thing
Context engineering is bookkeeping with stakes. The model is only ever as good as the tokens you put in front of it and only ever as cheap as the tokens you avoid re-paying for. You now have the levers and the map. When a system is slow, expensive, or forgetful, you can name which of the four pressures it is failing and reach for the matching tool. That is the whole job.