What context engineering is
The introduction gave the one-sentence version: context engineering is managing the model's input as a fixed-size, paid resource. This chapter makes that precise, because the precision is what lets you reason about every later technique instead of memorizing them.
The model is a pure function of its context
A chat model feels stateful. You tell it your name, and three messages later it still knows it. But the model itself remembers nothing between calls. What actually happens is that your application re-sends the whole conversation every time, and the model re-reads it from scratch on each call. Formally, a single call is
$$\text{output} = f_\theta(\text{context})$$
where $f_\theta$ is the frozen network and context is one long token sequence. The
weights $\theta$ hold what the model learned in training; the context holds everything
about this situation. There is no third place. If a fact is not in the weights and not in
the context, the model does not have it during this call, full stop.
This is the whole reason context engineering is a discipline and not a footnote. Because the context is the only channel for live information, every design question reduces to a question about it: what is in the context, in what form, at what cost, on this call?
Don't be confused. "The model remembers" and "my application re-sends the history" describe the same observable behavior but they are not the same mechanism, and the difference is the entire job. The model is stateless. The memory is something your code maintains and re-injects. When the memory grows past the window, the model does not "forget gracefully": your code must decide what to cut, or the call fails. Mistaking the illusion for the mechanism is how people end up surprised by both the bill and the amnesia.
A context is assembled, every single turn
The context is not one thing you write once. It is rebuilt on every call from named parts: a system prompt, the tool definitions, any retrieved documents, the running conversation, the latest tool results, and the user's new message, concatenated in a fixed order. Build order matters (it interacts with caching, Chapter 6), and total size matters, because the sum has to fit the window.
Here is that assembly as runnable code. It builds a realistic turn from parts, estimates each part's size with a crude word-based count (real counts are Chapter 2), and then enforces three different budgets by dropping the oldest evictable parts until the rest fit. The system prompt and the user's message are pinned: they never get dropped, because without them the call is meaningless.
"""Assemble an LLM context from its parts and enforce a token budget.
Context engineering, at its most basic, is this: every call you BUILD the input
sequence from named parts, MEASURE it against the window, and DECIDE what to drop
when it does not fit. This script does exactly that with a crude word-based token
estimate (good enough to see the mechanics; real token counts come in chapter 2).
Only depends on the Python standard library. Run: python3 context_assemble.py
"""
from dataclasses import dataclass
def est_tokens(text: str) -> int:
"""Rough token estimate. Real tokenizers split into sub-words, so tokens are
a bit more numerous than whitespace words; ~1.3 tokens per word is a fair
rule of thumb for English prose."""
return round(len(text.split()) * 1.3)
@dataclass
class Part:
name: str
text: str
pinned: bool = False # pinned parts are never evicted (system prompt, user turn)
@property
def tokens(self) -> int:
return est_tokens(self.text)
def assemble(parts, budget):
"""Pack parts into a token budget. Pinned parts always stay. Among the rest,
keep the most RECENT (parts later in the list) and drop the oldest first,
which is the simplest possible eviction policy."""
pinned = [p for p in parts if p.pinned]
evictable = [p for p in parts if not p.pinned]
used = sum(p.tokens for p in pinned)
kept_recent = []
# walk newest-to-oldest, keep while it fits
for p in reversed(evictable):
if used + p.tokens <= budget:
kept_recent.append(p)
used += p.tokens
kept = pinned + list(reversed(kept_recent))
dropped = [p for p in evictable if p not in kept_recent]
return kept, dropped, used
# A realistic turn: stable preamble, a long retrieved doc, growing history, the ask.
parts = [
Part("system_prompt", "You are a careful coding assistant. " * 20, pinned=True),
Part("tool_defs", "search(query) read_file(path) write_file(path, text) " * 12),
Part("retrieved_doc", "The deployment pipeline runs build.sh on every push. " * 60),
Part("history_turn_1", "User asked about the database schema. Assistant replied. " * 15),
Part("history_turn_2", "User asked about migrations. Assistant gave the steps. " * 15),
Part("history_turn_3", "User asked about rollback. Assistant explained. " * 15),
Part("user_message", "Now: why did the last deploy fail?", pinned=True),
]
total = sum(p.tokens for p in parts)
print("=== The context as assembled, part by part ===")
for p in parts:
flag = " [pinned]" if p.pinned else ""
print(f" {p.name:16s} {p.tokens:5d} tokens{flag}")
print(f" {'TOTAL':16s} {total:5d} tokens\n")
for budget in (10_000, 600, 300):
kept, dropped, used = assemble(parts, budget)
print(f"=== Budget {budget} tokens ===")
print(f" fits as-is: {total <= budget}")
print(f" kept ({used} tok): {', '.join(p.name for p in kept)}")
print(f" dropped: {', '.join(p.name for p in dropped) or '(nothing)'}")
print()
print("Lesson: the SAME parts produce a different context at every budget. Deciding")
print("what survives the squeeze IS context engineering; the rest of the book is")
print("smarter ways to do it than 'drop the oldest'.")
Running it:
=== The context as assembled, part by part ===
system_prompt 156 tokens [pinned]
tool_defs 62 tokens
retrieved_doc 624 tokens
history_turn_1 156 tokens
history_turn_2 156 tokens
history_turn_3 117 tokens
user_message 9 tokens [pinned]
TOTAL 1280 tokens
=== Budget 10000 tokens ===
fits as-is: True
kept (1280 tok): system_prompt, user_message, tool_defs, retrieved_doc, history_turn_1, history_turn_2, history_turn_3
dropped: (nothing)
=== Budget 600 tokens ===
fits as-is: False
kept (594 tok): system_prompt, user_message, history_turn_1, history_turn_2, history_turn_3
dropped: tool_defs, retrieved_doc
=== Budget 300 tokens ===
fits as-is: False
kept (282 tok): system_prompt, user_message, history_turn_3
dropped: tool_defs, retrieved_doc, history_turn_1, history_turn_2
Lesson: the SAME parts produce a different context at every budget. Deciding
what survives the squeeze IS context engineering; the rest of the book is
smarter ways to do it than 'drop the oldest'.
Look at what happened across the three budgets. With room to spare (10,000 tokens),
everything goes in. At 600 tokens the retrieved document and the tool definitions get cut,
and notice the cost of that crude policy: dropping retrieved_doc may have thrown away the
exact thing the user's question needed, while keeping three turns of older chat that did
not. At 300 tokens only the most recent turn survives. The same raw material produced three
different contexts, and the quality of each depends entirely on the policy that chose what
to keep.
That policy is the subject of the book. "Drop the oldest" is the dumbest possible version. Every later chapter is a smarter answer to the same question:
- Instead of dropping the document, compress it so it still fits (Chapter 3).
- Instead of dropping old turns, summarize them into a few tokens (Chapter 11).
- Instead of re-sending the stable preamble and re-paying for it, cache it (Chapter 6).
- Instead of guessing which document is relevant, retrieve and rank by the question (Chapter 9 and Chapter 13).
The two pressures, and the four families
Every technique in this book exists to relieve one of two pressures, and usually it trades a little of one for a lot of the other:
- The window is finite. You cannot fit everything, so you must choose, shrink, or externalize. This is the capacity pressure.
- Tokens cost money and time. Even when it all fits, re-sending and re-generating tokens you did not need to is waste. This is the cost pressure, and we make it concrete in the next chapter.
The four families of techniques map onto these pressures:
| Family | What it does | Mainly relieves |
|---|---|---|
| Compression (3to5) | makes each part smaller without losing what matters | capacity and cost |
| Caching (6to8) | reuses work already done on stable or repeated context | cost |
| Memory (9to12) | stores state outside the window and re-injects only the relevant slice | capacity |
| Architecture (13to14) | assembles the right context per turn, and makes long windows tractable at all | both |
What "good" looks like
A well-engineered context has four properties, and you can audit any system against them:
- Sufficient. Everything the task needs is present, in a form the model can use.
- Lean. Nothing the task does not need is present. Every token earns its place.
- Cheap to repeat. The stable parts are cached, not re-paid, across the many calls of a session.
- Durable. State that should outlive a single call is stored outside the model and re-injected on demand, so the system does not get amnesia at the window boundary.
Hold these four in mind as you read. Most production failures are a violation of exactly one of them: a missing document (not sufficient), a bloated preamble (not lean), a re-charged prefix (not cheap to repeat), or a forgotten fact (not durable).
Takeaways
- The model is a pure function of its context; between calls it remembers nothing. Any "memory" is state your code keeps and re-injects.
- The context is rebuilt every turn from named parts, and it must fit a fixed window. When it does not, something gets cut, and the policy that chooses is where quality lives.
- Two pressures drive everything: finite capacity and per-token cost. The four families (compression, caching, memory, architecture) each relieve one or both.
- A good context is sufficient, lean, cheap to repeat, and durable. Most failures are a violation of exactly one of those.
👉 We have been counting tokens with a hand-wave. The next chapter makes the cost real: how tokens are actually counted, why output is the expensive half, and how to put a dollar figure on a context before you ever send it.