Context orchestration
The last ten chapters each handed you one lever. Chapter 3 shrinks a part. Chapter 6 caches a stable prefix. Chapter 9 recalls a fact from outside the window. Each is a good answer to "how do I make this one part smaller or cheaper or durable?" None of them answers the question that sits above all of them: on this turn, for this query, which parts do I even need?
That is the job of orchestration. It is the conductor. It looks at the query that just arrived, decides which sources of context this particular query requires, and assembles only those, applying compression or caching or memory recall at the moments they actually help rather than always. A "hi there" should not drag in a 600-token tool catalog and a code index. A stack trace should. Orchestration is the layer that knows the difference and acts on it per turn.
The default is to include everything, and it is wasteful
The naive way to build a context is to gather every source you have (retrieved documents, long-term memory, the tool definitions, the relevant code) and concatenate them, every turn. Call this the kitchen sink. It is simple and it is never wrong in the sense of missing something, because everything is always present. But it is wrong in the sense that matters in this book: it is not lean. Most turns pay for sources they will not use. The model also has to read past all of them, which dilutes its attention and, past a point, degrades the answer.
The fix is to choose. Different queries need different context. A factual lookup needs the retrieved documents and maybe a memory of who is asking; it does not need the code index. A coding request needs the code and the tools; it does not need the refund policy. Small talk needs neither, and the model can answer from the system prompt alone. Once you accept that the right set of sources depends on the query, you need a small machine to make that decision on each turn. That machine is a router, and the cleanest way to build it is as a state graph.
A state graph: nodes, edges, and a decision in the middle
A state graph is a tiny program drawn as boxes and arrows. Each box is a node: a step that takes the current state, does one job, and passes the (possibly updated) state to the next node. Each arrow is an edge: it says which node runs next. A conditional edge is an arrow that branches: it reads the state and picks a different next node depending on what it finds. The state is just the bag of values flowing through (here: the query, its label, the chosen sources, the assembled context). "Durable state" means that bag can be saved to disk between turns and reloaded, so an agent survives a crash or a pause.
Our router is three nodes with one branch in the middle:
┌──────────┐ label ┌─────────┐ sources ┌──────────┐
query ─────► │ classify │ ───────────► │ route │ ──────────► │ assemble │ ──► context
└──────────┘ └────┬────┘ └──────────┘
│ conditional edge
┌────────────────────┼────────────────────┐
▼ ▼ ▼
coding: code, factual: retrieval, chitchat:
tools, memory memory (nothing)
- classify reads the query and labels it (
coding,factual, orchitchat). - route is the conditional edge. Given the label, it picks the set of sources to include. This single decision is the whole point of orchestration.
- assemble packs the chosen sources into the token budget, in priority order, skipping any that would overflow. This is the same packing from Chapter 1, but now operating on a selected set instead of on everything.
Read that diagram against the kitchen sink: the kitchen sink skips the branch entirely and always lands on "include all four sources." The router keeps the branch, and the branch is where the tokens get saved.
The router in code
Here is the whole thing. Four sources, each with a token cost. A keyword classifier (kept
deliberately simple and deterministic; a real system might use a small model or the main model
itself, but the routing logic is identical). A ROUTES table that is the routing policy in
four lines. Then two assemblers, kitchen_sink and routed, run over the same five queries
so you can read the difference directly.
"""A small state-graph context router.
Earlier chapters each gave you ONE lever: compress a part, cache a prefix,
recall a memory. Orchestration is the layer above them. On each turn it looks
at the query, decides WHICH sources of context this particular query needs, and
assembles only those under a token budget. The query "hey what's up" should not
drag in a 600-token tool catalog; the query "fix this stack trace" should.
This script builds that router as a tiny GRAPH of three nodes:
classify -> route -> assemble
- classify: read the query, label it (factual / coding / chitchat).
- route: given the label, pick the set of sources to include.
- assemble: pack the chosen sources into a token budget.
We then compare two assemblers over the same queries:
- KITCHEN SINK: always include every source (the naive default).
- ROUTED: include only what the query type asked for.
Only depends on the Python standard library. Run: python3 orchestrate.py
"""
from dataclasses import dataclass, field
def est_tokens(text: str) -> int:
"""Rough token estimate: ~1.3 tokens per English word (see chapter 2)."""
return round(len(text.split()) * 1.3)
# ---------------------------------------------------------------------------
# SOURCES: each is a named place context can come from, with a token cost.
# A "source" is just a producer of context: a retriever, a memory store, a
# tool catalog, a code indexer. In a real system each call is expensive (a DB
# query, an embedding lookup). Here each holds a fixed blob so the cost is
# visible. The point is not the text but its SIZE and whether a turn needs it.
# ---------------------------------------------------------------------------
@dataclass
class Source:
name: str
text: str
@property
def tokens(self) -> int:
return est_tokens(self.text)
SOURCES = {
"retrieval": Source(
"retrieval",
"Retrieved docs: The refund policy allows returns within 30 days. "
"Shipping is free over fifty dollars. Support hours are 9 to 5. " * 12,
),
"memory": Source(
"memory",
"Long-term memory: user prefers terse answers, is on the Pro plan, "
"lives in Berlin, asked about invoices twice before. " * 8,
),
"tools": Source(
"tools",
"Tool definitions: search(query) read_file(path) write_file(path,text) "
"run_tests() git_diff() open_pr(title,body) list_dir(path). " * 14,
),
"code": Source(
"code",
"Code context: def process(order): validate(order); charge(order.card); "
"return Receipt(order.id) # plus 200 lines of the surrounding module. " * 16,
),
}
# ---------------------------------------------------------------------------
# CLASSIFY node: label the query by simple keyword rules. A real system might
# use a tiny classifier or the model itself; keywords keep the demo readable
# and deterministic. The label is the only thing the router needs.
# ---------------------------------------------------------------------------
CODING_WORDS = {"bug", "stack", "trace", "function", "code", "test", "error",
"deploy", "refactor", "import", "exception", "fix"}
FACTUAL_WORDS = {"what", "when", "how", "policy", "refund", "hours", "price",
"where", "who", "does", "is"}
CHITCHAT_WORDS = {"hi", "hey", "hello", "thanks", "thank", "lol", "cool",
"nice", "ok", "okay", "bye"}
def classify(query: str) -> str:
"""Return one of: 'coding', 'factual', 'chitchat'. First match wins, in
priority order, so a coding question that also contains 'what' stays coding."""
words = {w.strip(".,!?").lower() for w in query.split()}
if words & CODING_WORDS:
return "coding"
if words & FACTUAL_WORDS:
return "factual"
if words & CHITCHAT_WORDS:
return "chitchat"
return "factual" # safe default: when unsure, allow a lookup
# ---------------------------------------------------------------------------
# ROUTE node: map a label to the set of sources that label needs. THIS is the
# orchestration decision. Chitchat needs nothing heavy. A factual lookup needs
# retrieval (and memory, to personalize). Coding needs the code and the tools,
# but not the refund policy. Routing is choosing this set, per turn.
# ---------------------------------------------------------------------------
ROUTES = {
"coding": ["code", "tools", "memory"],
"factual": ["retrieval", "memory"],
"chitchat": [], # answer from the system prompt alone; pull in nothing
}
def route(label: str) -> list:
"""Given a query label, return the ordered list of source names to assemble."""
return ROUTES[label]
# ---------------------------------------------------------------------------
# ASSEMBLE node: pack chosen sources into a token budget, in priority order,
# skipping any that would overflow. This is the same packing idea as chapter 1,
# but now operating on a SELECTED set rather than everything.
# ---------------------------------------------------------------------------
@dataclass
class Assembled:
label: str
included: list = field(default_factory=list)
skipped: list = field(default_factory=list)
tokens: int = 0
def assemble(source_names, budget) -> Assembled:
out = Assembled(label="")
for name in source_names:
src = SOURCES[name]
if out.tokens + src.tokens <= budget:
out.included.append(name)
out.tokens += src.tokens
else:
out.skipped.append(name)
return out
# ---------------------------------------------------------------------------
# The two assemblers under test.
# ---------------------------------------------------------------------------
def kitchen_sink(query, budget) -> Assembled:
"""Naive default: ignore the query, include every source every time."""
res = assemble(list(SOURCES.keys()), budget)
res.label = classify(query) # label only for reporting; not used to route
return res
def routed(query, budget) -> Assembled:
"""classify -> route -> assemble: include only what this query needs."""
label = classify(query)
chosen = route(label)
res = assemble(chosen, budget)
res.label = label
return res
QUERIES = [
"What is your refund policy?",
"Fix this stack trace in the deploy function",
"hey thanks, that was nice",
"When are your support hours?",
"There is a bug in the test, the import throws an exception",
]
BUDGET = 4000
def fmt(names):
return ", ".join(names) if names else "(nothing)"
print("=== Per-query: KITCHEN SINK vs ROUTED ===")
print(f"(budget {BUDGET} tokens per turn)\n")
ks_total = 0
rt_total = 0
for q in QUERIES:
ks = kitchen_sink(q, BUDGET)
rt = routed(q, BUDGET)
ks_total += ks.tokens
rt_total += rt.tokens
print(f"query: {q!r}")
print(f" classified as : {rt.label}")
print(f" kitchen sink : {ks.tokens:5d} tok [{fmt(ks.included)}]")
print(f" routed : {rt.tokens:5d} tok [{fmt(rt.included)}]")
print(f" saved : {ks.tokens - rt.tokens:5d} tok")
print()
print("=== Totals across all queries ===")
pct = 100 * (ks_total - rt_total) / ks_total
print(f" kitchen sink total : {ks_total:6d} tok")
print(f" routed total : {rt_total:6d} tok")
print(f" saved : {ks_total - rt_total:6d} tok ({pct:.0f}% smaller)")
print()
print("Each routed turn still carries the RIGHT source: the coding queries get")
print("'code' and 'tools', the factual queries get 'retrieval', and chitchat")
print("pulls in nothing heavy. Routing spends tokens where the query needs them.")
Running it:
=== Per-query: KITCHEN SINK vs ROUTED ===
(budget 4000 tokens per turn)
query: 'What is your refund policy?'
classified as : factual
kitchen sink : 1038 tok [retrieval, memory, tools, code]
routed : 541 tok [retrieval, memory]
saved : 497 tok
query: 'Fix this stack trace in the deploy function'
classified as : coding
kitchen sink : 1038 tok [retrieval, memory, tools, code]
routed : 695 tok [code, tools, memory]
saved : 343 tok
query: 'hey thanks, that was nice'
classified as : chitchat
kitchen sink : 1038 tok [retrieval, memory, tools, code]
routed : 0 tok [(nothing)]
saved : 1038 tok
query: 'When are your support hours?'
classified as : factual
kitchen sink : 1038 tok [retrieval, memory, tools, code]
routed : 541 tok [retrieval, memory]
saved : 497 tok
query: 'There is a bug in the test, the import throws an exception'
classified as : coding
kitchen sink : 1038 tok [retrieval, memory, tools, code]
routed : 695 tok [code, tools, memory]
saved : 343 tok
=== Totals across all queries ===
kitchen sink total : 5190 tok
routed total : 2472 tok
saved : 2718 tok (52% smaller)
Each routed turn still carries the RIGHT source: the coding queries get
'code' and 'tools', the factual queries get 'retrieval', and chitchat
pulls in nothing heavy. Routing spends tokens where the query needs them.
Look at what routing bought, and notice it is two things at once, not one. First, it is
smaller: 2,472 tokens against 5,190, a 52% cut over the five queries, and the chitchat turn
drops from 1,038 tokens to zero because none of the heavy sources earn their place in "hey
thanks." Second, and this is the part a blunt token cut would get wrong, every routed turn
still carries the right source. The two coding queries both pull in code and tools; the
two factual queries both pull in retrieval. The router did not save tokens by starving the
query. It saved tokens by not loading the sources this query never needed. That is the
distinction between orchestration and crude truncation: truncation cuts to hit a number,
orchestration cuts what is irrelevant and keeps what is not.
The conductor invokes the earlier levers, at the right moment
The router above only selects sources, but the same node-and-branch structure is where every earlier chapter plugs in. Orchestration is the conductor that decides when to use each lever, instead of using it always:
- Retrieve, or not? The cheapest retrieval is the one you skip. A chitchat turn routes to zero sources, so the router never runs the retriever at all. A "RAG router" is exactly this: a branch that decides whether the query even needs a document lookup before paying for one.
- Compress, but only when it would overflow. The
assemblenode can call the compressor from Chapter 3 on a source only when including it whole would blow the budget, rather than compressing on every turn. - Cache, on the stable branches. Sources that recur turn after turn (the tool catalog on a coding agent) are the prefix-cache candidates from Chapter 6. The router decides which branch is stable enough to cache.
- Recall memory, when the query is about the user. The factual and coding routes pull in
memory; the chitchat route does not. The router is what turns the memory store from Chapter 9 into a conditional read instead of an always-on tax.
So the four families are not a flat menu you apply uniformly. They are levers the conductor pulls per turn, on the branch where each one pays off.
Don't be confused. Three things sit near each other and are easy to merge. The prompt is the wording handed to the model: the system instruction, the phrasing of the task. Orchestration is not the prompt; it is the decision about which sources, tools, and state get assembled into the context this turn, before any wording is finalized. A linear chain is the third thing: a fixed pipeline of steps that always runs in the same order, with no branch and no saved state (retrieve, then stuff, then generate, every time, for every query). Orchestration is a chain that branches: it reads the query and routes to different context-gathering steps, and it can carry durable state across turns. You can have a great prompt inside a dumb linear chain that retrieves on a "hello," and you can have a smart orchestrator that picks the right sources and then hands them to a mediocre prompt. They are different layers, and a serious system needs all three to be good.
The real projects
Two pieces of the open-source landscape (Chapter 15) make this concrete.
LangGraph builds agents as exactly the graph in this chapter: you declare nodes (each a function over the shared state), wire them with edges, and use conditional edges to route to different context-gathering steps based on what the state holds. Its defining features are the explicit control flow (you can see and test every branch) and durable state (the state bag is checkpointed, so an agent can pause, survive a restart, and resume mid-graph). A multi-tool agent in LangGraph is a router node that picks which tool's context to load next; a RAG router is a conditional edge that decides whether to retrieve at all.
lean-ctx focuses on the assemble node: lean per-turn context assembly, building the
smallest sufficient context for the current query rather than the kitchen sink. It is the same
"select, then pack under budget" idea, treated as a first-class step.
The shared lesson is the framing of this whole chapter. Compression, caching, and memory are the instruments. Orchestration is the conductor that decides which to play, and when, for the query in front of it.
Using the real tool: commands and before/after proof
The router above is a from-scratch state graph so you can see every part. LangGraph is the
same shape with the bookkeeping handled for you: you declare nodes and edges, and it runs them
and threads the state through. Install it with pip (the package name is langgraph, all
lowercase):
pip install langgraph
Here is the router as a real LangGraph graph. The state is a typed dictionary (a TypedDict):
a plain dict whose keys and value types are spelled out, so each node knows what is in the bag
flowing through. classify and assemble are nodes (each takes the state and returns the
keys it changed). route is not a node: it is the function the conditional edge calls to pick
the next step. We wire START → classify, then add_conditional_edges from classify using
route plus a path_map that maps each label to the assemble node, and assemble → END. The
LangGraph parts are from langgraph.graph import StateGraph, START, END; the rest is ordinary
Python. (Follow-along: LangGraph is not installed on this box, so the output below is labeled
expected, not measured here.)
from typing import TypedDict
from langgraph.graph import StateGraph, START, END
# The state is the bag of values that flows through the graph.
class RouterState(TypedDict):
query: str
label: str
sources: list[str]
context: str
# The routing policy: which sources each query type needs.
ROUTES = {
"coding": ["code", "tools", "memory"],
"factual": ["retrieval", "memory"],
"chitchat": [],
}
def classify(state: RouterState) -> dict:
q = state["query"].lower()
if any(w in q for w in ("fix", "bug", "exception", "stack trace")):
label = "coding"
elif any(w in q for w in ("policy", "hours", "refund", "when")):
label = "factual"
else:
label = "chitchat"
return {"label": label}
def route(state: RouterState) -> str:
# The conditional edge calls this and uses the return value as the key.
return state["label"]
def assemble(state: RouterState) -> dict:
sources = ROUTES[state["label"]]
return {"sources": sources, "context": "\n".join(sources)}
builder = StateGraph(RouterState)
builder.add_node("classify", classify)
builder.add_node("assemble", assemble)
builder.add_edge(START, "classify")
builder.add_conditional_edges(
"classify",
route,
# path_map: route's return value -> the next node to run.
{"coding": "assemble", "factual": "assemble", "chitchat": "assemble"},
)
builder.add_edge("assemble", END)
graph = builder.compile()
result = graph.invoke({"query": "Fix this stack trace in the deploy function"})
print(result["label"], "->", result["sources"])
coding -> ['code', 'tools', 'memory']
That is the on-box from-scratch router (the verified demo above) wearing LangGraph's clothes:
the labels, the ROUTES table, and the branch are identical. What LangGraph adds is the graph
runtime and the durable state from the start of the chapter, so the same router can checkpoint
and resume.
The before/after proof: input tokens per turn
The metric that shows orchestration paying off is input tokens per turn: how many tokens
the model has to read on each request, before it writes a single token back. The kitchen sink
sends every source every turn; the router sends only what it selected. To prove the gap with
real numbers, count the tokens in each context with the Anthropic SDK rather than guessing.
client.messages.count_tokens(...) returns an object whose .input_tokens is the exact count
the API would charge for that prompt:
from anthropic import Anthropic
client = Anthropic()
def turn_tokens(context: str, query: str) -> int:
resp = client.messages.count_tokens(
model="claude-opus-4-8",
messages=[{"role": "user", "content": f"{context}\n\n{query}"}],
)
return resp.input_tokens
ALL_SOURCES = ["retrieval", "memory", "tools", "code"]
def build(sources: list[str]) -> str:
return "\n".join(f"<{s}>...</{s}>" for s in sources)
for query in ["Fix this stack trace", "What is your refund policy?", "thanks!"]:
label = classify({"query": query})["label"]
routed = turn_tokens(build(ROUTES[label]), query)
sink = turn_tokens(build(ALL_SOURCES), query)
print(f"{label:8} kitchen sink {sink:>5} routed {routed:>5}")
coding kitchen sink 1024 routed 695
factual kitchen sink 1024 routed 512
chitchat kitchen sink 1024 routed 8
(Token counts illustrative/expected, not measured here: the anthropic SDK is not installed
on this box and the source bodies above are stubs. Run it with a real key and full sources to
get the actual numbers.) The shape is the point. The kitchen sink reads about 1,000 tokens
every turn no matter what was asked. The routed turn reads only the sources the branch chose:
roughly half for a coding or factual query, and near zero for chitchat, because the model can
answer "thanks!" from the system prompt alone. Critically, the routed coding turn still carries
code and tools and the routed factual turn still carries retrieval: the cut came from
dropping the sources the query never needed, not from starving the query. That is the same
result the verified from-scratch demo proved on this box (52% smaller, right source kept); the
count_tokens recipe is how you put a real, billable number on it for your own sources.
In Claude Code
Claude Code is an orchestrator in this exact sense, not a single flat prompt. Two of its moves are orchestration in practice. First, it spawns subagents with the Task tool: when a job has an independent sub-part (search this part of the repo, run and read the tests), it hands that sub-part to a fresh subagent that gets only its own focused context and reports back a short result, instead of pouring every intermediate file into the main window. Second, it reads just the files a step needs: it greps and opens the specific lines relevant to the current edit rather than loading the whole codebase up front. Both are the router's lesson applied to a real agent: decide what this step needs, assemble only that, and keep the rest out of the window.
Takeaways
- Orchestration is the layer above the individual levers: per turn, it decides which sources, tools, and state to assemble, instead of including everything.
- The kitchen sink (always include every source) is never missing anything but is rarely lean. Routing makes the context smaller and keeps the source each query actually needs.
- Build the router as a state graph: a
classifynode, aroutenode that branches on the label (a conditional edge), and anassemblenode that packs the chosen sources under budget. - In the demo, routing cut total context by 52%, dropped a chitchat turn to zero heavy tokens, and still gave every coding query its code and tools and every factual query its documents.
- The conductor invokes the earlier levers at the right moment: retrieve only when the query needs a lookup, compress only on overflow, cache the stable branches, recall memory only when the query is about the user.
- Orchestration is not the prompt (the wording) and not a linear chain (a fixed pipeline with no branch or state). It is the branching, stateful layer that chooses what the prompt gets built from. LangGraph and lean-ctx are the real-world versions.
👉 Routing keeps each turn's context as small as it can be, but some tasks genuinely need a long window: a whole codebase, a long transcript, a book. Chapter 14 goes inside the model to ask why long contexts are expensive in the first place, and what makes attention tractable when the window is huge.