The open-source landscape
You have now built a small version of every lever. This chapter is the map from each lever to the real project that implements it at production scale, so that when you have a problem you can reach for the right tool instead of reinventing it. The aim is to leave you able to do two things: name the technique a project belongs to, and say when you would choose it.
A note on links, in the spirit of the rest of this site: the field moves fast and URLs rot, so most entries name the project and, where the address is one I am confident about, give a plain GitHub org or root domain. For the newer or more niche tools, the name is enough to search. Nothing here is an endorsement; pick by fit and by the quote you get today.
The map, by family
Compression: shrink what is sent and written
| Lever | Project(s) | What it does | Reach for it when |
|---|---|---|---|
| Prompt and context compression (Ch 3) | LLMLingua and LLMLingua-2 (Microsoft, github.com/microsoft/LLMLingua); plus newer tools like Headroom and RTK | Prune low-information tokens from a long prompt before sending, coarse-to-fine, keeping the meaning | A long RAG context or heavy few-shot block is blowing the budget |
| Output token reduction (Ch 4) | Provider params (Anthropic effort, structured outputs, stop sequences); output shapers like caveman and Headroom's shaper | Make the model write less: terser answers, schema-only fields, hard caps | Output volume dominates cost, or you only need a label or a JSON field |
| Code and structure-aware context (Ch 5) | CodeCompressor; lean-ctx; tree-sitter; Aider's repo map | Select code by AST or call graph instead of dumping whole files | A coding agent or repo QA system is pasting far more code than the task needs |
Caching: stop paying twice
| Lever | Project(s) | What it does | Reach for it when |
|---|---|---|---|
| KV-cache and prefix caching (Ch 6) | Provider-native prompt caching (Anthropic and others); CacheAligner-style tooling | Reuse the cached computation and charge for a stable prompt prefix across calls | A large system prompt or document is re-sent on every call in a session |
| Semantic and response caching (Ch 7) | GPTCache (github.com/zilliztech/GPTCache); Redis LangCache (redis.io) | Return a stored answer for an approximately-similar query, skipping inference | Traffic has many near-duplicate questions (FAQ, support, repeated analytics) |
| KV-cache serving optimization (Ch 8) | vLLM PagedAttention (github.com/vllm-project/vllm); SGLang RadixAttention (github.com/sgl-project/sglang) | Page KV memory and share prefix blocks across concurrent requests in the engine | You run your own inference server and need throughput and memory efficiency |
Memory and state
| Lever | Project(s) | What it does | Reach for it when |
|---|---|---|---|
| Agent memory and persistence (Ch 9) | Mem0 (github.com/mem0ai/mem0); Letta (formerly MemGPT, github.com/letta-ai/letta); Zep (github.com/getzep/zep) | Extract, store, retrieve, and invalidate facts across turns and sessions | The assistant must remember users and prior context across sessions |
| Temporal knowledge graphs (Ch 10) | Graphiti (github.com/getzep/graphiti); Zep | Store facts with validity windows so you can ask what was true at a past time | Facts change over time and point-in-time correctness matters |
| Context-window compaction (Ch 11) | Letta tiered memory; provider compaction and context editing (Anthropic) | Summarize or clear old context when the window fills, keeping the gist | Long agent runs or multi-hour chats that overflow the window |
| Failure and procedural learning (Ch 12) | LangMem (github.com/langchain-ai/langmem); headroom learn | Mine past sessions to rewrite the agent's instructions (CLAUDE.md, AGENTS.md) | The agent keeps repeating the same avoidable mistakes |
Orchestration and architecture
| Lever | Project(s) | What it does | Reach for it when |
|---|---|---|---|
| Context orchestration (Ch 13) | LangGraph (github.com/langchain-ai/langgraph); lean-ctx | Assemble the right context, tools, and state per turn via a graph with branching and state | A multi-tool agent needs to decide which sources to pull on each turn |
| Long-context attention efficiency (Ch 14) | DeepSeek sparse attention (DSA) and MLA (github.com/deepseek-ai); MiniMax lightning attention (github.com/MiniMax-AI) | Cut attention compute (sparse/linear) or KV memory (low-rank latent) so long windows are feasible | You are choosing or serving a model for very long context, or want to understand why 1M windows are possible |
Don't be confused. The rows are not alternatives to each other; they are complements. A serious system uses several at once: prefix caching (Ch 6) on the stable preamble, compression (Ch 3) on the retrieved docs, a memory store (Ch 9) for cross-session facts, compaction (Ch 11) when the chat runs long, and an orchestrator (Ch 13) deciding which to apply this turn. The question is rarely "which one"; it is "which combination, in what order".
A decision guide
When a context problem shows up, name the symptom first, then the lever:
- "It does not fit." Capacity problem. Compress the biggest part (Ch 3, Ch 5), or move state out to memory and retrieve only the relevant slice (Ch 9), or summarize the old turns (Ch 11).
- "It is too expensive." Cost problem. If the input is large and stable, cache the prefix (Ch 6). If the answer repeats, cache the answer (Ch 7). If the output is verbose, shape it (Ch 4). Remember output is 5x input (Ch 2).
- "It forgets." Durability problem. Add a memory store (Ch 9); if the facts change over time, make it temporal (Ch 10).
- "It repeats the same mistake." Learning problem. Mine the failures and rewrite the instructions (Ch 12).
- "It is slow at high load." Serving problem. Use a paged, prefix-sharing engine (Ch 8), and pick a model whose attention is efficient at your context length (Ch 14).
- "It pulls the wrong things." Routing problem. Put an orchestrator in front that decides per turn what to assemble (Ch 13).
Build or buy
The from-scratch versions in this book are for understanding, not for production. A real semantic cache needs a vector index, eviction, and persistence; a real memory layer needs durability, concurrency, and access control; a real compressor needs a tuned scorer. The projects above have solved those parts. Build the toy to know what the tool is doing, then use the tool. The one place where "build" often wins is the orchestrator (Ch 13): the routing policy is specific to your application, and a hundred lines of your own control flow is frequently clearer than bending a framework to fit.
Takeaways
- Every lever in this book has a production project behind it; the map above is the technique-to-tool lookup.
- The families are complements, not alternatives. Real systems stack caching, compression, memory, and orchestration together.
- Diagnose by symptom: does not fit (compress, externalize, summarize), too expensive (cache prefix or answer, shape output), forgets (memory, temporal), repeats mistakes (procedural learning), slow at load (paged serving, efficient attention), pulls the wrong things (orchestrate).
- Build the toy to understand the tool, then use the tool. The orchestrator is the part most worth writing yourself.
👉 The final page collects the projects, the papers behind them, and a glossary of the terms used throughout, as a reference you can return to.