The open-source landscape

You have now built a small version of every lever. This chapter is the map from each lever to the real project that implements it at production scale, so that when you have a problem you can reach for the right tool instead of reinventing it. The aim is to leave you able to do two things: name the technique a project belongs to, and say when you would choose it.

A note on links, in the spirit of the rest of this site: the field moves fast and URLs rot, so most entries name the project and, where the address is one I am confident about, give a plain GitHub org or root domain. For the newer or more niche tools, the name is enough to search. Nothing here is an endorsement; pick by fit and by the quote you get today.

The map, by family

Compression: shrink what is sent and written

LeverProject(s)What it doesReach for it when
Prompt and context compression (Ch 3)LLMLingua and LLMLingua-2 (Microsoft, github.com/microsoft/LLMLingua); plus newer tools like Headroom and RTKPrune low-information tokens from a long prompt before sending, coarse-to-fine, keeping the meaningA long RAG context or heavy few-shot block is blowing the budget
Output token reduction (Ch 4)Provider params (Anthropic effort, structured outputs, stop sequences); output shapers like caveman and Headroom's shaperMake the model write less: terser answers, schema-only fields, hard capsOutput volume dominates cost, or you only need a label or a JSON field
Code and structure-aware context (Ch 5)CodeCompressor; lean-ctx; tree-sitter; Aider's repo mapSelect code by AST or call graph instead of dumping whole filesA coding agent or repo QA system is pasting far more code than the task needs

Caching: stop paying twice

LeverProject(s)What it doesReach for it when
KV-cache and prefix caching (Ch 6)Provider-native prompt caching (Anthropic and others); CacheAligner-style toolingReuse the cached computation and charge for a stable prompt prefix across callsA large system prompt or document is re-sent on every call in a session
Semantic and response caching (Ch 7)GPTCache (github.com/zilliztech/GPTCache); Redis LangCache (redis.io)Return a stored answer for an approximately-similar query, skipping inferenceTraffic has many near-duplicate questions (FAQ, support, repeated analytics)
KV-cache serving optimization (Ch 8)vLLM PagedAttention (github.com/vllm-project/vllm); SGLang RadixAttention (github.com/sgl-project/sglang)Page KV memory and share prefix blocks across concurrent requests in the engineYou run your own inference server and need throughput and memory efficiency

Memory and state

LeverProject(s)What it doesReach for it when
Agent memory and persistence (Ch 9)Mem0 (github.com/mem0ai/mem0); Letta (formerly MemGPT, github.com/letta-ai/letta); Zep (github.com/getzep/zep)Extract, store, retrieve, and invalidate facts across turns and sessionsThe assistant must remember users and prior context across sessions
Temporal knowledge graphs (Ch 10)Graphiti (github.com/getzep/graphiti); ZepStore facts with validity windows so you can ask what was true at a past timeFacts change over time and point-in-time correctness matters
Context-window compaction (Ch 11)Letta tiered memory; provider compaction and context editing (Anthropic)Summarize or clear old context when the window fills, keeping the gistLong agent runs or multi-hour chats that overflow the window
Failure and procedural learning (Ch 12)LangMem (github.com/langchain-ai/langmem); headroom learnMine past sessions to rewrite the agent's instructions (CLAUDE.md, AGENTS.md)The agent keeps repeating the same avoidable mistakes

Orchestration and architecture

LeverProject(s)What it doesReach for it when
Context orchestration (Ch 13)LangGraph (github.com/langchain-ai/langgraph); lean-ctxAssemble the right context, tools, and state per turn via a graph with branching and stateA multi-tool agent needs to decide which sources to pull on each turn
Long-context attention efficiency (Ch 14)DeepSeek sparse attention (DSA) and MLA (github.com/deepseek-ai); MiniMax lightning attention (github.com/MiniMax-AI)Cut attention compute (sparse/linear) or KV memory (low-rank latent) so long windows are feasibleYou are choosing or serving a model for very long context, or want to understand why 1M windows are possible

Don't be confused. The rows are not alternatives to each other; they are complements. A serious system uses several at once: prefix caching (Ch 6) on the stable preamble, compression (Ch 3) on the retrieved docs, a memory store (Ch 9) for cross-session facts, compaction (Ch 11) when the chat runs long, and an orchestrator (Ch 13) deciding which to apply this turn. The question is rarely "which one"; it is "which combination, in what order".

A decision guide

When a context problem shows up, name the symptom first, then the lever:

  • "It does not fit." Capacity problem. Compress the biggest part (Ch 3, Ch 5), or move state out to memory and retrieve only the relevant slice (Ch 9), or summarize the old turns (Ch 11).
  • "It is too expensive." Cost problem. If the input is large and stable, cache the prefix (Ch 6). If the answer repeats, cache the answer (Ch 7). If the output is verbose, shape it (Ch 4). Remember output is 5x input (Ch 2).
  • "It forgets." Durability problem. Add a memory store (Ch 9); if the facts change over time, make it temporal (Ch 10).
  • "It repeats the same mistake." Learning problem. Mine the failures and rewrite the instructions (Ch 12).
  • "It is slow at high load." Serving problem. Use a paged, prefix-sharing engine (Ch 8), and pick a model whose attention is efficient at your context length (Ch 14).
  • "It pulls the wrong things." Routing problem. Put an orchestrator in front that decides per turn what to assemble (Ch 13).

Build or buy

The from-scratch versions in this book are for understanding, not for production. A real semantic cache needs a vector index, eviction, and persistence; a real memory layer needs durability, concurrency, and access control; a real compressor needs a tuned scorer. The projects above have solved those parts. Build the toy to know what the tool is doing, then use the tool. The one place where "build" often wins is the orchestrator (Ch 13): the routing policy is specific to your application, and a hundred lines of your own control flow is frequently clearer than bending a framework to fit.

Takeaways

  • Every lever in this book has a production project behind it; the map above is the technique-to-tool lookup.
  • The families are complements, not alternatives. Real systems stack caching, compression, memory, and orchestration together.
  • Diagnose by symptom: does not fit (compress, externalize, summarize), too expensive (cache prefix or answer, shape output), forgets (memory, temporal), repeats mistakes (procedural learning), slow at load (paged serving, efficient attention), pulls the wrong things (orchestrate).
  • Build the toy to understand the tool, then use the tool. The orchestrator is the part most worth writing yourself.

👉 The final page collects the projects, the papers behind them, and a glossary of the terms used throughout, as a reference you can return to.