References and further reading
How to use this page: you do not need any of it to have understood the book. Treat it as a shelf to reach for when a specific question comes up. It collects the projects named throughout, the papers behind the ideas, a glossary of the terms the chapters built up, and the runnable code so you can find a demo again.
A note on links: the field moves quickly and URLs rot, so most entries name the thing and, where I am confident of the address, give a plain GitHub org or root domain. Everything else is a name you can type into a search engine. Nothing here is fabricated; where I was unsure of an exact address I left the name and let you search.
Projects, by lever
- Prompt and context compression (Ch 3): LLMLingua and
LLMLingua-2 (
github.com/microsoft/LLMLingua). Also named in this space: Headroom, RTK. - Output token reduction (Ch 4): the provider's own controls
(Anthropic
effort, structured outputs, stop sequences); output shapers such as caveman and Headroom's shaper. - Code and structure-aware context (Ch 5): CodeCompressor,
lean-ctx, tree-sitter (
tree-sitter.github.io), Aider's repo map. - KV-cache and prefix caching (Ch 6): provider-native prompt caching (Anthropic and other vendors); CacheAligner-style tooling.
- Semantic and response caching (Ch 7): GPTCache
(
github.com/zilliztech/GPTCache), Redis LangCache (redis.io). - KV-cache serving optimization (Ch 8): vLLM
(
github.com/vllm-project/vllm, PagedAttention), SGLang (github.com/sgl-project/sglang, RadixAttention). - Agent memory and persistence (Ch 9): Mem0
(
github.com/mem0ai/mem0), Letta (github.com/letta-ai/letta, formerly MemGPT), Zep (github.com/getzep/zep). - Temporal knowledge graphs (Ch 10): Graphiti
(
github.com/getzep/graphiti), Zep. - Context-window compaction (Ch 11): Letta tiered memory; the provider's own compaction and context-editing features (Anthropic).
- Failure and procedural learning (Ch 12): LangMem
(
github.com/langchain-ai/langmem), headroom learn. - Context orchestration (Ch 13): LangGraph
(
github.com/langchain-ai/langgraph), lean-ctx. - Long-context attention efficiency (Ch 14): DeepSeek
(
github.com/deepseek-ai, sparse attention DSA and Multi-head Latent Attention), MiniMax (github.com/MiniMax-AI, lightning/linear attention).
Papers behind the ideas
Named so you can find the current version on arxiv.org or the project page; the exact
identifiers change as papers revise, so search the title.
- LLMLingua and LLMLingua-2 (Microsoft Research): prompt compression by perplexity (v1) and by a learned token classifier distilled from a strong model (v2).
- MemGPT (now Letta): an operating-system metaphor for LLM memory, with core, recall, and archival tiers the model pages between.
- PagedAttention ("Efficient Memory Management for Large Language Model Serving"): the paper behind vLLM, treating the KV cache like OS virtual memory.
- RadixAttention (SGLang): automatic KV prefix sharing across requests via a radix tree.
- DeepSeek-V2 / V3 technical reports: Multi-head Latent Attention (MLA) for KV compression, and the sparse attention used in later versions.
- MiniMax-01 technical report: lightning (linear) attention for long sequences.
- The provider documentation for prompt caching, token counting, compaction, and context editing is the authority for the exact parameters; consult the claude-api reference for the model you call.
Glossary
Terms the chapters built up, in one place.
- Context. The full token sequence sent to the model on one call: system prompt, tools, retrieved documents, history, and the user's message. See Chapter 1.
- Token. A sub-word unit from the model's vocabulary; the thing you are billed and windowed in. See Chapter 2.
- Context window. The maximum number of tokens a model can read at once.
- BPE (byte pair encoding). The algorithm that builds the tokenizer by merging frequent adjacent pairs. See Chapter 2.
- Compression ratio. Original tokens divided by compressed tokens. See Chapter 3.
- Effort / structured output / stop sequence. Provider controls that make the model write fewer output tokens. See Chapter 4.
- AST (abstract syntax tree). A program's structure as a tree; the basis for selecting code by call graph. See Chapter 5.
- Query, key, value. The three projections inside attention. See Chapter 6.
- KV cache. Stored keys and values for past tokens so they are not recomputed each step. See Chapter 6.
- Prefix caching / prompt caching. Reusing the cached work and charge for a stable prompt prefix across calls. See Chapter 6.
- Semantic cache. Returning a stored answer for an approximately-similar query. See Chapter 7.
- Embedding / cosine similarity. A vector for a piece of text, and the angle-based measure of how close two vectors are. See Chapter 7 and Chapter 9.
- PagedAttention / RadixAttention. Engine techniques: KV memory in fixed-size pages, and KV prefix sharing across requests. See Chapter 8.
- Extraction / retrieval / invalidation. The three memory operations: pull facts out, fetch the relevant ones, replace stale ones. See Chapter 9.
- Bi-temporal / validity interval. Tracking when a fact was true (event time) and when the system learned it (ingestion time). See Chapter 10.
- Compaction. Summarizing old context when the window fills, as opposed to deleting it. See Chapter 11.
- Procedural memory. Learned how-to rules that change the agent's behavior, stored in its instructions, as opposed to facts. See Chapter 12.
- Orchestration. Deciding which context, tools, and state to assemble per turn. See Chapter 13.
- Sparse / linear attention, MLA. Architecture-level ways to cut attention compute or KV memory so long windows are feasible. See Chapter 14.
This book's code
Every demo is in the code/ folder of this book's repository, runnable with python3 and
NumPy, no API key. By chapter:
context_assemble.py,token_economy.py: foundations.compress.py,output_shape.py,code_context.py: compression.kv_cache.py,semantic_cache.py,kv_serving.py: caching.agent_memory.py,temporal_kg.py,compaction.py,procedural_learning.py: memory.orchestrate.py,attention_efficiency.py: architecture.
Read them, change a number, and watch the trade-off move. That is the fastest way to make the ideas your own.
One last thing
Context engineering is bookkeeping with stakes. The model is only ever as good as the tokens you put in front of it and only ever as cheap as the tokens you avoid re-paying for. You now have the levers and the map. When a system is slow, expensive, or forgetful, you can name which of the four pressures it is failing and reach for the matching tool. That is the whole job.