Prompt and context compression
Chapter 1 ended on a crude policy: when the context does not fit, drop the oldest part. The worst case was dropping the retrieved document, the one piece the question actually needed. This chapter fixes that. Instead of throwing a part away whole, we compress it: keep the words that matter for this question and remove the words that do not, so the part still fits but still answers.
This is compression of the input, the tokens you send. A long tool output, a noisy log, a retrieved document, a stack of few-shot examples: most of those tokens are filler for the question at hand. Stripping them before the call buys you both window space and lower input cost (Chapter 2). Chapter 4 does the mirror job on the output, the tokens the model writes.
The idea: keep what answers the question
Compression here is query-aware and extractive. Two words worth pinning down:
- Query-aware means we compress with respect to a specific question. The same document compresses differently depending on what you are asking. A log compressed for "what was the error?" keeps different lines than the same log compressed for "how long did the outage last?".
- Extractive means we keep original sentences verbatim and only choose which to drop. We never rewrite or paraphrase. The opposite, abstractive compression, rewrites the text into something shorter (that is summarization, Chapter 11). Extractive is safer: it cannot hallucinate a fact that was not there, because every surviving word came straight from the source.
Don't be confused. Compression and summarization both make text shorter, but they are not the same operation. Compression here is extractive and lossless-at-the-word-level: it deletes spans and keeps the rest untouched, so what survives is guaranteed to be the author's original words. Summarization is abstractive: it generates new, shorter text that describes the original, which can introduce errors and costs output tokens to produce. Reach for extractive compression when you cannot afford a paraphrase to drift; reach for summarization when the source is so long that even the important sentences will not fit.
To choose which sentences to keep, we score each one by how much it is worth for the query, then keep the best ones until we hit a token budget.
Scoring a sentence
A good score has to do two things at once: reward sentences that mention the query terms, and not be fooled by filler words that appear everywhere. We combine two signals.
The first is term overlap: how many of the query's words show up in the sentence. On its own this is weak, because the most frequent overlapping word is usually something like "the" or "of", which tells you nothing about which sentence is relevant.
So we weight each word by its inverse document frequency (idf). Treat every sentence as a small "document". A word that appears in almost every sentence is common and uninformative; a word that appears in only one or two sentences is distinctive. idf turns that intuition into a number:
$$\text{idf}(w) = \log!\left(\frac{N + 1}{\text{df}(w) + 1}\right) + 1$$
Here $N$ is the number of sentences and $\text{df}(w)$ (the document frequency) is how many of them contain the word $w$. A word in every sentence has $\text{df}(w) \approx N$, so the fraction is near 1 and its idf is low. A word in one sentence has a small $\text{df}(w)$, so the fraction is large and its idf is high. The $+1$ terms are smoothing: they keep the logarithm away from dividing by zero and stop any weight from collapsing to exactly zero.
The sentence score is then the summed idf of every sentence word that also appears in the query, divided by the sentence length so we are scoring density of relevant content, not just rewarding long sentences:
$$\text{score}(s) = \frac{1}{|s|}\sum_{w \in s,; w \in q} \text{idf}(w)$$
where $|s|$ is the word count of sentence $s$ and $q$ is the set of query words.
Coarse to fine
We compress in two passes, cheap first.
- Coarse. Rank all sentences by score and keep the best ones, top down, until the next one would blow the token budget. This drops whole low-value sentences. It is the big win, because most of the bloat is entire sentences that have nothing to do with the question.
- Fine. On the sentences that survived, strip stopwords (the "the", "of", "is" filler) to squeeze out a bit more. We protect any word that appears in the query, so we never delete the very terms that made a sentence worth keeping.
Coarse-to-fine is the same shape the real systems use. Spend your cheap, high-leverage operation first (drop whole units), then do the finer, riskier trimming only on what is left.
The demo
The code below builds the scorer and both passes over an incident write-up where exactly one sentence answers the question ("what was the root cause of the outage?"), buried among filler and near-miss distractors. It prints every sentence's score so the ranking is not a black box, runs the coarse and fine passes under a 40-token budget, reports the compression ratio, and verifies the gold answer survived.
"""Query-aware extractive prompt compression, from scratch.
The problem: you are about to send a long context (a retrieved document, a pile
of logs, a tool's output) to the model along with a question. Most of those
tokens do not help answer THIS question. Compression removes them BEFORE the
call, so you pay for fewer input tokens and free up window space.
This is the simplest honest version of the idea behind Microsoft's LLMLingua:
score each unit of text by how much it matters to the query, then keep the best
units under a token budget. We do it EXTRACTIVELY (we keep original sentences,
we never rewrite), with two passes:
COARSE: drop whole low-scoring sentences.
FINE: strip filler ("stopwords") from the sentences that survived.
A sentence's score combines two signals:
* term OVERLAP with the query (does it mention what was asked?), and
* an inverse-document-frequency (idf) WEIGHT per word, so a word that appears
in almost every sentence (filler like "the", "deployment") counts for little
and a rare, informative word counts for a lot. We compute idf with NumPy.
We measure tokens with the same words * 1.3 estimate as chapter 2, report the
compression ratio, and VERIFY that the one sentence that actually answers the
query survives the squeeze.
NumPy + standard library only. Run: python3 compress.py
"""
import re
import numpy as np
# Words so common they carry almost no information about WHICH sentence answers
# a query. Real compressors learn this; we hard-code a tiny list to stay simple.
STOPWORDS = {
"the", "a", "an", "and", "or", "but", "if", "then", "of", "to", "in", "on",
"at", "by", "for", "with", "as", "is", "are", "was", "were", "be", "been",
"it", "its", "this", "that", "these", "those", "we", "you", "they", "i",
"our", "your", "their", "from", "into", "out", "up", "down", "over", "all",
"can", "will", "would", "should", "may", "have", "has", "had", "do", "does",
"not", "no", "so", "than", "very", "just", "also", "which", "when", "what",
}
def tokens(text):
"""Crude token estimate, consistent with chapter 2: words * 1.3."""
return round(len(text.split()) * 1.3)
def words(text):
"""Lowercased word list, punctuation stripped. The unit we score over."""
return re.findall(r"[a-z0-9]+", text.lower())
def split_sentences(text):
"""Split a blob into sentences on ., !, ? (good enough for plain prose)."""
parts = re.split(r"(?<=[.!?])\s+", text.strip())
return [p.strip() for p in parts if p.strip()]
def idf(sentences):
"""Inverse document frequency for every word, computed over the sentences.
Treat each SENTENCE as a 'document'. A word in many sentences is common and
uninformative (low idf); a word in few sentences is distinctive (high idf).
idf(w) = log( (N + 1) / (df(w) + 1) ) + 1
N is the sentence count, df(w) the number of sentences containing w. The
+1's are smoothing so nothing blows up or hits zero. NumPy does the arithmetic
over the whole vocabulary at once."""
vocab = sorted({w for s in sentences for w in words(s)})
index = {w: i for i, w in enumerate(vocab)}
n = len(sentences)
df = np.zeros(len(vocab))
for s in sentences:
for w in set(words(s)):
df[index[w]] += 1
scores = np.log((n + 1) / (df + 1)) + 1.0
return {w: float(scores[index[w]]) for w in vocab}
def score_sentence(sentence, query_words, idf_weights):
"""How much this sentence is worth, given the query.
Sum the idf weight of every sentence word that also appears in the query,
then divide by the sentence's length in words so we are not just rewarding
long sentences. This is a query-aware, idf-weighted overlap score."""
sw = words(sentence)
if not sw:
return 0.0
hit = sum(idf_weights.get(w, 1.0) for w in sw if w in query_words)
return hit / len(sw)
def compress(context, query, budget_tokens, strip_stopwords=False):
"""Keep the highest-scoring sentences under a token budget (coarse pass).
If strip_stopwords is set, also remove filler from survivors (fine pass).
Returns (kept_text, list_of_kept_original_sentences)."""
sentences = split_sentences(context)
idf_weights = idf(sentences)
qwords = set(words(query))
ranked = sorted(
sentences,
key=lambda s: score_sentence(s, qwords, idf_weights),
reverse=True,
)
# COARSE: walk best-first, keep a sentence if it still fits the budget.
kept, used = [], 0
for s in ranked:
cost = tokens(s)
if used + cost <= budget_tokens:
kept.append(s)
used += cost
# Put survivors back in their original reading order.
kept_in_order = [s for s in sentences if s in kept]
if not strip_stopwords:
return " ".join(kept_in_order), kept_in_order
# FINE: drop stopwords from each survivor. Query words are always protected
# so we never delete the very terms that made a sentence worth keeping.
fine = []
for s in kept_in_order:
out = [w for w in s.split()
if w.lower().strip(".,!?;:") not in STOPWORDS
or w.lower().strip(".,!?;:") in qwords]
fine.append(" ".join(out))
return " ".join(fine), kept_in_order
# A realistic long context: an incident write-up where exactly ONE sentence
# answers the question, buried in filler and near-miss distractors.
CONTEXT = """
The on-call engineer was paged at 02:14 about elevated error rates.
The dashboard showed the API returning 500s on roughly one in three requests.
We have a weekly deployment cadence and this happened right after a release.
The team had been discussing a migration to a new database for some time.
Logs from the gateway were noisy but did not point to a single cause at first.
The root cause was a connection pool limit of 20 that the new release exhausted under load.
A rollback to the previous version restored service within eight minutes.
The previous version had run for three weeks without any incident at all.
Customer support received a handful of tickets during the outage window.
We will add an alert on pool saturation and raise the limit in the next release.
The postmortem is scheduled for Thursday and the document is already drafted.
Coffee in the office kitchen ran out around the same time, unrelated.
"""
QUERY = "What was the root cause of the outage?"
GOLD = ("The root cause was a connection pool limit of 20 that the new "
"release exhausted under load.")
print("=== Query-aware extractive compression ===")
print(f"Query: {QUERY}\n")
original = " ".join(split_sentences(CONTEXT))
orig_tok = tokens(original)
n_sent = len(split_sentences(CONTEXT))
print(f"BEFORE: {n_sent} sentences, {orig_tok} tokens (words * 1.3 estimate)")
# Show the score of every sentence so the ranking is not a black box.
sents = split_sentences(CONTEXT)
idf_weights = idf(sents)
qwords = set(words(QUERY))
print("\nSentence scores (query-aware, idf-weighted):")
for s in sorted(sents, key=lambda s: score_sentence(s, qwords, idf_weights),
reverse=True):
sc = score_sentence(s, qwords, idf_weights)
short = s if len(s) <= 60 else s[:57] + "..."
print(f" {sc:5.2f} {short}")
# COARSE pass: drop whole low-value sentences under a tight budget.
BUDGET = 40
coarse_text, kept = compress(CONTEXT, QUERY, BUDGET)
coarse_tok = tokens(coarse_text)
print(f"\n--- COARSE (budget {BUDGET} tokens): keep best whole sentences ---")
print(f"AFTER: {len(kept)} sentences, {coarse_tok} tokens "
f"({orig_tok / coarse_tok:.1f}x smaller)")
for s in kept:
print(f" + {s}")
# Verify the gold sentence survived the squeeze.
gold_kept = any(GOLD.split()[3:8] == s.split()[3:8] for s in kept) \
or GOLD in kept
print(f"\nGold answer retained? {gold_kept}")
# FINE pass: also strip stopwords from the survivors.
fine_text, _ = compress(CONTEXT, QUERY, BUDGET, strip_stopwords=True)
fine_tok = tokens(fine_text)
print(f"\n--- FINE (also strip stopwords from survivors) ---")
print(f"AFTER: {fine_tok} tokens ({orig_tok / fine_tok:.1f}x smaller overall)")
print(f" {fine_text}")
print("\n=== Summary ===")
print(f" original : {orig_tok:4d} tokens")
print(f" coarse compressed : {coarse_tok:4d} tokens "
f"({orig_tok / coarse_tok:.1f}x)")
print(f" + fine compressed : {fine_tok:4d} tokens "
f"({orig_tok / fine_tok:.1f}x)")
print(f" gold sentence kept: {gold_kept}")
print("\nThe model still gets the one sentence that answers the question,")
print("at a fraction of the tokens. That saving is real input cost (chapter 2).")
Running it:
=== Query-aware extractive compression ===
Query: What was the root cause of the outage?
BEFORE: 12 sentences, 207 tokens (words * 1.3 estimate)
Sentence scores (query-aware, idf-weighted):
0.73 The root cause was a connection pool limit of 20 that the...
0.58 Customer support received a handful of tickets during the...
0.27 The on-call engineer was paged at 02:14 about elevated er...
0.22 Logs from the gateway were noisy but did not point to a s...
0.18 The postmortem is scheduled for Thursday and the document...
0.18 Coffee in the office kitchen ran out around the same time...
0.17 The dashboard showed the API returning 500s on roughly on...
0.14 We will add an alert on pool saturation and raise the lim...
0.10 A rollback to the previous version restored service withi...
0.08 The previous version had run for three weeks without any ...
0.08 The team had been discussing a migration to a new databas...
0.00 We have a weekly deployment cadence and this happened rig...
--- COARSE (budget 40 tokens): keep best whole sentences ---
AFTER: 2 sentences, 36 tokens (5.8x smaller)
+ The root cause was a connection pool limit of 20 that the new release exhausted under load.
+ Customer support received a handful of tickets during the outage window.
Gold answer retained? True
--- FINE (also strip stopwords from survivors) ---
AFTER: 32 tokens (6.5x smaller overall)
The root cause was connection pool limit of 20 the new release exhausted under load. Customer support received handful of tickets during the outage window.
=== Summary ===
original : 207 tokens
coarse compressed : 36 tokens (5.8x)
+ fine compressed : 32 tokens (6.5x)
gold sentence kept: True
The model still gets the one sentence that answers the question,
at a fraction of the tokens. That saving is real input cost (chapter 2).
The gold sentence ("The root cause was a connection pool limit of 20...") scores highest at 0.73, far above the rest, because it contains the rare, high-idf words "root", "cause", and "pool" that also match the query, while the deployment-cadence sentence scores 0.00 because none of its words overlap the query. The coarse pass cuts 207 tokens to 36, a 5.8x reduction, and the fine pass trims that to 32, for 6.5x overall, with the answer intact.
Two honest details in that output. The second kept sentence ("Customer support received a handful of tickets...") is a passenger: it scored 0.58 mostly on the word "outage" matching the query, and it fit in the remaining budget, so it rode along. A tighter budget would have dropped it. And in the fine output, the words "the", "of", and "was" survived even though they are stopwords, because they appear in the query "What was the root cause of the outage?" and we protect query words from deletion. Both behaviors are the rule working as written, not a bug.
The real project: LLMLingua
The demo is the teaching version of Microsoft's LLMLingua family, the reference systems for prompt compression. They share the coarse-to-fine shape but replace our hand-rolled score with a learned one.
LLMLingua does the same two-level pruning, on a larger scale:
- Coarse (demonstration-level pruning). Given a prompt full of few-shot examples (demonstrations), it ranks whole demonstrations and drops the least useful ones, the same move as our coarse pass dropping whole sentences.
- Fine (token-level pruning). On what remains, it ranks individual tokens and drops the most predictable ones. The ranking signal is perplexity from a small language model: roughly, how surprised the small model is to see each token given the ones before it. A token the small model could have guessed (low perplexity, high redundancy) carries little information and is safe to drop; a surprising token (high perplexity) is doing real work and is kept. This is where it improves on idf: idf only knows how rare a word is across sentences, while perplexity knows how predictable a token is in this exact context.
- A budget controller sets different compression rates for different parts of the prompt, compressing the few-shot examples harder than the instructions, on the theory that examples are more redundant than the task description and can lose more without hurting.
- A distribution-alignment step nudges the small model used for scoring to behave more like the large target model, so the perplexity rankings transfer.
LLMLingua reports up to roughly 20x compression with little performance loss on its benchmarks, well past the 6.5x our extractive toy reaches, because token-level perplexity pruning is finer-grained than dropping whole sentences.
LLMLingua-2 changes the fine pass. Instead of the perplexity heuristic, it uses a transformer classifier trained for the job: for each token, a small model outputs keep-or-drop. The training labels come from data distilled from GPT-4, which was prompted to compress text, producing examples of which tokens a strong model judged droppable. Learning keep/drop directly is both faster (one forward pass, no per-token perplexity loop) and often more accurate than the perplexity proxy, because the classifier was trained on the actual target (was this token kept?) rather than a stand-in for it.
Don't be confused. Perplexity and the classifier are two different ways to answer the same question, "is this token droppable?". Perplexity (LLMLingua) is a heuristic: it assumes predictable tokens are redundant, which is usually but not always true. The classifier (LLMLingua-2) is trained on labeled keep/drop decisions, so it learns the exception cases the heuristic gets wrong. Same goal, one inferred, one learned.
Where these earn their keep:
- Long RAG contexts. You retrieved ten documents and only fragments of three are relevant. Compress them against the query before the call.
- Few-shot demonstrations. A long prompt of examples is highly redundant; the budget controller compresses it hardest, often the largest single saving.
- Long chain-of-thought. A model's own long reasoning trace can be compressed before it is fed back in for the next step.
Verifying the saving with Claude
The point of compression is fewer input tokens, and you should confirm the count rather than
trust the words * 1.3 estimate. Anthropic's count_tokens gives the exact input count
without running the model, so you can measure before and after. The following is
follow-along (the build machine has no API key), but it is the exact call:
# Illustrative: requires the anthropic SDK and an API key.
import anthropic
client = anthropic.Anthropic()
def input_tokens(text):
return client.messages.count_tokens(
model="claude-opus-4-8",
messages=[{"role": "user", "content": text}],
).input_tokens
before = input_tokens(ORIGINAL_CONTEXT)
after = input_tokens(COMPRESSED_CONTEXT)
print(f"{before} -> {after} input tokens ({before / after:.1f}x smaller)")
Run that with the same model id you will send to, because token counts are tokenizer-specific (Chapter 2). The ratio it reports is the real saving, the one your bill will reflect.
When not to compress
Compression is not free and not always worth it.
- It costs compute. The classifier or the small scoring model is itself a model call. For a context you send once, the compression can cost more than it saves. It pays off when the same large context is reused, or when fitting the window is the binding constraint regardless of cost.
- It can drop the answer. Any lossy method can cut the one span you needed, which is why the demo verifies the gold sentence explicitly. In production you set the budget with margin and measure end-task quality, not just the ratio.
- It fights caching. If a prefix is stable across calls, caching it (Chapter 6) is cheaper than compressing it, and a compressor that changes the prefix per query defeats the cache. Compress the variable part (the retrieved documents), cache the stable part (the system prompt).
Using the real tool: commands and before/after proof
The from-scratch demo above is the on-box proof that query-aware extraction works. LLMLingua is the production version of the same idea, and you install it like any other Python package:
pip install llmlingua
The minimal real usage is two steps: build a PromptCompressor, then call compress_prompt
on your long context. The first call downloads a model the first time it runs, so it is not
instant. The snippet below is follow-along (this build box has neither the library nor a model
cache), but it is the documented quickstart shape:
# Follow-along: requires `pip install llmlingua` and a one-time model download.
from llmlingua import PromptCompressor
# LLMLingua-2 (the trained keep/drop classifier from earlier in this chapter).
compressor = PromptCompressor(
model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
use_llmlingua2=True,
)
ORIGINAL_CONTEXT = "...a long retrieved document, log, or stack of few-shot examples..."
QUESTION = "What was the root cause of the outage?"
result = compressor.compress_prompt(
ORIGINAL_CONTEXT,
question=QUESTION, # query-aware: keep what answers this
rate=0.3, # target keeping ~30% of the tokens
)
print(result["compressed_prompt"]) # the kept text
print(result["origin_tokens"], "->", result["compressed_tokens"])
print(result["ratio"]) # a string like "3.3x"
compress_prompt returns a dict. The fields that matter here are compressed_prompt (the
kept text, ready to send), origin_tokens and compressed_tokens (the before and after
counts by LLMLingua's own tokenizer), and ratio (a formatted string such as "3.3x"). The
two knobs are rate (keep this fraction, a float at most 1.0) and target_token (compress to
roughly this many tokens). If you pass target_token, it wins and rate is ignored. The
original (non-2) LLMLingua uses the same call; you just construct PromptCompressor() with no
arguments and it leans on target_token plus the perplexity heuristic.
Before/after proof
LLMLingua's own ratio is one number, but the count that hits your bill is the count from the
model you actually send to. So measure both prompts with Anthropic's count_tokens, which
returns the exact input count for a given model without running it, and confirm the answer
still survives the compressed version. The metric is input tokens, and the test is: the
compressed prompt is much smaller and still answers the question.
# Follow-along: requires the anthropic SDK and an API key.
import anthropic
client = anthropic.Anthropic()
def input_tokens(text):
return client.messages.count_tokens(
model="claude-opus-4-8",
messages=[{"role": "user", "content": text}],
).input_tokens
original = f"{ORIGINAL_CONTEXT}\n\nQuestion: {QUESTION}"
compressed = f"{result['compressed_prompt']}\n\nQuestion: {QUESTION}"
before = input_tokens(original)
after = input_tokens(compressed)
print(f"{before} -> {after} input tokens ({before / after:.1f}x smaller)")
# Send the compressed prompt and check the answer is still there.
reply = client.messages.create(
model="claude-opus-4-8",
max_tokens=256,
messages=[{"role": "user", "content": compressed}],
)
print(reply.content[0].text)
On a real run you would see roughly 3000 -> 600 input tokens (5.0x smaller) with the answer
(the connection-pool root cause) still in the reply. Those numbers are illustrative, not
measured on this box: the actual ratio depends on your text and the rate you pick, and you
should read it off your own count_tokens output, because the count is tokenizer-specific to
the model you send to (Chapter 2). The verified demo earlier in this
chapter is the toy you can run here; this is the same query-aware compression, measured the
same way (input tokens before vs. after), with the production tool doing the cutting.
In Claude Code
The same move applies inside an agent loop. When a step is about to feed a long retrieved document or a noisy log into the model, compress it against the current question first, so the context window stays under budget and the input bill stays small. Claude Code keeps its own context lean by the cheaper version of this discipline: it reads the specific files and line ranges a task needs rather than pasting whole files or directories into the prompt, which is extractive compression done by retrieval instead of by a compressor model.
Takeaways
- Prompt compression shrinks the input: it removes tokens a long context does not need for the current question, saving both window space and input cost.
- Score sentences query-aware and idf-weighted so filler words count for little and rare, on-topic words count for a lot. Keep the best under a token budget.
- Work coarse to fine: drop whole low-value units first (the big, cheap win), then trim filler from the survivors, protecting query terms.
- LLMLingua prunes demonstrations then tokens, ranking tokens by perplexity from a small model, with a budget controller that compresses examples harder than instructions, for up to roughly 20x. LLMLingua-2 swaps the perplexity heuristic for a trained keep/drop classifier distilled from GPT-4.
- Verify the saving with
count_tokenson the exact model id, and watch the trade-offs: compression costs compute, can drop the answer, and fights prefix caching, so compress the variable part and cache the stable part.
👉 We have squeezed the input. The next chapter turns to the other, pricier half: the output. Since each written token costs about 5x an input token, getting the model to write less is often the larger saving, and it takes different tools than compressing a prompt.