Context-window management and compaction

A chat that runs long enough will, eventually, not fit. Every turn re-sends the whole history (Chapter 1), the history only grows, and the window is fixed. So at some point your code has to drop something. Chapter 1 showed the dumbest version of that decision: drop the oldest turn. It keeps you under budget, but it is amnesia by design. The fact that mattered most might be the one you said first.

This chapter does better. Instead of throwing old turns away, we summarize them into a few tokens and keep that summary in the window. The raw turns move to cheap storage out of the window; the gist stays. This is called compaction, and it is what lets a multi-hour agent run or a long support chat stay under budget without forgetting the codename you agreed on in turn 2.

The vocabulary, defined

A handful of terms get used loosely in this area, so here they are pinned down. All of them describe state your code maintains, since the model itself is stateless (Chapter 1).

  • Working set: the recent turns you keep verbatim and send to the model on this call. These are in-window: they cost tokens and they fit the budget.
  • Archive: older raw turns you keep somewhere cheap (a database, a file) but do not send. Out-of-window. They cost storage, not tokens, and the model never sees them unless you pull one back.
  • Eviction: the act of removing a turn from the working set. "Drop the oldest" evicts to the trash. Compaction evicts to the archive, but only after extracting the gist.
  • Compaction: replacing a batch of old turns with a short summary of them, so the summary stays in-window while the raw turns leave it.
  • Tiered memory: the whole arrangement, a small fast in-window tier (working set plus summary) backed by a large slow out-of-window tier (the archive). The system pages between them the way an operating system pages between RAM and disk.

Don't be confused. Deleting context and summarizing it are different operations with different costs. Deleting (also called context editing) removes blocks outright: you might clear out old tool results that no longer matter, and those bytes are simply gone. Summarizing (compaction) replaces a block with a shorter version that keeps the gist: the raw bytes leave the window but the meaning stays, in fewer tokens. Delete when the content is genuinely spent (a stale tool output nobody will reference again). Summarize when the content is old but its conclusions still matter (the early turns where you settled the requirements). Picking the wrong one is how you either bloat the window with dead tool dumps or forget a decision you still need.

The mechanism

The plan is a loop. Keep appending turns to the working set. After each turn, check whether the in-window total (the summary plus the working set) would blow the budget. While it would, take the oldest working-set turn, fold its key facts into a running summary, and move the raw turn to the archive. Stop when you fit again, or when the working set is down to a small floor of the most recent turns that you always keep verbatim.

Two design choices make this work and not silently lose things:

  1. The summarizer keeps facts, not prose. A real system asks the model to write the summary. To keep this demo reproducible (and runnable on a box with no API key, see the provider section below), we use a deterministic stub that pulls out the sentences that look load-bearing: a stated decision (a codename, a deadline, a chosen option) or a sentence that names an entity. Everything chatty is dropped. The exact rule does not matter; the point is that the summary is much smaller than the turns it replaces.
  2. The summary itself is bounded. If you only ever append to the summary, it grows without limit and eventually breaks the budget on its own. So the summary is capped: when it is full, the lowest-priority facts fall off first. Decisions outrank passing mentions, so the codename (a decision) is never the fact that gets dropped to make room for small talk.

Here is the whole thing, with a head-to-head against Chapter 1's drop-oldest policy built in. We plant an important fact early ("the project is codenamed Atlas" in turn 2), run a realistic 19-turn conversation that piles up tokens, and then, on the last turn, ask for the codename. The question is answerable only if "Atlas" still appears in what we would send.

"""Context-window management by COMPACTION (summarize, don't just drop).

A long conversation has to keep fitting a fixed window, but you don't want to
lose the facts that were said early on. This script builds a TIERED context
manager that keeps a token budget while preserving the gist of evicted turns:

  - WORKING SET: recent turns kept verbatim, in-window, sent to the model.
  - ARCHIVE:     older raw turns kept out-of-window (cheap storage, not sent).
  - SUMMARY:     a compact running note that captures the key facts (decisions,
                 named entities) from turns that have been pushed out of the
                 working set, so they survive even though the raw text doesn't.

When the in-window total would exceed the budget, we COMPACT: take the oldest
working-set turns, extract their key facts into the running summary, move the
raw turns to the archive, and keep the summary plus the most recent turns.

We compare this against chapter 1's policy ("drop the oldest") by planting an
important fact early and asking about it late. Drop-oldest loses the fact;
compaction keeps it in the summary, under the same budget.

Standard library only. Run:  python3 compaction.py
"""

import re

# Same crude estimate as chapters 1 and 2: ~1.3 tokens per English word.
# Real billing-grade counts come from the provider's count_tokens endpoint.
def est_tokens(text):
    return round(len(text.split()) * 1.3)


# --- The stub summarizer ----------------------------------------------------
# A real system asks the model to summarize. Here we use a deterministic stub
# so the output is reproducible and the mechanism is visible: we keep the lines
# that look load-bearing (a stated decision, or a sentence naming an entity)
# and drop the chatty rest. This is a stand-in for "the model wrote a summary",
# not a serious summarizer.

DECISION_RE = re.compile(
    r"\b(codenamed|named|is called|decided|chose|will use|deadline|budget|"
    r"must|the goal is|launch|ships?|version)\b",
    re.IGNORECASE,
)
# A capitalized word that is not the first word of the sentence: a crude proxy
# for a named entity (a project, person, product, or place).
ENTITY_RE = re.compile(r"(?<!^)(?<![.!?]\s)\b([A-Z][a-zA-Z0-9]{2,})\b")


def key_facts(turn_text):
    """Pull the fact-bearing sentences out of one turn's text, each tagged with
    a priority: 2 if it states a decision (codename, deadline, choice), 1 if it
    only names an entity. Higher-priority facts are kept first when the summary
    itself has to be trimmed, so a decision like the codename outranks chatter."""
    facts = []
    for sentence in re.split(r"(?<=[.!?])\s+", turn_text.strip()):
        s = sentence.strip()
        if not s:
            continue
        if DECISION_RE.search(s):
            facts.append((2, s))
        elif ENTITY_RE.search(s):
            facts.append((1, s))
    return facts


def summarize(turns, prior_summary, max_facts=4):
    """Fold a batch of turns into the running summary, de-duplicating, then
    keep only the top `max_facts` by priority so the summary stays bounded.
    Ties break toward the EARLIER fact, so a fact stated once early survives."""
    facts = list(prior_summary)
    seen = {s for _, s in facts}
    for t in turns:
        for prio, s in key_facts(t["text"]):
            if s not in seen:
                facts.append((prio, s))
                seen.add(s)
    # Stable sort by descending priority; Python's sort is stable, so within a
    # priority the original (chronological) order is preserved.
    facts.sort(key=lambda pf: -pf[0])
    return facts[:max_facts]


def render_summary(facts):
    if not facts:
        return ""
    lines = [s for _, s in facts]
    return "Summary of earlier conversation:\n- " + "\n- ".join(lines)


# --- The tiered context manager --------------------------------------------

class TieredContext:
    """Keeps the in-window context under `budget` tokens by compacting the
    oldest working-set turns into a running summary."""

    def __init__(self, budget, keep_recent=2):
        self.budget = budget          # token ceiling for what we SEND
        self.keep_recent = keep_recent  # min recent turns to keep verbatim
        self.working = []             # recent raw turns (in-window)
        self.archive = []             # older raw turns (out-of-window)
        self.summary = []             # running list of key facts (in-window)
        self.compactions = 0

    def in_window_tokens(self):
        used = est_tokens(render_summary(self.summary))
        used += sum(est_tokens(t["text"]) for t in self.working)
        return used

    def add(self, role, text):
        self.working.append({"role": role, "text": text})
        self._compact_if_needed()

    def _compact_if_needed(self):
        # While we're over budget and still have turns we're allowed to evict,
        # fold the oldest working-set turn into the summary and archive it.
        while self.in_window_tokens() > self.budget and \
                len(self.working) > self.keep_recent:
            oldest = self.working.pop(0)
            self.summary = summarize([oldest], self.summary)
            self.archive.append(oldest)
            self.compactions += 1


# --- Chapter 1's policy, for comparison ------------------------------------

class DropOldest:
    """Naive baseline: keep recent raw turns under budget by discarding the
    oldest. No summary, so anything dropped is gone for good."""

    def __init__(self, budget):
        self.budget = budget
        self.working = []
        self.dropped = 0

    def in_window_tokens(self):
        return sum(est_tokens(t["text"]) for t in self.working)

    def add(self, role, text):
        self.working.append({"role": role, "text": text})
        while self.in_window_tokens() > self.budget and len(self.working) > 1:
            self.working.pop(0)
            self.dropped += 1


# --- Answerability check ----------------------------------------------------
# Late in the chat we ask "what is the project codename?". The answer is
# available only if "Atlas" still appears somewhere in what we'd send: the
# summary or the working-set turns.

def can_answer_codename(visible_text):
    return "Atlas" in visible_text


def visible_text_tiered(ctx):
    parts = [render_summary(ctx.summary)] + [t["text"] for t in ctx.working]
    return "\n".join(parts)


def visible_text_dropoldest(ctx):
    return "\n".join(t["text"] for t in ctx.working)


# --- A scripted ~18-turn conversation --------------------------------------
# Turn 2 plants the load-bearing fact. The rest is realistic filler that piles
# up tokens and pushes the early turns toward eviction.

SCRIPT = [
    ("user",      "Hi, I'm kicking off the planning for our new internal search service."),
    ("assistant", "Great. For the record, the project is codenamed Atlas. I'll refer to it that way from here on."),
    ("user",      "We need it to index about ten million documents to start."),
    ("assistant", "Ten million is fine for a first cut. We can shard the index across a few nodes and grow later."),
    ("user",      "What embedding dimension should we use for the vectors?"),
    ("assistant", "Start at 768 dimensions. It balances recall against memory, and you can revisit it after measuring."),
    ("user",      "How should we handle re-indexing when documents change?"),
    ("assistant", "Use an append-only log of changes and replay it nightly. That keeps the live index stable during the day."),
    ("user",      "Our latency target is under 100 milliseconds at the 95th percentile."),
    ("assistant", "Achievable. Keep the hot shards in memory and cache the most frequent queries to hit that target."),
    ("user",      "Should we expose a REST API or a gRPC one for the search endpoint?"),
    ("assistant", "Offer gRPC internally for speed and a thin REST gateway for external callers who want simplicity."),
    ("user",      "What about access control on the documents?"),
    ("assistant", "Filter results by the caller's permission set at query time, and never index secrets into the vectors."),
    ("user",      "How many engineers do you think this needs for the first quarter?"),
    ("assistant", "Three is a reasonable starting team: one on indexing, one on serving, one on the API and client work."),
    ("user",      "When should we aim to ship the first internal preview?"),
    ("assistant", "Target the end of the quarter for an internal preview, then harden it before any wider rollout."),
    ("user",      "Remind me, what is the project codename again? I need it for the ticket."),
]


def run(strategy_factory, visible_fn, label):
    ctx = strategy_factory()
    print(f"=== {label} ===")
    print(f"  budget: {ctx.budget} tokens, "
          f"turns: {len(SCRIPT)}")
    over = 0
    peak = 0
    for i, (role, text) in enumerate(SCRIPT, start=1):
        ctx.add(role, text)
        used = ctx.in_window_tokens()
        peak = max(peak, used)
        if used > ctx.budget:
            over += 1
    answerable = can_answer_codename(visible_fn(ctx))
    print(f"  peak in-window tokens: {peak}  (budget {ctx.budget})")
    print(f"  turns over budget:     {over} of {len(SCRIPT)}")
    if hasattr(ctx, "compactions"):
        print(f"  compactions:           {ctx.compactions}")
        print(f"  raw turns archived:    {len(ctx.archive)}")
        print(f"  summary facts kept:    {len(ctx.summary)}")
    if hasattr(ctx, "dropped"):
        print(f"  raw turns dropped:     {ctx.dropped}")
    verdict = "ANSWERABLE" if answerable else "UNANSWERABLE (fact lost)"
    print(f"  late codename question: {verdict}")
    print()
    return ctx


BUDGET = 120

print("Planted fact: turn 2 says the project is codenamed Atlas.")
print(f"Late question (turn {len(SCRIPT)}): 'what is the project codename?'\n")

drop = run(lambda: DropOldest(BUDGET), visible_text_dropoldest, "Chapter 1 policy: drop the oldest")
tier = run(lambda: TieredContext(BUDGET), visible_text_tiered, "This chapter: tiered compaction")

print("What the compacting manager would actually SEND on the final turn")
print("(running summary, then the verbatim working set):")
print("-" * 68)
print(visible_text_tiered(tier))
print("-" * 68)
print()
print("Both stayed under the same", BUDGET, "token budget. Drop-oldest lost the")
print("codename when turn 2 fell out of the window; compaction kept it in the")
print("summary, so the late question is still answerable.")

Running it:

Planted fact: turn 2 says the project is codenamed Atlas.
Late question (turn 19): 'what is the project codename?'

=== Chapter 1 policy: drop the oldest ===
  budget: 120 tokens, turns: 19
  peak in-window tokens: 119  (budget 120)
  turns over budget:     0 of 19
  raw turns dropped:     13
  late codename question: UNANSWERABLE (fact lost)

=== This chapter: tiered compaction ===
  budget: 120 tokens, turns: 19
  peak in-window tokens: 118  (budget 120)
  turns over budget:     0 of 19
  compactions:           17
  raw turns archived:    17
  summary facts kept:    4
  late codename question: ANSWERABLE

What the compacting manager would actually SEND on the final turn
(running summary, then the verbatim working set):
--------------------------------------------------------------------
Summary of earlier conversation:
- For the record, the project is codenamed Atlas.
- When should we aim to ship the first internal preview?
- Should we expose a REST API or a gRPC one for the search endpoint?
- Offer gRPC internally for speed and a thin REST gateway for external callers who want simplicity.
Target the end of the quarter for an internal preview, then harden it before any wider rollout.
Remind me, what is the project codename again? I need it for the ticket.
--------------------------------------------------------------------

Both stayed under the same 120 token budget. Drop-oldest lost the
codename when turn 2 fell out of the window; compaction kept it in the
summary, so the late question is still answerable.

Read the two blocks side by side, because the comparison is the point. Both policies held the line on the same 120-token budget: drop-oldest peaked at 119 tokens, compaction at 118, and neither went over on any of the 19 turns. So on the budget metric they are a tie. The difference is what survived. Drop-oldest discarded 13 raw turns into nothing, and turn 2 was one of them, so by the final turn the codename is simply gone and the question cannot be answered. Compaction archived 17 raw turns but distilled them into a 4-fact running summary first, and "Atlas" is the top line of that summary, so the same question is answerable at the same cost.

Notice the summary in the final block sits at four facts and stays there. That is the bound doing its job: as new decisions arrive, the lowest-priority facts age out, but "the project is codenamed Atlas" is a decision, ranks high, and holds its place. Without that cap the summary would have grown every turn and eventually broken the budget itself, which is the trap a naive "just keep summarizing" loop falls into.

When you reach for this

Compaction earns its complexity on long-running, single-conversation work where the history genuinely outgrows the window:

  • Multi-hour chats. Support sessions or pairing sessions that accumulate hundreds of turns. The early turns set context that still matters; you cannot afford to drop them and you cannot afford to keep them all verbatim.
  • Tool-heavy agent loops. An agent that calls tools dozens of times fills the window with tool results fast. Here you often want both operations from the "Don't be confused" box: delete the spent tool outputs (context editing) and summarize the reasoning that produced them (compaction).
  • Long autonomous runs. An agent working a task overnight will blow any fixed window without some form of paging. Compaction plus an archive is how it keeps going.

For state that has to outlive a single conversation (across sessions, across restarts), this in-conversation summary is not enough; you want the durable, queryable memory of Chapter 9 and Chapter 10. Compaction manages one conversation's window; agent memory manages knowledge across many.

How the real systems do it

The tiered model in the demo is the same one production systems use, with more machinery.

Letta (formerly MemGPT) makes the tiers explicit and lets the agent manage them itself. It splits memory into core (always in-window, the essential persona and facts), recall (the recent conversation, like our working set), and archival (everything older, like our archive). The agent self-pages: it calls tools to summarize recall memory into core when the window fills, and to search archival memory to pull an old fact back in when it needs one. The demo's compaction loop is the same idea with the policy hard-coded instead of model-driven.

Anthropic's API offers both operations as server-side features, so you do not have to hand-roll the loop. The follow-along below shows the exact calls (the build box has no API key, so the output is illustrative, not a verified block).

Compaction summarizes earlier context into a compaction block as you approach the window. The one rule that bites people: you must append response.content back to your messages each turn, because the compaction state lives in those returned blocks. Strip out just the text and the summary is silently lost.

# Illustrative: requires the anthropic SDK and an API key.
import anthropic

client = anthropic.Anthropic()
messages = []

def chat(user_message):
    messages.append({"role": "user", "content": user_message})
    response = client.beta.messages.create(
        betas=["compact-2026-01-12"],            # the compaction beta header
        model="claude-opus-4-8",
        max_tokens=16000,
        messages=messages,
        context_management={"edits": [{"type": "compact_20260112"}]},
    )
    # Append the FULL content, not just the text: the compaction blocks in here
    # are what the API uses to replace the compacted history next turn.
    messages.append({"role": "assistant", "content": response.content})
    return next(b.text for b in response.content if b.type == "text")

Context editing is the delete side of the "Don't be confused" box: it removes old blocks rather than summarizing them. The clear_tool_uses strategy strips out stale tool results, which is exactly right for a tool-heavy loop where the outputs are spent but the conversation should keep its shape.

# Illustrative: requires the anthropic SDK and an API key.
response = client.beta.messages.create(
    betas=["context-management-2025-06-27"],     # the context-editing beta header
    model="claude-opus-4-8",
    max_tokens=16000,
    messages=messages,
    context_management={"edits": [{"type": "clear_tool_uses_20250919"}]},
    tools=tools,
)

The two are complementary, and a long agent run often uses both: clear the tool results that are done with (clear_tool_uses), and compact the reasoning and decisions that are old but still load-bearing (compact_20260112). Note the different beta headers and edit types: the clearing strategy is not the compaction one, and mixing them up is a common slip.

Using the real tool: commands and before/after proof

The "How the real systems do it" section showed the call shapes. This section runs a real chat loop with them and explains how you prove the compaction is actually working: you watch the input-token count instead of letting it climb forever.

Here is a long chat loop that turns server-side compaction on. The point to get right is the last line of the loop: you append response.content back to messages, not just the text you pulled out of it. The compaction state the API builds up lives inside those returned content blocks. If you keep only the text and throw the rest away, the next request has no summary to stand on and the API has to start the whole job over.

# Follow-along: needs the anthropic SDK and an API key (neither is on this box,
# so the token figures further down are labeled illustrative, not measured here).
import anthropic

client = anthropic.Anthropic()
messages = []

def chat(user_message):
    messages.append({"role": "user", "content": user_message})
    response = client.beta.messages.create(
        betas=["compact-2026-01-12"],                       # turns on server-side compaction
        model="claude-opus-4-8",
        max_tokens=16000,
        messages=messages,
        context_management={"edits": [{"type": "compact_20260112"}]},
    )
    # Append the FULL content, not just the text. The compaction block the API
    # may have written lives in here; strip it and the summary is lost next turn.
    messages.append({"role": "assistant", "content": response.content})
    print("input tokens this turn:", response.usage.input_tokens)
    return next(b.text for b in response.content if b.type == "text")

# Run this dozens of times. As the history grows toward the window, the API
# starts replacing the older turns with a compaction block on its own.
for turn in range(40):
    chat(f"... turn {turn} of a long working session ...")

The delete-side sibling is context editing, which clears spent blocks instead of summarizing them. It uses a different beta header and a different edit type, so do not mix the two up: the clearing strategy is clear_tool_uses_20250919, not the compaction one.

# Follow-along: same caveats. This clears stale tool results rather than summarizing.
response = client.beta.messages.create(
    betas=["context-management-2025-06-27"],                # the context-editing beta header
    model="claude-opus-4-8",
    max_tokens=16000,
    messages=messages,
    context_management={"edits": [{"type": "clear_tool_uses_20250919"}]},
    tools=tools,
)

The before/after proof: bounded input tokens

The metric that tells you compaction is working is input tokens per call, and the test is whether that number stays bounded as the conversation gets long. Every request re-sends the whole history (Chapter 1), so on the naive path, where you just keep appending raw turns, the input-token count on each call climbs in step with the transcript and eventually slams into the window. With compaction on, once the history nears the trigger the API folds the older turns into a compaction block, and the prompt you send on later calls is that block plus the recent turns rather than the full transcript. So input_tokens on a turn-30 call lands far below what the same turn would have cost if you had sent everything.

You read the result off two places in the response:

  • response.usage.input_tokens, the count printed in the loop above. On the naive path it rises every turn; with compaction on it rises, then drops back down at the turn where the API emits the summary, then rises again from that lower floor.
  • a compaction content block in response.content, which is the summary the API wrote. Its presence is the direct signal that compaction fired on that turn (and it is exactly the block you must append back, per the loop above).

Concretely, with small illustrative numbers (expected shape, not measured on this box):

              naive "send the whole transcript"   with compaction on
turn 1                         ~3k input tokens    ~3k input tokens
turn 10                       ~60k input tokens   ~60k input tokens
turn 30                      ~180k input tokens   ~30k input tokens   <- summary fired

The two paths track each other early, while the history still fits cheaply. They split once the history nears the window: the naive path keeps climbing toward turn-30's ~180k, while the compacted path drops back to ~30k when the API replaces the early turns with a summary block. Same conversation, a fraction of the per-call cost, and the conversation keeps going instead of hitting the wall.

We cannot run that against the live API on this box (no key), so the figures above are the expected shape, not a measured one. The on-box proof is the verified demo earlier in this chapter: it holds the in-window total under a fixed 120-token budget across all 19 turns while keeping the planted "Atlas" fact answerable. That is the same bounded-cost, kept-the-gist result the real tool produces, run end to end with output you can check.

In Claude Code

You see this without writing any of it when you use Claude Code. A long session there compacts on its own as it approaches the window, so the conversation keeps going without you pruning old turns by hand. You can also trigger it yourself with the /compact command when you want to fold the history down at a chosen moment rather than waiting for the automatic point. Either way it is the same mechanism: summarize the old turns, keep the gist in the window, drop the raw history to cheap storage.

Takeaways

  • A long conversation cannot keep fitting a fixed window, so something gets evicted. "Drop the oldest" stays under budget but is amnesia by design: the most important fact is often the earliest.
  • Compaction evicts smarter. It summarizes the oldest working-set turns into a compact running summary, archives the raw turns out of the window, and keeps the gist in-window for a fraction of the tokens.
  • Tiered memory is the arrangement: a small in-window tier (working set plus summary) backed by a large out-of-window archive, paged between like RAM and disk.
  • Bound the summary, or it becomes the new leak. Cap it and let low-priority facts age out first, so decisions survive and small talk does not.
  • Deleting context and summarizing it are different tools. Delete spent tool results (context editing); summarize old-but-still-relevant reasoning (compaction). Long agent runs use both.
  • The providers offer both server-side: Anthropic's compact_20260112 summarizes and clear_tool_uses_20250919 deletes; Letta exposes core/recall/archival tiers the agent pages itself. Same tiered idea, less hand-rolling.

👉 Compaction keeps the facts of a long run alive. The next chapter keeps the lessons alive: when an agent fails, how it records what went wrong and turns that into a procedure it follows next time, so the same mistake is not repeated. On to Chapter 12.