Output token reduction

Chapter 2 ended with a rule: weight a saved output token as five saved input tokens, then multiply by how many of each you actually have. Chapter 3 spent that rule on the input side, shrinking the prompt. This chapter spends it on the output side, trimming what the model writes back. Output is priced 5x higher per token, so when you control the shape of the answer, this is often the single highest-return lever in the book.

The catch is that you do not edit the output the way you edit a prompt. The prompt is text you assemble; the output is text the model generates, and you have already paid for every token of it by the time you could delete one. So output reduction is not "write less text after the fact." It is "make the model generate less in the first place." That distinction runs through the whole chapter.

Where the waste is

Ask an untuned chat model a one-word question and you rarely get a one-word answer. You get a preamble ("Certainly! I'd be happy to help..."), a restatement of your question, the actual answer, and a trailing summary that adds nothing. For a human reading one reply, that padding is harmless, even friendly. For a pipeline making the same call a hundred thousand times, three of those four parts are pure cost: you pay output rates to generate words no downstream step will read.

The demo below makes that concrete with no real model. A stub generator returns the four-part verbose answer. Then three shapers trim it different ways, and we measure output tokens (the words * 1.3 estimate from chapter 2) and the dollars saved across a workload of 100,000 calls at the real claude-opus-4-8 output rate of $25 per million tokens. The three shapers are:

(a) terse mode: strip the preamble, the restatement, and the trailing summary, keeping only the lines that answer.
(b) schema extraction: force the answer into a tiny JSON object {"label": ...}, so only the needed field is ever emitted.
(c) hard cap: a max_tokens ceiling that truncates the stream once it is hit.

"""Output token reduction: shaping what the model writes BACK.

Chapter 2 established that output tokens cost 5x input tokens on the Anthropic
models. So cutting output is, token for token, the highest-return lever you have.
This file makes that concrete with NO real model: a stub generator produces a
verbose answer, three "shapers" trim it different ways, and we measure the output
tokens and DOLLARS saved across a workload of N calls.

Everything here is stdlib + (optional) numpy. We never call an API; the point is
to reason about the *shape* of the output, which you control before you ever send
the request.
"""

import re

# Real Anthropic rate for claude-opus-4-8: $25 per million OUTPUT tokens.
# (Input is $5/Mtok; output is the expensive half, see chapter 2.)
OUTPUT_RATE_PER_MTOK = 25.0


def est_tokens(text):
    """Estimate tokens from words, the chapter-2 rule: ~1.3 tokens per word.

    A real count comes from the provider's count_tokens endpoint; this estimate
    is fine for comparing two versions of the same kind of text.
    """
    words = len(text.split())
    return round(words * 1.3)


def dollars(out_tokens, n_calls):
    """Cost of `out_tokens` output tokens per call, across `n_calls` calls."""
    return out_tokens * n_calls / 1_000_000 * OUTPUT_RATE_PER_MTOK


# ---------------------------------------------------------------------------
# The stub "model". It returns a verbose answer in four parts, the way an
# untuned chat model tends to: a preamble, a restatement of the question, the
# actual answer, and a trailing summary. Only the third part carries signal.
# ---------------------------------------------------------------------------

def verbose_model(question, answer):
    """Simulate a chatty model. Returns one string with four labeled parts."""
    preamble = "Certainly! I'd be happy to help you with that."
    restatement = f"You asked: {question}"
    body = f"The answer is: {answer}."
    summary = (
        "In summary, I hope this explanation clarifies things for you. "
        "Let me know if you'd like me to elaborate further on any point!"
    )
    return "\n".join([preamble, restatement, body, summary])


# ---------------------------------------------------------------------------
# Three shapers. Each takes the verbose text (and, for the schema shaper, the
# raw answer) and returns a trimmed string.
# ---------------------------------------------------------------------------

# Filler openers a chat model reaches for. We strip whole lines that are pure
# preamble/summary and have no answer content.
_FILLER = re.compile(
    r"^(certainly|sure|of course|absolutely|i'd be happy|in summary|"
    r"i hope this|let me know|you asked)",
    re.IGNORECASE,
)


def shape_terse(text):
    """(a) Terse mode: drop preamble, restatement, and trailing summary.

    Keep only the lines that actually answer. This is what a 'be concise, no
    preamble' instruction (or low effort) buys you on the real API.
    """
    kept = []
    for line in text.splitlines():
        stripped = line.strip()
        if not stripped:
            continue
        if _FILLER.match(stripped):
            continue
        kept.append(stripped)
    return "\n".join(kept)


def shape_schema(answer_value):
    """(b) Schema extraction: emit ONLY the needed field as a tiny JSON object.

    On the real API this is output_config={"format": {"type": "json_schema",
    "schema": {...}}}: the model is constrained to emit just the schema fields,
    so the prose never gets generated in the first place.
    """
    # Compact JSON, no spaces: {"label":"<value>"}
    return '{"label":"' + str(answer_value) + '"}'


def shape_cap(text, max_tokens):
    """(c) Hard cap: a max_tokens ceiling. Truncate once the estimate is hit.

    This does NOT make the answer short by design; it cuts the model off
    mid-stream, so it can lose the part that mattered (see the 'Don't be
    confused' box in the chapter).
    """
    out_words = []
    for word in text.split():
        # +1 word is ~1.3 more tokens; stop before crossing the ceiling.
        if est_tokens(" ".join(out_words + [word])) > max_tokens:
            break
        out_words.append(word)
    return " ".join(out_words)


def report(name, before_text, after_text, n_calls):
    """Print BEFORE/AFTER output tokens and dollars for one shaper."""
    b_tok, a_tok = est_tokens(before_text), est_tokens(after_text)
    b_usd, a_usd = dollars(b_tok, n_calls), dollars(a_tok, n_calls)
    pct = 0.0 if b_tok == 0 else (b_tok - a_tok) / b_tok * 100
    print(f"  {name}")
    print(f"    output tokens : {b_tok:4d}  ->  {a_tok:4d}   ({pct:4.0f}% smaller)")
    print(f"    cost / {n_calls:,} calls: ${b_usd:8.2f}  ->  ${a_usd:8.2f}   "
          f"(saves ${b_usd - a_usd:7.2f})")


def main():
    n_calls = 100_000  # one workload, run this many times

    # A classification task: the whole useful answer is one word.
    question = "Is this review positive or negative: 'the staff were rude'?"
    answer = "negative"

    verbose = verbose_model(question, answer)

    print("=== The verbose answer the model wants to write ===")
    print(verbose)
    print()
    print(f"Workload: {n_calls:,} calls, claude-opus-4-8 output at "
          f"${OUTPUT_RATE_PER_MTOK:.0f}/Mtok.")
    print(f"Verbose output is {est_tokens(verbose)} tokens; "
          f"that costs ${dollars(est_tokens(verbose), n_calls):,.2f} just to "
          f"emit, {n_calls:,} times.")
    print()

    print("=== Three ways to shape the output ===")
    report("(a) terse mode    ", verbose, shape_terse(verbose), n_calls)
    report("(b) schema (JSON) ", verbose, shape_schema(answer), n_calls)
    report("(c) hard cap @ 12 ", verbose, shape_cap(verbose, 12), n_calls)
    print()

    # The classification punchline: a one-word answer beats a paragraph, and it
    # is also a *correct* answer. Nothing was lost by shaping it short.
    one_word = answer
    paragraph = verbose
    print("=== Classification: one word beats a paragraph ===")
    print(f"  paragraph answer : {est_tokens(paragraph):3d} tokens  "
          f"(${dollars(est_tokens(paragraph), n_calls):,.2f} / {n_calls:,} calls)")
    print(f"  one-word answer  : {est_tokens(one_word):3d} tokens  "
          f"(${dollars(est_tokens(one_word), n_calls):,.2f} / {n_calls:,} calls)")
    saved = dollars(est_tokens(paragraph), n_calls) - dollars(est_tokens(one_word), n_calls)
    print(f"  same label, {saved / dollars(est_tokens(paragraph), n_calls) * 100:.0f}% "
          f"cheaper. The extra prose was never the answer.")


if __name__ == "__main__":
    main()

Running it:

=== The verbose answer the model wants to write ===
Certainly! I'd be happy to help you with that.
You asked: Is this review positive or negative: 'the staff were rude'?
The answer is: negative.
In summary, I hope this explanation clarifies things for you. Let me know if you'd like me to elaborate further on any point!

Workload: 100,000 calls, claude-opus-4-8 output at $25/Mtok.
Verbose output is 62 tokens; that costs $155.00 just to emit, 100,000 times.

=== Three ways to shape the output ===
  (a) terse mode    
    output tokens :   62  ->     5   (  92% smaller)
    cost / 100,000 calls: $  155.00  ->  $   12.50   (saves $ 142.50)
  (b) schema (JSON) 
    output tokens :   62  ->     1   (  98% smaller)
    cost / 100,000 calls: $  155.00  ->  $    2.50   (saves $ 152.50)
  (c) hard cap @ 12 
    output tokens :   62  ->    12   (  81% smaller)
    cost / 100,000 calls: $  155.00  ->  $   30.00   (saves $ 125.00)

=== Classification: one word beats a paragraph ===
  paragraph answer :  62 tokens  ($155.00 / 100,000 calls)
  one-word answer  :   1 tokens  ($2.50 / 100,000 calls)
  same label, 98% cheaper. The extra prose was never the answer.

Read the three shapers against each other. Terse mode takes the 62-token reply down to 5 tokens (92% smaller) by dropping the three non-answer parts. Schema extraction goes further, to a single token, because it never lets the prose exist: the model is constrained to emit {"label":"negative"} and nothing else. The hard cap saves the least, 81%, and for a reason that matters: it does not shape the answer, it just stops it at 12 tokens. Here that happens to land after the useful word, but a cap is a blunt instrument, and the chapter's warning box is about exactly that.

The two ways to make output small

The schema and terse results are 98% and 92% smaller, the cap 81%, but the gap is not the headline. The headline is how each got small, because that decides whether the answer is still correct.

Don't be confused. Truncating output with a hard cap and shaping it to be short by design are not the same thing, even when they produce a similar token count. A max_tokens ceiling cuts the stream off at a fixed length wherever the model happens to be: if the model front-loaded filler, the cap can chop off the actual answer and leave you the preamble. Shaping (terse instructions, a schema, low effort) changes what the model decides to write, so the answer is short because it was never padded, not because it was guillotined. Use the cap as a safety ceiling against a runaway response, not as your primary way to get short answers. If your answers are short only because they keep hitting the cap, you have a design problem wearing a cost solution.

That is why the schema shaper is the strongest of the three. It is the difference between asking nicely for a short answer and removing the model's ability to write a long one. A json_schema with one string field has no slot for a preamble, so the preamble cannot be generated, so you cannot be billed for it. Terse instructions get most of the way there and are easy to apply everywhere; schemas get the rest of the way when the output is structured enough to pin down.

One word beats a paragraph

The last block of the demo is the classification punchline. The task ("is this review positive or negative?") has a one-word answer. A paragraph that explains the sentiment, hedges, and offers to elaborate costs 62 tokens. The bare label negative costs 1. Same answer, 98% cheaper, and the short version is not a degraded answer: it is the whole answer. The extra prose was never the thing you asked for.

This generalizes past classification. Any task whose result is a fixed shape, a label, a number, a yes/no, a single extracted field, an enum, is a task where the useful output is tiny and everything else is the model being conversational at your expense. Extraction pipelines, routing decisions, and JSON-returning API endpoints all live here. The win is largest exactly where the answer is most constrained, because that is where the ratio of padding to signal is worst.

The provider-native levers

Everything above was simulated so the mechanics stay visible. On the real Anthropic API you do not hand-roll the shapers; you set request parameters and let the model do the trimming. The four that matter for output, in rough order of how often you reach for them:

The following is follow-along (the build machine has no API key), and the output shapes are illustrative, but the calls are exact. Consult the claude-api skill before writing this for real.

# Illustrative: requires the anthropic SDK and an API key.
import anthropic
client = anthropic.Anthropic()

# 1. effort: low. The single biggest knob. Lower effort means fewer and more
#    consolidated tool calls, less preamble, and terser confirmations. Values:
#    low | medium | high | max. For simple/cheap work, low is the right default.
resp = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=256,                       # 2. hard ceiling (the safety cap, not the shaper)
    output_config={"effort": "low"},
    messages=[{"role": "user", "content": "Classify sentiment: 'the staff were rude'"}],
)

# 3. Structured output: constrain the model to emit ONLY the schema fields, so
#    the prose is never generated. This is shaper (b) from the demo, for real.
resp = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=64,
    output_config={
        "format": {
            "type": "json_schema",
            "schema": {
                "type": "object",
                "properties": {"label": {"type": "string",
                                         "enum": ["positive", "negative"]}},
                "required": ["label"],
                "additionalProperties": False,
            },
        }
    },
    messages=[{"role": "user", "content": "Classify sentiment: 'the staff were rude'"}],
)

# 4. stop_sequences: end generation the instant a marker appears, so the model
#    can't run on past the part you wanted.
resp = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=256,
    stop_sequences=["\n\n"],              # stop at the first blank line
    messages=[{"role": "user", "content": "Give the answer on one line."}],
)

Map these back to the demo. output_config={"effort": "low"} is the production version of terse mode: it makes the model write less without you specifying what to cut. output_config={"format": {...}} is the schema shaper, the strongest lever when the output is structured, because it removes the slots that padding would fill. max_tokens is the hard cap, and the same warning applies: it is a ceiling against a runaway response, not your way to get short answers. stop_sequences is a finer version of the same idea, ending generation at a content marker rather than a token count, so it cuts at the right place instead of an arbitrary length.

A note on where these don't help. The 5x asymmetry is per token, so trimming output pays best when the output is a meaningful fraction of the total. In a long-document, short-answer workload the input still dominates the bill (chapter 2's worked example), and the right move is to compress the input or cache it. Output reduction shines when the answer is verbose relative to the question, which is most of the time for chat, agents, and any task where the model is inclined to explain itself.

Two real projects

Two named tools sit on the output side of this line, trimming what the model emits rather than what you send it.

caveman is a post-generation trimmer: it takes the model's reply and strips the conversational scaffolding (the "Certainly!", the hedges, the offers to elaborate), the way shaper (a) does. It is a cleanup pass on text you already paid to generate, which makes it a tool for the downstream consumer, not a way to lower the bill.
Headroom's shaper is a constrained-generation layer: it pushes the request to emit a target shape (a schema, a bounded length) so the trimming happens during generation, the way shapers (b) and (c) do. Because it constrains what the model writes, it actually reduces the output tokens you are billed for, not just the tokens you keep.

The distinction between them is the same one the "Don't be confused" box drew: trimming after the fact tidies the text but you already paid for it; constraining the generation is what changes the bill. Reach for constrained generation (effort, schema, stop sequences) when cost is the goal, and post-trimming when you just want the downstream text clean.

Using the real tool: commands and before/after proof

The "real tool" for output reduction is mostly the Anthropic API itself: the levers are request parameters, and the metric is a field on the response. There is no separate library to install. Here is one call that stacks the levers from the demo, written as follow-along because this box has no API key.

# Follow-along: requires the anthropic SDK and an API key.
import anthropic
client = anthropic.Anthropic()

resp = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=16,                     # hard ceiling: a safety cap, not the shaper
    output_config={
        "effort": "low",              # terser, less preamble (low | medium | high | max)
        "format": {                   # constrain the answer to one JSON field
            "type": "json_schema",
            "schema": {
                "type": "object",
                "properties": {"label": {"type": "string",
                                         "enum": ["positive", "negative"]}},
                "required": ["label"],
                "additionalProperties": False,
            },
        },
    },
    stop_sequences=["}"],              # end generation at the closing brace
    messages=[{"role": "user",
               "content": "Classify sentiment: 'the staff were rude'"}],
)

Each line earns its place. effort: "low" tells the model to spend fewer tokens thinking and to skip the conversational scaffolding. max_tokens=16 is the ceiling that truncates a runaway response, the same blunt instrument as shaper (c). The json_schema format removes every slot the prose would fill, so the model can only emit {"label": ...}, which is shaper (b). stop_sequences=["}"] halts generation the instant the object closes, so the model cannot run on.

To prove the levers did something, read response.usage.output_tokens, the count of tokens the model actually generated, and price it at the claude-opus-4-8 output rate of $25 per million tokens. Call once without the levers and once with them, and compare:

# Follow-along: same caveat as above.
def cost(tokens):                     # dollars to emit this many output tokens
    return tokens / 1_000_000 * 25.0

# WITHOUT the levers: high effort, free-form answer.
loose = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    output_config={"effort": "high"},
    messages=[{"role": "user",
               "content": "Classify sentiment: 'the staff were rude'"}],
)
print(loose.usage.output_tokens, cost(loose.usage.output_tokens))

# WITH the levers: the stacked call from above.
print(resp.usage.output_tokens, cost(resp.usage.output_tokens))

The numbers below are illustrative (expected shapes, not measured on this box, which has no key), but the ratio is the point:

without levers : ~120 output tokens   (~$0.0030 per call)
with levers    :   ~8 output tokens   (~$0.0002 per call)

That 120-to-8 drop is the same effect the from-scratch shaper demo measured at the top of the chapter, where terse mode and the schema took a 62-token reply down to 5 and 1 tokens. The demo proved the mechanism on the box with a stub; this proves it on the real API by reading the token count the provider bills you for.

In Claude Code

Claude Code exposes the same effort lever at the CLI. Its default for coding is xhigh, which is deliberate: coding wants thorough reasoning and careful tool use. When you lower the effort, the agent makes fewer and more consolidated tool calls and writes terser output, which means fewer output tokens for the same task. You can also just ask for a short final answer ("give me the one-line summary, no explanation"). Lowering effort is the CLI-level version of the output_config={"effort": "low"} parameter above: same knob, same effect on the bill, reached through the tool instead of the request body.

Takeaways

Output is priced 5x input per token, so making the model write less is usually the highest-return lever. You cannot edit output after the fact for free; you have already paid for every token by the time you could delete one.
Output reduction means making the model generate less, not cleaning up afterward. The levers are effort, structured output (schemas), max_tokens, and stop sequences.
A hard cap truncates wherever the model happens to be and can cut off the real answer; shaping (terse, schema, low effort) makes the answer short by design and keeps it correct. Use the cap as a safety ceiling, not as your short-answer strategy.
For tasks with a fixed-shape result (classification, extraction, routing, JSON APIs), a one-word or one-field answer is not a worse answer, it is the whole answer. That is where output reduction pays the most.
On the real API, output_config={"effort": "low"} is the everyday terse knob and output_config={"format": {"type": "json_schema", ...}} is the strongest when the output is structured, because it removes the slots padding would fill.

👉 We have now squeezed both halves of a single call: the input (Chapter 3) and the output (this chapter). The next chapter narrows in on a context type that breaks the naive shrinking we have done so far: code. Source files have structure (imports, definitions, call graphs) that a blind word-trimmer destroys, and the next chapter shows how to compress and select context in a way that respects it.

Context Engineering in Depth