The token economy

Every technique in this book is justified by a number: tokens saved, dollars saved, milliseconds saved. So before the techniques, we need to be able to count. This chapter covers what a token is, why output tokens cost several times more than input tokens (which quietly decides what is worth optimizing), and how to estimate a context's size and price before you send it.

What a token is

Models do not read characters or words. They read tokens: chunks of text from a fixed vocabulary, usually sub-word pieces. Common words are a single token (context), rarer or longer words split into several (contextual becomes context + ual), and unusual strings fall apart into many small pieces. The split is learned, not rule-based, by an algorithm called byte pair encoding (BPE): start from individual characters, then repeatedly merge the most frequent adjacent pair into a new token, and keep the merges as a ranked list. At encode time you replay those merges on a new word.

This matters for context engineering because your bill and your window are measured in tokens, not words or characters, and the ratio is not fixed. English prose runs about 1.3 tokens per word; code, JSON, and non-English text run higher because they fragment more. That is why you cannot eyeball a budget reliably and why every provider ships a count_tokens endpoint.

Output is the expensive half

Here is the fact that reshapes what you optimize: you are billed for both the tokens you send (input) and the tokens the model writes (output), but output is priced several times higher than input. On the Anthropic models, output is exactly 5x the input rate.

The demo below shows three things from that one fact. First, the 5x asymmetry across models. Second, a worked workload: token for token, cutting output is worth 5x cutting input, even though in a long-prompt/short-answer workload the input still dominates the total bill (the asymmetry is per token, not per call). Third, a from-scratch BPE tokenizer so "sub-word merging" is concrete rather than a phrase, plus the two practical estimates you can apply in your head.

"""The token economy: why output is the expensive half, and how to estimate cost.

Two ideas, both measurable:
  1. You are billed per token, and OUTPUT tokens cost several times more than INPUT
     tokens. So trimming what the model WRITES often saves more than trimming what
     you SEND.
  2. You can estimate a prompt's token count before you send it. Real tokenizers use
     sub-word merging (BPE); we show a tiny BPE to demystify it, plus the practical
     rule of thumb. (For exact counts, call the provider's count_tokens endpoint;
     never use another vendor's tokenizer, it will be wrong.)

Standard library only. Run:  python3 token_economy.py
"""

from collections import Counter

# Published Anthropic prices, US dollars per MILLION tokens (input, output).
# These anchor the asymmetry; verify current numbers before quoting them anywhere.
PRICES = {
    "claude-opus-4-8":   (5.0, 25.0),   # 1M context
    "claude-sonnet-4-6": (3.0, 15.0),   # 1M context
    "claude-haiku-4-5":  (1.0,  5.0),   # 200K context
}


def cost(model, in_tok, out_tok):
    pin, pout = PRICES[model]
    return (in_tok / 1e6) * pin + (out_tok / 1e6) * pout


print("=== 1. Output is the expensive half ===")
print("Per-token, output costs 5x input on every model here:\n")
for m, (pin, pout) in PRICES.items():
    print(f"  {m:18s} input ${pin:>5.2f}/Mtok   output ${pout:>5.2f}/Mtok   "
          f"output is {pout/pin:.0f}x")
print()

# A support agent: a big stable prompt in, a short answer out, called a lot.
in_tok, out_tok, calls = 8000, 400, 100_000
m = "claude-opus-4-8"
base = cost(m, in_tok, out_tok) * calls
print(f"Workload: {calls:,} calls, {in_tok} input + {out_tok} output tokens each "
      f"({m}).")
print(f"  total cost: ${base:,.0f}\n")
# Per-token, trimming OUTPUT is worth 5x trimming the same count of INPUT.
save_out = (cost(m, in_tok, out_tok) - cost(m, in_tok, out_tok - 100)) * calls
save_in = (cost(m, in_tok, out_tok) - cost(m, in_tok - 100, out_tok)) * calls
print(f"  cut 100 OUTPUT tokens/call:  saves ${save_out:,.0f}")
print(f"  cut 100 INPUT  tokens/call:  saves ${save_in:,.0f}")
print(f"  -> token for token, output is worth {save_out/save_in:.0f}x as much to cut.")
print("  (Caching makes input cheaper still, ~0.1x on a hit, widening the gap;")
print("   see chapter 6. Whether INPUT or OUTPUT dominates your bill depends on the")
print("   ratio: a long prompt with a short answer is input-heavy in TOTAL, even")
print("   though each output token is pricier.)\n")


def bpe_encode(word, merges):
    """Encode one word with a learned merge list (the heart of GPT/Claude-style
    tokenizers). Start from characters, then repeatedly glue the highest-priority
    adjacent pair until no learned merge applies."""
    toks = list(word)
    while True:
        pairs = [(toks[i], toks[i + 1]) for i in range(len(toks) - 1)]
        ranked = [(merges[p], i) for i, p in enumerate(pairs) if p in merges]
        if not ranked:
            return toks
        _, i = min(ranked)  # apply the earliest-learned (highest priority) merge
        toks[i:i + 2] = ["".join(toks[i:i + 2])]


def learn_merges(corpus, n):
    """Learn n merges greedily: repeatedly fuse the most frequent adjacent pair."""
    seqs = [list(w) for w in corpus.split()]
    merges = {}
    for rank in range(n):
        pairs = Counter()
        for s in seqs:
            for i in range(len(s) - 1):
                pairs[(s[i], s[i + 1])] += 1
        if not pairs:
            break
        best = pairs.most_common(1)[0][0]
        merges[best] = rank
        for s in seqs:  # apply the new merge everywhere
            i = 0
            while i < len(s) - 1:
                if (s[i], s[i + 1]) == best:
                    s[i:i + 2] = ["".join(best)]
                else:
                    i += 1
    return merges


print("=== 2. A tiny tokenizer (BPE), from scratch ===")
corpus = "context contexts contextual contexted engineering engineer engineered " * 50
merges = learn_merges(corpus, n=12)
for w in ["context", "contextual", "engineering", "token"]:
    toks = bpe_encode(w, merges)
    print(f"  {w:12s} -> {toks}   ({len(toks)} token(s))")
print()

print("=== 3. The practical estimate ===")
prompt = ("Summarize the deployment failure and propose a fix. "
          "Keep it under three sentences. ") * 4
words = len(prompt.split())
chars = len(prompt)
print(f"  prompt: {words} words, {chars} chars")
print(f"  ~words * 1.3 = {round(words * 1.3)} tokens")
print(f"  ~chars / 4   = {round(chars / 4)} tokens")
print("  Both are estimates. For billing-grade counts, call the model's own")
print("  count_tokens endpoint with the SAME model id you will send to.")

Running it:

=== 1. Output is the expensive half ===
Per-token, output costs 5x input on every model here:

  claude-opus-4-8    input $ 5.00/Mtok   output $25.00/Mtok   output is 5x
  claude-sonnet-4-6  input $ 3.00/Mtok   output $15.00/Mtok   output is 5x
  claude-haiku-4-5   input $ 1.00/Mtok   output $ 5.00/Mtok   output is 5x

Workload: 100,000 calls, 8000 input + 400 output tokens each (claude-opus-4-8).
  total cost: $5,000

  cut 100 OUTPUT tokens/call:  saves $250
  cut 100 INPUT  tokens/call:  saves $50
  -> token for token, output is worth 5x as much to cut.
  (Caching makes input cheaper still, ~0.1x on a hit, widening the gap;
   see chapter 6. Whether INPUT or OUTPUT dominates your bill depends on the
   ratio: a long prompt with a short answer is input-heavy in TOTAL, even
   though each output token is pricier.)

=== 2. A tiny tokenizer (BPE), from scratch ===
  context      -> ['context']   (1 token(s))
  contextual   -> ['context', 'u', 'a', 'l']   (4 token(s))
  engineering  -> ['enginee', 'r', 'i', 'ng']   (4 token(s))
  token        -> ['t', 'o', 'k', 'e', 'n']   (5 token(s))

=== 3. The practical estimate ===
  prompt: 52 words, 332 chars
  ~words * 1.3 = 68 tokens
  ~chars / 4   = 83 tokens
  Both are estimates. For billing-grade counts, call the model's own
  count_tokens endpoint with the SAME model id you will send to.

Read the BPE output closely, because it shows the algorithm working. After learning just twelve merges on a small corpus, context has been fused into a single token, while token itself, which never appeared in the training corpus, stays shattered into five character tokens. That is exactly why a word common in your domain is cheap and a rare identifier is expensive: the tokenizer has a merge for the former and not the latter.

The cost asymmetry is the strategic takeaway. It is why Chapter 4 is dedicated entirely to making the model write less, and why a one-line "be concise" instruction can have a bigger return than compressing a long document. It does not mean input is free: in the workload above the input still costs more in total because there is 20x more of it. The rule is per token. When you are choosing what to optimize, weight a saved output token as five saved input tokens, then multiply by how many of each you actually have.

Counting tokens for real

Estimates (words * 1.3, chars / 4) are fine for a sanity check. For anything that touches a budget or a billing decision, count exactly, with the same model id you will send to, because different model families tokenize differently. Anthropic exposes this as count_tokens, which takes the same messages you would send and returns the input token count without running the model. The following is follow-along (the build machine has no API key), but it is the exact call:

# Illustrative: requires the anthropic SDK and an API key.
import anthropic
client = anthropic.Anthropic()

n = client.messages.count_tokens(
    model="claude-opus-4-8",
    system=SYSTEM_PROMPT,
    messages=[{"role": "user", "content": LONG_DOCUMENT}],
).input_tokens
print(n, "input tokens")  # exact, free, no generation

Don't be confused. Do not reach for another vendor's tokenizer library (for example a GPT tokenizer) to count tokens for Claude, or vice versa. Each model family has its own vocabulary and merges, so a foreign tokenizer can be off by 15 to 20 percent, more on code. "Approximately right" is fine for a gut check and wrong for a budget. Use the provider's own count_tokens for the model you will actually call.

Pricing a context before you send it

Put the two halves together and you can price a call before making it: count the input tokens, estimate the output tokens you expect, and multiply by the per-token rates. The cost(model, in_tok, out_tok) function in the demo is the whole calculation. Doing this up front is what turns context engineering from a vibe ("this feels bloated") into a decision ("this preamble costs $2,000 a day re-sent uncached; caching it drops that to $200"). Every later chapter ends up cashing out as a change to one of the three numbers in that formula: fewer input tokens, fewer output tokens, or a cheaper rate on repeated input.

Takeaways

  • Tokens are learned sub-word pieces, not words. Count is roughly 1.3 tokens per English word and higher for code or JSON; the ratio is not fixed, so estimates drift.
  • BPE builds the vocabulary by merging frequent adjacent pairs; a domain-common word is one cheap token, a rare identifier is many expensive ones.
  • Output tokens cost about 5x input tokens. Token for token, trimming output is worth 5x trimming input, but whether input or output dominates your total bill depends on the ratio you actually run.
  • Estimate with words * 1.3 for a gut check; count with the provider's count_tokens and the exact model id for anything that touches a budget. Never use a foreign tokenizer.
  • Pricing a context up front (count in, estimate out, multiply by rates) turns "feels bloated" into a number you can act on.

👉 Now that we can measure a context and its price, we can start shrinking it. The next chapter compresses the input: removing the tokens a long prompt does not need while keeping the ones it does.