KV-cache and prefix caching

Chapter 2 priced a context: count the input tokens, estimate the output, multiply by the rates. That formula has a quiet assumption baked into it, that you pay for every input token on every call. For the parts of your context that do not change between calls, that assumption is wrong, and this chapter is about why.

Most production contexts are mostly stable. A long system prompt, a fixed set of tool definitions, a retrieved document the user is asking three questions about: the same tokens go out at the front of call after call, and only the tail (the new question) differs. If the model re-reads that stable prefix from scratch every time, you re-pay for it every time. Caching is how you stop.

There are two layers to this, and they are easy to confuse, so we separate them. The lower layer is the KV cache, a mechanism inside the model that makes generation itself tractable. The upper layer is prompt caching, a feature you call from the API that reuses the cached work for a stable prefix across separate requests. The first makes one generation cheap; the second makes a thousand calls with a shared prefix cheap. They are the same idea (do not redo work on tokens you have already processed) applied at two scales, and seeing the mechanism makes the API feature obvious instead of magical.

What attention actually computes

To see what gets cached, we need the one operation at the heart of a transformer: attention. Strip away the layers and the heads and it comes down to this. For a single token that is "looking" at the tokens before it, attention computes

$$\text{softmax}!\left(\frac{QK^\top}{\sqrt{d}}\right)V$$

That is dense, so here is every symbol in plain terms. A token enters as a vector. Three learned weight matrices turn that vector into three new vectors:

the query $Q$, what this token is looking for;
the key $K$, a label each token advertises, to be matched against queries;
the value $V$, the content a token contributes once its key is matched.

Turning a token vector into a query, key, or value by multiplying it through one of those matrices is called a projection. Projections are the work. They are what we count for the rest of this chapter, because they are what caching avoids.

Now read the formula left to right. $QK^\top$ is the query of the current token dotted against the key of every earlier token: one number per earlier token, scoring how well each one matches what we are looking for. Dividing by $\sqrt{d}$ (where $d$ is the vector dimension) keeps those scores from growing huge as the dimension grows, which would make the next step too sharp. The softmax turns the row of scores into a row of weights that are all positive and sum to 1. Finally $\dots V$ takes the weighted average of the value vectors. That weighted average is the attention output for this token: it pulled in content from earlier tokens in proportion to how well their keys matched its query.

The thing to notice: to attend, the current token needs the key and value of every token before it. Those keys and values are exactly what a cache can store.

The KV cache, within one generation

A model generates text one token at a time. To produce token $t$, it runs attention, which needs $K$ and $V$ for tokens $0$ through $t$. Then it appends the new token and does it again for $t+1$. The naive way to implement this is to rebuild $K$ and $V$ from scratch at every step: at step $t$, project all $t+1$ tokens again. That is $1 + 2 + \dots + n$ projections to generate $n$ tokens, which is $n(n+1)/2$, roughly $n^2/2$. Quadratic.

But the key and value of token 5 never change. Once token 5 is in the sequence, its projection is fixed; later tokens cannot alter it (a token only ever attends backward). So you store each token's $K$ and $V$ the first time you compute it, and at each new step you project only the one new token and append it to the stored list. That stored list is the KV cache. With it, generating $n$ tokens costs $n$ projections. Linear.

The demo below builds a toy single-head attention in NumPy and counts projections both ways, for $n = 8, 64, 256$. Crucially, it also asserts that the cached and uncached paths produce the same output: the cache is exact, a pure speedup, not an approximation that trades quality for cost.

"""KV-cache and prefix caching, from scratch in NumPy.

Two demonstrations, both with real counts:

  1. KV cache WITHIN one generation. To produce token t, attention needs the
     keys (K) and values (V) of every token 0..t. Without a cache, each step
     re-projects K,V for all prior tokens, so generating n tokens costs about
     n^2/2 projections. With a cache, you store past K,V and project only the
     ONE new token, so it costs about n. We count both, print the ratio, and
     assert the two paths produce the SAME output (the cache is exact, not an
     approximation).

  2. PREFIX sharing ACROSS requests. Two requests share a long identical
     prefix, then differ. Caching the prefix's K,V lets the second request skip
     recomputing the shared part. We report the projections saved, then convert
     that to money using Anthropic's prompt-caching economics (a cache read
     costs about 0.1x the normal input rate).

Run with NumPy + the standard library only. No other imports.
"""

import numpy as np

# A global projection counter. Every time we project one token into a key and
# value vector, we bump this by 1. Counting projections is how we make the
# "quadratic vs linear" claim concrete instead of hand-wavy.
PROJECTIONS = 0


def reset_counter():
    global PROJECTIONS
    PROJECTIONS = 0


def softmax(x):
    """Numerically stable softmax over the last axis.

    softmax turns a row of raw scores into a row of weights that are all
    positive and sum to 1. Subtracting the max first avoids overflow in exp().
    """
    x = x - np.max(x, axis=-1, keepdims=True)
    e = np.exp(x)
    return e / np.sum(e, axis=-1, keepdims=True)


# ----------------------------------------------------------------------------
# A toy single-head self-attention.
#
# Vocabulary of the terms, assuming no background:
#   - "token"      : one chunk of input, here represented as a vector x_t.
#   - "projection" : multiply a token vector by a learned weight matrix to get
#                    a new vector. Here W_q, W_k, W_v turn each token into a
#                    query, a key, and a value. This is the work we are counting.
#   - "query" (Q)  : what the current token is looking for.
#   - "key"   (K)  : what each token offers as a label, to be matched against Q.
#   - "value" (V)  : the content each token contributes once its key matches.
#
# Attention for one query is:  softmax(q . K^T / sqrt(d)) @ V
#   q . K^T   : how well the query matches every key (one score per past token).
#   / sqrt(d) : scale the scores so they do not blow up as the dimension grows.
#   softmax   : turn the scores into weights that sum to 1.
#   @ V       : take the weighted average of the value vectors. That average is
#               the attention output for this token.
# ----------------------------------------------------------------------------

D = 8  # vector dimension. Small so the demo is fast; the logic is the same at scale.
rng = np.random.default_rng(0)  # fixed seed: the numbers below are reproducible.
W_q = rng.standard_normal((D, D)) * 0.5
W_k = rng.standard_normal((D, D)) * 0.5
W_v = rng.standard_normal((D, D)) * 0.5


def project_kv(token):
    """Project ONE token into its key and value vector. Counts as 1 projection."""
    global PROJECTIONS
    PROJECTIONS += 1
    k = token @ W_k
    v = token @ W_v
    return k, v


def attend(query, K, V):
    """Attention output for a single query against stored keys K and values V."""
    scores = (query @ K.T) / np.sqrt(D)  # one score per past token
    weights = softmax(scores)            # weights that sum to 1
    return weights @ V                   # weighted average of the values


def generate_no_cache(tokens):
    """Generate, RE-projecting K,V for all prior tokens at every step.

    At step t we rebuild K and V from scratch for tokens 0..t. That is t+1
    projections at step t, so 1 + 2 + ... + n = n(n+1)/2 projections in total:
    quadratic in the number of tokens.
    """
    outputs = []
    for t in range(len(tokens)):
        K_rows, V_rows = [], []
        for j in range(t + 1):                 # rebuild K,V for every token 0..t
            k, v = project_kv(tokens[j])
            K_rows.append(k)
            V_rows.append(v)
        K = np.array(K_rows)
        V = np.array(V_rows)
        q = tokens[t] @ W_q                    # query for the current token
        outputs.append(attend(q, K, V))
    return np.array(outputs)


def generate_with_cache(tokens):
    """Generate, KEEPING past K,V in a cache and projecting only the new token.

    We append each new token's k,v to a growing cache. At every step we project
    exactly ONE token, so the whole generation costs n projections: linear.
    """
    K_cache, V_cache = [], []                  # this list IS the KV cache
    outputs = []
    for t in range(len(tokens)):
        k, v = project_kv(tokens[t])           # project ONLY the new token
        K_cache.append(k)
        V_cache.append(v)
        K = np.array(K_cache)
        V = np.array(V_cache)
        q = tokens[t] @ W_q
        outputs.append(attend(q, K, V))
    return np.array(outputs)


def demo_within_generation():
    print("=== 1. KV cache WITHIN one generation ===")
    print("Generating n tokens. Counting K,V projections each way.\n")
    print(f"{'n':>6}  {'no cache (~n^2/2)':>18}  {'with cache (~n)':>16}  {'ratio':>7}")
    for n in (8, 64, 256):
        tokens = rng.standard_normal((n, D))   # n random token vectors

        reset_counter()
        out_slow = generate_no_cache(tokens)
        slow = PROJECTIONS

        reset_counter()
        out_fast = generate_with_cache(tokens)
        fast = PROJECTIONS

        # The cache must be EXACT: same output, just less work. Assert it.
        assert np.allclose(out_slow, out_fast), "cache changed the output!"
        print(f"{n:>6}  {slow:>18,}  {fast:>16,}  {slow / fast:>6.1f}x")

    print("\nThe outputs are identical (assert passed), so the cache is exact,")
    print("not an approximation. It just stops re-paying for past tokens.")
    print("Cost goes from ~n^2/2 projections to ~n: at n=256 that is the")
    print("difference between 32,896 and 256.\n")


def demo_prefix_sharing():
    print("=== 2. PREFIX sharing ACROSS two requests ===")
    # Two requests with a long identical prefix, then a short tail that differs.
    PREFIX_LEN = 250   # the shared system prompt / context (stable across calls)
    TAIL_LEN = 6       # the part that differs (e.g. the user's question)

    prefix = rng.standard_normal((PREFIX_LEN, D))
    tail_a = rng.standard_normal((TAIL_LEN, D))
    tail_b = rng.standard_normal((TAIL_LEN, D))

    req_a = np.vstack([prefix, tail_a])
    req_b = np.vstack([prefix, tail_b])

    # Request A: process the whole thing once, projecting K,V for every token,
    # and KEEP the prefix's K,V to reuse on the next request.
    reset_counter()
    K_prefix, V_prefix = [], []
    for j in range(PREFIX_LEN):
        k, v = project_kv(prefix[j])
        K_prefix.append(k)
        V_prefix.append(v)
    for j in range(TAIL_LEN):
        project_kv(tail_a[j])
    a_projections = PROJECTIONS  # PREFIX_LEN + TAIL_LEN

    # Request B WITHOUT the cache: redo the shared prefix from scratch.
    reset_counter()
    for j in range(PREFIX_LEN):
        project_kv(prefix[j])
    for j in range(TAIL_LEN):
        project_kv(tail_b[j])
    b_uncached = PROJECTIONS  # PREFIX_LEN + TAIL_LEN again

    # Request B WITH the cache: reuse the prefix K,V, project only the new tail.
    reset_counter()
    for j in range(TAIL_LEN):
        project_kv(tail_b[j])
    b_cached = PROJECTIONS  # TAIL_LEN only

    saved = b_uncached - b_cached
    print(f"Shared prefix: {PREFIX_LEN} tokens.  Differing tail: {TAIL_LEN} tokens.\n")
    print(f"Request A (first call, fills the cache):  {a_projections} projections")
    print("Request B, BEFORE (no cache, redo prefix): "
          f"{b_uncached} projections")
    print(f"Request B, AFTER  (reuse cached prefix):   {b_cached} projections")
    print(f"Saved by reusing the prefix:               {saved} projections "
          f"({saved / b_uncached * 100:.0f}% of request B)\n")


def demo_money():
    print("=== 3. The same idea, in dollars (Anthropic prompt caching) ===")
    # Anthropic prompt caching prices a cache READ at about 0.1x the normal
    # input rate. A cache WRITE costs about 1.25x (5-minute TTL). So the first
    # call pays a small write premium, and every later call re-reads the prefix
    # for about a tenth of the price instead of re-sending it at full price.
    INPUT_RATE = 5.00 / 1_000_000   # claude-opus-4-8: $5.00 per 1M input tokens
    READ_RATE = INPUT_RATE * 0.1    # a cache hit is ~0.1x
    WRITE_RATE = INPUT_RATE * 1.25  # a 5-minute cache write is ~1.25x

    PREFIX_TOKENS = 10_000          # a 10k-token shared system prompt
    CALLS = 100                     # re-used across this many calls in 5 minutes

    uncached = PREFIX_TOKENS * INPUT_RATE * CALLS
    # First call writes the cache (1.25x); the other CALLS-1 read it (0.1x).
    cached = (
        PREFIX_TOKENS * WRITE_RATE
        + PREFIX_TOKENS * READ_RATE * (CALLS - 1)
    )
    print(f"A {PREFIX_TOKENS:,}-token prefix, re-used across {CALLS} calls "
          "(claude-opus-4-8).\n")
    print(f"  BEFORE (re-sent uncached every call): ${uncached:,.2f}")
    print(f"  AFTER  (written once, then read):     ${cached:,.2f}")
    print(f"  Saved:                                ${uncached - cached:,.2f} "
          f"({(1 - cached / uncached) * 100:.0f}% cheaper)\n")
    print("Per call, a cached re-read of the prefix costs ~0.1x what re-sending")
    print("it uncached would. The write premium is paid once and amortized away")
    print("after a couple of reads.")


if __name__ == "__main__":
    demo_within_generation()
    demo_prefix_sharing()
    demo_money()

Running it:

=== 1. KV cache WITHIN one generation ===
Generating n tokens. Counting K,V projections each way.

     n   no cache (~n^2/2)   with cache (~n)    ratio
     8                  36                 8     4.5x
    64               2,080                64    32.5x
   256              32,896               256   128.5x

The outputs are identical (assert passed), so the cache is exact,
not an approximation. It just stops re-paying for past tokens.
Cost goes from ~n^2/2 projections to ~n: at n=256 that is the
difference between 32,896 and 256.

=== 2. PREFIX sharing ACROSS two requests ===
Shared prefix: 250 tokens.  Differing tail: 6 tokens.

Request A (first call, fills the cache):  256 projections
Request B, BEFORE (no cache, redo prefix): 256 projections
Request B, AFTER  (reuse cached prefix):   6 projections
Saved by reusing the prefix:               250 projections (98% of request B)

=== 3. The same idea, in dollars (Anthropic prompt caching) ===
A 10,000-token prefix, re-used across 100 calls (claude-opus-4-8).

  BEFORE (re-sent uncached every call): $5.00
  AFTER  (written once, then read):     $0.56
  Saved:                                $4.44 (89% cheaper)

Per call, a cached re-read of the prefix costs ~0.1x what re-sending
it uncached would. The write premium is paid once and amortized away
after a couple of reads.

The first table is the within-generation story. At $n=8$ the cache saves a little; at $n=256$ it does $256$ projections where the naive version does $32{,}896$, a $128.5\times$ gap that keeps widening with length. This is not a tuning knob you turn on. Every production inference engine keeps a KV cache during generation, because without it long outputs would be quadratically slow. We come back to who manages that memory, and how in Chapter 8; for now the point is just that the model already caches keys and values internally, by token position.

Here is the leap. If keys and values can be cached within one generation, they can be cached across requests too, as long as the requests start with the same tokens.

The second block of output makes it concrete. Two requests share a 250-token prefix (think: a fixed system prompt plus a retrieved document) and then differ in a 6-token tail (the user's question). Request A processes all 256 tokens and we keep the prefix's $K$ and $V$. Request B without a cache redoes the whole prefix: 256 projections again, 250 of them pure waste because they recompute the identical prefix. Request B with the cache reuses the stored prefix keys and values and projects only the 6 new tail tokens: 6 projections. The shared prefix is paid for once, not twice. We saved 250 projections, 98% of request B's work.

That 98% is the whole economic case for prompt caching. The shared prefix is the expensive part (it is long and stable), the tail is the cheap part (it is short and changes), and a prefix cache lets the long stable part be processed once and reused.

Don't be confused. The model's internal KV cache and the API's prompt cache are not the same thing, even though they cache the same underlying quantities. The KV cache lives inside a single inference engine, holds keys and values for the duration of one generation (or while a session is warm), and you never touch it directly. The prompt cache is a feature you opt into from the API: you mark a stable prefix, and the provider stores its computed state so a later, separate request can reuse it. The first is the mechanism; the second is the product built on top of it. When this book says "caching relieves the cost pressure" (Chapter 1), it means the second, riding on the first.

From projections to dollars

The third block converts the saved work into money using Anthropic's actual prompt-caching prices. The economics rest on two numbers. A cache read (a request that reuses an already-cached prefix) costs about 0.1x the normal input rate: a tenth of the price of re-sending those tokens fresh. A cache write (the first request, which computes the prefix and stores it) costs about 1.25x the input rate for the default 5-minute cache, a small one-time premium.

So take a 10,000-token shared prefix re-used across 100 calls on claude-opus-4-8, whose input rate is $5.00 per million tokens. Re-sent uncached, the prefix alone costs $5.00 across those 100 calls. Cached, the first call pays the 1.25x write and the other 99 pay the 0.1x read, for $0.56 total: 89% cheaper, the same shape as the 98% projection saving, now in dollars. Per call, re-reading the cached prefix costs about a tenth of re-sending it.

The break-even is quick. The write costs 1.25x and a read costs 0.1x, so after the write you are ahead as soon as the reads you avoid would have cost more than 0.25x, which happens at the second or third call. Any stable prefix hit more than two or three times in the cache window pays for itself.

Calling it: Anthropic prompt caching

This maps directly onto the API. You mark the end of a stable prefix with a cache breakpoint and the provider caches everything up to that point. The following is follow-along (the build machine has no API key), but it is the exact call:

# Illustrative: requires the anthropic SDK and an API key.
import anthropic
client = anthropic.Anthropic()

resp = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LARGE_STABLE_SYSTEM_PROMPT,   # the long, unchanging preamble
            "cache_control": {"type": "ephemeral"},  # 5-minute cache; cache up to HERE
        }
    ],
    messages=[{"role": "user", "content": user_question}],  # the volatile tail
)

print(resp.usage.cache_read_input_tokens)   # >0 means we got a cache HIT
print(resp.usage.cache_creation_input_tokens)  # >0 means we WROTE the cache this call

cache_control={"type": "ephemeral"} requests the 5-minute cache; for content reused over a longer span, {"type": "ephemeral", "ttl": "1h"} requests a 1-hour cache (its write costs about 2x instead of 1.25x, so it needs more reads to pay off, but it survives gaps in bursty traffic). The simplest form is a top-level cache_control={"type": "ephemeral"} on the request, which auto-caches the last cacheable block for you.

Four facts decide whether this actually works, and getting any of them wrong fails silently (you simply keep paying full price with no error).

It is a prefix match. The cache key is the exact bytes of the prompt up to the breakpoint. Any change anywhere in the prefix invalidates everything after it. So keep volatile content (a timestamp, a request ID, the user's question) after the cached prefix, never inside it. A datetime.now() interpolated into the system prompt changes the prefix every call and caches nothing.

Render order is tools, then system, then messages. Because the cache is a prefix, a breakpoint on the last system block caches the tool definitions and the system prompt together (tools come first in the rendered prompt). That is usually what you want: the tools and the system prompt are both stable, so cache them as one block and put the conversation after.

There is a minimum size. On claude-opus-4-8 the smallest cacheable prefix is 4096 tokens. A prefix shorter than that will not cache even with a breakpoint set, and you will see cache_creation_input_tokens stay at zero with no error.

Verify with the usage fields. After a few calls with an identical prefix, response.usage.cache_read_input_tokens should be greater than zero. If it stays zero, a silent invalidator is changing your prefix: the usual culprits are a datetime.now() or a UUID in the system prompt, a JSON blob serialized without sorted keys, or a tool list whose order varies between calls. Diff the rendered prompt bytes of two requests to find it.

Don't be confused. Prefix caching is not semantic caching. Prefix caching reuses the computed state of identical input tokens: the prefix must match byte for byte, and it still runs the model to produce the answer. Semantic caching, the subject of Chapter 7, reuses a stored answer when a new question is close enough in meaning, returning it without calling the model at all. One caches the question's processing; the other caches the answer. They compose (you can do both), but they are different mechanisms with different failure modes.

Designing for the cache

Once you know it is a prefix match, the design rule writes itself: order your context from most stable to least stable. The frozen system prompt and the deterministic tool list go first, then the per-session context, then the running conversation, then the new user message last. Put the cache breakpoint at the boundary between the stable part and the volatile part. This is the build-order point from Chapter 1, now with a concrete payoff: a context assembled stable-first is a context that caches, and one assembled with a timestamp at the top is one that never will.

Using the real tool: commands and before/after proof

Everything above is the why. Here is the how, with a recipe you can run against the live provider to confirm the cache is actually working. The build machine has no API key, so the snippet below is follow-along (you supply the key with ANTHROPIC_API_KEY), but the call is exactly the one you would make.

The setup is the cheapest possible experiment: call the model twice with the same large system prompt marked for caching, changing only the user question. If caching works, the first call pays to build the cache and the second call reads it back cheaply.

# Illustrative: requires the anthropic SDK and an API key (ANTHROPIC_API_KEY).
import anthropic

client = anthropic.Anthropic()

# A long, stable preamble. To cache on claude-opus-4-8 it must be at least
# 4096 tokens; below that the breakpoint is silently ignored (see below).
LARGE_STABLE_SYSTEM_PROMPT = open("system_prompt.txt").read()

def ask(question):
    return client.messages.create(
        model="claude-opus-4-8",
        max_tokens=512,
        system=[
            {
                "type": "text",
                "text": LARGE_STABLE_SYSTEM_PROMPT,    # the unchanging prefix
                "cache_control": {"type": "ephemeral"},  # 5-minute cache; cache up to HERE
            }
        ],
        messages=[{"role": "user", "content": question}],  # the volatile tail
    )

# Call 1: same prefix, first question.
r1 = ask("Summarize the document in one sentence.")
print("call 1  write:", r1.usage.cache_creation_input_tokens,
      " read:", r1.usage.cache_read_input_tokens)

# Call 2: SAME prefix, different question.
r2 = ask("List the three main risks the document mentions.")
print("call 2  write:", r2.usage.cache_creation_input_tokens,
      " read:", r2.usage.cache_read_input_tokens)

The before/after proof

The two usage fields are the whole proof. cache_creation_input_tokens counts tokens written to the cache this call (you paid the 1.25x write premium on them). cache_read_input_tokens counts tokens served from the cache this call (you paid about 0.1x on them). Read the two calls in sequence:

Call 1 writes the cache. It is the first time the provider has seen this prefix, so it processes the whole system prompt and stores the result. You see cache_creation_input_tokens > 0 and cache_read_input_tokens == 0.
Call 2 reads the cache. The prefix is byte-for-byte identical, so the provider reuses the stored work and only processes the new question. You see cache_read_input_tokens > 0, and cache_creation_input_tokens drops to 0 because nothing new needed writing.

Here is the shape of that output. These numbers are illustrative (expected, not measured on this box, which has no key), with a system prompt of about 5,000 tokens:

call 1  write: 5000  read: 0
call 2  write: 0  read: 5000

That read: 5000 on call 2 is the cache hit you are looking for, and the 5,000 tokens it covers cost about a tenth of what they cost on call 1. That is the same 0.1x read price from the dollars section above, now visible in a field you can assert on.

The rule for reading this in your own runs: if cache_read_input_tokens stays 0 across repeated calls that share an identical prefix, a silent invalidator is changing your prefix. The usual culprit is something volatile slipping in ahead of the breakpoint: a datetime.now() or a UUID interpolated into the system prompt makes the prefix different every call, so every call writes a fresh cache and none ever reads one. When the read field is stuck at zero, diff the rendered prompt bytes of two requests and look for the thing that moved.

This is the same mechanism the from-scratch NumPy demo proved on this box: reuse the stored keys and values for the shared prefix and process only the new tail. The demo showed the work disappear (6 projections instead of 256); the usage fields show the same saving as a price you are billed (a 0.1x read instead of full input).

In Claude Code

Claude Code is built on exactly this. Across the turns of one session it automatically caches the stable parts of the request: the system prompt and the tool definitions, which do not change from turn to turn. A long CLAUDE.md is therefore paid for once, on the first turn that writes it into the cache, and read cheaply on every turn after that. So the per-turn cost you watch tick up mostly reflects cache reads of that stable preamble plus the genuinely new tokens of the latest turn, not a full re-send of the whole preamble every time. The same caching that makes a hundred API calls with a shared prefix cheap is what keeps a long interactive coding session from re-paying for its own context on every message.

Takeaways

Attention needs the key and value of every prior token. Caching those keys and values is what lets the model avoid re-projecting tokens it has already processed.
Within one generation, the KV cache turns roughly $n^2/2$ projections into $n$: in the demo, $256$ instead of $32{,}896$ at $n=256$. It is exact (the assert on equal outputs passes), not an approximation. Every inference engine does this.
Across requests, caching a shared prefix's keys and values lets a later call skip the shared work: the demo's second request did 6 projections instead of 256, a 98% saving.
In dollars, a 10,000-token prefix re-used across 100 calls drops from $5.00 to $0.56 (89% cheaper) because cache reads cost about 0.1x input and the 1.25x write is paid once. Stable prefixes hit more than two or three times pay for themselves.
It is a prefix match, so keep volatile content after the cached prefix, respect the render order (tools, then system, then messages) and the 4096-token minimum on claude-opus-4-8, and verify hits with cache_read_input_tokens. A zero there means a silent invalidator.

👉 Prefix caching reuses the work of processing an identical question; it still runs the model. The next chapter caches the other end: when a new question is close enough in meaning to one you have answered before, return the stored answer and skip the model entirely.

Context Engineering in Depth