KV-cache serving optimization

Chapter 6 was about the bill: when you re-send a stable prefix, the provider can skip re-reading it and charge you a fraction of the price. That is the view from outside, through the API. This chapter goes inside the box. The same key/value state that prefix caching lets you reuse has to physically live somewhere on the server's GPU, and how the serving engine lays that memory out across many users at once decides how many requests it can run in parallel and how fast each one comes back. Two ideas do most of the work, and both have a clean from-scratch model: paging the cache into fixed-size blocks, and sharing the blocks of a common prefix across requests.

What the KV cache is, concretely

When a model generates text, each new token has to attend to every token before it. Attention works by comparing the new token's query vector against a key vector for every earlier token, then mixing in a value vector for each. The keys and values for the earlier tokens do not change as generation continues, so recomputing them on every step would be pure waste. The engine computes them once and keeps them. That saved pile of key and value vectors is the KV cache.

Two facts about it drive this whole chapter. First, it grows by exactly one slot per token, per layer, as the sequence gets longer, so a request that has generated 500 tokens is holding 500 slots of KV. Second, it lives in GPU memory, which is small and expensive, and it is usually the thing that runs out first when you try to serve many users at once. Model weights are a fixed cost you pay once; the KV cache is a per-request, per-token cost that scales with how many people are talking to the model and how much they have said. So the engineering question is not "how do we compute attention" but "how do we store all this KV for all these concurrent requests without wasting memory."

Don't be confused. This is a different layer from Chapter 6. That chapter was about API prompt caching: what you, the caller, pay when your prompt repeats an exact prefix, billed per request, surfaced as a discount on your invoice. This chapter is about engine memory management: how the server arranges KV in GPU memory so it can hold many requests at once, none of which you see or are billed for directly. They are related (prefix sharing here is the mechanism that makes API prefix caching possible at all), but one is your cost model and the other is the server's memory model. When you call a hosted API you are using both, through different doors.

Idea one: page the cache instead of reserving the maximum

Here is the trap a naive serving engine falls into. The KV cache for a request grows as it generates, and the engine does not know in advance how long the request will get. The model could generate up to max_seq_len tokens (say 2048), so the simple thing is to reserve one contiguous buffer of 2048 slots per request up front. Contiguous means one unbroken run of memory addresses, which the old attention kernels needed.

The problem is that most requests finish far short of the maximum. A request that generates 80 tokens still holds a 2048-slot reservation, and the other 1968 slots sit empty but allocated, unusable by anyone else. That wasted tail is internal fragmentation: memory that is reserved but not used, trapped inside a per-request allocation. With requests that vary in length, the average waste is enormous, and it directly caps how many requests fit on the GPU.

The fix, PagedAttention (introduced by vLLM), is the same trick an operating system uses for RAM. Instead of one big contiguous buffer per request, the engine carves GPU memory into many small fixed-size blocks (also called pages; vLLM's default is 16 tokens of KV per block) and hands them out on demand. A request that has filled 80 tokens holds ceil(80 / 16) = 5 blocks and nothing more. When it needs token 81 it asks for one more block. The blocks for a single request do not have to be next to each other in memory; a small table maps the logical sequence to its scattered physical blocks, exactly like a page table maps virtual addresses to physical RAM. The only waste left is the partly filled last block, which is at most block_size - 1 tokens per request.

The first part of the demo measures the difference. We take 64 requests with realistic varied lengths and compute memory utilization, the fraction of reserved memory that actually holds live KV:

$$\text{utilization} = \frac{\text{tokens used}}{\text{tokens reserved}}$$

The contiguous allocator reserves max_seq_len for every request; the paged allocator reserves ceil(len / block_size) blocks. We never compute an attention kernel here, just the arithmetic of what each policy reserves.

The second idea attacks a different waste. In multi-user serving, many requests start with the same tokens: the same system prompt, the same few-shot examples, the same tool definitions. Because the KV for a position depends only on the tokens up to that position, two requests that share a prefix produce bit-for-bit identical KV blocks for that prefix. Storing a separate copy per request is redundant. The engine could store the shared prefix once and let every request that begins with it point at the same blocks.

The data structure for "find the longest shared prefix fast" is a radix tree, also called a prefix tree or trie. A trie stores sequences as paths from a root: each edge is one element (here, one KV block), and sequences that share a prefix share the early part of their path before branching. RadixAttention (from SGLang) builds exactly this tree over the block sequences of live requests, keyed by block content, so that the moment a new request's prefix matches an existing path, it reuses those blocks instead of allocating and recomputing them. This is automatic: no one has to declare "these requests share a prompt," the tree discovers it.

The second part of the demo builds a small radix tree over 40 requests. Each request is a list of block ids; equal ids mean identical blocks (same tokens, same positions, so shareable). All 40 share an 8-block system prompt, 30 of them also share a 6-block few-shot preamble, and each has a short unique tail. Inserting a request into the tree creates a new node only for a block that has never appeared on that path, so the count of nodes created is the count of blocks the engine actually has to store. We compare that against the naive total where every request keeps its own copy of everything.

The demo

"""KV-cache serving optimization: how the engine stores attention state in GPU memory.

When a model generates text, every token it has already seen leaves behind a pair of
vectors per layer (a "key" and a "value") that the next token attends to. That stored
pile of key/value vectors is the KV CACHE. It grows by one slot per generated token, and
it lives in scarce GPU memory. How the serving engine LAYS OUT that memory across many
concurrent requests decides how many requests fit at once. Two ideas dominate:

  1. PAGED allocation (vLLM's PagedAttention). A naive engine reserves one contiguous
     buffer of max_seq_len up front for each sequence, then most sequences finish short
     and the unused tail is wasted (internal fragmentation). A paged allocator hands out
     fixed-size BLOCKS on demand, like an operating system pages virtual memory, so a
     sequence only holds the blocks it actually filled. We measure UTILIZATION =
     used / reserved for both.

  2. PREFIX SHARING (SGLang's RadixAttention). Many requests start with the SAME prefix
     (a shared system prompt, a fixed few-shot preamble). Their KV blocks for that prefix
     are identical, so the engine can store ONE copy and let every request point at it.
     We build a radix tree (a prefix tree / trie over block sequences), then count blocks
     stored without sharing vs with sharing, and report blocks saved.

This is a SIMULATION with real arithmetic, not a GPU kernel. NumPy + stdlib only.
Run:  python3 kv_serving.py
"""

import numpy as np

# ----------------------------------------------------------------------------------
# Part 1: contiguous (reserve max) vs paged (blocks on demand) allocation.
# ----------------------------------------------------------------------------------

MAX_SEQ_LEN = 2048      # the longest a sequence is allowed to get
BLOCK_SIZE = 16         # tokens of KV per page/block (vLLM's default is 16)

# A batch of requests with realistic, varied actual lengths. A naive allocator cannot
# know these in advance, so it must reserve MAX_SEQ_LEN for every one of them.
rng = np.random.default_rng(0)
actual_lens = rng.integers(20, 600, size=64)   # 64 requests, 20..599 tokens each


def blocks_needed(n_tokens, block_size):
    """How many fixed-size blocks hold n_tokens. The last block is partly empty;
    that small leftover is the only waste a paged allocator ever pays."""
    return int(np.ceil(n_tokens / block_size))


def contiguous_utilization(lens, max_len):
    """Reserve max_len slots per sequence regardless of how long it actually gets."""
    used = int(np.sum(lens))
    reserved = len(lens) * max_len
    return used, reserved


def paged_utilization(lens, max_len, block_size):
    """Hand out blocks on demand; a sequence holds only ceil(len/block_size) blocks."""
    used = int(np.sum(lens))
    reserved = int(np.sum([blocks_needed(n, block_size) for n in lens])) * block_size
    return used, reserved


used_c, reserved_c = contiguous_utilization(actual_lens, MAX_SEQ_LEN)
used_p, reserved_p = paged_utilization(actual_lens, MAX_SEQ_LEN, BLOCK_SIZE)

util_c = used_c / reserved_c
util_p = used_p / reserved_p

print("=== 1. Paged vs contiguous KV allocation ===")
print(f"  {len(actual_lens)} requests, actual lengths {actual_lens.min()}..{actual_lens.max()} "
      f"tokens (mean {actual_lens.mean():.0f}).")
print(f"  max_seq_len = {MAX_SEQ_LEN}, block_size = {BLOCK_SIZE} tokens/block.\n")

print("  BEFORE (contiguous: reserve max_seq_len per request)")
print(f"    used {used_c:,} tok / reserved {reserved_c:,} tok  ->  "
      f"utilization {util_c:6.1%}")
print(f"    wasted: {reserved_c - used_c:,} tok ({1 - util_c:.1%} of reserved)\n")

print("  AFTER (paged: blocks on demand)")
print(f"    used {used_p:,} tok / reserved {reserved_p:,} tok  ->  "
      f"utilization {util_p:6.1%}")
print(f"    wasted: {reserved_p - used_p:,} tok ({1 - util_p:.1%} of reserved, all in "
      f"half-full last blocks)\n")

print(f"  Same KV in {reserved_c:,} reserved tok shrinks to {reserved_p:,}: "
      f"{reserved_c / reserved_p:.1f}x less memory reserved for the SAME work.")
print(f"  -> a fixed GPU can now hold ~{reserved_c / reserved_p:.1f}x as many concurrent "
      f"requests.\n")


# ----------------------------------------------------------------------------------
# Part 2: prefix sharing via a radix tree (trie) over block sequences.
# ----------------------------------------------------------------------------------

class RadixNode:
    """One node = one shared block, identified by a content key (here, a token-block
    id). Children are the blocks that have followed it. Many requests can walk the
    same path, which is exactly the prefix they share."""
    __slots__ = ("children",)

    def __init__(self):
        self.children = {}


def insert(root, block_seq):
    """Walk the request's block ids down the tree, creating a node only when a block
    has never been seen on this path before. Returns how many NEW blocks this request
    forced us to store (the blocks not already shared with an earlier request)."""
    node = root
    new_blocks = 0
    for blk in block_seq:
        if blk not in node.children:
            node.children[blk] = RadixNode()
            new_blocks += 1
        node = node.children[blk]
    return new_blocks


# A workload that looks like multi-user serving: a long shared SYSTEM PROMPT, then a
# shared FEW-SHOT preamble for most requests, then each request's unique question.
# We model each request as a sequence of block ids. Equal ids mean identical KV blocks
# (same tokens in the same position), which is what makes them shareable.
SYSTEM_BLOCKS = [f"sys{i}" for i in range(8)]      # 8 blocks every request shares
FEWSHOT_BLOCKS = [f"shot{i}" for i in range(6)]    # 6 more blocks most requests share

requests = []
for r in range(40):
    blocks = list(SYSTEM_BLOCKS)
    if r % 4 != 0:                                 # 30 of 40 also use the few-shot preamble
        blocks += FEWSHOT_BLOCKS
    blocks += [f"req{r}_q{j}" for j in range(rng.integers(2, 6))]  # unique tail
    requests.append(blocks)

total_blocks_no_share = sum(len(b) for b in requests)

root = RadixNode()
stored_with_share = sum(insert(root, b) for b in requests)

print("=== 2. Prefix sharing via a radix tree (RadixAttention) ===")
print(f"  {len(requests)} requests. Each starts with an {len(SYSTEM_BLOCKS)}-block system "
      f"prompt; {sum(1 for r in range(40) if r % 4 != 0)} also share a "
      f"{len(FEWSHOT_BLOCKS)}-block few-shot preamble.\n")
print("  BEFORE (no sharing: every request stores its own copy of every block)")
print(f"    blocks stored: {total_blocks_no_share:,}\n")
print("  AFTER (radix tree: shared prefixes stored once)")
print(f"    blocks stored: {stored_with_share:,}")
saved = total_blocks_no_share - stored_with_share
print(f"    blocks saved:  {saved:,}  ({saved / total_blocks_no_share:.1%} fewer)\n")
print(f"  The {len(SYSTEM_BLOCKS)} system blocks are stored once instead of "
      f"{len(requests)} times; the few-shot preamble once instead of many times.")
print("  Every saved block is GPU memory freed for another concurrent request,")
print("  and prefill compute the engine skips because that KV already exists.")

Running it:

=== 1. Paged vs contiguous KV allocation ===
  64 requests, actual lengths 21..598 tokens (mean 315).
  max_seq_len = 2048, block_size = 16 tokens/block.

  BEFORE (contiguous: reserve max_seq_len per request)
    used 20,174 tok / reserved 131,072 tok  ->  utilization  15.4%
    wasted: 110,898 tok (84.6% of reserved)

  AFTER (paged: blocks on demand)
    used 20,174 tok / reserved 20,640 tok  ->  utilization  97.7%
    wasted: 466 tok (2.3% of reserved, all in half-full last blocks)

  Same KV in 131,072 reserved tok shrinks to 20,640: 6.4x less memory reserved for the SAME work.
  -> a fixed GPU can now hold ~6.4x as many concurrent requests.

=== 2. Prefix sharing via a radix tree (RadixAttention) ===
  40 requests. Each starts with an 8-block system prompt; 30 also share a 6-block few-shot preamble.

  BEFORE (no sharing: every request stores its own copy of every block)
    blocks stored: 642

  AFTER (radix tree: shared prefixes stored once)
    blocks stored: 156
    blocks saved:  486  (75.7% fewer)

  The 8 system blocks are stored once instead of 40 times; the few-shot preamble once instead of many times.
  Every saved block is GPU memory freed for another concurrent request,
  and prefill compute the engine skips because that KV already exists.

Read the two halves together, because they attack waste from opposite directions. Part one is about waste within a single request: the contiguous allocator reserves 131,072 tokens to hold 20,174 real ones, a utilization of 15.4 percent, which is to say it throws away 84.6 percent of the memory it reserved on padding that no request will ever fill. Paging drops that to 2.3 percent waste and 97.7 percent utilization, all of the remaining loss confined to the half-empty last block of each request. The same KV that needed 131,072 reserved slots now fits in 20,640, which is 6.4x less memory for identical work. Memory is the binding constraint on how many requests a GPU can serve at once, so reserving 6.4x less of it is, to first order, room for 6.4x more concurrent requests.

Part two is about waste across requests. Without sharing, the 40 requests store 642 blocks total, re-storing the same system prompt 40 separate times. The radix tree stores the 8 system blocks once and the few-shot preamble once, bringing the total down to 156 blocks, a saving of 486 blocks or 75.7 percent. Every block saved is GPU memory freed for another user, and it is also prefill compute the engine gets to skip entirely, because the KV for those shared tokens already exists and does not need to be recomputed. That skipped compute is the server-side mechanism underneath the cache-read discount you saw on your bill in Chapter 6.

The exact percentages here come from this synthetic workload and the seed in the script; change the length distribution or the prefix overlap and the numbers move. What is robust is the direction and the rough scale: contiguous reservation wastes the large majority of memory when lengths vary, paging recovers nearly all of it, and prefix sharing removes most of the redundancy when many requests share a long preamble. Those are exactly the conditions of real multi-user serving.

Where this shows up, and where it does not

These two techniques are the core of modern open-source serving engines. vLLM is built around PagedAttention; SGLang adds RadixAttention for automatic prefix sharing across requests, and both implement the other's idea too, so in practice you get paging and sharing together. The workloads that benefit most are the obvious ones: high-throughput serving with many concurrent users (paging lets far more of them coexist), any setup where requests share a long system prompt or few-shot preamble (sharing stores it once), and agent fan-out where one parent spawns many children that all carry the same instructions and context (Chapter 13), which is prefix sharing at its best because the overlap is large and deliberate.

If you only ever call a hosted API, you do not configure any of this; the provider runs an engine like these on your behalf, and your lever is to structure your prompts so the shared part comes first and stays byte-stable, which is the same discipline that earns the API cache discount in Chapter 6. If you self-host, this is the layer you actually operate, and understanding it is how you reason about throughput and the maximum batch size your GPU can hold. Either way, it connects upward to attention itself: paging and sharing manage where the KV lives, while Chapter 14 is about making attention over a long cache cheap to compute in the first place.

Using the real tool: commands and before/after proof

The from-scratch demo proves the mechanism on this box. The production versions of both ideas live in real serving engines you run on a GPU server, not in Claude Code. Claude Code is a client that talks to a hosted API; the paging and prefix sharing below happen on whatever server is answering, and you only operate them yourself when you self-host a model with vLLM or SGLang. So this section is server-side: the commands assume a machine with a GPU and the model weights, and the numbers at the end are labeled illustrative because there is no GPU on this box to run them.

Both engines do paging by default. The lever you actually toggle is prefix sharing, so that is what the before/after proof turns on and off.

Install vLLM and start a server. vllm serve <model> launches an OpenAI-compatible HTTP endpoint on localhost:8000. In the current vLLM (the V1 engine), automatic prefix caching is on by default, so the first command below is the "after" (sharing on) and the second is the "before" (sharing off) for an apples-to-apples comparison:

pip install vllm

# AFTER: prefix sharing ON (this is the default in vLLM V1; the flag is explicit here)
vllm serve meta-llama/Meta-Llama-3-8B-Instruct --enable-prefix-caching

# BEFORE: prefix sharing OFF (forces a cold prefill on every request)
vllm serve meta-llama/Meta-Llama-3-8B-Instruct --no-enable-prefix-caching

SGLang is the engine that introduced RadixAttention. Its launcher is a Python module, and RadixAttention prefix caching is also on by default; --disable-radix-cache is the off switch for the baseline run:

pip install "sglang[all]"

# Prefix sharing ON (default)
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct

# Prefix sharing OFF (baseline)
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --disable-radix-cache

Now hit the endpoint with two requests that share a long prefix and differ only in a short tail. The point is that the big shared block is identical across both calls, so the second call should get to reuse its KV. This is a follow-along curl against the OpenAI-compatible /v1/chat/completions route (the LONG_SYSTEM text below stands in for a real multi-thousand-token system prompt or document; make it long, that is the whole point):

LONG_SYSTEM="<a few thousand tokens of stable system prompt / policy / docs, identical every time>"

# Request 1: pays the full prefill for the long shared prefix
time curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{\"model\":\"meta-llama/Meta-Llama-3-8B-Instruct\",
       \"messages\":[{\"role\":\"system\",\"content\":\"$LONG_SYSTEM\"},
                     {\"role\":\"user\",\"content\":\"Summarize in one line.\"}],
       \"max_tokens\":32}"

# Request 2: SAME long prefix, different short question
time curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{\"model\":\"meta-llama/Meta-Llama-3-8B-Instruct\",
       \"messages\":[{\"role\":\"system\",\"content\":\"$LONG_SYSTEM\"},
                     {\"role\":\"user\",\"content\":\"List three risks.\"}],
       \"max_tokens\":32}"

The before/after proof is the second request. Both engines log a prefix cache hit rate (vLLM prints lines like Prefix cache hit rate: ...; SGLang reports reused/cached tokens), so you watch two signals at once: the server logs for how many prefix tokens were reused, and the client clock (time, or the time-to-first-token if you stream) for how long the second call took to start producing output. With sharing on, the second request finds the long prefix already in the KV cache, skips re-running prefill over those thousands of tokens, and its first token comes back much sooner. With sharing off, every request prefills the whole prefix from scratch, so the second call is no faster than the first. The shape of the result (illustrative, not measured on this box, because it has no GPU):

# Prefix sharing OFF (--no-enable-prefix-caching / --disable-radix-cache)
request 1: time-to-first-token ~ 0.9 s    (full prefill of the long prefix)
request 2: time-to-first-token ~ 0.9 s    (full prefill again; nothing reused)
server log: prefix cache hit rate ~ 0%

# Prefix sharing ON (default)
request 1: time-to-first-token ~ 0.9 s    (cold: builds the prefix's KV)
request 2: time-to-first-token ~ 0.1 s    (warm: prefix KV reused, prefill skipped)
server log: prefix cache hit rate high (most of request 2's prompt tokens were cache hits)

A high hit rate also means the shared prefix's KV is stored once and pointed at by both requests instead of copied, which is KV memory headroom freed for more concurrent users. That is the same two wins the from-scratch demo measured above, now coming from a real engine: the paged allocator is what lets request 2 reuse exactly the blocks request 1 built, and the radix tree is what finds the shared prefix to reuse. The demo's 97.7% utilization and 75.7% fewer blocks are the on-box proof of the mechanism that produces this drop in time-to-first-token on the server.

One sharp distinction from Chapter 6: that chapter is about what you pay at the API, the cache-read discount on your invoice, while this is what the serving engine does with GPU memory across concurrent requests so it can reuse that KV and fit more of them at once.

Takeaways

The KV cache is the stored key/value vectors for every token a request has seen. It grows one slot per token, lives in scarce GPU memory, and is usually what limits how many requests a server can run at once.
Reserving a contiguous max_seq_len buffer per request wastes most of the memory when lengths vary (internal fragmentation): in the demo, 15.4 percent utilization, 84.6 percent wasted.
PagedAttention hands out fixed-size blocks on demand, like OS virtual memory, pushing utilization to 97.7 percent and fitting about 6.4x more concurrent requests in the same memory.
RadixAttention stores the KV of a shared prefix once via a radix tree (a trie over block sequences), so a common system prompt costs one copy instead of one per request: 75.7 percent fewer blocks in the demo.
This is the server-side mechanism beneath the API prefix-cache discount of Chapter 6. If you self-host you operate it directly; if you call an API, you exploit it by keeping the shared prefix first and byte-stable.

👉 Caching reuses work on context that repeats. But some state should outlive the window entirely, surviving across sessions rather than just across calls. The next part turns to memory: storing what the model should remember outside the context and re-injecting only the slice that this turn needs.

Context Engineering in Depth