Long-context attention efficiency
Every technique so far has worked from outside the model: choosing what goes in the
context, compressing it, caching it, storing it elsewhere. This chapter goes inside the
model itself, because there is a hard limit those techniques cannot move on their own.
The reason a 1M-token context is expensive is not the disk it sits on or the bytes you
send over the wire. It is that the model's attention mechanism, the thing that lets
each token look at the others, costs work that grows with the square of the context
length, and the KV cache it leaves behind costs memory that grows with the length.
Those two costs are what model architects attack, and the 1M-token windows of models like
claude-opus-4-8 exist only because they found ways to attack them. This chapter explains
the two costs and the two families of fixes.
What attention is, and why it costs $n^2$
When a model processes a sequence, each token needs to look at the others to figure out what it means in context. The word "it" has to find its referent; a closing brace has to find its opening one. The mechanism that does this looking is attention, and it works with three vectors per token. For each token the model computes a query (what this token is looking for), a key (what this token offers to others), and a value (the information this token will pass along if attended to). A query from one token is compared against the keys of all the others; wherever a query and a key match well, that token's value gets mixed in. Written out, attention is
$$\text{attention}(Q, K, V) = \text{softmax}!\left(\frac{QK^\top}{\sqrt{d}}\right)V$$
and every symbol there earns its place. $Q$, $K$, $V$ are the stacks of query, key, and value vectors for all $n$ tokens, each vector of length $d$ (the head dimension, the size of one attention head's working space). $QK^\top$ is the score matrix: row $i$, column $j$ is the dot product of query $i$ with key $j$, a single number saying how much token $i$ should attend to token $j$. Dividing by $\sqrt{d}$ keeps those numbers from growing too large as $d$ grows. softmax turns each row into weights that are positive and sum to one, so each query spreads a total attention of exactly 1 across all keys. Multiplying those weights by $V$ produces the output: each token's new representation is a weighted blend of every token's value.
Now look at the shape of $QK^\top$. With $n$ tokens it is an $n \times n$ matrix: every token scored against every other token. Each of those $n^2$ entries is a dot product of length $d$, and then the weight-times-value step is another $n \times n$ pass. So the work is proportional to $2 n^2 d$, which we write as $O(n^2 d)$: it grows with the square of the sequence length. "$O(\cdot)$" is just shorthand for "grows in proportion to," ignoring constant factors. The consequence is brutal. Double the context and attention does four times the work. Go from a 1,000-token prompt to a 1,000,000-token one, a 1000x increase in length, and attention does a million times the work. That single fact is why long context is hard, and the first part of the demo makes the quadratic curve a concrete set of numbers.
Fix one, less compute: sparse and linear attention
The first family of fixes notices that the $n \times n$ score matrix is almost always wasteful. A query does not actually need a precise score against all $n$ keys; in real language most of a token's attention lands on a handful of others (its near neighbours, a few key earlier tokens) and the rest get a weight so close to zero they may as well not have been computed. So why compute them?
Sparse attention computes only a chosen subset of the score matrix. The simplest pattern is a sliding window: each query attends only to the $w$ most recent tokens instead of all $n$, where $w$ is a fixed window size much smaller than $n$. That turns the score matrix from $n \times n$ into roughly $n \times w$, so the cost drops from $O(n^2 d)$ to $O(n w d)$, linear in $n$ instead of quadratic. A richer version is top-$k$ selection: instead of a fixed local window, each query keeps only the $k$ keys it scores highest and ignores the rest, so it can still reach an important token far back while paying for only $k$ comparisons. DeepSeek Sparse Attention (DSA), used in DeepSeek-V3.2, does exactly this kind of learned selection: for each query it picks a sparse set of keys to attend to, so the cost stops growing with the full sequence length.
A different route is linear attention, which avoids the $n \times n$ matrix altogether. By dropping or reshaping the softmax, attention can be rewritten so that keys and values are summed into a fixed-size running state that each query reads from, making the cost grow linearly with $n$ and using a fixed amount of memory regardless of length. MiniMax's lightning attention (in MiniMax-01) is a production example: it replaces the quadratic softmax attention with a linear-cost mechanism for long sequences, which is what lets the model take very long inputs without the $n^2$ wall.
The second part of the demo shows the catch and why these methods work anyway. Cheaper operations are worthless if the answer comes out wrong. So it runs full attention and windowed attention on the same input and measures how far apart their outputs are. The input is built so that attention weight decays with distance (the way it does in real text, where a token leans mostly on what is near it), and the result is that almost all of every query's attention mass already falls inside the window. Clipping to the window then changes the output by a tiny amount: the approximation is reasonable, not random.
Fix two, less memory: MLA and KV compression
Sparse and linear attention attack compute. They do nothing for the other cost, which is memory. Recall the KV cache from Chapter 6 and Chapter 8: so the model does not recompute the past on every generated token, it stores the key and value vectors of every token it has seen. Each token contributes a key and a value, each of size $d$, so the cache holds on the order of $n \cdot d$ numbers per layer, and it lives in scarce GPU memory. For a 1M-token context this cache is enormous, and as Chapter 8 established, the KV cache is usually the thing that runs out first when serving long contexts. Sparse attention does not shrink it: even if a query only looks at a window, the model still has to keep the keys and values around in case a later query needs them.
Multi-head Latent Attention (MLA), introduced in DeepSeek-V2 and carried into DeepSeek-V3, shrinks the cache directly. The idea is low-rank compression. The full $d$-dimensional key and value vectors carry less independent information than their size suggests; they can be squeezed into a much smaller latent vector of dimension $r$, where $r$ is far smaller than $d$ (the "low-rank" part means a few dimensions capture most of what matters). MLA projects each token's key and value down to one shared latent of size $r$, stores only that latent in the cache, and projects it back up to full keys and values when attention actually needs them. The down-projection and up-projection are fixed weight matrices, paid once, not stored per token. So the per-token cache cost falls from about $d$ to about $r$, and the whole cache shrinks by roughly $d / r$.
The third part of the demo puts a number on it: the KV-cache bytes for a 1M-token context stored in full versus stored as MLA-style latents, and the compression ratio. It also runs a small sanity check that a rank-$r$ latent really can carry the information, by compressing genuinely low-rank vectors down and back up and confirming almost nothing is lost.
Don't be confused. Reducing attention compute and reducing KV-cache memory are two different problems with two different fixes, and it is easy to blur them because both wear the word "attention." Sparse and linear attention (DSA, MiniMax lightning) cut the number of score computations: they make the model do less arithmetic per token by not comparing every query against every key. MLA cuts the bytes per token in the cache: it makes each stored token smaller, but it does not change how many comparisons attention performs. One attacks the $n^2$ in the FLOPs, the other attacks the $n \cdot d$ in the memory. They are orthogonal, and the strongest long-context models combine them: sparse or linear attention running over a compressed MLA cache, paying less compute and less memory at once. When you read that a model "does efficient long context," ask which of the two it means, because a model can fix one and still be bottlenecked on the other.
The demo
"""Long-context attention efficiency: why long windows are expensive, and the two
architecture-level fixes.
A transformer's attention compares EVERY token against EVERY earlier token. For a
sequence of n tokens, that comparison forms an n x n SCORE MATRIX, so the compute grows
like n^2. Double the context and you quadruple the attention work. Separately, to avoid
recomputing the past on every step, the engine stores a KEY and a VALUE vector (each of
size d) per token: the KV CACHE, whose memory grows like n*d. These are two DIFFERENT
bottlenecks, and there are two different fixes:
1. FULL attention is O(n^2 * d). We count the multiply-add operations for the
score matrix (n*n*d) plus the value mixing (n*n*d) as n grows, so the quadratic
blow-up is visible as a number.
2. SPARSE / WINDOWED attention cuts COMPUTE. Each query attends only to a local
WINDOW of w nearby keys instead of all n, giving about O(n*w*d). We count its
ops vs full, and on a SMOOTH input show the windowed output stays close to full
attention (small max-abs difference), so the approximation is reasonable, not random.
3. MLA-style KV compression cuts MEMORY. Multi-head Latent Attention projects the
d-dim key/value down to a shared low-rank LATENT of size r << d, stores only that
latent per token, and projects back up when needed. We compute KV-cache BYTES for
full (n*d) vs latent (n*r) and the compression ratio for a long n.
Everything below is plain NumPy with float32. No GPU, no transformer library; just the
arithmetic that the real systems are built on. NumPy + stdlib only.
Run: python3 attention_efficiency.py
"""
import numpy as np
BYTES_PER_ELEM = 2 # KV caches are usually stored in fp16/bf16: 2 bytes/number
rng = np.random.default_rng(0)
# ----------------------------------------------------------------------------------
# Part 1: FULL attention is O(n^2 * d). Count the operations as n grows.
# ----------------------------------------------------------------------------------
# Attention is softmax(Q K^T / sqrt(d)) V. With n tokens and head dimension d:
# * Q K^T compares every query against every key: an n x n matrix, each entry a
# dot product of length d -> n * n * d multiply-adds.
# * Multiplying that n x n weight matrix by V mixes the values -> another n * n * d.
# So the cost is proportional to 2 * n^2 * d, the famous "quadratic in sequence length".
def full_attention_ops(n, d):
"""Multiply-add count for full attention over n tokens, head dim d."""
score_ops = n * n * d # Q K^T : the n x n score matrix
mix_ops = n * n * d # weights @ V : mixing the values
return score_ops + mix_ops
def windowed_attention_ops(n, d, w):
"""Each query attends to at most w keys (a local window) instead of all n.
The score matrix becomes n x w, so the cost is about 2 * n * w * d."""
eff = min(w, n)
return n * eff * d + n * eff * d
D = 64 # head dimension: the size of each query/key/value vector
W = 64 # local window: how many nearby keys a query may attend to
print("=== 1. Full attention is O(n^2): cost as the context grows ===")
print(f" head dim d = {D}, window w = {W}. Ops = multiply-adds for one attention head.\n")
print(f" {'n':>6} {'FULL ops (~2 n^2 d)':>22} {'WINDOWED ops (~2 n w d)':>24} {'full/windowed':>14}")
prev_full = None
for n in (64, 256, 1024, 4096):
full = full_attention_ops(n, D)
win = windowed_attention_ops(n, D, W)
grow = "" if prev_full is None else f" (n x4 -> ops x{full / prev_full:.0f})"
print(f" {n:>6} {full:>22,} {win:>24,} {full / win:>13.1f}x{grow}")
prev_full = full
print("\n Each 4x in n makes FULL attention ~16x more work (quadratic), while WINDOWED")
print(" grows only ~4x (linear in n). That gap is why naive attention cannot reach 1M tokens.\n")
# ----------------------------------------------------------------------------------
# Part 2: windowed attention APPROXIMATES full attention on smooth input.
# ----------------------------------------------------------------------------------
# Cheaper ops are useless if the answer is wrong. On structured input (here a smooth
# signal where nearby tokens are similar), almost all of a query's attention mass lands
# on nearby keys anyway, so restricting to a local window changes the output very little.
# We run BOTH on the same input and report the largest per-element difference.
def softmax(x, axis=-1):
x = x - np.max(x, axis=axis, keepdims=True) # subtract max for numerical safety
e = np.exp(x)
return e / np.sum(e, axis=axis, keepdims=True)
def _distance_bias(n, slope):
"""A score penalty that grows with how far apart two tokens are: position j gets
-slope * |i - j| added to query i's score. This is the ALiBi recipe, and it bakes in
the locality real language has: a token leans on what is near it. The further away a
key is, the lower its score, so its softmax weight decays toward zero with distance."""
idx = np.arange(n)
return -slope * np.abs(idx[:, None] - idx[None, :])
def full_attention(Q, K, V, slope):
"""The real thing: softmax(Q K^T / sqrt(d) + distance_bias) V, causal (a query may
only see itself and earlier tokens), every query over every visible key."""
n, d = Q.shape
idx = np.arange(n)
scores = Q @ K.T / np.sqrt(d) + _distance_bias(n, slope)
future = idx[None, :] > idx[:, None] # a token cannot see the future
scores = np.where(future, -np.inf, scores)
return softmax(scores, axis=-1) @ V
def windowed_attention(Q, K, V, w, slope):
"""Each query i attends only to keys in [i-w, i]. Positions outside the window get
-inf score, so softmax assigns them exactly zero weight. This is the local/sliding-
window sparsity pattern long-context models use; it never even forms the far scores."""
n, d = Q.shape
idx = np.arange(n)
scores = Q @ K.T / np.sqrt(d) + _distance_bias(n, slope)
# mask True where a key is OUTSIDE query i's window (too far back, or in the future)
too_far = (idx[None, :] < idx[:, None] - w) | (idx[None, :] > idx[:, None])
scores = np.where(too_far, -np.inf, scores)
return softmax(scores, axis=-1) @ V
N = 256
SLOPE = 0.05
# A realistic input: random token embeddings PLUS the distance bias above, so attention
# weight decays with distance the way it does in real language (nearby tokens matter most,
# far ones fade out). Under that decay, almost all of every query's attention mass already
# lands inside a local window, so clipping to the window changes the output only slightly.
X = rng.normal(0, 1, size=(N, D))
Q, K, V = X.copy(), X.copy(), X.copy() # tie Q=K=V to keep the demo simple
out_full = full_attention(Q, K, V, SLOPE)
out_win = windowed_attention(Q, K, V, W, SLOPE)
max_abs_diff = float(np.max(np.abs(out_full - out_win)))
rel = max_abs_diff / float(np.max(np.abs(out_full)))
# How much of each query's full-attention weight already lands inside the window:
idx = np.arange(N)
full_scores = Q @ K.T / np.sqrt(D) + _distance_bias(N, SLOPE)
full_w = softmax(np.where(idx[None, :] > idx[:, None], -np.inf, full_scores), axis=-1)
in_window = (idx[None, :] >= idx[:, None] - W) & (idx[None, :] <= idx[:, None])
avg_window_mass = float((full_w * in_window).sum(axis=1).mean())
print("=== 2. Windowed attention stays close to full attention (locality of language) ===")
print(f" n = {N} tokens, window w = {W}. Same input through both.")
print(f" full attention ops: {full_attention_ops(N, D):>14,}")
print(f" windowed attention ops: {windowed_attention_ops(N, D, W):>14,} "
f"({full_attention_ops(N, D) / windowed_attention_ops(N, D, W):.1f}x cheaper)")
print(f" avg attention mass already inside the window: {avg_window_mass:.1%}")
print(f" max abs difference in output: {max_abs_diff:.4f} "
f"({rel:.2%} of the largest output value)")
print(" -> a fraction of the compute, and the output barely moves. The window is a good")
print(" approximation because nearby tokens already carry almost all the attention mass.\n")
# ----------------------------------------------------------------------------------
# Part 3: MLA-style KV compression cuts MEMORY (not compute).
# ----------------------------------------------------------------------------------
# The KV cache stores, per token, a key and a value vector of size d. For n tokens that
# is on the order of n*d numbers, and it lives in scarce GPU memory (see chapters 6, 8).
# Multi-head Latent Attention (MLA, from DeepSeek-V2/V3) does not store the full d-dim
# K and V. It PROJECTS them down to a shared LOW-RANK LATENT of size r << d and stores
# only that latent per token; when attention needs the full K and V, it projects the
# latent back UP with fixed matrices. So the per-token cache shrinks from ~d to ~r.
def kv_cache_bytes_full(n, d, bytes_per_elem=BYTES_PER_ELEM):
"""Full cache: one key vector + one value vector of size d per token."""
return n * (2 * d) * bytes_per_elem
def kv_cache_bytes_latent(n, r, bytes_per_elem=BYTES_PER_ELEM):
"""MLA-style cache: one shared latent vector of size r per token (the up-projection
matrices are fixed weights, paid once, not per token)."""
return n * r * bytes_per_elem
def latent_roundtrip_error(d, r, n_probe=256, seed=1):
"""Sanity check that a rank-r latent can carry most of a d-dim vector when the data
itself is low-rank: build vectors that live in an r-dim subspace, compress to the
latent and back, and report the worst reconstruction error."""
g = np.random.default_rng(seed)
basis = g.normal(size=(r, d)) # an r-dimensional subspace of R^d
coeffs = g.normal(size=(n_probe, r))
kv = coeffs @ basis # genuinely rank-r data
down = np.linalg.pinv(basis) # d -> r (compress)
latent = kv @ down # store this (size r per token)
recon = latent @ basis # r -> d (project back up)
return float(np.max(np.abs(kv - recon)))
N_LONG = 1_000_000 # a 1M-token context, like claude-opus-4-8's long window
D_MODEL = 128 # per-head key/value dimension
R_LATENT = 16 # MLA latent dimension, r << d
full_bytes = kv_cache_bytes_full(N_LONG, D_MODEL)
latent_bytes = kv_cache_bytes_latent(N_LONG, R_LATENT)
err = latent_roundtrip_error(D_MODEL, R_LATENT)
print("=== 3. MLA-style KV compression cuts cache MEMORY ===")
print(f" n = {N_LONG:,} tokens, per-head d = {D_MODEL}, latent r = {R_LATENT} "
f"(r << d), {BYTES_PER_ELEM} bytes/number.\n")
print(" BEFORE (full cache: store K and V of size d per token)")
print(f" {full_bytes:,} bytes = {full_bytes / 1e9:.2f} GB (per layer, per head)\n")
print(" AFTER (MLA: store one shared latent of size r per token)")
print(f" {latent_bytes:,} bytes = {latent_bytes / 1e9:.2f} GB (per layer, per head)\n")
print(f" compression ratio: {full_bytes / latent_bytes:.1f}x smaller KV cache")
print(f" (low-rank round-trip on genuinely rank-{R_LATENT} data loses at most "
f"{err:.1e} per element: the latent carries the information.)")
print(" -> same context, a fraction of the cache. Memory, not compute, is what MLA buys.\n")
# ----------------------------------------------------------------------------------
# Summary: the two bottlenecks are independent and combine.
# ----------------------------------------------------------------------------------
print("=== Summary: two different bottlenecks, two different fixes ===")
print(" COMPUTE (score matrix is n x n): sparse/windowed/linear attention -> fewer ops.")
print(" MEMORY (KV cache is n x d): MLA low-rank latent -> fewer bytes.")
print(" They are orthogonal: a model can use BOTH (sparse attention over an MLA cache)")
print(" to serve 1M-token contexts cheaply. That is what makes whole-repo and whole-book")
print(" inputs affordable at all.")
Running it:
=== 1. Full attention is O(n^2): cost as the context grows ===
head dim d = 64, window w = 64. Ops = multiply-adds for one attention head.
n FULL ops (~2 n^2 d) WINDOWED ops (~2 n w d) full/windowed
64 524,288 524,288 1.0x
256 8,388,608 2,097,152 4.0x (n x4 -> ops x16)
1024 134,217,728 8,388,608 16.0x (n x4 -> ops x16)
4096 2,147,483,648 33,554,432 64.0x (n x4 -> ops x16)
Each 4x in n makes FULL attention ~16x more work (quadratic), while WINDOWED
grows only ~4x (linear in n). That gap is why naive attention cannot reach 1M tokens.
=== 2. Windowed attention stays close to full attention (locality of language) ===
n = 256 tokens, window w = 64. Same input through both.
full attention ops: 8,388,608
windowed attention ops: 2,097,152 (4.0x cheaper)
avg attention mass already inside the window: 99.9%
max abs difference in output: 0.0134 (0.34% of the largest output value)
-> a fraction of the compute, and the output barely moves. The window is a good
approximation because nearby tokens already carry almost all the attention mass.
=== 3. MLA-style KV compression cuts cache MEMORY ===
n = 1,000,000 tokens, per-head d = 128, latent r = 16 (r << d), 2 bytes/number.
BEFORE (full cache: store K and V of size d per token)
512,000,000 bytes = 0.51 GB (per layer, per head)
AFTER (MLA: store one shared latent of size r per token)
32,000,000 bytes = 0.03 GB (per layer, per head)
compression ratio: 16.0x smaller KV cache
(low-rank round-trip on genuinely rank-16 data loses at most 2.7e-14 per element: the latent carries the information.)
-> same context, a fraction of the cache. Memory, not compute, is what MLA buys.
=== Summary: two different bottlenecks, two different fixes ===
COMPUTE (score matrix is n x n): sparse/windowed/linear attention -> fewer ops.
MEMORY (KV cache is n x d): MLA low-rank latent -> fewer bytes.
They are orthogonal: a model can use BOTH (sparse attention over an MLA cache)
to serve 1M-token contexts cheaply. That is what makes whole-repo and whole-book
inputs affordable at all.
Read the three parts in order, because they trace the problem and both fixes. Part one is the problem. At $n = 64$ full and windowed attention cost the same, because the window is as wide as the sequence. From there they diverge fast: every time the length quadruples, full attention's op count goes up 16x (that is the $n^2$, since $4^2 = 16$), while windowed attention goes up only 4x (linear in $n$, since the window stays fixed). By $n = 4096$ full attention is doing 64x the work of the window for the same sequence, and that ratio keeps widening without limit. Extend the table to a million tokens and the full-attention column becomes a number no accelerator will run interactively. That is the wall.
Part two is the first fix and its justification. The window does a quarter of the operations at $n = 256$, and the gap only grows with $n$. The question is whether the answer survives, and the demo shows it does: 99.9 percent of every query's attention weight already lands inside the window, so dropping the keys outside it changes the output by 0.0134, which is 0.34 percent of the largest output value. The reason is the locality the input was built to have, and that real language has: a token leans mostly on what is near it. This is the property DSA's learned selection and a sliding window both exploit, and it is why sparse attention is a good approximation and not a gamble.
Part three is the second fix, on the other axis entirely. A 1M-token context stored as full keys and values takes 0.51 GB per layer per head; the same context stored as MLA latents of dimension 16 takes 0.03 GB, a 16x reduction, which is exactly $d / r = 128 / 16$. The round-trip sanity check confirms the latent is not throwing the information away: on genuinely rank-16 data the compress-and-restore loses at most $2.7 \times 10^{-14}$ per number, which is floating-point noise. Multiply that 16x by every layer and head in a real model and the cache for a long context goes from "does not fit" to "fits comfortably," which is the difference between offering a 1M-token window and not.
The exact numbers here come from the toy dimensions and the seed in the script; change $d$, $r$, the window, or the input and they move. What is robust is the structure: full attention is quadratic in length and falls over at scale, local or learned-sparse attention is linear and stays accurate when attention is local, and low-rank KV compression shrinks the cache by $d / r$ with almost no loss when the keys and values are genuinely low-rank.
Why this is the floor under everything else
This chapter is the architecture layer the rest of the book stands on. The capacity pressure from Chapter 1, the finite window, only relaxes to a million tokens because attention was made sub-quadratic and the cache was made small. The KV-cache serving work of Chapter 8, paging and prefix sharing, manages where the cache lives and how it is shared; MLA changes how big each token's entry is in the first place, and the two compound. And the cost economics of Chapter 2 reach their limit here: cheap long-context serving, the kind that makes feeding a whole repository or a whole book into one call affordable, is downstream of these architectural choices. When you can hand a model an entire codebase and ask one question across all of it, you are using a model whose architects paid down both the $n^2$ compute and the $n \cdot d$ memory so that the call is tractable at all.
It is worth being clear about what you, as someone engineering context, control here and what you do not. You do not choose a model's attention pattern or whether it uses MLA; those are fixed when the model is trained. What this chapter gives you is the why behind the prices and limits you do see. It explains why a provider can offer a 1M-token window at a workable price, why long-context calls are still more expensive per token than short ones even with these fixes, and why the discipline of the earlier chapters (sending less, caching the stable prefix, storing state outside the window) still pays off: a smaller, leaner context is cheaper on any architecture, efficient attention or not. The fixes here lower the ceiling on what is possible; the rest of the book lowers your bill underneath it.
Using the real tool: commands and before/after proof
The demo above proves the mechanism on this box. This section is about the two ways you actually meet efficient attention as a practitioner: you serve an open-weights model that has it built in, or you call a long-context model over an API that has it built in. Either way the thing you can measure and show a colleague is the same: how cost and memory scale with the context length $n$.
Serving an efficient-attention open-weights model
The open-weights models that ship the fixes from this chapter (DeepSeek with MLA and DeepSeek Sparse Attention, MiniMax with lightning attention) run under vLLM, a serving engine that downloads the weights from Hugging Face and exposes an HTTP endpoint. You point it at a model id and it serves it:
pip install vllm
# DeepSeek: Multi-head Latent Attention (small KV cache) + DeepSeek Sparse
# Attention (sub-quadratic compute). Use the current DeepSeek model id on
# Hugging Face, e.g. a DeepSeek-V3 / V3.2 / V4 family checkpoint.
vllm serve deepseek-ai/<current-deepseek-model>
# MiniMax: lightning (linear) attention for long sequences.
# Use the current MiniMax model id on Hugging Face.
vllm serve MiniMaxAI/<current-minimax-model>
Replace <current-deepseek-model> and <current-minimax-model> with the model id you find on
the model's Hugging Face page; the exact names move as new versions ship, so check the page
rather than trusting a name from memory. vLLM reads the architecture from the checkpoint and
runs the right attention kernel: MLA stores each token's key and value as the small latent from
the third part of the demo, so the KV cache for a long context is a fraction of a vanilla
transformer's at the same length, and the sparse or linear path means the per-token compute
stops growing with the full sequence. You did not configure any of that. It is in the weights,
and serving the model is how you use it. The point of the chapter is to know why that one
command can hold a context a vanilla transformer could not: both costs were paid down in the
architecture.
Calling a long-context model over an API
The other way you use efficient attention is to not run a model at all. When you call a
long-context model like claude-opus-4-8, with its 1M-token context window, the fact that
you can put a whole repository or a whole book in a single prompt at all is downstream of this
chapter. A 1M-token window is only offered because the attention is sub-quadratic and the KV
cache is small enough to serve; a vanilla transformer at a million tokens would be doing the
million-times-more work from part one of the demo and holding the half-gigabyte-per-head cache
from part three. You send the long prompt and read the answer; the architecture is what makes
the call tractable on the provider's side.
The before/after proof: cost versus context length
Here is a recipe you can run yourself to see the thing efficient attention controls. The metric
is cost, and the variable is context length. You do not need a GPU for it. Count the
tokens of inputs of increasing size with the Anthropic SDK, then multiply by the input rate
(claude-opus-4-8 is $5 per million input tokens, $25 per million output):
import anthropic
client = anthropic.Anthropic()
RATE_PER_TOKEN = 5.0 / 1_000_000 # $5 per million input tokens
# Feed progressively larger inputs (a file, then a directory, then a whole repo).
for label, text in [("one file", small), ("a package", medium), ("the whole repo", large)]:
n = client.messages.count_tokens(
model="claude-opus-4-8",
messages=[{"role": "user", "content": text}],
).input_tokens
print(f"{label:>16}: {n:>9,} tokens -> ${n * RATE_PER_TOKEN:.2f} per call")
count_tokens is the exact tokenizer the model uses, so the counts are real, not estimates.
Run it across inputs from a few thousand tokens up to the edge of the window and you watch the
per-call cost climb in a straight line with the length:
one file: 8,000 tokens -> $0.04 per call (illustrative / expected)
a package: 120,000 tokens -> $0.60 per call (illustrative / expected)
the whole repo: 900,000 tokens -> $4.50 per call (illustrative / expected)
(The numbers above are illustrative: the token counts depend on your actual files, and the SDK call needs network and an API key. The structure is the load-bearing part.) Notice that the price tracks the length linearly: ten times the context is about ten times the input bill, not a hundred times. That linearity is the whole point. The provider's per-token serving cost is roughly flat as $n$ grows, so they can charge you per token; if their attention were the naive $O(n^2)$ from the start of the chapter, the work behind each token would itself grow with the context, the half-gigabyte-per-head cache would not fit, and a 1M-token call would be priced out of reach (or not offered). The from-scratch demo is the on-box proof of that mechanism: part one showed the full quadratic compute curve that efficient attention flattens, and part three showed the MLA KV-memory drop directly, the $16\times$ smaller cache that lets the long context be held at all. The cost-versus-length table here is the same story measured from the outside, at the level you actually pay for.
Takeaways
- Attention compares every token against every other, so its score matrix is $n \times n$ and its compute is $O(n^2 d)$: quadruple the context, quadruple the work. That quadratic cost is the reason long context is hard.
- Sparse attention (a sliding window, or top-$k$/learned selection as in DeepSeek Sparse Attention) and linear attention (MiniMax lightning) cut the compute by scoring only the keys that matter, dropping the cost from $O(n^2 d)$ toward $O(n w d)$.
- A local window is a good approximation, not a gamble, because attention in real language is local: in the demo 99.9 percent of each query's weight is already inside the window, so the output barely moves.
- MLA (Multi-head Latent Attention, from DeepSeek-V2/V3) cuts the memory by storing each token's key and value as a small low-rank latent of size $r \ll d$ instead of the full $d$, shrinking the KV cache by about $d / r$ (16x in the demo) with negligible loss.
- Compute and memory are different bottlenecks with different fixes; the strongest
long-context models combine sparse/linear attention with an MLA cache. The 1M-token windows
of models like
claude-opus-4-8exist because of this layer.
👉 That closes the techniques. The final chapter steps back to the ecosystem: the open-source libraries and serving stacks that implement everything in this book, so you know what to reach for instead of building it yourself.