The problem: memory & scale
The HNSW book framed the speed problem. This book adds the memory problem — the one that decides whether your vectors fit on the machine at all.
Exact search: simple, correct, and huge
The baseline ("flat" index) stores every vector and scans them all:
import numpy as np
def flat_knn(X, q, k=10):
d = ((X - q) ** 2).sum(1) # distance to every vector
idx = np.argsort(d)[:k]
return idx, d[idx]
It's exact and trivial. Its problem is size.
The memory wall
A flat index stores $n \times d$ numbers as 32-bit floats — $n \cdot d \cdot 4$ bytes. Plug in real numbers:
| Vectors $n$ | Dim $d$ | Float32 size |
|---|---|---|
| 1,000,000 | 768 | ~3 GB |
| 100,000,000 | 768 | ~300 GB |
| 1,000,000,000 | 768 | ~3 TB |
Three terabytes won't fit in the RAM of a normal server, and even 300 GB is expensive. RAM matters because vector search must be fast, and reading from disk per query is far too slow. So the question becomes:
How do we shrink each vector enough that a billion of them fit in memory, while still answering "what's nearest?" accurately?
That is the quantization problem, and it's the heart of this book.
Two levers
There are two independent things we can attack:
- Don't look at everything (time). Partition the vectors into regions and search only the regions near the query. → IVF (Chapter 3).
- Don't store everything (memory). Replace each vector with a compact code. → Scalar and Product quantization (Chapters 4–5).
Combine both and you get IVFPQ (Chapter 6): search a few regions, each holding heavily compressed vectors.
The trade-off we're buying
Every technique here trades a little accuracy (recall < 1.0) for big wins in memory and/or speed. The skill is choosing operating points:
- IVF: more regions probed → higher recall, more work.
- Quantization: more bytes per code → higher recall, more memory.
- Re-ranking (a key trick we'll meet): use the cheap approximate index to shortlist candidates, then re-score the shortlist exactly — recovering most of the lost recall for little cost.
Relationship to HNSW
| HNSW | IVF | Quantization (PQ) | |
|---|---|---|---|
| Attacks | search time | search time | memory |
| Idea | navigable graph | partition + probe | compress to bytes |
| Memory | more than flat (stores a graph too) | same as flat | much less than flat |
| Composes with | quantization | quantization | IVF, HNSW |
HNSW makes search fast but uses extra memory for the graph. Quantization makes storage tiny. Real systems mix them. First, the algorithm both IVF and PQ are built on: k-means. 👉