The problem: memory & scale

The HNSW book framed the speed problem. This book adds the memory problem — the one that decides whether your vectors fit on the machine at all.

Exact search: simple, correct, and huge

The baseline ("flat" index) stores every vector and scans them all:

import numpy as np

def flat_knn(X, q, k=10):
    d = ((X - q) ** 2).sum(1)        # distance to every vector
    idx = np.argsort(d)[:k]
    return idx, d[idx]

It's exact and trivial. Its problem is size.

The memory wall

A flat index stores $n \times d$ numbers as 32-bit floats — $n \cdot d \cdot 4$ bytes. Plug in real numbers:

Vectors $n$	Dim $d$	Float32 size
1,000,000	768	~3 GB
100,000,000	768	~300 GB
1,000,000,000	768	~3 TB

Three terabytes won't fit in the RAM of a normal server, and even 300 GB is expensive. RAM matters because vector search must be fast, and reading from disk per query is far too slow. So the question becomes:

How do we shrink each vector enough that a billion of them fit in memory, while still answering "what's nearest?" accurately?

That is the quantization problem, and it's the heart of this book.

Two levers

There are two independent things we can attack:

Don't look at everything (time). Partition the vectors into regions and search only the regions near the query. → IVF (Chapter 3).
Don't store everything (memory). Replace each vector with a compact code. → Scalar and Product quantization (Chapters 4–5).

Combine both and you get IVFPQ (Chapter 6): search a few regions, each holding heavily compressed vectors.

The trade-off we're buying

Every technique here trades a little accuracy (recall < 1.0) for big wins in memory and/or speed. The skill is choosing operating points:

IVF: more regions probed → higher recall, more work.
Quantization: more bytes per code → higher recall, more memory.
Re-ranking (a key trick we'll meet): use the cheap approximate index to shortlist candidates, then re-score the shortlist exactly — recovering most of the lost recall for little cost.

Relationship to HNSW

	HNSW	IVF	Quantization (PQ)
Attacks	search time	search time	memory
Idea	navigable graph	partition + probe	compress to bytes
Memory	more than flat (stores a graph too)	same as flat	much less than flat
Composes with	quantization	quantization	IVF, HNSW

HNSW makes search fast but uses extra memory for the graph. Quantization makes storage tiny. Real systems mix them. First, the algorithm both IVF and PQ are built on: k-means. 👉