The papers: understand & reproduce

Like the other books, we built these methods as if from thin air. They come from a line of research papers. This chapter tells that story: what the key paper says, its lineage, how to read it, and an honest reflection on what we reproduced versus what FAISS adds.

The key paper

Hervé Jégou, Matthijs Douze, Cordelia Schmid. Product Quantization for Nearest Neighbor Search. IEEE TPAMI, 2011.

This paper introduced both halves of what we built:

Product Quantization (Section 3) — split vectors, quantize each sub-space, store the codes, and compute distances by table lookup (ADC).
IVFADC (Section 4) — combine an inverted file (coarse quantizer) with PQ on residuals, which is exactly our IVFPQ. The "ADC" is the asymmetric distance computation we implemented.

The four moving parts of the paper map onto our code:

Paper concept	Our code
Coarse quantizer (k-means)	`kmeans`, `IVF.train`
Inverted lists	`IVF.add` / `IVFPQ.add`
Product quantizer + codebooks	`PQ.train`, `PQ.encode`
Asymmetric Distance Computation	`PQ.distance_table`, `PQ.adc_distances`
Residual encoding (IVFADC)	`IVFPQ.train` / `IVFPQ.search`

Lineage and descendants

Vector quantization comes from classic signal compression (Lloyd's algorithm, 1957/1982 — the same k-means we use).
Inverted files come from text retrieval, repurposed for vectors.
Descendants: Optimized PQ (OPQ, Ge et al. 2013) learns a rotation before PQ; ScaNN (Google, 2020) adds anisotropic quantization tuned for inner product; FAISS packages all of these.

Recognizing this lineage helps: when the paper is terse about k-means or distance algebra, the older quantization literature fills the gap.

How to read a paper like this

The same three-pass approach as the other books:

Skim (5 min). Abstract, Figure 2 (the PQ split), and the IVFADC system diagram. Decide what you need: here, PQ + IVFADC.
Method, carefully. Read Sections 3–4. Write down, in your own words: how a code is formed, how ADC computes distance from codes, why residuals are encoded, and the parameters (m, k* = ksub, nlist/coarse k', w = nprobe).
Reproduce. Implement and measure recall vs. brute force — the demo is the proof you understood it.

Tip: the paper reports recall@{1,10,100} and memory per vector. Reproducing those axes (not necessarily the exact numbers, which depend on the dataset) is the real test — and it surfaces the re-ranking insight, since raw recall@10 looks alarming until you shortlist-and-rescore.

From paper to code: the process

Build the primitive first. k-means underlies everything, so we wrote and tested it before IVF or PQ.
Translate the data structures. Codebooks → a (m, ksub, dsub) array; inverted lists → np.where(assign == c); codes → a uint8 matrix.
Get ADC right. The table-lookup distance (adc_distances) is the subtle part; a wrong axis or sum and recall collapses.
Add residual encoding for IVFPQ. Encoding x - centroid instead of x is a one-line change with a big recall effect — exactly as the paper claims.
Validate against ground truth, and discover the re-ranking technique when raw recall looks low.

Reflection: what we built vs. production

Our implementation is algorithmically faithful — PQ, ADC, IVFADC, residuals, correct recall behavior. What we left out is performance engineering, which is most of FAISS:

Aspect	This book	FAISS / ScaNN
Language	pure Python + NumPy	C++/CUDA, SIMD, `PQFastScan`
Training	full data, simple Lloyd	sampled, k-means++, many iters
Distance	NumPy per cell	register-resident table lookups, GPU
Extras	—	OPQ rotation, on-disk, sharding, deletes
Re-rank	manual top-100 rescoring	`IndexRefineFlat`

Same lesson as the KTS and HNSW books: the algorithm is compact and learnable; the performance is careful systems work layered on top. A faithful from-scratch version is the right way to understand it — and a baseline to validate the fast version against.

See the References for the papers and their roots.