The papers: understand & reproduce

Like the other books, we built these methods as if from thin air. They come from a line of research papers. This chapter tells that story: what the key paper says, its lineage, how to read it, and an honest reflection on what we reproduced versus what FAISS adds.

The key paper

Hervé Jégou, Matthijs Douze, Cordelia Schmid. Product Quantization for Nearest Neighbor Search. IEEE TPAMI, 2011.

This paper introduced both halves of what we built:

  • Product Quantization (Section 3) — split vectors, quantize each sub-space, store the codes, and compute distances by table lookup (ADC).
  • IVFADC (Section 4) — combine an inverted file (coarse quantizer) with PQ on residuals, which is exactly our IVFPQ. The "ADC" is the asymmetric distance computation we implemented.

The four moving parts of the paper map onto our code:

Paper conceptOur code
Coarse quantizer (k-means)kmeans, IVF.train
Inverted listsIVF.add / IVFPQ.add
Product quantizer + codebooksPQ.train, PQ.encode
Asymmetric Distance ComputationPQ.distance_table, PQ.adc_distances
Residual encoding (IVFADC)IVFPQ.train / IVFPQ.search

Lineage and descendants

  • Vector quantization comes from classic signal compression (Lloyd's algorithm, 1957/1982 — the same k-means we use).
  • Inverted files come from text retrieval, repurposed for vectors.
  • Descendants: Optimized PQ (OPQ, Ge et al. 2013) learns a rotation before PQ; ScaNN (Google, 2020) adds anisotropic quantization tuned for inner product; FAISS packages all of these.

Recognizing this lineage helps: when the paper is terse about k-means or distance algebra, the older quantization literature fills the gap.

How to read a paper like this

The same three-pass approach as the other books:

  1. Skim (5 min). Abstract, Figure 2 (the PQ split), and the IVFADC system diagram. Decide what you need: here, PQ + IVFADC.
  2. Method, carefully. Read Sections 3–4. Write down, in your own words: how a code is formed, how ADC computes distance from codes, why residuals are encoded, and the parameters (m, k* = ksub, nlist/coarse k', w = nprobe).
  3. Reproduce. Implement and measure recall vs. brute force — the demo is the proof you understood it.

Tip: the paper reports recall@{1,10,100} and memory per vector. Reproducing those axes (not necessarily the exact numbers, which depend on the dataset) is the real test — and it surfaces the re-ranking insight, since raw recall@10 looks alarming until you shortlist-and-rescore.

From paper to code: the process

  1. Build the primitive first. k-means underlies everything, so we wrote and tested it before IVF or PQ.
  2. Translate the data structures. Codebooks → a (m, ksub, dsub) array; inverted lists → np.where(assign == c); codes → a uint8 matrix.
  3. Get ADC right. The table-lookup distance (adc_distances) is the subtle part; a wrong axis or sum and recall collapses.
  4. Add residual encoding for IVFPQ. Encoding x - centroid instead of x is a one-line change with a big recall effect — exactly as the paper claims.
  5. Validate against ground truth, and discover the re-ranking technique when raw recall looks low.

Reflection: what we built vs. production

Our implementation is algorithmically faithful — PQ, ADC, IVFADC, residuals, correct recall behavior. What we left out is performance engineering, which is most of FAISS:

AspectThis bookFAISS / ScaNN
Languagepure Python + NumPyC++/CUDA, SIMD, PQFastScan
Trainingfull data, simple Lloydsampled, k-means++, many iters
DistanceNumPy per cellregister-resident table lookups, GPU
ExtrasOPQ rotation, on-disk, sharding, deletes
Re-rankmanual top-100 rescoringIndexRefineFlat

Same lesson as the KTS and HNSW books: the algorithm is compact and learnable; the performance is careful systems work layered on top. A faithful from-scratch version is the right way to understand it — and a baseline to validate the fast version against.

See the References for the papers and their roots.