Introduction
No background required. Like its companion books, this one assumes you know nothing about programming, AI, or advanced math. Every concept, line of code, and symbol is explained, and every code example is followed by the exact output it produces. New to Python? Read the 5-minute primer first.
What this book is about
The HNSW book covered one way to search billions of vectors quickly: build a clever graph and navigate it. This book covers the other half of the story — how to make vector search fit in memory and partition the work — using two classic techniques:
- IVF (Inverted File index) — chop the vector space into regions and only search the few regions near the query.
- Quantization — compress each vector from hundreds of floats down to a handful of bytes, so a billion vectors fit in RAM.
Together (as IVFPQ) they are the workhorse behind FAISS and most billion-scale vector databases.
The two problems with exact search
Storing $n$ vectors of $d$ dimensions as 32-bit floats and scanning them all per query has two costs:
- Time — $n$ distance computations per query (the problem HNSW attacks).
- Memory — $n \times d \times 4$ bytes. One billion 768-dimensional embeddings is about 3 terabytes. That won't fit in RAM on any normal machine.
This book is mostly about problem 2 (and partly problem 1 via IVF). Quantization can shrink that 3 TB by 16–64×, down to something that fits on a single server — the difference between "impossible" and "routine".
Two everyday analogies
IVF — the library. A library doesn't make you read every book to find one about cooking. Books are shelved by topic; you go straight to the cooking section and look only there. IVF shelves vectors into regions (by similarity) and searches only the regions near your query.
Quantization — paint-by-numbers. Instead of storing the exact color of every pixel, paint-by-numbers stores a small palette and, per pixel, just the number of the nearest palette color. You lose a little fidelity but store far less. Quantization does exactly this with vectors: keep a small set of representative vectors (a "codebook") and store, per vector, just the id of the nearest one.
What you'll build
A complete, working compressed index in one readable file,
ivfpq.py, containing k-means, IVF, PQ, and IVFPQ —
plus:
- a demo measuring recall vs. memory vs. work, and
- a CLI tool that estimates the trade-off for your own vectors.
How it relates to HNSW
These aren't competitors — they compose. A common production setup is "HNSW on top of quantized vectors", or "IVF to partition, PQ to compress". HNSW answers which graph node next?; IVF/PQ answer how do we store a billion vectors and only look at a few? We'll map out exactly when to use which in the use-cases chapter.
Everything rests on one humble algorithm — k-means clustering — so after the problem statement, that's where we start. 👉
A 5-minute primer: vectors, clusters & recall
Just enough to read every example. Skip if you're comfortable with Python, NumPy, and nearest-neighbor basics (the HNSW primer covers the same ground in more detail).
Code boxes show their output
import numpy as np
print(np.array([1.0, 2.0, 3.0]).mean())
Output:
2.0
Vectors and distance
A vector is a list of numbers — a point in space. The squared Euclidean distance between two vectors is the sum of squared differences (we skip the square root because it doesn't change which point is nearest):
import numpy as np
a = np.array([0.0, 0.0])
b = np.array([3.0, 4.0])
print(np.sum((a - b) ** 2)) # 9 + 16
Output:
25.0
In this book vectors are usually embeddings — numeric fingerprints of text, images, products, or users, arranged so similar items sit close together. We just assume we're handed them.
Clusters and centroids
A cluster is a group of nearby vectors; its centroid is the group's average — a single vector that "represents" the group. The whole book leans on one idea: replace many vectors by a few representative centroids.
- IVF uses centroids to define regions (search only the nearby region).
- Quantization uses centroids as a palette (store the nearest centroid's id instead of the full vector).
The algorithm that finds good centroids is k-means, built from scratch in Chapter 2.
Recall: measuring an approximate answer
These methods are approximate — fast and small, but they occasionally miss a true neighbor. We measure quality with recall:
recall@k = (how many of the true top-
kneighbors we returned) ÷k.
Recall 1.0 = perfect; 0.9 = we found 9 of the true 10. Throughout, we compare against exact brute-force search to compute recall honestly.
true_top10 = {3, 8, 12, 19, 25, 31, 44, 50, 61, 77}
approx_top10 = {3, 8, 12, 19, 25, 31, 44, 50, 61, 99} # missed 77, added 99
recall = len(true_top10 & approx_top10) / 10
print(recall)
Output:
0.9
NumPy bits we use
| You'll see | Meaning |
|---|---|
np.argsort(d)[:k] | indices of the k smallest values in d |
(X - q) ** 2).sum(1) | squared distance from q to every row of X |
X @ C.T | all pairwise dot products between rows of X and C |
np.where(a == c)[0] | positions where assignment array a equals cell c |
.astype(np.uint8) | store small integers (0–255) in one byte each |
That last one is the heart of quantization: turning floats into single bytes. On to why exact search doesn't fit in memory. 👉
The problem: memory & scale
The HNSW book framed the speed problem. This book adds the memory problem — the one that decides whether your vectors fit on the machine at all.
Exact search: simple, correct, and huge
The baseline ("flat" index) stores every vector and scans them all:
import numpy as np
def flat_knn(X, q, k=10):
d = ((X - q) ** 2).sum(1) # distance to every vector
idx = np.argsort(d)[:k]
return idx, d[idx]
It's exact and trivial. Its problem is size.
The memory wall
A flat index stores $n \times d$ numbers as 32-bit floats — $n \cdot d \cdot 4$ bytes. Plug in real numbers:
| Vectors $n$ | Dim $d$ | Float32 size |
|---|---|---|
| 1,000,000 | 768 | ~3 GB |
| 100,000,000 | 768 | ~300 GB |
| 1,000,000,000 | 768 | ~3 TB |
Three terabytes won't fit in the RAM of a normal server, and even 300 GB is expensive. RAM matters because vector search must be fast, and reading from disk per query is far too slow. So the question becomes:
How do we shrink each vector enough that a billion of them fit in memory, while still answering "what's nearest?" accurately?
That is the quantization problem, and it's the heart of this book.
Two levers
There are two independent things we can attack:
- Don't look at everything (time). Partition the vectors into regions and search only the regions near the query. → IVF (Chapter 3).
- Don't store everything (memory). Replace each vector with a compact code. → Scalar and Product quantization (Chapters 4–5).
Combine both and you get IVFPQ (Chapter 6): search a few regions, each holding heavily compressed vectors.
The trade-off we're buying
Every technique here trades a little accuracy (recall < 1.0) for big wins in memory and/or speed. The skill is choosing operating points:
- IVF: more regions probed → higher recall, more work.
- Quantization: more bytes per code → higher recall, more memory.
- Re-ranking (a key trick we'll meet): use the cheap approximate index to shortlist candidates, then re-score the shortlist exactly — recovering most of the lost recall for little cost.
Relationship to HNSW
| HNSW | IVF | Quantization (PQ) | |
|---|---|---|---|
| Attacks | search time | search time | memory |
| Idea | navigable graph | partition + probe | compress to bytes |
| Memory | more than flat (stores a graph too) | same as flat | much less than flat |
| Composes with | quantization | quantization | IVF, HNSW |
HNSW makes search fast but uses extra memory for the graph. Quantization makes storage tiny. Real systems mix them. First, the algorithm both IVF and PQ are built on: k-means. 👉
k-means: the building block
Both IVF and quantization rest on one algorithm: k-means clustering. It finds
k representative points (centroids) that summarize a cloud of vectors. Get
this and the rest of the book is just two clever ways of using centroids.
The goal
Given many vectors, find k centroids so that every vector is close to its
nearest centroid. Those centroids become:
- the region centers in IVF (each vector belongs to its nearest region), and
- the palette in quantization (each vector is replaced by its nearest centroid's id).
Lloyd's algorithm (the standard k-means)
It alternates two steps until things stop moving:
- Assign — put each vector in the group of its nearest centroid.
- Update — move each centroid to the average of its group.
Repeat. Each round can only reduce the total distance, so it converges.
start: random centroids assign points move centroids to means
x x ● x x | ● x x ●
x x ● ─► x x | ● ─► x ● ●
x x x x x x | (nearest) x x x
...repeat until stable
The code
import numpy as np
def _assign(X, C):
"""Index of the nearest centroid for each row of X (squared L2).
Uses ||x-c||^2 = ||x||^2 - 2 x.c + ||c||^2 to compute all distances at once."""
d = (X ** 2).sum(1)[:, None] - 2.0 * X @ C.T + (C ** 2).sum(1)[None, :]
return d.argmin(1)
def kmeans(X, k, iters=25, seed=0):
"""Lloyd's k-means. Returns (centroids, assignments)."""
X = np.asarray(X, dtype=np.float64)
n = X.shape[0]
k = min(k, n)
rng = np.random.default_rng(seed)
C = X[rng.choice(n, size=k, replace=False)].copy() # init: random points
for _ in range(iters):
a = _assign(X, C) # 1. assign
newC = C.copy()
for j in range(k): # 2. update
members = X[a == j]
if len(members):
newC[j] = members.mean(0)
if np.allclose(newC, C): # converged?
C = newC
break
C = newC
return C, _assign(X, C)
Reading it
_assigncomputes, for every vector, the distance to every centroid in one vectorized expression, then picks the closest. This is the workhorse — IVF and PQ both call it.kmeansstarts fromkrandom data points, then loops assign→update until the centroids stop changing (oritersis hit).
See it work
Six points forming two obvious clusters — one near the origin, one near $(5,5)$:
import numpy as np
from ivfpq import kmeans
X = np.array([[0., 0.], [0.2, 0.1], [0.1, 0.2], # cluster A
[5., 5.], [5.1, 4.9], [4.9, 5.1]]) # cluster B
C, a = kmeans(X, k=2, seed=0)
print("centroids:\n", np.round(C, 2))
print("assignments:", a.tolist())
Output:
centroids:
[[5. 5. ]
[0.1 0.1]]
assignments: [1, 1, 1, 0, 0, 0]
k-means found exactly the two cluster centers — $(5,5)$ and $(0.1,0.1)$ — and
assigned the first three points to centroid 1 and the last three to centroid
0. 🎯 Those two centroids now summarize all six points.
How centroids power everything next
- IVF: treat each centroid as a bucket. Every vector is filed under its nearest centroid; at query time we only open the buckets near the query.
- Quantization: treat the centroids as a codebook. Store, per vector, just the id of its nearest centroid (one small integer) instead of the full vector.
Same algorithm, two superpowers. Let's build IVF first. 👉
IVF — the inverted file index
IVF (Inverted File index) attacks the time problem: instead of scanning every vector, partition them into regions and scan only the regions near the query. It's the "go straight to the cooking section of the library" idea.
How it works
- Train: run k-means to get
nlistcentroids. These definenlistregions (mathematically, Voronoi cells: each point belongs to its nearest centroid). - Add: assign every database vector to its nearest centroid, and store its id in that centroid's list. These lists are the "inverted lists".
- Search: find the
nprobecentroids nearest the query, then do an exact search only over the vectors in those few lists.
The name "inverted file" comes from text search: instead of "document → words", you keep "region → vectors in it", and look up by region.
centroid 0 ─► [v3, v9, v17, ...]
query ──► centroid 1 ─► [v1, v5, v22, ...] ◄─ nprobe=2: scan
centroid 2 ─► [v8, v11, ...] ◄─ these two lists only
centroid 3 ─► [v2, v14, ...]
The code
import numpy as np
from ivfpq import kmeans, _assign # _assign: nearest-centroid helper
class IVF:
def __init__(self, nlist=64, nprobe=8, seed=0):
self.nlist = nlist # number of regions
self.nprobe = nprobe # regions scanned per query
self.seed = seed
def train(self, X):
self.centroids, _ = kmeans(X, self.nlist, seed=self.seed)
return self
def add(self, X):
self.X = np.asarray(X, dtype=np.float64)
a = _assign(self.X, self.centroids) # which region each vector
self.lists = [np.where(a == c)[0] for c in range(len(self.centroids))]
return self
def search(self, q, k=10, nprobe=None):
nprobe = nprobe or self.nprobe
dc = ((self.centroids - q) ** 2).sum(1) # distance to each region
cells = np.argsort(dc)[:nprobe] # the nprobe nearest regions
cand = np.concatenate([self.lists[c] for c in cells])
d = ((self.X[cand] - q) ** 2).sum(1) # EXACT within those regions
order = np.argsort(d)[:k]
return cand[order], d[order]
Reading it
trainlearns the region centers;addfiles each vector into its region's list (np.where(a == c)[0]collects the ids in regionc).searchranks regions by distance to the query, takes the closestnprobe, gathers their vectors, and does an ordinary exact search over just those. Note the result is exact within the probed regions — IVF's only error is when a true neighbor sits in a region we didn't probe.
The regions really do partition the data
Building an 8-region IVF over 1000 vectors files them into balanced lists:
import numpy as np
from ivfpq import IVF
data = np.random.default_rng(1).normal(size=(1000, 16))
ivf = IVF(nlist=8, seed=0).train(data); ivf.add(data)
print("cell sizes:", [len(l) for l in ivf.lists])
Output:
cell sizes: [126, 144, 99, 109, 139, 139, 119, 125]
Every vector landed in exactly one of the 8 lists, ~125 each. A query now scans a couple of lists (~250 vectors) instead of all 1000.
The nprobe dial
nprobe trades recall for speed, without rebuilding:
nprobe = 1— fastest, but misses neighbors that fall just across a region border.- larger
nprobe— scans more regions, higher recall, more work.
In the demo you'll see IVF hit recall 1.0 while scanning ~6% of
the data (nprobe=8 of nlist=128). That's the whole point: near-exact answers
at a fraction of the cost.
What IVF does not do
IVF keeps the full vectors — it saves time, not memory. A billion vectors still need terabytes. To shrink them, we need quantization, which is next. 👉
Scalar quantization
Before the powerful Product Quantization, here's the simplest compression of all — scalar quantization (SQ) — which already gives a 4× memory cut and introduces the core idea: store small integers instead of floats.
The idea
A 32-bit float takes 4 bytes. But embedding values usually live in a modest range (say roughly $-1$ to $1$). If we map that range onto the 256 integers $0..255$, we can store each number in a single byte — a 4× reduction — losing only a little precision.
For each dimension we find its min and max across the dataset, then:
$$ \text{code} = \mathrm{round}\!\left( 255 \cdot \frac{x - \min}{\max - \min} \right), \qquad x \approx \min + \frac{\text{code}}{255}\,(\max - \min). $$
The first formula compresses a float to a byte; the second reconstructs an approximate float from the byte.
The code
import numpy as np
class ScalarQuantizer:
def train(self, X):
self.lo = X.min(axis=0) # per-dimension min
self.hi = X.max(axis=0) # per-dimension max
self.scale = np.maximum(self.hi - self.lo, 1e-9)
return self
def encode(self, X): # floats -> bytes
u = (X - self.lo) / self.scale # to 0..1
return np.clip(np.round(u * 255), 0, 255).astype(np.uint8)
def decode(self, codes): # bytes -> approx floats
return self.lo + (codes.astype(np.float64) / 255.0) * self.scale
See it work
import numpy as np
rng = np.random.default_rng(0)
X = rng.uniform(-1, 1, size=(1000, 8))
sq = ScalarQuantizer().train(X)
codes = sq.encode(X)
recon = sq.decode(codes)
print("original :", np.round(X[0], 3).tolist())
print("codes(byte):", codes[0].tolist())
print("decoded :", np.round(recon[0], 3).tolist())
print("bytes/vec : float32", X.shape[1] * 4, "-> SQ", codes.shape[1],
f"({X.shape[1] * 4 // codes.shape[1]}x smaller)")
print("avg abs error:", round(float(np.abs(X - recon).mean()), 4))
Output:
original : [0.274, -0.46, -0.918, -0.967, 0.627, 0.826, 0.213, 0.459]
codes(byte): [162, 69, 10, 4, 207, 233, 155, 186]
decoded : [0.273, -0.459, -0.917, -0.968, 0.624, 0.823, 0.215, 0.457]
bytes/vec : float32 32 -> SQ 8 (4x smaller)
avg abs error: 0.0019
Each float became one byte (4× smaller), and the reconstruction is off by only
~0.002 — usually negligible for nearest-neighbor ranking. (This is FAISS's SQ8.)
Why we need more than SQ
SQ is great and widely used (FAISS calls it SQ8), but it tops out at ~4×
because it still keeps one byte per dimension. For a 768-dim vector that's
768 bytes — better than 3072, but a billion of them is still ~768 GB.
To go further — 16× to 64× — we must stop treating dimensions independently and compress groups of them together. That's Product Quantization, the heart of this book. 👉
Product quantization
Product Quantization (PQ) is the key idea of this book. It compresses a vector from hundreds of floats to a handful of bytes (16–64× smaller) and — cleverly — lets you compute approximate distances directly from those bytes, fast.
The problem with one big codebook
Quantization means "replace a vector by the id of its nearest codebook entry". To quantize whole 128-dim vectors finely, you'd need an astronomically large codebook (you can't cover 128-dim space with 256 points). Storing and searching such a codebook is impossible.
The trick: split, then quantize each piece
PQ splits each vector into m shorter sub-vectors and quantizes each piece
separately with its own small codebook of 256 centroids. The full vector's code
is just the m sub-vector ids — m bytes total.
vector (D=8): [ a1 a2 | b1 b2 | c1 c2 | d1 d2 ] split into m=4 pieces
│ │ │ │
codebook codebook codebook codebook each: 256 centroids
│ │ │ │
code (m=4 bytes): [ 37 | 201 | 8 | 142 ] one byte per piece
Why this is powerful: m codebooks of 256 entries together represent
$256^m$ distinct vectors — e.g. with m=8 that's $256^8 \approx 1.8 \times 10^{19}$
possible reconstructions — astronomically expressive, yet each vector costs only
m bytes and each codebook is tiny.
The code
import numpy as np
from ivfpq import kmeans, _assign
class PQ:
def __init__(self, m=8, ksub=256, seed=0):
self.m = m # number of sub-vectors (= bytes per code)
self.ksub = ksub # centroids per sub-quantizer (256 -> 1 byte)
self.seed = seed
def train(self, X):
n, D = X.shape
self.dsub = D // self.m # length of each sub-vector
self.codebooks = np.zeros((self.m, self.ksub, self.dsub))
for j in range(self.m): # one codebook per sub-space
sub = X[:, j * self.dsub:(j + 1) * self.dsub]
C, _ = kmeans(sub, self.ksub, seed=self.seed + j)
self.codebooks[j, :len(C)] = C
return self
def encode(self, X): # vectors -> m-byte codes
codes = np.zeros((X.shape[0], self.m), dtype=np.uint8)
for j in range(self.m):
sub = X[:, j * self.dsub:(j + 1) * self.dsub]
codes[:, j] = _assign(sub, self.codebooks[j])
return codes
train learns m independent codebooks (k-means on each slice of the
dimensions). encode replaces each sub-vector with the id of its nearest
centroid — m bytes per vector.
See it compress
Four-dimensional vectors in two clear groups; m=2 sub-vectors of 2 dims each,
ksub=2 centroids per piece:
import numpy as np
from ivfpq import PQ
rng = np.random.default_rng(0)
data = np.vstack([rng.normal(loc=0, scale=0.3, size=(100, 4)),
rng.normal(loc=9, scale=0.3, size=(100, 4))])
pq = PQ(m=2, ksub=2, seed=1).train(data)
v = np.array([8.9, 9.1, 9.0, 8.8])
code = pq.encode(v[None])[0]
recon = np.concatenate([pq.codebooks[j, code[j]] for j in range(2)])
print("original :", v.tolist())
print("PQ code :", code.tolist(), "(2 bytes instead of 4 floats)")
print("reconstructed:", np.round(recon, 2).tolist())
Output:
original : [8.9, 9.1, 9.0, 8.8]
PQ code : [1, 1] (2 bytes instead of 4 floats)
reconstructed: [9.01, 9.0, 8.96, 9.01]
The vector became just two bytes [1, 1], and reconstructs back to almost the
original. Lossy, but close — and tiny.
ADC: distance without decompressing
Here's the elegant part. To search, we don't decompress every stored vector. Instead we use Asymmetric Distance Computation (ADC):
- Split the query into the same
mpieces. - For each piece, precompute the distance from the query's piece to all 256
centroids of that codebook — a small distance table of shape
(m, 256). - For any stored code, its approximate distance to the query is just
mtable lookups added up — no multiplications overDdimensions.
$$ \text{dist}(q, x) \approx \sum_{j=1}^{m} \text{table}\big[j,\ \text{code}_x[j]\big]. $$
"Asymmetric" = the query stays full-precision while the database is compressed, which is more accurate than compressing both. The table is built once per query; scoring a million codes is then a million cheap lookups.
def distance_table(self, q): # shape (m, ksub)
table = np.zeros((self.m, self.ksub))
for j in range(self.m):
qs = q[j * self.dsub:(j + 1) * self.dsub]
table[j] = ((self.codebooks[j] - qs) ** 2).sum(1)
return table
def adc_distances(self, codes, q): # approx distance for each code
table = self.distance_table(q)
d = np.zeros(len(codes))
for j in range(self.m):
d += table[j, codes[:, j]]
return d
The accuracy cost — and re-ranking
Compression is lossy, so raw PQ recall is modest. From the demo:
PQ m= 8 (8B/vec, 64x smaller): recall@10 raw=0.292 +rerank(top100)=0.843
PQ m=16 (16B/vec, 32x smaller): recall@10 raw=0.386 +rerank(top100)=0.938
Raw recall@10 of ~0.3 sounds bad — but it's not how PQ is used. The standard, crucial trick is re-ranking:
Use the cheap PQ distances to shortlist a larger candidate set (say the top 100), then compute exact distances on just those 100 and keep the top 10.
Re-ranking lifts recall@10 from 0.29 to 0.84 (m=8) and 0.39 to 0.94
(m=16) — near-exact accuracy, at 32–64× less memory, because we only pay full
distance cost on a tiny shortlist. (Re-ranking does need access to the full
vectors for that shortlist; FAISS's IndexRefineFlat keeps them for exactly this.)
PQ shrinks memory; IVF cuts search time. Next we combine them. 👉
IVFPQ — partition + compress
IVF saves time (scan few regions); PQ saves memory (bytes per vector).
IVFPQ does both at once, and it's the index that made billion-scale vector
search practical. This is the FAISS workhorse (IndexIVFPQ).
The one new idea: encode residuals
Combining them naively (PQ-compress everything, then bucket by IVF) wastes accuracy. IVFPQ adds a refinement: instead of PQ-encoding each vector directly, it encodes the vector's residual — what's left after subtracting its region's centroid.
$$ \text{residual} = x - \text{centroid}(x). $$
Why this helps: within a region, all vectors are close to the centroid, so their
residuals are small and similar. PQ quantizes small, concentrated residuals far
more accurately than the spread-out raw vectors — better recall for the same
bytes. The full reconstruction is centroid + decoded_residual.
build: x ─► nearest centroid c ─► residual (x - c) ─► PQ code (m bytes)
store (id, code) in cell c's inverted list
search: q ─► nearest nprobe cells
for each cell c: build PQ table for residual (q - c)
score that cell's codes by ADC table lookups
merge cells, keep top-k (then optionally re-rank exactly)
The code
import numpy as np
from ivfpq import kmeans, _assign, PQ
class IVFPQ:
def __init__(self, nlist=64, nprobe=8, m=8, ksub=256, seed=0):
self.nlist, self.nprobe = nlist, nprobe
self.m, self.ksub, self.seed = m, ksub, seed
def train(self, X):
self.centroids, a = kmeans(X, self.nlist, seed=self.seed)
residuals = X - self.centroids[a] # encode residuals, not X
self.pq = PQ(self.m, self.ksub, seed=self.seed + 1).train(residuals)
return self
def add(self, X):
a = _assign(X, self.centroids)
residuals = X - self.centroids[a]
self.codes = self.pq.encode(residuals) # m bytes per vector
self.lists = [np.where(a == c)[0] for c in range(len(self.centroids))]
return self
def search(self, q, k=10, nprobe=None):
nprobe = nprobe or self.nprobe
dc = ((self.centroids - q) ** 2).sum(1)
cells = np.argsort(dc)[:nprobe]
all_d, all_id = [], []
for c in cells:
ids = self.lists[c]
if ids.size == 0:
continue
res_q = q - self.centroids[c] # query residual for THIS cell
d = self.pq.adc_distances(self.codes[ids], res_q)
all_d.append(d); all_id.append(ids)
d = np.concatenate(all_d); ids = np.concatenate(all_id)
order = np.argsort(d)[:k]
return ids[order], d[order]
Reading it
trainlearns the IVF centroids and trains PQ on the residuals.addfiles each vector into its cell and stores only itsm-byte residual code.searchprobes thenprobenearest cells; for each, it forms the query's residual relative to that cell's centroid, builds the PQ distance table, and scores the cell's codes by ADC. Results from all probed cells are merged.
How well it works
From the demo (10,000 vectors, m=16 → 16 bytes/vector):
IVFPQ (IVF + PQ on residuals — partitioned and compressed):
nprobe= 8: recall@10 raw=0.741 +rerank(top100)=1.000
nprobe=16: recall@10 raw=0.741 +rerank(top100)=1.000
So IVFPQ is partitioned (scans ~6% of cells) and compressed (16–32× smaller), and with re-ranking reaches perfect recall on this data. That combination — small memory, fast search, high recall — is why it scales to billions of vectors.
The knobs, all together
| Knob | Controls | Bigger means |
|---|---|---|
nlist | number of IVF cells | finer partition; needs more nprobe for recall |
nprobe | cells probed per query | higher recall, more time (runtime dial) |
m | bytes per PQ code | higher recall, more memory |
ksub | centroids per sub-quantizer | usually 256 (1 byte); rarely changed |
| re-rank size | exact-rescoring shortlist | higher recall, slight extra time |
A typical recipe: pick nlist ≈ √n, train, then tune nprobe and the re-rank
size at query time until recall is high enough — all without rebuilding the codes.
Next, the consolidated implementation. 👉
The full implementation
Here is the consolidated module — k-means, exact search, IVF, PQ, and IVFPQ — in
one file, code/ivfpq.py, reproduced in full.
"""
IVF and Product Quantization for approximate nearest-neighbor search,
from scratch in NumPy.
References:
Jégou, Douze, Schmid. "Product Quantization for Nearest Neighbor Search."
IEEE TPAMI, 2011. (introduces PQ and the IVFADC system)
This module implements, in order of dependency:
kmeans(X, k) -> the building block for every codebook here
flat_knn(X, q, k) -> exact search (ground truth for recall)
IVF -> Inverted File index (partition + probe)
PQ -> Product Quantizer (compress vectors to bytes)
IVFPQ -> IVF + PQ on residuals (the FAISS workhorse)
Only NumPy is used.
"""
from __future__ import annotations
import numpy as np
# ===========================================================================
# k-means and exact search
# ===========================================================================
def _assign(X, C):
"""Nearest-centroid index for each row of X (squared L2)."""
d = (X ** 2).sum(1)[:, None] - 2.0 * X @ C.T + (C ** 2).sum(1)[None, :]
return d.argmin(1)
def kmeans(X, k, iters=25, seed=0):
"""Lloyd's k-means. Returns (centroids, assignments)."""
X = np.asarray(X, dtype=np.float64)
n = X.shape[0]
k = min(k, n)
rng = np.random.default_rng(seed)
C = X[rng.choice(n, size=k, replace=False)].copy()
for _ in range(iters):
a = _assign(X, C)
newC = C.copy()
for j in range(k):
members = X[a == j]
if len(members):
newC[j] = members.mean(0)
if np.allclose(newC, C):
C = newC
break
C = newC
return C, _assign(X, C)
def flat_knn(X, q, k=10):
"""Exact k nearest neighbors by brute force. Returns (ids, distances)."""
d = ((X - q) ** 2).sum(1)
idx = np.argsort(d)[:k]
return idx, d[idx]
# ===========================================================================
# IVF — Inverted File index (partition space, search a few partitions)
# ===========================================================================
class IVF:
def __init__(self, nlist=64, nprobe=8, seed=0):
self.nlist = nlist # number of partitions (Voronoi cells)
self.nprobe = nprobe # how many cells to scan per query
self.seed = seed
def train(self, X):
"""Learn the partition centroids (the coarse quantizer)."""
self.centroids, _ = kmeans(X, self.nlist, seed=self.seed)
return self
def add(self, X):
"""Assign every vector to its nearest cell and store the inverted lists."""
self.X = np.asarray(X, dtype=np.float64)
a = _assign(self.X, self.centroids)
self.lists = [np.where(a == c)[0] for c in range(len(self.centroids))]
return self
def search(self, q, k=10, nprobe=None):
nprobe = nprobe or self.nprobe
q = np.asarray(q, dtype=np.float64)
dc = ((self.centroids - q) ** 2).sum(1)
cells = np.argsort(dc)[:nprobe]
cand = np.concatenate([self.lists[c] for c in cells]) if len(cells) else np.array([], int)
if cand.size == 0:
return np.array([], int), np.array([])
d = ((self.X[cand] - q) ** 2).sum(1) # exact distance within probed cells
order = np.argsort(d)[:k]
return cand[order], d[order]
# ===========================================================================
# PQ — Product Quantization (compress each vector to `m` bytes)
# ===========================================================================
class PQ:
def __init__(self, m=8, ksub=256, seed=0):
self.m = m # number of sub-vectors
self.ksub = ksub # centroids per sub-quantizer (256 => 1 byte)
self.seed = seed
def train(self, X):
X = np.asarray(X, dtype=np.float64)
n, D = X.shape
if D % self.m != 0:
raise ValueError(f"dim {D} not divisible by m={self.m}")
self.dsub = D // self.m
self.codebooks = np.zeros((self.m, self.ksub, self.dsub))
for j in range(self.m):
sub = X[:, j * self.dsub:(j + 1) * self.dsub]
C, _ = kmeans(sub, self.ksub, seed=self.seed + j)
self.codebooks[j, :len(C)] = C
return self
def encode(self, X):
"""Map each vector to `m` centroid ids (uint8 codes)."""
X = np.asarray(X, dtype=np.float64)
codes = np.zeros((X.shape[0], self.m), dtype=np.uint8)
for j in range(self.m):
sub = X[:, j * self.dsub:(j + 1) * self.dsub]
codes[:, j] = _assign(sub, self.codebooks[j])
return codes
def distance_table(self, q):
"""Squared distance from q's sub-vectors to every centroid: shape (m, ksub)."""
q = np.asarray(q, dtype=np.float64)
table = np.zeros((self.m, self.ksub))
for j in range(self.m):
qs = q[j * self.dsub:(j + 1) * self.dsub]
table[j] = ((self.codebooks[j] - qs) ** 2).sum(1)
return table
def adc_distances(self, codes, q):
"""Asymmetric distance: approx ||q - x|| using only the codes + table."""
table = self.distance_table(q)
d = np.zeros(len(codes))
for j in range(self.m):
d += table[j, codes[:, j]]
return d
# ===========================================================================
# IVFPQ — IVF + PQ on residuals (compressed AND partitioned)
# ===========================================================================
class IVFPQ:
def __init__(self, nlist=64, nprobe=8, m=8, ksub=256, seed=0):
self.nlist = nlist
self.nprobe = nprobe
self.m = m
self.ksub = ksub
self.seed = seed
def train(self, X):
X = np.asarray(X, dtype=np.float64)
self.centroids, a = kmeans(X, self.nlist, seed=self.seed)
residuals = X - self.centroids[a] # PQ encodes the residual
self.pq = PQ(self.m, self.ksub, seed=self.seed + 1).train(residuals)
return self
def add(self, X):
X = np.asarray(X, dtype=np.float64)
a = _assign(X, self.centroids)
residuals = X - self.centroids[a]
self.codes = self.pq.encode(residuals)
self.lists = [np.where(a == c)[0] for c in range(len(self.centroids))]
return self
def search(self, q, k=10, nprobe=None):
nprobe = nprobe or self.nprobe
q = np.asarray(q, dtype=np.float64)
dc = ((self.centroids - q) ** 2).sum(1)
cells = np.argsort(dc)[:nprobe]
all_d, all_id = [], []
for c in cells:
ids = self.lists[c]
if ids.size == 0:
continue
res_q = q - self.centroids[c] # residual relative to this cell
d = self.pq.adc_distances(self.codes[ids], res_q)
all_d.append(d)
all_id.append(ids)
if not all_id:
return np.array([], int), np.array([])
d = np.concatenate(all_d)
ids = np.concatenate(all_id)
order = np.argsort(d)[:k]
return ids[order], d[order]
__all__ = ["kmeans", "flat_knn", "IVF", "PQ", "IVFPQ"]
The
{{#include}}directive splices in the realcode/ivfpq.pyat build time, so the documentation can never drift from the actual implementation.
API at a glance
| Call | Purpose |
|---|---|
kmeans(X, k) | Learn k centroids + assignments (used by everything). |
flat_knn(X, q, k) | Exact search — the ground truth for recall. |
IVF(nlist, nprobe).train(X).add(X) | Partitioned exact search. |
PQ(m, ksub).train(X) | Compress vectors to m-byte codes; .encode, .adc_distances. |
IVFPQ(nlist, m).train(X).add(X) | Partitioned and compressed search. |
Typical usage
import numpy as np
from ivfpq import IVFPQ
vectors = np.load("embeddings.npy") # shape (n, d), d divisible by m
index = IVFPQ(nlist=256, m=16, ksub=256).train(vectors)
index.add(vectors)
q = vectors[0]
ids, approx_d = index.search(q, k=100, nprobe=16) # shortlist via compressed codes
# re-rank the shortlist exactly for high recall:
exact_d = ((vectors[ids] - q) ** 2).sum(1)
top10 = ids[np.argsort(exact_d)[:10]]
print(top10)
What this is and isn't
This implementation is written for clarity — it shows exactly how IVF, PQ, and
IVFPQ work and produces correct, measurable recall. Production libraries
(FAISS, ScaNN) add the speed and scale: SIMD/AVX distance kernels, optimized
PQ table lookups (PQFastScan), GPU support, multi-threaded training on samples,
on-disk indexes, and combinations like HNSW-as-coarse-quantizer or
OPQ (a learned rotation before PQ that boosts recall). They share these exact
algorithms; they just run them far faster — see the
use-cases chapter, which maps each class here to its FAISS
equivalent.
Next: measure the whole thing. 👉
A worked demo: recall vs. memory
This demo puts all four index types side by side on the same data and measures the
three quantities that matter: recall (accuracy), memory, and work
(fraction of data scanned). The full script is code/demo.py.
The setup
10,000 vectors in 64 dimensions, drawn from 40 clusters (so the data has real
structure, like embeddings do). 100 queries. We compute exact answers with
flat_knn as ground truth, then score each method's recall against them.
import numpy as np
from ivfpq import flat_knn, IVF, PQ, IVFPQ
# ... build clustered data, pick 100 queries, compute exact top-10 ...
The result
Running python demo.py (takes ~40s — pure-Python k-means training) prints:
dataset: 10000 vectors x 64 dims; 100 queries; k=10
IVF (partition into 128 cells, scan a few):
nprobe= 1: recall@10=0.657 ~78 vectors scanned (0.8% of data)
nprobe= 4: recall@10=0.994 ~312 vectors scanned (3.1% of data)
nprobe= 8: recall@10=1.000 ~625 vectors scanned (6.2% of data)
nprobe=16: recall@10=1.000 ~1250 vectors scanned (12.5% of data)
PQ (compress each vector to m bytes):
m= 8 (8B/vec, 64x smaller): recall@10 raw=0.292 +rerank(top100)=0.843
m=16 (16B/vec, 32x smaller): recall@10 raw=0.386 +rerank(top100)=0.938
IVFPQ (IVF + PQ on residuals — partitioned and compressed):
nprobe= 8: recall@10 raw=0.741 +rerank(top100)=1.000
nprobe=16: recall@10 raw=0.741 +rerank(top100)=1.000
flat (exact) memory: 5.1 MB for 10000 vectors
IVF keeps full vectors (no compression); PQ/IVFPQ shrink them 16-32x.
How to read it
IVF — near-exact, a fraction of the work. Scanning just 6% of the data
(nprobe=8) gives perfect recall. IVF's only error is a true neighbor sitting
in an un-probed cell; raising nprobe fixes it. IVF saves time but keeps full
vectors (no memory savings).
PQ — huge memory savings, recovered by re-ranking. Raw PQ recall looks low
(0.29–0.39) because the codes are lossy — but that's not how PQ is used. Shortlist
the top-100 by cheap PQ distance, re-rank those exactly, and recall jumps to
0.84–0.94 at 32–64× less memory. The m knob trades bytes for accuracy.
IVFPQ — the best of both. Partitioned (scans ~6% of cells) and compressed (16–32× smaller), and with re-ranking it reaches recall 1.0. This is why IVFPQ scales to billions of vectors on a single machine.
A note on the compression factor. The demo's arrays are float64 (8 bytes), so it reports 32–64×. Real systems store float32 (4 bytes), so the same
m=8code is 32× smaller andm=16is 16× — still enormous.
The mental model
saves TIME? saves MEMORY? recall (with re-rank)
flat no no 1.000 (exact)
IVF YES no ~1.0
PQ no YES (16-64x) high
IVFPQ YES YES (16-64x) high
Things to try
- Raise
nprobeand watch IVF/IVFPQ recall climb toward 1.0. - Increase
m(8 → 16 → 32) for higher PQ recall at the cost of more bytes. - Increase the re-rank shortlist (100 → 200) to squeeze out more recall.
- Grow
Nto 100,000 to feel how IVF's "% scanned" advantage compounds.
Now, where all this is actually used — including FAISS. 👉
Use cases & how to choose
You've now built every piece that powers industrial vector search. This chapter connects them to the real world — especially FAISS, the library where these algorithms live in production — and gives you a decision guide.
FAISS: where these algorithms actually run
FAISS (Facebook AI Similarity Search) is the most widely used library for this, and it is essentially a highly optimized, composable version of exactly what this book built. Every class here maps directly to a FAISS index:
| This book | FAISS index | What it is |
|---|---|---|
flat_knn | IndexFlatL2 / IndexFlatIP | exact brute force |
ScalarQuantizer | IndexScalarQuantizer (SQ8) | 1 byte/dim, 4× smaller |
IVF | IndexIVFFlat | partition + probe, full vectors |
PQ | IndexPQ | product-quantized codes |
IVFPQ | IndexIVFPQ | partition + PQ residuals (the workhorse) |
| (HNSW book) | IndexHNSWFlat | graph navigation |
The FAISS "index factory"
FAISS lets you describe an index with a short string, and it assembles the pieces — the same Lego blocks from this book:
"Flat" exact search
"IVF4096,Flat" IVF with 4096 cells, full vectors (our IVF)
"PQ16" product quantization, 16 bytes/vector (our PQ)
"IVF4096,PQ16" IVF + PQ residuals (our IVFPQ)
"OPQ16,IVF4096,PQ16" add a learned rotation before PQ (higher recall)
"HNSW32" HNSW graph, 32 links/node (the HNSW book)
"IVF4096,PQ16,RFlat" IVFPQ + exact re-ranking (RFlat = refine) (our re-rank trick)
Reading those strings is now easy: each comma-separated piece is a stage you
understand. OPQ is one we didn't build — a learned rotation of the data before
PQ that spreads information evenly across sub-vectors, measurably improving recall;
it's a drop-in upgrade to the PQ stage.
Why FAISS is fast (and our version isn't)
Same algorithms, serious engineering: SIMD/AVX distance kernels, a special
PQFastScan that does PQ table lookups in registers, GPU implementations,
multi-threaded training on a sample of the data, and memory-mapped on-disk
indexes for sets too big for RAM. Our from-scratch version is for understanding;
FAISS is for shipping — and validating your understanding against.
The common pattern: embed → index → search
As in the HNSW book, almost every application is:
- Embed items (text/images/users) into vectors with a model.
- Index them — and at scale, that means IVFPQ (or HNSW) to fit memory and stay fast.
- Search with a query embedding; optionally re-rank the shortlist exactly.
IVF/PQ is the part that makes step 2 survive billions of items.
Where it's used
- Large-scale semantic search & RAG. When a corpus has tens of millions of chunks, storing full float vectors is too costly; IVFPQ compresses them and keeps retrieval fast. The retrieved shortlist is then re-ranked (and fed to an LLM, for RAG).
- Recommendations. Item embeddings from matrix factorization or two-tower models (see the HNSW use-cases) are indexed with IVFPQ; "similar items" and per-user candidate generation become compressed nearest-neighbor lookups over hundreds of millions of items.
- Image / video / audio search at web scale — billions of media embeddings compressed to bytes.
- De-duplication & clustering of massive corpora, where keeping full vectors in RAM is infeasible.
How to choose: a decision guide
| Situation | Use |
|---|---|
| Small data (≤ ~10k–100k) | Flat (exact). Simple, and vectorized brute force is plenty fast. |
| Need speed, RAM is fine | HNSW or IVFFlat. Full vectors, near-exact, fast. |
| RAM-constrained, big data | PQ / IVFPQ. Compress 16–64×; re-rank for recall. |
| Billions of vectors | IVFPQ (often OPQ,IVF…,PQ…), the standard at scale. |
| Maximum recall, RAM available | HNSW (optionally on SQ8) + exact re-rank. |
| Need exact guarantees | Flat only. |
Rules of thumb:
- Start exact. If brute force is fast enough, don't add complexity.
- Hit a speed wall? Add IVF (or HNSW).
- Hit a memory wall? Add PQ, and always pair it with re-ranking.
- Both walls (web scale)? IVFPQ, tuned via
nprobeand re-rank size.
How IVF/PQ and HNSW relate
They're complementary layers, often combined:
- HNSW as a coarse quantizer. FAISS can replace IVF's flat list of centroids
with an HNSW graph over the centroids (
IndexHNSWFlatas the quantizer), making the "which cells?" step itself fast whennlistis huge. - HNSW on compressed vectors. Run the graph over PQ/SQ-compressed vectors to cut the graph's memory.
- Division of labor. HNSW = fast navigation; IVF = partitioning; PQ = compression. Real systems mix the three to hit a target recall, latency, and memory budget simultaneously.
Next, a look at the papers that introduced these ideas. 👉
The papers: understand & reproduce
Like the other books, we built these methods as if from thin air. They come from a line of research papers. This chapter tells that story: what the key paper says, its lineage, how to read it, and an honest reflection on what we reproduced versus what FAISS adds.
The key paper
Hervé Jégou, Matthijs Douze, Cordelia Schmid. Product Quantization for Nearest Neighbor Search. IEEE TPAMI, 2011.
This paper introduced both halves of what we built:
- Product Quantization (Section 3) — split vectors, quantize each sub-space, store the codes, and compute distances by table lookup (ADC).
- IVFADC (Section 4) — combine an inverted file (coarse quantizer) with PQ
on residuals, which is exactly our
IVFPQ. The "ADC" is the asymmetric distance computation we implemented.
The four moving parts of the paper map onto our code:
| Paper concept | Our code |
|---|---|
| Coarse quantizer (k-means) | kmeans, IVF.train |
| Inverted lists | IVF.add / IVFPQ.add |
| Product quantizer + codebooks | PQ.train, PQ.encode |
| Asymmetric Distance Computation | PQ.distance_table, PQ.adc_distances |
| Residual encoding (IVFADC) | IVFPQ.train / IVFPQ.search |
Lineage and descendants
- Vector quantization comes from classic signal compression (Lloyd's algorithm, 1957/1982 — the same k-means we use).
- Inverted files come from text retrieval, repurposed for vectors.
- Descendants: Optimized PQ (OPQ, Ge et al. 2013) learns a rotation before PQ; ScaNN (Google, 2020) adds anisotropic quantization tuned for inner product; FAISS packages all of these.
Recognizing this lineage helps: when the paper is terse about k-means or distance algebra, the older quantization literature fills the gap.
How to read a paper like this
The same three-pass approach as the other books:
- Skim (5 min). Abstract, Figure 2 (the PQ split), and the IVFADC system diagram. Decide what you need: here, PQ + IVFADC.
- Method, carefully. Read Sections 3–4. Write down, in your own words: how a
code is formed, how ADC computes distance from codes, why residuals are
encoded, and the parameters (
m,k*=ksub,nlist/coarsek',w=nprobe). - Reproduce. Implement and measure recall vs. brute force — the demo is the proof you understood it.
Tip: the paper reports recall@{1,10,100} and memory per vector. Reproducing those axes (not necessarily the exact numbers, which depend on the dataset) is the real test — and it surfaces the re-ranking insight, since raw recall@10 looks alarming until you shortlist-and-rescore.
From paper to code: the process
- Build the primitive first. k-means underlies everything, so we wrote and tested it before IVF or PQ.
- Translate the data structures. Codebooks → a
(m, ksub, dsub)array; inverted lists →np.where(assign == c); codes → auint8matrix. - Get ADC right. The table-lookup distance (
adc_distances) is the subtle part; a wrong axis or sum and recall collapses. - Add residual encoding for IVFPQ. Encoding
x - centroidinstead ofxis a one-line change with a big recall effect — exactly as the paper claims. - Validate against ground truth, and discover the re-ranking technique when raw recall looks low.
Reflection: what we built vs. production
Our implementation is algorithmically faithful — PQ, ADC, IVFADC, residuals, correct recall behavior. What we left out is performance engineering, which is most of FAISS:
| Aspect | This book | FAISS / ScaNN |
|---|---|---|
| Language | pure Python + NumPy | C++/CUDA, SIMD, PQFastScan |
| Training | full data, simple Lloyd | sampled, k-means++, many iters |
| Distance | NumPy per cell | register-resident table lookups, GPU |
| Extras | — | OPQ rotation, on-disk, sharding, deletes |
| Re-rank | manual top-100 rescoring | IndexRefineFlat |
Same lesson as the KTS and HNSW books: the algorithm is compact and learnable; the performance is careful systems work layered on top. A faithful from-scratch version is the right way to understand it — and a baseline to validate the fast version against.
See the References for the papers and their roots.
One-file CLI: the trade-off estimator
Everything in this book, distilled into one tool that answers the practical question every engineer asks before compressing vectors:
If I compress my embeddings with these settings, how much memory do I save, and what recall do I keep?
pqtool.py builds an IVFPQ index over your vectors (or synthetic data), measures
recall against exact search, and reports the memory savings — so you can pick
parameters with evidence instead of guesswork.
The steps
- Load vectors from an
.npyfile (shape[n, d]), or generate synthetic clustered data with--synthetic. - Hold out some vectors as queries and compute their exact neighbors (ground truth).
- Build an IVFPQ index with your chosen
nlist,m,nprobe. - Measure recall@k — both raw and with exact re-ranking.
- Report recall, memory (float32 vs PQ codes), and the fraction of cells scanned.
Install & run
pip install numpy # that's all it needs
# try it on synthetic data (no files required)
python pqtool.py --synthetic --m 16 --nlist 128 --nprobe 8
# estimate for your own embeddings
python pqtool.py --npy embeddings.npy --m 16 --nlist 256 --nprobe 16 --rerank 100
It works — real output
$ python pqtool.py --synthetic --m 16 --nlist 128 --nprobe 8
synthetic: 10000 vectors x 64 dims
building IVFPQ (nlist=128, m=16) ...
--- results ---
recall@10 (raw) : 0.711
recall@10 (+rerank 100): 1.000
memory (float32 vectors) : 2.6 MB
memory (PQ codes) : 0.16 MB (16x smaller)
per query, IVF scans ~6.2% of cells
The read-out is the whole decision in one place: 16× less memory, recall
1.0 after re-ranking, scanning only 6% of cells. Change --m to trade
memory for raw recall, --nprobe to trade speed for recall, and --rerank to
trade a little query time for accuracy.
The complete script
#!/usr/bin/env python3
"""
pqtool.py — estimate the memory / recall / speed trade-off of compressing your
vectors with IVF + Product Quantization, using the from-scratch index in ivfpq.py.
Answers the practical question: "If I compress my embeddings with these settings,
how much memory do I save and what recall do I keep?"
Usage:
# try it on synthetic clustered data (no files needed)
python pqtool.py --synthetic
# estimate for your own embeddings (an .npy array of shape [n, d])
python pqtool.py --npy embeddings.npy --m 16 --nlist 256 --nprobe 16
Requirements: numpy (and the local ivfpq.py).
"""
from __future__ import annotations
import argparse
import numpy as np
from ivfpq import flat_knn, IVFPQ
def make_synthetic(n=10000, d=64, seed=0):
rng = np.random.default_rng(seed)
centers = rng.normal(scale=5, size=(max(n // 250, 2), d))
data = np.vstack([centers[i % len(centers)] + rng.normal(size=(1, d))
for i in range(n)])
return data
def main(argv=None):
p = argparse.ArgumentParser(description="Estimate IVF-PQ memory/recall trade-off.")
p.add_argument("--npy", help="path to an .npy array of shape [n, d]")
p.add_argument("--synthetic", action="store_true",
help="use generated clustered data instead of a file")
p.add_argument("--n", type=int, default=10000, help="synthetic dataset size")
p.add_argument("--d", type=int, default=64, help="synthetic dimensionality")
p.add_argument("--m", type=int, default=16, help="PQ sub-vectors / bytes per code")
p.add_argument("--nlist", type=int, default=128, help="IVF partitions")
p.add_argument("--nprobe", type=int, default=8, help="IVF cells probed per query")
p.add_argument("--rerank", type=int, default=100,
help="candidates to re-rank exactly (0 = no re-rank)")
p.add_argument("-k", type=int, default=10, help="neighbors to return")
p.add_argument("--queries", type=int, default=100, help="evaluation queries")
args = p.parse_args(argv)
if args.npy:
data = np.load(args.npy).astype(np.float64)
print(f"loaded {data.shape[0]} vectors x {data.shape[1]} dims from {args.npy}")
else:
data = make_synthetic(args.n, args.d)
print(f"synthetic: {data.shape[0]} vectors x {data.shape[1]} dims")
N, D = data.shape
if D % args.m != 0:
raise SystemExit(f"error: dim {D} not divisible by --m {args.m}; "
f"pick m in {[x for x in (4,8,16,32) if D % x == 0]}")
rng = np.random.default_rng(123)
qidx = rng.choice(N, size=min(args.queries, N), replace=False)
queries = data[qidx] + rng.normal(scale=0.5, size=(len(qidx), D))
exact = [set(flat_knn(data, q, args.k)[0].tolist()) for q in queries]
print(f"building IVFPQ (nlist={args.nlist}, m={args.m}) ...")
index = IVFPQ(nlist=args.nlist, m=args.m, ksub=256, seed=1).train(data)
index.add(data)
def recall(rerank):
hits = 0
for q, ex in zip(queries, exact):
want = args.rerank if (rerank and args.rerank) else args.k
ids, _ = index.search(q, want, nprobe=args.nprobe)
if rerank and args.rerank and len(ids):
d = ((data[ids] - q) ** 2).sum(1)
ids = ids[np.argsort(d)[:args.k]]
else:
ids = ids[:args.k]
hits += len(set(np.asarray(ids).tolist()) & ex)
return hits / (len(queries) * args.k)
flat_bytes = N * D * 4 # realistic float32 baseline
code_bytes = N * args.m # PQ codes (1 byte each)
raw = recall(False)
rr = recall(True) if args.rerank else None
print("\n--- results ---")
print(f" recall@{args.k} (raw) : {raw:.3f}")
if rr is not None:
print(f" recall@{args.k} (+rerank {args.rerank}): {rr:.3f}")
print(f" memory (float32 vectors) : {flat_bytes / 1e6:.1f} MB")
print(f" memory (PQ codes) : {code_bytes / 1e6:.2f} MB "
f"({flat_bytes // code_bytes}x smaller)")
print(f" per query, IVF scans ~{args.nprobe / args.nlist * 100:.1f}% of cells")
if __name__ == "__main__":
main()
Point it at your real embeddings and you have an evidence-based way to choose IVFPQ parameters — the same decision FAISS users make every day, now with a tool you fully understand.
References
-
Hervé Jégou, Matthijs Douze, Cordelia Schmid. Product Quantization for Nearest Neighbor Search. IEEE TPAMI, 2011. — Introduces PQ, ADC, and the IVFADC system this book reproduces.
-
Tiezheng Ge, Kaiming He, Qifa Ke, Jian Sun. Optimized Product Quantization. CVPR 2013. — OPQ: a learned rotation before PQ that improves recall (
OPQin FAISS). -
Stuart P. Lloyd. Least squares quantization in PCM. IEEE Trans. Information Theory, 1982 (work from 1957). — The k-means algorithm at the core of every codebook here.
-
Jeff Johnson, Matthijs Douze, Hervé Jégou. Billion-scale similarity search with GPUs. IEEE Big Data, 2019. — The FAISS library and its GPU implementations.
-
Ruiqi Guo et al. Accelerating Large-Scale Inference with Anisotropic Vector Quantization (ScaNN). ICML 2020. — Quantization tuned for maximum inner-product search.
Tools & systems
- FAISS —
IndexFlat,IndexIVFFlat,IndexPQ,IndexIVFPQ,IndexScalarQuantizer,IndexHNSWFlat, and the index factory. - ScaNN (Google) — anisotropic quantization for MIPS.
- Vector databases — Qdrant, Weaviate, Milvus, Pinecone (IVF/PQ/HNSW indexes).
Related books here
- HNSW from Scratch — the graph-navigation side of vector search.
- KTS from Scratch — kernel temporal segmentation.
This book's code
code/ivfpq.py— k-means, IVF, PQ, IVFPQ.code/demo.py— recall vs. memory vs. work comparison.code/pqtool.py— the trade-off estimator CLI.
All depend only on NumPy and the standard library.