Product quantization
Product Quantization (PQ) is the key idea of this book. It compresses a vector from hundreds of floats to a handful of bytes (16–64× smaller) and — cleverly — lets you compute approximate distances directly from those bytes, fast.
The problem with one big codebook
Quantization means "replace a vector by the id of its nearest codebook entry". To quantize whole 128-dim vectors finely, you'd need an astronomically large codebook (you can't cover 128-dim space with 256 points). Storing and searching such a codebook is impossible.
The trick: split, then quantize each piece
PQ splits each vector into m shorter sub-vectors and quantizes each piece
separately with its own small codebook of 256 centroids. The full vector's code
is just the m sub-vector ids — m bytes total.
vector (D=8): [ a1 a2 | b1 b2 | c1 c2 | d1 d2 ] split into m=4 pieces
│ │ │ │
codebook codebook codebook codebook each: 256 centroids
│ │ │ │
code (m=4 bytes): [ 37 | 201 | 8 | 142 ] one byte per piece
Why this is powerful: m codebooks of 256 entries together represent
$256^m$ distinct vectors — e.g. with m=8 that's $256^8 \approx 1.8 \times 10^{19}$
possible reconstructions — astronomically expressive, yet each vector costs only
m bytes and each codebook is tiny.
The code
import numpy as np
from ivfpq import kmeans, _assign
class PQ:
def __init__(self, m=8, ksub=256, seed=0):
self.m = m # number of sub-vectors (= bytes per code)
self.ksub = ksub # centroids per sub-quantizer (256 -> 1 byte)
self.seed = seed
def train(self, X):
n, D = X.shape
self.dsub = D // self.m # length of each sub-vector
self.codebooks = np.zeros((self.m, self.ksub, self.dsub))
for j in range(self.m): # one codebook per sub-space
sub = X[:, j * self.dsub:(j + 1) * self.dsub]
C, _ = kmeans(sub, self.ksub, seed=self.seed + j)
self.codebooks[j, :len(C)] = C
return self
def encode(self, X): # vectors -> m-byte codes
codes = np.zeros((X.shape[0], self.m), dtype=np.uint8)
for j in range(self.m):
sub = X[:, j * self.dsub:(j + 1) * self.dsub]
codes[:, j] = _assign(sub, self.codebooks[j])
return codes
train learns m independent codebooks (k-means on each slice of the
dimensions). encode replaces each sub-vector with the id of its nearest
centroid — m bytes per vector.
See it compress
Four-dimensional vectors in two clear groups; m=2 sub-vectors of 2 dims each,
ksub=2 centroids per piece:
import numpy as np
from ivfpq import PQ
rng = np.random.default_rng(0)
data = np.vstack([rng.normal(loc=0, scale=0.3, size=(100, 4)),
rng.normal(loc=9, scale=0.3, size=(100, 4))])
pq = PQ(m=2, ksub=2, seed=1).train(data)
v = np.array([8.9, 9.1, 9.0, 8.8])
code = pq.encode(v[None])[0]
recon = np.concatenate([pq.codebooks[j, code[j]] for j in range(2)])
print("original :", v.tolist())
print("PQ code :", code.tolist(), "(2 bytes instead of 4 floats)")
print("reconstructed:", np.round(recon, 2).tolist())
Output:
original : [8.9, 9.1, 9.0, 8.8]
PQ code : [1, 1] (2 bytes instead of 4 floats)
reconstructed: [9.01, 9.0, 8.96, 9.01]
The vector became just two bytes [1, 1], and reconstructs back to almost the
original. Lossy, but close — and tiny.
ADC: distance without decompressing
Here's the elegant part. To search, we don't decompress every stored vector. Instead we use Asymmetric Distance Computation (ADC):
- Split the query into the same
mpieces. - For each piece, precompute the distance from the query's piece to all 256
centroids of that codebook — a small distance table of shape
(m, 256). - For any stored code, its approximate distance to the query is just
mtable lookups added up — no multiplications overDdimensions.
$$ \text{dist}(q, x) \approx \sum_{j=1}^{m} \text{table}\big[j,\ \text{code}_x[j]\big]. $$
"Asymmetric" = the query stays full-precision while the database is compressed, which is more accurate than compressing both. The table is built once per query; scoring a million codes is then a million cheap lookups.
def distance_table(self, q): # shape (m, ksub)
table = np.zeros((self.m, self.ksub))
for j in range(self.m):
qs = q[j * self.dsub:(j + 1) * self.dsub]
table[j] = ((self.codebooks[j] - qs) ** 2).sum(1)
return table
def adc_distances(self, codes, q): # approx distance for each code
table = self.distance_table(q)
d = np.zeros(len(codes))
for j in range(self.m):
d += table[j, codes[:, j]]
return d
The accuracy cost — and re-ranking
Compression is lossy, so raw PQ recall is modest. From the demo:
PQ m= 8 (8B/vec, 64x smaller): recall@10 raw=0.292 +rerank(top100)=0.843
PQ m=16 (16B/vec, 32x smaller): recall@10 raw=0.386 +rerank(top100)=0.938
Raw recall@10 of ~0.3 sounds bad — but it's not how PQ is used. The standard, crucial trick is re-ranking:
Use the cheap PQ distances to shortlist a larger candidate set (say the top 100), then compute exact distances on just those 100 and keep the top 10.
Re-ranking lifts recall@10 from 0.29 to 0.84 (m=8) and 0.39 to 0.94
(m=16) — near-exact accuracy, at 32–64× less memory, because we only pay full
distance cost on a tiny shortlist. (Re-ranking does need access to the full
vectors for that shortlist; FAISS's IndexRefineFlat keeps them for exactly this.)
PQ shrinks memory; IVF cuts search time. Next we combine them. 👉