Product quantization

Product Quantization (PQ) is the key idea of this book. It compresses a vector from hundreds of floats to a handful of bytes (16–64× smaller) and — cleverly — lets you compute approximate distances directly from those bytes, fast.

The problem with one big codebook

Quantization means "replace a vector by the id of its nearest codebook entry". To quantize whole 128-dim vectors finely, you'd need an astronomically large codebook (you can't cover 128-dim space with 256 points). Storing and searching such a codebook is impossible.

The trick: split, then quantize each piece

PQ splits each vector into m shorter sub-vectors and quantizes each piece separately with its own small codebook of 256 centroids. The full vector's code is just the m sub-vector ids — m bytes total.

vector (D=8):     [ a1 a2 | b1 b2 | c1 c2 | d1 d2 ]      split into m=4 pieces
                     │        │        │        │
                  codebook  codebook codebook codebook   each: 256 centroids
                     │        │        │        │
code (m=4 bytes): [   37   |  201  |   8   |  142  ]      one byte per piece

Why this is powerful: m codebooks of 256 entries together represent $256^m$ distinct vectors — e.g. with m=8 that's $256^8 \approx 1.8 \times 10^{19}$ possible reconstructions — astronomically expressive, yet each vector costs only m bytes and each codebook is tiny.

The code

import numpy as np
from ivfpq import kmeans, _assign


class PQ:
    def __init__(self, m=8, ksub=256, seed=0):
        self.m = m              # number of sub-vectors (= bytes per code)
        self.ksub = ksub        # centroids per sub-quantizer (256 -> 1 byte)
        self.seed = seed

    def train(self, X):
        n, D = X.shape
        self.dsub = D // self.m                      # length of each sub-vector
        self.codebooks = np.zeros((self.m, self.ksub, self.dsub))
        for j in range(self.m):                      # one codebook per sub-space
            sub = X[:, j * self.dsub:(j + 1) * self.dsub]
            C, _ = kmeans(sub, self.ksub, seed=self.seed + j)
            self.codebooks[j, :len(C)] = C
        return self

    def encode(self, X):                             # vectors -> m-byte codes
        codes = np.zeros((X.shape[0], self.m), dtype=np.uint8)
        for j in range(self.m):
            sub = X[:, j * self.dsub:(j + 1) * self.dsub]
            codes[:, j] = _assign(sub, self.codebooks[j])
        return codes

train learns m independent codebooks (k-means on each slice of the dimensions). encode replaces each sub-vector with the id of its nearest centroid — m bytes per vector.

See it compress

Four-dimensional vectors in two clear groups; m=2 sub-vectors of 2 dims each, ksub=2 centroids per piece:

import numpy as np
from ivfpq import PQ

rng = np.random.default_rng(0)
data = np.vstack([rng.normal(loc=0, scale=0.3, size=(100, 4)),
                  rng.normal(loc=9, scale=0.3, size=(100, 4))])
pq = PQ(m=2, ksub=2, seed=1).train(data)

v = np.array([8.9, 9.1, 9.0, 8.8])
code = pq.encode(v[None])[0]
recon = np.concatenate([pq.codebooks[j, code[j]] for j in range(2)])
print("original     :", v.tolist())
print("PQ code      :", code.tolist(), "(2 bytes instead of 4 floats)")
print("reconstructed:", np.round(recon, 2).tolist())

Output:

original     : [8.9, 9.1, 9.0, 8.8]
PQ code      : [1, 1] (2 bytes instead of 4 floats)
reconstructed: [9.01, 9.0, 8.96, 9.01]

The vector became just two bytes [1, 1], and reconstructs back to almost the original. Lossy, but close — and tiny.

ADC: distance without decompressing

Here's the elegant part. To search, we don't decompress every stored vector. Instead we use Asymmetric Distance Computation (ADC):

  1. Split the query into the same m pieces.
  2. For each piece, precompute the distance from the query's piece to all 256 centroids of that codebook — a small distance table of shape (m, 256).
  3. For any stored code, its approximate distance to the query is just m table lookups added up — no multiplications over D dimensions.

$$ \text{dist}(q, x) \approx \sum_{j=1}^{m} \text{table}\big[j,\ \text{code}_x[j]\big]. $$

"Asymmetric" = the query stays full-precision while the database is compressed, which is more accurate than compressing both. The table is built once per query; scoring a million codes is then a million cheap lookups.

def distance_table(self, q):                 # shape (m, ksub)
    table = np.zeros((self.m, self.ksub))
    for j in range(self.m):
        qs = q[j * self.dsub:(j + 1) * self.dsub]
        table[j] = ((self.codebooks[j] - qs) ** 2).sum(1)
    return table

def adc_distances(self, codes, q):           # approx distance for each code
    table = self.distance_table(q)
    d = np.zeros(len(codes))
    for j in range(self.m):
        d += table[j, codes[:, j]]
    return d

The accuracy cost — and re-ranking

Compression is lossy, so raw PQ recall is modest. From the demo:

PQ m= 8 (8B/vec, 64x smaller): recall@10 raw=0.292  +rerank(top100)=0.843
PQ m=16 (16B/vec, 32x smaller): recall@10 raw=0.386  +rerank(top100)=0.938

Raw recall@10 of ~0.3 sounds bad — but it's not how PQ is used. The standard, crucial trick is re-ranking:

Use the cheap PQ distances to shortlist a larger candidate set (say the top 100), then compute exact distances on just those 100 and keep the top 10.

Re-ranking lifts recall@10 from 0.29 to 0.84 (m=8) and 0.39 to 0.94 (m=16) — near-exact accuracy, at 32–64× less memory, because we only pay full distance cost on a tiny shortlist. (Re-ranking does need access to the full vectors for that shortlist; FAISS's IndexRefineFlat keeps them for exactly this.)

PQ shrinks memory; IVF cuts search time. Next we combine them. 👉