Introduction

No background required. Like its companion books, this one assumes you know nothing about programming, AI, or advanced math. Every concept, line of code, and symbol is explained, and every code example is followed by the exact output it produces. New to Python? Read the 5-minute primer first.

What this book is about

The HNSW book covered one way to search billions of vectors quickly: build a clever graph and navigate it. This book covers the other half of the story — how to make vector search fit in memory and partition the work — using two classic techniques:

  • IVF (Inverted File index) — chop the vector space into regions and only search the few regions near the query.
  • Quantizationcompress each vector from hundreds of floats down to a handful of bytes, so a billion vectors fit in RAM.

Together (as IVFPQ) they are the workhorse behind FAISS and most billion-scale vector databases.

Storing $n$ vectors of $d$ dimensions as 32-bit floats and scanning them all per query has two costs:

  1. Time — $n$ distance computations per query (the problem HNSW attacks).
  2. Memory — $n \times d \times 4$ bytes. One billion 768-dimensional embeddings is about 3 terabytes. That won't fit in RAM on any normal machine.

This book is mostly about problem 2 (and partly problem 1 via IVF). Quantization can shrink that 3 TB by 16–64×, down to something that fits on a single server — the difference between "impossible" and "routine".

Two everyday analogies

IVF — the library. A library doesn't make you read every book to find one about cooking. Books are shelved by topic; you go straight to the cooking section and look only there. IVF shelves vectors into regions (by similarity) and searches only the regions near your query.

Quantization — paint-by-numbers. Instead of storing the exact color of every pixel, paint-by-numbers stores a small palette and, per pixel, just the number of the nearest palette color. You lose a little fidelity but store far less. Quantization does exactly this with vectors: keep a small set of representative vectors (a "codebook") and store, per vector, just the id of the nearest one.

What you'll build

A complete, working compressed index in one readable file, ivfpq.py, containing k-means, IVF, PQ, and IVFPQ — plus:

  • a demo measuring recall vs. memory vs. work, and
  • a CLI tool that estimates the trade-off for your own vectors.

How it relates to HNSW

These aren't competitors — they compose. A common production setup is "HNSW on top of quantized vectors", or "IVF to partition, PQ to compress". HNSW answers which graph node next?; IVF/PQ answer how do we store a billion vectors and only look at a few? We'll map out exactly when to use which in the use-cases chapter.

Everything rests on one humble algorithm — k-means clustering — so after the problem statement, that's where we start. 👉

A 5-minute primer: vectors, clusters & recall

Just enough to read every example. Skip if you're comfortable with Python, NumPy, and nearest-neighbor basics (the HNSW primer covers the same ground in more detail).

Code boxes show their output

import numpy as np
print(np.array([1.0, 2.0, 3.0]).mean())

Output:

2.0

Vectors and distance

A vector is a list of numbers — a point in space. The squared Euclidean distance between two vectors is the sum of squared differences (we skip the square root because it doesn't change which point is nearest):

import numpy as np
a = np.array([0.0, 0.0])
b = np.array([3.0, 4.0])
print(np.sum((a - b) ** 2))     # 9 + 16

Output:

25.0

In this book vectors are usually embeddings — numeric fingerprints of text, images, products, or users, arranged so similar items sit close together. We just assume we're handed them.

Clusters and centroids

A cluster is a group of nearby vectors; its centroid is the group's average — a single vector that "represents" the group. The whole book leans on one idea: replace many vectors by a few representative centroids.

  • IVF uses centroids to define regions (search only the nearby region).
  • Quantization uses centroids as a palette (store the nearest centroid's id instead of the full vector).

The algorithm that finds good centroids is k-means, built from scratch in Chapter 2.

Recall: measuring an approximate answer

These methods are approximate — fast and small, but they occasionally miss a true neighbor. We measure quality with recall:

recall@k = (how many of the true top-k neighbors we returned) ÷ k.

Recall 1.0 = perfect; 0.9 = we found 9 of the true 10. Throughout, we compare against exact brute-force search to compute recall honestly.

true_top10  = {3, 8, 12, 19, 25, 31, 44, 50, 61, 77}
approx_top10 = {3, 8, 12, 19, 25, 31, 44, 50, 61, 99}   # missed 77, added 99
recall = len(true_top10 & approx_top10) / 10
print(recall)

Output:

0.9

NumPy bits we use

You'll seeMeaning
np.argsort(d)[:k]indices of the k smallest values in d
(X - q) ** 2).sum(1)squared distance from q to every row of X
X @ C.Tall pairwise dot products between rows of X and C
np.where(a == c)[0]positions where assignment array a equals cell c
.astype(np.uint8)store small integers (0–255) in one byte each

That last one is the heart of quantization: turning floats into single bytes. On to why exact search doesn't fit in memory. 👉

The problem: memory & scale

The HNSW book framed the speed problem. This book adds the memory problem — the one that decides whether your vectors fit on the machine at all.

Exact search: simple, correct, and huge

The baseline ("flat" index) stores every vector and scans them all:

import numpy as np

def flat_knn(X, q, k=10):
    d = ((X - q) ** 2).sum(1)        # distance to every vector
    idx = np.argsort(d)[:k]
    return idx, d[idx]

It's exact and trivial. Its problem is size.

The memory wall

A flat index stores $n \times d$ numbers as 32-bit floats — $n \cdot d \cdot 4$ bytes. Plug in real numbers:

Vectors $n$Dim $d$Float32 size
1,000,000768~3 GB
100,000,000768~300 GB
1,000,000,000768~3 TB

Three terabytes won't fit in the RAM of a normal server, and even 300 GB is expensive. RAM matters because vector search must be fast, and reading from disk per query is far too slow. So the question becomes:

How do we shrink each vector enough that a billion of them fit in memory, while still answering "what's nearest?" accurately?

That is the quantization problem, and it's the heart of this book.

Two levers

There are two independent things we can attack:

  1. Don't look at everything (time). Partition the vectors into regions and search only the regions near the query. → IVF (Chapter 3).
  2. Don't store everything (memory). Replace each vector with a compact code. → Scalar and Product quantization (Chapters 45).

Combine both and you get IVFPQ (Chapter 6): search a few regions, each holding heavily compressed vectors.

The trade-off we're buying

Every technique here trades a little accuracy (recall < 1.0) for big wins in memory and/or speed. The skill is choosing operating points:

  • IVF: more regions probed → higher recall, more work.
  • Quantization: more bytes per code → higher recall, more memory.
  • Re-ranking (a key trick we'll meet): use the cheap approximate index to shortlist candidates, then re-score the shortlist exactly — recovering most of the lost recall for little cost.

Relationship to HNSW

HNSWIVFQuantization (PQ)
Attackssearch timesearch timememory
Ideanavigable graphpartition + probecompress to bytes
Memorymore than flat (stores a graph too)same as flatmuch less than flat
Composes withquantizationquantizationIVF, HNSW

HNSW makes search fast but uses extra memory for the graph. Quantization makes storage tiny. Real systems mix them. First, the algorithm both IVF and PQ are built on: k-means. 👉

k-means: the building block

Both IVF and quantization rest on one algorithm: k-means clustering. It finds k representative points (centroids) that summarize a cloud of vectors. Get this and the rest of the book is just two clever ways of using centroids.

The goal

Given many vectors, find k centroids so that every vector is close to its nearest centroid. Those centroids become:

  • the region centers in IVF (each vector belongs to its nearest region), and
  • the palette in quantization (each vector is replaced by its nearest centroid's id).

Lloyd's algorithm (the standard k-means)

It alternates two steps until things stop moving:

  1. Assign — put each vector in the group of its nearest centroid.
  2. Update — move each centroid to the average of its group.

Repeat. Each round can only reduce the total distance, so it converges.

start: random centroids        assign points        move centroids to means
   x   x      ●                 x   x  |  ●           x  x        ●
 x   x          ●      ─►     x   x    |    ●   ─►   x  ●            ●
   x    x   x                   x  x x | (nearest)     x  x  x
                                                    ...repeat until stable

The code

import numpy as np


def _assign(X, C):
    """Index of the nearest centroid for each row of X (squared L2).
    Uses ||x-c||^2 = ||x||^2 - 2 x.c + ||c||^2 to compute all distances at once."""
    d = (X ** 2).sum(1)[:, None] - 2.0 * X @ C.T + (C ** 2).sum(1)[None, :]
    return d.argmin(1)


def kmeans(X, k, iters=25, seed=0):
    """Lloyd's k-means. Returns (centroids, assignments)."""
    X = np.asarray(X, dtype=np.float64)
    n = X.shape[0]
    k = min(k, n)
    rng = np.random.default_rng(seed)
    C = X[rng.choice(n, size=k, replace=False)].copy()   # init: random points
    for _ in range(iters):
        a = _assign(X, C)                                # 1. assign
        newC = C.copy()
        for j in range(k):                              # 2. update
            members = X[a == j]
            if len(members):
                newC[j] = members.mean(0)
        if np.allclose(newC, C):                        # converged?
            C = newC
            break
        C = newC
    return C, _assign(X, C)

Reading it

  • _assign computes, for every vector, the distance to every centroid in one vectorized expression, then picks the closest. This is the workhorse — IVF and PQ both call it.
  • kmeans starts from k random data points, then loops assign→update until the centroids stop changing (or iters is hit).

See it work

Six points forming two obvious clusters — one near the origin, one near $(5,5)$:

import numpy as np
from ivfpq import kmeans

X = np.array([[0., 0.], [0.2, 0.1], [0.1, 0.2],     # cluster A
              [5., 5.], [5.1, 4.9], [4.9, 5.1]])     # cluster B
C, a = kmeans(X, k=2, seed=0)
print("centroids:\n", np.round(C, 2))
print("assignments:", a.tolist())

Output:

centroids:
 [[5.  5. ]
 [0.1 0.1]]
assignments: [1, 1, 1, 0, 0, 0]

k-means found exactly the two cluster centers — $(5,5)$ and $(0.1,0.1)$ — and assigned the first three points to centroid 1 and the last three to centroid 0. 🎯 Those two centroids now summarize all six points.

How centroids power everything next

  • IVF: treat each centroid as a bucket. Every vector is filed under its nearest centroid; at query time we only open the buckets near the query.
  • Quantization: treat the centroids as a codebook. Store, per vector, just the id of its nearest centroid (one small integer) instead of the full vector.

Same algorithm, two superpowers. Let's build IVF first. 👉

IVF — the inverted file index

IVF (Inverted File index) attacks the time problem: instead of scanning every vector, partition them into regions and scan only the regions near the query. It's the "go straight to the cooking section of the library" idea.

How it works

  1. Train: run k-means to get nlist centroids. These define nlist regions (mathematically, Voronoi cells: each point belongs to its nearest centroid).
  2. Add: assign every database vector to its nearest centroid, and store its id in that centroid's list. These lists are the "inverted lists".
  3. Search: find the nprobe centroids nearest the query, then do an exact search only over the vectors in those few lists.

The name "inverted file" comes from text search: instead of "document → words", you keep "region → vectors in it", and look up by region.

                centroid 0 ─► [v3, v9, v17, ...]
   query  ──►   centroid 1 ─► [v1, v5, v22, ...]   ◄─ nprobe=2: scan
                centroid 2 ─► [v8, v11, ...]        ◄─ these two lists only
                centroid 3 ─► [v2, v14, ...]

The code

import numpy as np
from ivfpq import kmeans, _assign       # _assign: nearest-centroid helper


class IVF:
    def __init__(self, nlist=64, nprobe=8, seed=0):
        self.nlist = nlist      # number of regions
        self.nprobe = nprobe    # regions scanned per query
        self.seed = seed

    def train(self, X):
        self.centroids, _ = kmeans(X, self.nlist, seed=self.seed)
        return self

    def add(self, X):
        self.X = np.asarray(X, dtype=np.float64)
        a = _assign(self.X, self.centroids)               # which region each vector
        self.lists = [np.where(a == c)[0] for c in range(len(self.centroids))]
        return self

    def search(self, q, k=10, nprobe=None):
        nprobe = nprobe or self.nprobe
        dc = ((self.centroids - q) ** 2).sum(1)           # distance to each region
        cells = np.argsort(dc)[:nprobe]                   # the nprobe nearest regions
        cand = np.concatenate([self.lists[c] for c in cells])
        d = ((self.X[cand] - q) ** 2).sum(1)              # EXACT within those regions
        order = np.argsort(d)[:k]
        return cand[order], d[order]

Reading it

  • train learns the region centers; add files each vector into its region's list (np.where(a == c)[0] collects the ids in region c).
  • search ranks regions by distance to the query, takes the closest nprobe, gathers their vectors, and does an ordinary exact search over just those. Note the result is exact within the probed regions — IVF's only error is when a true neighbor sits in a region we didn't probe.

The regions really do partition the data

Building an 8-region IVF over 1000 vectors files them into balanced lists:

import numpy as np
from ivfpq import IVF

data = np.random.default_rng(1).normal(size=(1000, 16))
ivf = IVF(nlist=8, seed=0).train(data); ivf.add(data)
print("cell sizes:", [len(l) for l in ivf.lists])

Output:

cell sizes: [126, 144, 99, 109, 139, 139, 119, 125]

Every vector landed in exactly one of the 8 lists, ~125 each. A query now scans a couple of lists (~250 vectors) instead of all 1000.

The nprobe dial

nprobe trades recall for speed, without rebuilding:

  • nprobe = 1 — fastest, but misses neighbors that fall just across a region border.
  • larger nprobe — scans more regions, higher recall, more work.

In the demo you'll see IVF hit recall 1.0 while scanning ~6% of the data (nprobe=8 of nlist=128). That's the whole point: near-exact answers at a fraction of the cost.

What IVF does not do

IVF keeps the full vectors — it saves time, not memory. A billion vectors still need terabytes. To shrink them, we need quantization, which is next. 👉

Scalar quantization

Before the powerful Product Quantization, here's the simplest compression of all — scalar quantization (SQ) — which already gives a 4× memory cut and introduces the core idea: store small integers instead of floats.

The idea

A 32-bit float takes 4 bytes. But embedding values usually live in a modest range (say roughly $-1$ to $1$). If we map that range onto the 256 integers $0..255$, we can store each number in a single byte — a 4× reduction — losing only a little precision.

For each dimension we find its min and max across the dataset, then:

$$ \text{code} = \mathrm{round}\!\left( 255 \cdot \frac{x - \min}{\max - \min} \right), \qquad x \approx \min + \frac{\text{code}}{255}\,(\max - \min). $$

The first formula compresses a float to a byte; the second reconstructs an approximate float from the byte.

The code

import numpy as np


class ScalarQuantizer:
    def train(self, X):
        self.lo = X.min(axis=0)                 # per-dimension min
        self.hi = X.max(axis=0)                 # per-dimension max
        self.scale = np.maximum(self.hi - self.lo, 1e-9)
        return self

    def encode(self, X):                        # floats -> bytes
        u = (X - self.lo) / self.scale          # to 0..1
        return np.clip(np.round(u * 255), 0, 255).astype(np.uint8)

    def decode(self, codes):                    # bytes -> approx floats
        return self.lo + (codes.astype(np.float64) / 255.0) * self.scale

See it work

import numpy as np

rng = np.random.default_rng(0)
X = rng.uniform(-1, 1, size=(1000, 8))
sq = ScalarQuantizer().train(X)
codes = sq.encode(X)
recon = sq.decode(codes)

print("original  :", np.round(X[0], 3).tolist())
print("codes(byte):", codes[0].tolist())
print("decoded   :", np.round(recon[0], 3).tolist())
print("bytes/vec : float32", X.shape[1] * 4, "-> SQ", codes.shape[1],
      f"({X.shape[1] * 4 // codes.shape[1]}x smaller)")
print("avg abs error:", round(float(np.abs(X - recon).mean()), 4))

Output:

original  : [0.274, -0.46, -0.918, -0.967, 0.627, 0.826, 0.213, 0.459]
codes(byte): [162, 69, 10, 4, 207, 233, 155, 186]
decoded   : [0.273, -0.459, -0.917, -0.968, 0.624, 0.823, 0.215, 0.457]
bytes/vec : float32 32 -> SQ 8 (4x smaller)
avg abs error: 0.0019

Each float became one byte (4× smaller), and the reconstruction is off by only ~0.002 — usually negligible for nearest-neighbor ranking. (This is FAISS's SQ8.)

Why we need more than SQ

SQ is great and widely used (FAISS calls it SQ8), but it tops out at ~4× because it still keeps one byte per dimension. For a 768-dim vector that's 768 bytes — better than 3072, but a billion of them is still ~768 GB.

To go further — 16× to 64× — we must stop treating dimensions independently and compress groups of them together. That's Product Quantization, the heart of this book. 👉

Product quantization

Product Quantization (PQ) is the key idea of this book. It compresses a vector from hundreds of floats to a handful of bytes (16–64× smaller) and — cleverly — lets you compute approximate distances directly from those bytes, fast.

The problem with one big codebook

Quantization means "replace a vector by the id of its nearest codebook entry". To quantize whole 128-dim vectors finely, you'd need an astronomically large codebook (you can't cover 128-dim space with 256 points). Storing and searching such a codebook is impossible.

The trick: split, then quantize each piece

PQ splits each vector into m shorter sub-vectors and quantizes each piece separately with its own small codebook of 256 centroids. The full vector's code is just the m sub-vector ids — m bytes total.

vector (D=8):     [ a1 a2 | b1 b2 | c1 c2 | d1 d2 ]      split into m=4 pieces
                     │        │        │        │
                  codebook  codebook codebook codebook   each: 256 centroids
                     │        │        │        │
code (m=4 bytes): [   37   |  201  |   8   |  142  ]      one byte per piece

Why this is powerful: m codebooks of 256 entries together represent $256^m$ distinct vectors — e.g. with m=8 that's $256^8 \approx 1.8 \times 10^{19}$ possible reconstructions — astronomically expressive, yet each vector costs only m bytes and each codebook is tiny.

The code

import numpy as np
from ivfpq import kmeans, _assign


class PQ:
    def __init__(self, m=8, ksub=256, seed=0):
        self.m = m              # number of sub-vectors (= bytes per code)
        self.ksub = ksub        # centroids per sub-quantizer (256 -> 1 byte)
        self.seed = seed

    def train(self, X):
        n, D = X.shape
        self.dsub = D // self.m                      # length of each sub-vector
        self.codebooks = np.zeros((self.m, self.ksub, self.dsub))
        for j in range(self.m):                      # one codebook per sub-space
            sub = X[:, j * self.dsub:(j + 1) * self.dsub]
            C, _ = kmeans(sub, self.ksub, seed=self.seed + j)
            self.codebooks[j, :len(C)] = C
        return self

    def encode(self, X):                             # vectors -> m-byte codes
        codes = np.zeros((X.shape[0], self.m), dtype=np.uint8)
        for j in range(self.m):
            sub = X[:, j * self.dsub:(j + 1) * self.dsub]
            codes[:, j] = _assign(sub, self.codebooks[j])
        return codes

train learns m independent codebooks (k-means on each slice of the dimensions). encode replaces each sub-vector with the id of its nearest centroid — m bytes per vector.

See it compress

Four-dimensional vectors in two clear groups; m=2 sub-vectors of 2 dims each, ksub=2 centroids per piece:

import numpy as np
from ivfpq import PQ

rng = np.random.default_rng(0)
data = np.vstack([rng.normal(loc=0, scale=0.3, size=(100, 4)),
                  rng.normal(loc=9, scale=0.3, size=(100, 4))])
pq = PQ(m=2, ksub=2, seed=1).train(data)

v = np.array([8.9, 9.1, 9.0, 8.8])
code = pq.encode(v[None])[0]
recon = np.concatenate([pq.codebooks[j, code[j]] for j in range(2)])
print("original     :", v.tolist())
print("PQ code      :", code.tolist(), "(2 bytes instead of 4 floats)")
print("reconstructed:", np.round(recon, 2).tolist())

Output:

original     : [8.9, 9.1, 9.0, 8.8]
PQ code      : [1, 1] (2 bytes instead of 4 floats)
reconstructed: [9.01, 9.0, 8.96, 9.01]

The vector became just two bytes [1, 1], and reconstructs back to almost the original. Lossy, but close — and tiny.

ADC: distance without decompressing

Here's the elegant part. To search, we don't decompress every stored vector. Instead we use Asymmetric Distance Computation (ADC):

  1. Split the query into the same m pieces.
  2. For each piece, precompute the distance from the query's piece to all 256 centroids of that codebook — a small distance table of shape (m, 256).
  3. For any stored code, its approximate distance to the query is just m table lookups added up — no multiplications over D dimensions.

$$ \text{dist}(q, x) \approx \sum_{j=1}^{m} \text{table}\big[j,\ \text{code}_x[j]\big]. $$

"Asymmetric" = the query stays full-precision while the database is compressed, which is more accurate than compressing both. The table is built once per query; scoring a million codes is then a million cheap lookups.

def distance_table(self, q):                 # shape (m, ksub)
    table = np.zeros((self.m, self.ksub))
    for j in range(self.m):
        qs = q[j * self.dsub:(j + 1) * self.dsub]
        table[j] = ((self.codebooks[j] - qs) ** 2).sum(1)
    return table

def adc_distances(self, codes, q):           # approx distance for each code
    table = self.distance_table(q)
    d = np.zeros(len(codes))
    for j in range(self.m):
        d += table[j, codes[:, j]]
    return d

The accuracy cost — and re-ranking

Compression is lossy, so raw PQ recall is modest. From the demo:

PQ m= 8 (8B/vec, 64x smaller): recall@10 raw=0.292  +rerank(top100)=0.843
PQ m=16 (16B/vec, 32x smaller): recall@10 raw=0.386  +rerank(top100)=0.938

Raw recall@10 of ~0.3 sounds bad — but it's not how PQ is used. The standard, crucial trick is re-ranking:

Use the cheap PQ distances to shortlist a larger candidate set (say the top 100), then compute exact distances on just those 100 and keep the top 10.

Re-ranking lifts recall@10 from 0.29 to 0.84 (m=8) and 0.39 to 0.94 (m=16) — near-exact accuracy, at 32–64× less memory, because we only pay full distance cost on a tiny shortlist. (Re-ranking does need access to the full vectors for that shortlist; FAISS's IndexRefineFlat keeps them for exactly this.)

PQ shrinks memory; IVF cuts search time. Next we combine them. 👉

IVFPQ — partition + compress

IVF saves time (scan few regions); PQ saves memory (bytes per vector). IVFPQ does both at once, and it's the index that made billion-scale vector search practical. This is the FAISS workhorse (IndexIVFPQ).

The one new idea: encode residuals

Combining them naively (PQ-compress everything, then bucket by IVF) wastes accuracy. IVFPQ adds a refinement: instead of PQ-encoding each vector directly, it encodes the vector's residual — what's left after subtracting its region's centroid.

$$ \text{residual} = x - \text{centroid}(x). $$

Why this helps: within a region, all vectors are close to the centroid, so their residuals are small and similar. PQ quantizes small, concentrated residuals far more accurately than the spread-out raw vectors — better recall for the same bytes. The full reconstruction is centroid + decoded_residual.

build:   x ─► nearest centroid c ─► residual (x - c) ─► PQ code (m bytes)
         store (id, code) in cell c's inverted list

search:  q ─► nearest nprobe cells
         for each cell c:  build PQ table for residual (q - c)
                           score that cell's codes by ADC table lookups
         merge cells, keep top-k   (then optionally re-rank exactly)

The code

import numpy as np
from ivfpq import kmeans, _assign, PQ


class IVFPQ:
    def __init__(self, nlist=64, nprobe=8, m=8, ksub=256, seed=0):
        self.nlist, self.nprobe = nlist, nprobe
        self.m, self.ksub, self.seed = m, ksub, seed

    def train(self, X):
        self.centroids, a = kmeans(X, self.nlist, seed=self.seed)
        residuals = X - self.centroids[a]                  # encode residuals, not X
        self.pq = PQ(self.m, self.ksub, seed=self.seed + 1).train(residuals)
        return self

    def add(self, X):
        a = _assign(X, self.centroids)
        residuals = X - self.centroids[a]
        self.codes = self.pq.encode(residuals)             # m bytes per vector
        self.lists = [np.where(a == c)[0] for c in range(len(self.centroids))]
        return self

    def search(self, q, k=10, nprobe=None):
        nprobe = nprobe or self.nprobe
        dc = ((self.centroids - q) ** 2).sum(1)
        cells = np.argsort(dc)[:nprobe]
        all_d, all_id = [], []
        for c in cells:
            ids = self.lists[c]
            if ids.size == 0:
                continue
            res_q = q - self.centroids[c]                  # query residual for THIS cell
            d = self.pq.adc_distances(self.codes[ids], res_q)
            all_d.append(d); all_id.append(ids)
        d = np.concatenate(all_d); ids = np.concatenate(all_id)
        order = np.argsort(d)[:k]
        return ids[order], d[order]

Reading it

  • train learns the IVF centroids and trains PQ on the residuals.
  • add files each vector into its cell and stores only its m-byte residual code.
  • search probes the nprobe nearest cells; for each, it forms the query's residual relative to that cell's centroid, builds the PQ distance table, and scores the cell's codes by ADC. Results from all probed cells are merged.

How well it works

From the demo (10,000 vectors, m=16 → 16 bytes/vector):

IVFPQ  (IVF + PQ on residuals — partitioned and compressed):
   nprobe= 8: recall@10 raw=0.741  +rerank(top100)=1.000
   nprobe=16: recall@10 raw=0.741  +rerank(top100)=1.000

So IVFPQ is partitioned (scans ~6% of cells) and compressed (16–32× smaller), and with re-ranking reaches perfect recall on this data. That combination — small memory, fast search, high recall — is why it scales to billions of vectors.

The knobs, all together

KnobControlsBigger means
nlistnumber of IVF cellsfiner partition; needs more nprobe for recall
nprobecells probed per queryhigher recall, more time (runtime dial)
mbytes per PQ codehigher recall, more memory
ksubcentroids per sub-quantizerusually 256 (1 byte); rarely changed
re-rank sizeexact-rescoring shortlisthigher recall, slight extra time

A typical recipe: pick nlist ≈ √n, train, then tune nprobe and the re-rank size at query time until recall is high enough — all without rebuilding the codes.

Next, the consolidated implementation. 👉

The full implementation

Here is the consolidated module — k-means, exact search, IVF, PQ, and IVFPQ — in one file, code/ivfpq.py, reproduced in full.

"""
IVF and Product Quantization for approximate nearest-neighbor search,
from scratch in NumPy.

References:
    Jégou, Douze, Schmid. "Product Quantization for Nearest Neighbor Search."
    IEEE TPAMI, 2011.  (introduces PQ and the IVFADC system)

This module implements, in order of dependency:
    kmeans(X, k)        -> the building block for every codebook here
    flat_knn(X, q, k)   -> exact search (ground truth for recall)
    IVF                 -> Inverted File index (partition + probe)
    PQ                  -> Product Quantizer (compress vectors to bytes)
    IVFPQ               -> IVF + PQ on residuals (the FAISS workhorse)

Only NumPy is used.
"""

from __future__ import annotations

import numpy as np


# ===========================================================================
# k-means and exact search
# ===========================================================================
def _assign(X, C):
    """Nearest-centroid index for each row of X (squared L2)."""
    d = (X ** 2).sum(1)[:, None] - 2.0 * X @ C.T + (C ** 2).sum(1)[None, :]
    return d.argmin(1)


def kmeans(X, k, iters=25, seed=0):
    """Lloyd's k-means. Returns (centroids, assignments)."""
    X = np.asarray(X, dtype=np.float64)
    n = X.shape[0]
    k = min(k, n)
    rng = np.random.default_rng(seed)
    C = X[rng.choice(n, size=k, replace=False)].copy()
    for _ in range(iters):
        a = _assign(X, C)
        newC = C.copy()
        for j in range(k):
            members = X[a == j]
            if len(members):
                newC[j] = members.mean(0)
        if np.allclose(newC, C):
            C = newC
            break
        C = newC
    return C, _assign(X, C)


def flat_knn(X, q, k=10):
    """Exact k nearest neighbors by brute force. Returns (ids, distances)."""
    d = ((X - q) ** 2).sum(1)
    idx = np.argsort(d)[:k]
    return idx, d[idx]


# ===========================================================================
# IVF — Inverted File index (partition space, search a few partitions)
# ===========================================================================
class IVF:
    def __init__(self, nlist=64, nprobe=8, seed=0):
        self.nlist = nlist          # number of partitions (Voronoi cells)
        self.nprobe = nprobe        # how many cells to scan per query
        self.seed = seed

    def train(self, X):
        """Learn the partition centroids (the coarse quantizer)."""
        self.centroids, _ = kmeans(X, self.nlist, seed=self.seed)
        return self

    def add(self, X):
        """Assign every vector to its nearest cell and store the inverted lists."""
        self.X = np.asarray(X, dtype=np.float64)
        a = _assign(self.X, self.centroids)
        self.lists = [np.where(a == c)[0] for c in range(len(self.centroids))]
        return self

    def search(self, q, k=10, nprobe=None):
        nprobe = nprobe or self.nprobe
        q = np.asarray(q, dtype=np.float64)
        dc = ((self.centroids - q) ** 2).sum(1)
        cells = np.argsort(dc)[:nprobe]
        cand = np.concatenate([self.lists[c] for c in cells]) if len(cells) else np.array([], int)
        if cand.size == 0:
            return np.array([], int), np.array([])
        d = ((self.X[cand] - q) ** 2).sum(1)     # exact distance within probed cells
        order = np.argsort(d)[:k]
        return cand[order], d[order]


# ===========================================================================
# PQ — Product Quantization (compress each vector to `m` bytes)
# ===========================================================================
class PQ:
    def __init__(self, m=8, ksub=256, seed=0):
        self.m = m                  # number of sub-vectors
        self.ksub = ksub            # centroids per sub-quantizer (256 => 1 byte)
        self.seed = seed

    def train(self, X):
        X = np.asarray(X, dtype=np.float64)
        n, D = X.shape
        if D % self.m != 0:
            raise ValueError(f"dim {D} not divisible by m={self.m}")
        self.dsub = D // self.m
        self.codebooks = np.zeros((self.m, self.ksub, self.dsub))
        for j in range(self.m):
            sub = X[:, j * self.dsub:(j + 1) * self.dsub]
            C, _ = kmeans(sub, self.ksub, seed=self.seed + j)
            self.codebooks[j, :len(C)] = C
        return self

    def encode(self, X):
        """Map each vector to `m` centroid ids (uint8 codes)."""
        X = np.asarray(X, dtype=np.float64)
        codes = np.zeros((X.shape[0], self.m), dtype=np.uint8)
        for j in range(self.m):
            sub = X[:, j * self.dsub:(j + 1) * self.dsub]
            codes[:, j] = _assign(sub, self.codebooks[j])
        return codes

    def distance_table(self, q):
        """Squared distance from q's sub-vectors to every centroid: shape (m, ksub)."""
        q = np.asarray(q, dtype=np.float64)
        table = np.zeros((self.m, self.ksub))
        for j in range(self.m):
            qs = q[j * self.dsub:(j + 1) * self.dsub]
            table[j] = ((self.codebooks[j] - qs) ** 2).sum(1)
        return table

    def adc_distances(self, codes, q):
        """Asymmetric distance: approx ||q - x|| using only the codes + table."""
        table = self.distance_table(q)
        d = np.zeros(len(codes))
        for j in range(self.m):
            d += table[j, codes[:, j]]
        return d


# ===========================================================================
# IVFPQ — IVF + PQ on residuals (compressed AND partitioned)
# ===========================================================================
class IVFPQ:
    def __init__(self, nlist=64, nprobe=8, m=8, ksub=256, seed=0):
        self.nlist = nlist
        self.nprobe = nprobe
        self.m = m
        self.ksub = ksub
        self.seed = seed

    def train(self, X):
        X = np.asarray(X, dtype=np.float64)
        self.centroids, a = kmeans(X, self.nlist, seed=self.seed)
        residuals = X - self.centroids[a]          # PQ encodes the residual
        self.pq = PQ(self.m, self.ksub, seed=self.seed + 1).train(residuals)
        return self

    def add(self, X):
        X = np.asarray(X, dtype=np.float64)
        a = _assign(X, self.centroids)
        residuals = X - self.centroids[a]
        self.codes = self.pq.encode(residuals)
        self.lists = [np.where(a == c)[0] for c in range(len(self.centroids))]
        return self

    def search(self, q, k=10, nprobe=None):
        nprobe = nprobe or self.nprobe
        q = np.asarray(q, dtype=np.float64)
        dc = ((self.centroids - q) ** 2).sum(1)
        cells = np.argsort(dc)[:nprobe]
        all_d, all_id = [], []
        for c in cells:
            ids = self.lists[c]
            if ids.size == 0:
                continue
            res_q = q - self.centroids[c]          # residual relative to this cell
            d = self.pq.adc_distances(self.codes[ids], res_q)
            all_d.append(d)
            all_id.append(ids)
        if not all_id:
            return np.array([], int), np.array([])
        d = np.concatenate(all_d)
        ids = np.concatenate(all_id)
        order = np.argsort(d)[:k]
        return ids[order], d[order]


__all__ = ["kmeans", "flat_knn", "IVF", "PQ", "IVFPQ"]

The {{#include}} directive splices in the real code/ivfpq.py at build time, so the documentation can never drift from the actual implementation.

API at a glance

CallPurpose
kmeans(X, k)Learn k centroids + assignments (used by everything).
flat_knn(X, q, k)Exact search — the ground truth for recall.
IVF(nlist, nprobe).train(X).add(X)Partitioned exact search.
PQ(m, ksub).train(X)Compress vectors to m-byte codes; .encode, .adc_distances.
IVFPQ(nlist, m).train(X).add(X)Partitioned and compressed search.

Typical usage

import numpy as np
from ivfpq import IVFPQ

vectors = np.load("embeddings.npy")        # shape (n, d), d divisible by m

index = IVFPQ(nlist=256, m=16, ksub=256).train(vectors)
index.add(vectors)

q = vectors[0]
ids, approx_d = index.search(q, k=100, nprobe=16)   # shortlist via compressed codes
# re-rank the shortlist exactly for high recall:
exact_d = ((vectors[ids] - q) ** 2).sum(1)
top10 = ids[np.argsort(exact_d)[:10]]
print(top10)

What this is and isn't

This implementation is written for clarity — it shows exactly how IVF, PQ, and IVFPQ work and produces correct, measurable recall. Production libraries (FAISS, ScaNN) add the speed and scale: SIMD/AVX distance kernels, optimized PQ table lookups (PQFastScan), GPU support, multi-threaded training on samples, on-disk indexes, and combinations like HNSW-as-coarse-quantizer or OPQ (a learned rotation before PQ that boosts recall). They share these exact algorithms; they just run them far faster — see the use-cases chapter, which maps each class here to its FAISS equivalent.

Next: measure the whole thing. 👉

A worked demo: recall vs. memory

This demo puts all four index types side by side on the same data and measures the three quantities that matter: recall (accuracy), memory, and work (fraction of data scanned). The full script is code/demo.py.

The setup

10,000 vectors in 64 dimensions, drawn from 40 clusters (so the data has real structure, like embeddings do). 100 queries. We compute exact answers with flat_knn as ground truth, then score each method's recall against them.

import numpy as np
from ivfpq import flat_knn, IVF, PQ, IVFPQ
# ... build clustered data, pick 100 queries, compute exact top-10 ...

The result

Running python demo.py (takes ~40s — pure-Python k-means training) prints:

dataset: 10000 vectors x 64 dims; 100 queries; k=10

IVF  (partition into 128 cells, scan a few):
   nprobe= 1: recall@10=0.657   ~78 vectors scanned (0.8% of data)
   nprobe= 4: recall@10=0.994   ~312 vectors scanned (3.1% of data)
   nprobe= 8: recall@10=1.000   ~625 vectors scanned (6.2% of data)
   nprobe=16: recall@10=1.000   ~1250 vectors scanned (12.5% of data)

PQ  (compress each vector to m bytes):
   m= 8 (8B/vec, 64x smaller): recall@10 raw=0.292  +rerank(top100)=0.843
   m=16 (16B/vec, 32x smaller): recall@10 raw=0.386  +rerank(top100)=0.938

IVFPQ  (IVF + PQ on residuals — partitioned and compressed):
   nprobe= 8: recall@10 raw=0.741  +rerank(top100)=1.000
   nprobe=16: recall@10 raw=0.741  +rerank(top100)=1.000

flat (exact) memory: 5.1 MB for 10000 vectors
IVF keeps full vectors (no compression); PQ/IVFPQ shrink them 16-32x.

How to read it

IVF — near-exact, a fraction of the work. Scanning just 6% of the data (nprobe=8) gives perfect recall. IVF's only error is a true neighbor sitting in an un-probed cell; raising nprobe fixes it. IVF saves time but keeps full vectors (no memory savings).

PQ — huge memory savings, recovered by re-ranking. Raw PQ recall looks low (0.29–0.39) because the codes are lossy — but that's not how PQ is used. Shortlist the top-100 by cheap PQ distance, re-rank those exactly, and recall jumps to 0.84–0.94 at 32–64× less memory. The m knob trades bytes for accuracy.

IVFPQ — the best of both. Partitioned (scans ~6% of cells) and compressed (16–32× smaller), and with re-ranking it reaches recall 1.0. This is why IVFPQ scales to billions of vectors on a single machine.

A note on the compression factor. The demo's arrays are float64 (8 bytes), so it reports 32–64×. Real systems store float32 (4 bytes), so the same m=8 code is 32× smaller and m=16 is 16× — still enormous.

The mental model

            saves TIME?     saves MEMORY?     recall (with re-rank)
  flat          no              no                 1.000 (exact)
  IVF          YES              no                 ~1.0
  PQ            no             YES (16-64x)        high
  IVFPQ        YES             YES (16-64x)        high

Things to try

  • Raise nprobe and watch IVF/IVFPQ recall climb toward 1.0.
  • Increase m (8 → 16 → 32) for higher PQ recall at the cost of more bytes.
  • Increase the re-rank shortlist (100 → 200) to squeeze out more recall.
  • Grow N to 100,000 to feel how IVF's "% scanned" advantage compounds.

Now, where all this is actually used — including FAISS. 👉

Use cases & how to choose

You've now built every piece that powers industrial vector search. This chapter connects them to the real world — especially FAISS, the library where these algorithms live in production — and gives you a decision guide.

FAISS: where these algorithms actually run

FAISS (Facebook AI Similarity Search) is the most widely used library for this, and it is essentially a highly optimized, composable version of exactly what this book built. Every class here maps directly to a FAISS index:

This bookFAISS indexWhat it is
flat_knnIndexFlatL2 / IndexFlatIPexact brute force
ScalarQuantizerIndexScalarQuantizer (SQ8)1 byte/dim, 4× smaller
IVFIndexIVFFlatpartition + probe, full vectors
PQIndexPQproduct-quantized codes
IVFPQIndexIVFPQpartition + PQ residuals (the workhorse)
(HNSW book)IndexHNSWFlatgraph navigation

The FAISS "index factory"

FAISS lets you describe an index with a short string, and it assembles the pieces — the same Lego blocks from this book:

"Flat"                exact search
"IVF4096,Flat"        IVF with 4096 cells, full vectors        (our IVF)
"PQ16"                product quantization, 16 bytes/vector     (our PQ)
"IVF4096,PQ16"        IVF + PQ residuals                        (our IVFPQ)
"OPQ16,IVF4096,PQ16"  add a learned rotation before PQ (higher recall)
"HNSW32"              HNSW graph, 32 links/node                 (the HNSW book)
"IVF4096,PQ16,RFlat"  IVFPQ + exact re-ranking (RFlat = refine) (our re-rank trick)

Reading those strings is now easy: each comma-separated piece is a stage you understand. OPQ is one we didn't build — a learned rotation of the data before PQ that spreads information evenly across sub-vectors, measurably improving recall; it's a drop-in upgrade to the PQ stage.

Why FAISS is fast (and our version isn't)

Same algorithms, serious engineering: SIMD/AVX distance kernels, a special PQFastScan that does PQ table lookups in registers, GPU implementations, multi-threaded training on a sample of the data, and memory-mapped on-disk indexes for sets too big for RAM. Our from-scratch version is for understanding; FAISS is for shipping — and validating your understanding against.

As in the HNSW book, almost every application is:

  1. Embed items (text/images/users) into vectors with a model.
  2. Index them — and at scale, that means IVFPQ (or HNSW) to fit memory and stay fast.
  3. Search with a query embedding; optionally re-rank the shortlist exactly.

IVF/PQ is the part that makes step 2 survive billions of items.

Where it's used

  • Large-scale semantic search & RAG. When a corpus has tens of millions of chunks, storing full float vectors is too costly; IVFPQ compresses them and keeps retrieval fast. The retrieved shortlist is then re-ranked (and fed to an LLM, for RAG).
  • Recommendations. Item embeddings from matrix factorization or two-tower models (see the HNSW use-cases) are indexed with IVFPQ; "similar items" and per-user candidate generation become compressed nearest-neighbor lookups over hundreds of millions of items.
  • Image / video / audio search at web scale — billions of media embeddings compressed to bytes.
  • De-duplication & clustering of massive corpora, where keeping full vectors in RAM is infeasible.

How to choose: a decision guide

SituationUse
Small data (≤ ~10k–100k)Flat (exact). Simple, and vectorized brute force is plenty fast.
Need speed, RAM is fineHNSW or IVFFlat. Full vectors, near-exact, fast.
RAM-constrained, big dataPQ / IVFPQ. Compress 16–64×; re-rank for recall.
Billions of vectorsIVFPQ (often OPQ,IVF…,PQ…), the standard at scale.
Maximum recall, RAM availableHNSW (optionally on SQ8) + exact re-rank.
Need exact guaranteesFlat only.

Rules of thumb:

  • Start exact. If brute force is fast enough, don't add complexity.
  • Hit a speed wall? Add IVF (or HNSW).
  • Hit a memory wall? Add PQ, and always pair it with re-ranking.
  • Both walls (web scale)? IVFPQ, tuned via nprobe and re-rank size.

How IVF/PQ and HNSW relate

They're complementary layers, often combined:

  • HNSW as a coarse quantizer. FAISS can replace IVF's flat list of centroids with an HNSW graph over the centroids (IndexHNSWFlat as the quantizer), making the "which cells?" step itself fast when nlist is huge.
  • HNSW on compressed vectors. Run the graph over PQ/SQ-compressed vectors to cut the graph's memory.
  • Division of labor. HNSW = fast navigation; IVF = partitioning; PQ = compression. Real systems mix the three to hit a target recall, latency, and memory budget simultaneously.

Next, a look at the papers that introduced these ideas. 👉

The papers: understand & reproduce

Like the other books, we built these methods as if from thin air. They come from a line of research papers. This chapter tells that story: what the key paper says, its lineage, how to read it, and an honest reflection on what we reproduced versus what FAISS adds.

The key paper

Hervé Jégou, Matthijs Douze, Cordelia Schmid. Product Quantization for Nearest Neighbor Search. IEEE TPAMI, 2011.

This paper introduced both halves of what we built:

  • Product Quantization (Section 3) — split vectors, quantize each sub-space, store the codes, and compute distances by table lookup (ADC).
  • IVFADC (Section 4) — combine an inverted file (coarse quantizer) with PQ on residuals, which is exactly our IVFPQ. The "ADC" is the asymmetric distance computation we implemented.

The four moving parts of the paper map onto our code:

Paper conceptOur code
Coarse quantizer (k-means)kmeans, IVF.train
Inverted listsIVF.add / IVFPQ.add
Product quantizer + codebooksPQ.train, PQ.encode
Asymmetric Distance ComputationPQ.distance_table, PQ.adc_distances
Residual encoding (IVFADC)IVFPQ.train / IVFPQ.search

Lineage and descendants

  • Vector quantization comes from classic signal compression (Lloyd's algorithm, 1957/1982 — the same k-means we use).
  • Inverted files come from text retrieval, repurposed for vectors.
  • Descendants: Optimized PQ (OPQ, Ge et al. 2013) learns a rotation before PQ; ScaNN (Google, 2020) adds anisotropic quantization tuned for inner product; FAISS packages all of these.

Recognizing this lineage helps: when the paper is terse about k-means or distance algebra, the older quantization literature fills the gap.

How to read a paper like this

The same three-pass approach as the other books:

  1. Skim (5 min). Abstract, Figure 2 (the PQ split), and the IVFADC system diagram. Decide what you need: here, PQ + IVFADC.
  2. Method, carefully. Read Sections 3–4. Write down, in your own words: how a code is formed, how ADC computes distance from codes, why residuals are encoded, and the parameters (m, k* = ksub, nlist/coarse k', w = nprobe).
  3. Reproduce. Implement and measure recall vs. brute force — the demo is the proof you understood it.

Tip: the paper reports recall@{1,10,100} and memory per vector. Reproducing those axes (not necessarily the exact numbers, which depend on the dataset) is the real test — and it surfaces the re-ranking insight, since raw recall@10 looks alarming until you shortlist-and-rescore.

From paper to code: the process

  1. Build the primitive first. k-means underlies everything, so we wrote and tested it before IVF or PQ.
  2. Translate the data structures. Codebooks → a (m, ksub, dsub) array; inverted lists → np.where(assign == c); codes → a uint8 matrix.
  3. Get ADC right. The table-lookup distance (adc_distances) is the subtle part; a wrong axis or sum and recall collapses.
  4. Add residual encoding for IVFPQ. Encoding x - centroid instead of x is a one-line change with a big recall effect — exactly as the paper claims.
  5. Validate against ground truth, and discover the re-ranking technique when raw recall looks low.

Reflection: what we built vs. production

Our implementation is algorithmically faithful — PQ, ADC, IVFADC, residuals, correct recall behavior. What we left out is performance engineering, which is most of FAISS:

AspectThis bookFAISS / ScaNN
Languagepure Python + NumPyC++/CUDA, SIMD, PQFastScan
Trainingfull data, simple Lloydsampled, k-means++, many iters
DistanceNumPy per cellregister-resident table lookups, GPU
ExtrasOPQ rotation, on-disk, sharding, deletes
Re-rankmanual top-100 rescoringIndexRefineFlat

Same lesson as the KTS and HNSW books: the algorithm is compact and learnable; the performance is careful systems work layered on top. A faithful from-scratch version is the right way to understand it — and a baseline to validate the fast version against.

See the References for the papers and their roots.

One-file CLI: the trade-off estimator

Everything in this book, distilled into one tool that answers the practical question every engineer asks before compressing vectors:

If I compress my embeddings with these settings, how much memory do I save, and what recall do I keep?

pqtool.py builds an IVFPQ index over your vectors (or synthetic data), measures recall against exact search, and reports the memory savings — so you can pick parameters with evidence instead of guesswork.

The steps

  1. Load vectors from an .npy file (shape [n, d]), or generate synthetic clustered data with --synthetic.
  2. Hold out some vectors as queries and compute their exact neighbors (ground truth).
  3. Build an IVFPQ index with your chosen nlist, m, nprobe.
  4. Measure recall@k — both raw and with exact re-ranking.
  5. Report recall, memory (float32 vs PQ codes), and the fraction of cells scanned.

Install & run

pip install numpy        # that's all it needs

# try it on synthetic data (no files required)
python pqtool.py --synthetic --m 16 --nlist 128 --nprobe 8

# estimate for your own embeddings
python pqtool.py --npy embeddings.npy --m 16 --nlist 256 --nprobe 16 --rerank 100

It works — real output

$ python pqtool.py --synthetic --m 16 --nlist 128 --nprobe 8
synthetic: 10000 vectors x 64 dims
building IVFPQ (nlist=128, m=16) ...

--- results ---
  recall@10 (raw)      : 0.711
  recall@10 (+rerank 100): 1.000
  memory (float32 vectors) : 2.6 MB
  memory (PQ codes)        : 0.16 MB  (16x smaller)
  per query, IVF scans ~6.2% of cells

The read-out is the whole decision in one place: 16× less memory, recall 1.0 after re-ranking, scanning only 6% of cells. Change --m to trade memory for raw recall, --nprobe to trade speed for recall, and --rerank to trade a little query time for accuracy.

The complete script

#!/usr/bin/env python3
"""
pqtool.py — estimate the memory / recall / speed trade-off of compressing your
vectors with IVF + Product Quantization, using the from-scratch index in ivfpq.py.

Answers the practical question: "If I compress my embeddings with these settings,
how much memory do I save and what recall do I keep?"

Usage:
    # try it on synthetic clustered data (no files needed)
    python pqtool.py --synthetic

    # estimate for your own embeddings (an .npy array of shape [n, d])
    python pqtool.py --npy embeddings.npy --m 16 --nlist 256 --nprobe 16

Requirements: numpy (and the local ivfpq.py).
"""

from __future__ import annotations

import argparse

import numpy as np

from ivfpq import flat_knn, IVFPQ


def make_synthetic(n=10000, d=64, seed=0):
    rng = np.random.default_rng(seed)
    centers = rng.normal(scale=5, size=(max(n // 250, 2), d))
    data = np.vstack([centers[i % len(centers)] + rng.normal(size=(1, d))
                      for i in range(n)])
    return data


def main(argv=None):
    p = argparse.ArgumentParser(description="Estimate IVF-PQ memory/recall trade-off.")
    p.add_argument("--npy", help="path to an .npy array of shape [n, d]")
    p.add_argument("--synthetic", action="store_true",
                   help="use generated clustered data instead of a file")
    p.add_argument("--n", type=int, default=10000, help="synthetic dataset size")
    p.add_argument("--d", type=int, default=64, help="synthetic dimensionality")
    p.add_argument("--m", type=int, default=16, help="PQ sub-vectors / bytes per code")
    p.add_argument("--nlist", type=int, default=128, help="IVF partitions")
    p.add_argument("--nprobe", type=int, default=8, help="IVF cells probed per query")
    p.add_argument("--rerank", type=int, default=100,
                   help="candidates to re-rank exactly (0 = no re-rank)")
    p.add_argument("-k", type=int, default=10, help="neighbors to return")
    p.add_argument("--queries", type=int, default=100, help="evaluation queries")
    args = p.parse_args(argv)

    if args.npy:
        data = np.load(args.npy).astype(np.float64)
        print(f"loaded {data.shape[0]} vectors x {data.shape[1]} dims from {args.npy}")
    else:
        data = make_synthetic(args.n, args.d)
        print(f"synthetic: {data.shape[0]} vectors x {data.shape[1]} dims")

    N, D = data.shape
    if D % args.m != 0:
        raise SystemExit(f"error: dim {D} not divisible by --m {args.m}; "
                         f"pick m in {[x for x in (4,8,16,32) if D % x == 0]}")

    rng = np.random.default_rng(123)
    qidx = rng.choice(N, size=min(args.queries, N), replace=False)
    queries = data[qidx] + rng.normal(scale=0.5, size=(len(qidx), D))
    exact = [set(flat_knn(data, q, args.k)[0].tolist()) for q in queries]

    print(f"building IVFPQ (nlist={args.nlist}, m={args.m}) ...")
    index = IVFPQ(nlist=args.nlist, m=args.m, ksub=256, seed=1).train(data)
    index.add(data)

    def recall(rerank):
        hits = 0
        for q, ex in zip(queries, exact):
            want = args.rerank if (rerank and args.rerank) else args.k
            ids, _ = index.search(q, want, nprobe=args.nprobe)
            if rerank and args.rerank and len(ids):
                d = ((data[ids] - q) ** 2).sum(1)
                ids = ids[np.argsort(d)[:args.k]]
            else:
                ids = ids[:args.k]
            hits += len(set(np.asarray(ids).tolist()) & ex)
        return hits / (len(queries) * args.k)

    flat_bytes = N * D * 4                     # realistic float32 baseline
    code_bytes = N * args.m                    # PQ codes (1 byte each)
    raw = recall(False)
    rr = recall(True) if args.rerank else None

    print("\n--- results ---")
    print(f"  recall@{args.k} (raw)      : {raw:.3f}")
    if rr is not None:
        print(f"  recall@{args.k} (+rerank {args.rerank}): {rr:.3f}")
    print(f"  memory (float32 vectors) : {flat_bytes / 1e6:.1f} MB")
    print(f"  memory (PQ codes)        : {code_bytes / 1e6:.2f} MB  "
          f"({flat_bytes // code_bytes}x smaller)")
    print(f"  per query, IVF scans ~{args.nprobe / args.nlist * 100:.1f}% of cells")


if __name__ == "__main__":
    main()

Point it at your real embeddings and you have an evidence-based way to choose IVFPQ parameters — the same decision FAISS users make every day, now with a tool you fully understand.

References

  1. Hervé Jégou, Matthijs Douze, Cordelia Schmid. Product Quantization for Nearest Neighbor Search. IEEE TPAMI, 2011. — Introduces PQ, ADC, and the IVFADC system this book reproduces.

  2. Tiezheng Ge, Kaiming He, Qifa Ke, Jian Sun. Optimized Product Quantization. CVPR 2013. — OPQ: a learned rotation before PQ that improves recall (OPQ in FAISS).

  3. Stuart P. Lloyd. Least squares quantization in PCM. IEEE Trans. Information Theory, 1982 (work from 1957). — The k-means algorithm at the core of every codebook here.

  4. Jeff Johnson, Matthijs Douze, Hervé Jégou. Billion-scale similarity search with GPUs. IEEE Big Data, 2019. — The FAISS library and its GPU implementations.

  5. Ruiqi Guo et al. Accelerating Large-Scale Inference with Anisotropic Vector Quantization (ScaNN). ICML 2020. — Quantization tuned for maximum inner-product search.

Tools & systems

  • FAISSIndexFlat, IndexIVFFlat, IndexPQ, IndexIVFPQ, IndexScalarQuantizer, IndexHNSWFlat, and the index factory.
  • ScaNN (Google) — anisotropic quantization for MIPS.
  • Vector databases — Qdrant, Weaviate, Milvus, Pinecone (IVF/PQ/HNSW indexes).

This book's code

All depend only on NumPy and the standard library.