k-means: the building block
Both IVF and quantization rest on one algorithm: k-means clustering. It finds
k representative points (centroids) that summarize a cloud of vectors. Get
this and the rest of the book is just two clever ways of using centroids.
The goal
Given many vectors, find k centroids so that every vector is close to its
nearest centroid. Those centroids become:
- the region centers in IVF (each vector belongs to its nearest region), and
- the palette in quantization (each vector is replaced by its nearest centroid's id).
Lloyd's algorithm (the standard k-means)
It alternates two steps until things stop moving:
- Assign — put each vector in the group of its nearest centroid.
- Update — move each centroid to the average of its group.
Repeat. Each round can only reduce the total distance, so it converges.
start: random centroids assign points move centroids to means
x x ● x x | ● x x ●
x x ● ─► x x | ● ─► x ● ●
x x x x x x | (nearest) x x x
...repeat until stable
The code
import numpy as np
def _assign(X, C):
"""Index of the nearest centroid for each row of X (squared L2).
Uses ||x-c||^2 = ||x||^2 - 2 x.c + ||c||^2 to compute all distances at once."""
d = (X ** 2).sum(1)[:, None] - 2.0 * X @ C.T + (C ** 2).sum(1)[None, :]
return d.argmin(1)
def kmeans(X, k, iters=25, seed=0):
"""Lloyd's k-means. Returns (centroids, assignments)."""
X = np.asarray(X, dtype=np.float64)
n = X.shape[0]
k = min(k, n)
rng = np.random.default_rng(seed)
C = X[rng.choice(n, size=k, replace=False)].copy() # init: random points
for _ in range(iters):
a = _assign(X, C) # 1. assign
newC = C.copy()
for j in range(k): # 2. update
members = X[a == j]
if len(members):
newC[j] = members.mean(0)
if np.allclose(newC, C): # converged?
C = newC
break
C = newC
return C, _assign(X, C)
Reading it
_assigncomputes, for every vector, the distance to every centroid in one vectorized expression, then picks the closest. This is the workhorse — IVF and PQ both call it.kmeansstarts fromkrandom data points, then loops assign→update until the centroids stop changing (oritersis hit).
See it work
Six points forming two obvious clusters — one near the origin, one near $(5,5)$:
import numpy as np
from ivfpq import kmeans
X = np.array([[0., 0.], [0.2, 0.1], [0.1, 0.2], # cluster A
[5., 5.], [5.1, 4.9], [4.9, 5.1]]) # cluster B
C, a = kmeans(X, k=2, seed=0)
print("centroids:\n", np.round(C, 2))
print("assignments:", a.tolist())
Output:
centroids:
[[5. 5. ]
[0.1 0.1]]
assignments: [1, 1, 1, 0, 0, 0]
k-means found exactly the two cluster centers — $(5,5)$ and $(0.1,0.1)$ — and
assigned the first three points to centroid 1 and the last three to centroid
0. 🎯 Those two centroids now summarize all six points.
How centroids power everything next
- IVF: treat each centroid as a bucket. Every vector is filed under its nearest centroid; at query time we only open the buckets near the query.
- Quantization: treat the centroids as a codebook. Store, per vector, just the id of its nearest centroid (one small integer) instead of the full vector.
Same algorithm, two superpowers. Let's build IVF first. 👉