k-means: the building block

Both IVF and quantization rest on one algorithm: k-means clustering. It finds k representative points (centroids) that summarize a cloud of vectors. Get this and the rest of the book is just two clever ways of using centroids.

The goal

Given many vectors, find k centroids so that every vector is close to its nearest centroid. Those centroids become:

  • the region centers in IVF (each vector belongs to its nearest region), and
  • the palette in quantization (each vector is replaced by its nearest centroid's id).

Lloyd's algorithm (the standard k-means)

It alternates two steps until things stop moving:

  1. Assign — put each vector in the group of its nearest centroid.
  2. Update — move each centroid to the average of its group.

Repeat. Each round can only reduce the total distance, so it converges.

start: random centroids        assign points        move centroids to means
   x   x      ●                 x   x  |  ●           x  x        ●
 x   x          ●      ─►     x   x    |    ●   ─►   x  ●            ●
   x    x   x                   x  x x | (nearest)     x  x  x
                                                    ...repeat until stable

The code

import numpy as np


def _assign(X, C):
    """Index of the nearest centroid for each row of X (squared L2).
    Uses ||x-c||^2 = ||x||^2 - 2 x.c + ||c||^2 to compute all distances at once."""
    d = (X ** 2).sum(1)[:, None] - 2.0 * X @ C.T + (C ** 2).sum(1)[None, :]
    return d.argmin(1)


def kmeans(X, k, iters=25, seed=0):
    """Lloyd's k-means. Returns (centroids, assignments)."""
    X = np.asarray(X, dtype=np.float64)
    n = X.shape[0]
    k = min(k, n)
    rng = np.random.default_rng(seed)
    C = X[rng.choice(n, size=k, replace=False)].copy()   # init: random points
    for _ in range(iters):
        a = _assign(X, C)                                # 1. assign
        newC = C.copy()
        for j in range(k):                              # 2. update
            members = X[a == j]
            if len(members):
                newC[j] = members.mean(0)
        if np.allclose(newC, C):                        # converged?
            C = newC
            break
        C = newC
    return C, _assign(X, C)

Reading it

  • _assign computes, for every vector, the distance to every centroid in one vectorized expression, then picks the closest. This is the workhorse — IVF and PQ both call it.
  • kmeans starts from k random data points, then loops assign→update until the centroids stop changing (or iters is hit).

See it work

Six points forming two obvious clusters — one near the origin, one near $(5,5)$:

import numpy as np
from ivfpq import kmeans

X = np.array([[0., 0.], [0.2, 0.1], [0.1, 0.2],     # cluster A
              [5., 5.], [5.1, 4.9], [4.9, 5.1]])     # cluster B
C, a = kmeans(X, k=2, seed=0)
print("centroids:\n", np.round(C, 2))
print("assignments:", a.tolist())

Output:

centroids:
 [[5.  5. ]
 [0.1 0.1]]
assignments: [1, 1, 1, 0, 0, 0]

k-means found exactly the two cluster centers — $(5,5)$ and $(0.1,0.1)$ — and assigned the first three points to centroid 1 and the last three to centroid 0. 🎯 Those two centroids now summarize all six points.

How centroids power everything next

  • IVF: treat each centroid as a bucket. Every vector is filed under its nearest centroid; at query time we only open the buckets near the query.
  • Quantization: treat the centroids as a codebook. Store, per vector, just the id of its nearest centroid (one small integer) instead of the full vector.

Same algorithm, two superpowers. Let's build IVF first. 👉