A 5-minute primer: vectors, clusters & recall
Just enough to read every example. Skip if you're comfortable with Python, NumPy, and nearest-neighbor basics (the HNSW primer covers the same ground in more detail).
Code boxes show their output
import numpy as np
print(np.array([1.0, 2.0, 3.0]).mean())
Output:
2.0
Vectors and distance
A vector is a list of numbers — a point in space. The squared Euclidean distance between two vectors is the sum of squared differences (we skip the square root because it doesn't change which point is nearest):
import numpy as np
a = np.array([0.0, 0.0])
b = np.array([3.0, 4.0])
print(np.sum((a - b) ** 2)) # 9 + 16
Output:
25.0
In this book vectors are usually embeddings — numeric fingerprints of text, images, products, or users, arranged so similar items sit close together. We just assume we're handed them.
Clusters and centroids
A cluster is a group of nearby vectors; its centroid is the group's average — a single vector that "represents" the group. The whole book leans on one idea: replace many vectors by a few representative centroids.
- IVF uses centroids to define regions (search only the nearby region).
- Quantization uses centroids as a palette (store the nearest centroid's id instead of the full vector).
The algorithm that finds good centroids is k-means, built from scratch in Chapter 2.
Recall: measuring an approximate answer
These methods are approximate — fast and small, but they occasionally miss a true neighbor. We measure quality with recall:
recall@k = (how many of the true top-
kneighbors we returned) ÷k.
Recall 1.0 = perfect; 0.9 = we found 9 of the true 10. Throughout, we compare against exact brute-force search to compute recall honestly.
true_top10 = {3, 8, 12, 19, 25, 31, 44, 50, 61, 77}
approx_top10 = {3, 8, 12, 19, 25, 31, 44, 50, 61, 99} # missed 77, added 99
recall = len(true_top10 & approx_top10) / 10
print(recall)
Output:
0.9
NumPy bits we use
| You'll see | Meaning |
|---|---|
np.argsort(d)[:k] | indices of the k smallest values in d |
(X - q) ** 2).sum(1) | squared distance from q to every row of X |
X @ C.T | all pairwise dot products between rows of X and C |
np.where(a == c)[0] | positions where assignment array a equals cell c |
.astype(np.uint8) | store small integers (0–255) in one byte each |
That last one is the heart of quantization: turning floats into single bytes. On to why exact search doesn't fit in memory. 👉