Similarity & distance (the recipe chapter)
Once everything is a vector, "are these two things alike?" becomes "are these two vectors close?" This single question powers search, recommendation, clustering, deduplication, retrieval (RAG), face matching, and the attention inside transformers. This chapter is your recipe card for every common similarity and distance — with sample input, output, and the traps that bite people.
All recipes live in code/recipes.py and print exactly
what's shown here.
Similarity vs. distance: opposite directions
Don't be confused. A similarity is bigger when things are more alike (cosine = 1 means identical direction). A distance is smaller when things are more alike (Euclidean = 0 means identical point). They point in opposite directions. You convert between them with tricks like
distance = 1 - similarity(for cosine) orsimilarity = exp(-distance)(the RBF kernel). Always know which one a function returns before you sort by it — sorting the wrong way silently returns the least similar items.
Our cast of three example vectors:
import numpy as np
a = np.array([1.0, 0.0]) # points right
b = np.array([2.0, 0.0]) # same direction, twice as long
c = np.array([0.0, 1.0]) # perpendicular to a
Recipe 1 — Dot product (linear similarity)
The raw inner product: multiply elementwise, sum. It grows with both alignment and magnitude.
$$ a \cdot b = \sum_i a_i b_i $$
print(np.dot(a, b)) # 1*2 + 0*0
Output:
2.0
Use it when magnitude is meaningful and your vectors are already normalized
(then dot is cosine). It's the cheapest similarity, and the one transformers use
internally (Q·K). Watch out: a longer vector scores higher just for being
long, even if it's off-topic.
Recipe 2 — Cosine similarity (direction only)
Normalize away length, keep only the angle. Range $[-1, 1]$: 1 = same direction, 0 = perpendicular, −1 = opposite.
$$ \cos(a, b) = \frac{a \cdot b}{\lVert a\rVert , \lVert b\rVert} $$
def cosine_sim(a, b, eps=1e-12):
a = np.asarray(a, float); b = np.asarray(b, float)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + eps))
print("cos(a, b):", round(cosine_sim(a, b), 3)) # same direction
print("cos(a, c):", round(cosine_sim(a, c), 3)) # perpendicular
Output:
cos(a, b): 1.0
cos(a, c): 0.0
a and b point the same way → cosine 1.0, even though b is twice as long.
a and c are perpendicular → 0.0. Cosine is the default for text/semantic
embeddings, because there a document's topic (direction) matters, not its
length.
Don't be confused: cosine vs. dot product. They agree only when vectors are unit length. When they're not, they can disagree badly. Here a generic-but- long vector beats a short-but-on-topic one under dot, while cosine judges by topic:
doc = np.array([1., 1., 0.]) short = np.array([1., 0., 0.]) # same topic, short longd = np.array([5., 5., 5.]) # generic, long print("dot:", doc @ short, doc @ longd) # 1.0 10.0 -> long wins print("cos:", round(cosine_sim(doc, short),3), round(cosine_sim(doc, longd),3))dot: 1.0 10.0 cos: 0.707 0.816The fix used everywhere: L2-normalize your vectors once up front, then a plain dot product equals cosine — fast and unambiguous. That's why vector databases store normalized vectors and call it "inner-product" search.
Recipe 3 — Euclidean (L2) distance
Ordinary straight-line distance. A distance (0 = identical).
$$ \lVert a - b\rVert_2 = \sqrt{\sum_i (a_i - b_i)^2} $$
print(np.sqrt(np.sum((a - b) ** 2)))
Output:
1.0
Use it for dense numeric features on a comparable scale (after standardizing!). It's the metric behind k-means and most geometric clustering.
Don't be confused: cosine vs. Euclidean. They can rank things differently. Two vectors pointing the same way but at different lengths are cosine- identical yet Euclidean-far:
u = np.array([1., 2.]); v = np.array([10., 20.]) # identical direction print("cosine:", round(cosine_sim(u, v), 3)) # 1.0 print("euclid:", round(float(np.sqrt(np.sum((u-v)**2))), 3)) # 20.125cosine: 1.0 euclid: 20.125Choose by what "similar" means for your data: same proportions → cosine; same location → Euclidean. Fun fact: for unit-length vectors the two are equivalent — $\lVert a-b\rVert^2 = 2(1 - \cos(a,b))$ — so normalizing collapses the distinction.
Recipe 4 — Manhattan (L1) distance
Sum of absolute differences — "city-block" distance, as if you can only move along grid streets.
$$ \lVert a - b\rVert_1 = \sum_i |a_i - b_i| $$
print(np.sum(np.abs(a - c))) # |1-0| + |0-1|
Output:
2.0
Use it when you want robustness to outliers (L1 punishes a single huge coordinate difference less than L2's squaring does), or in high dimensions where it sometimes behaves better.
Recipe 5 — RBF / Gaussian kernel (distance → similarity)
The Radial Basis Function kernel turns a distance into a similarity that decays smoothly: 1 when identical, fading toward 0 as points separate.
$$ k(a, b) = \exp!\left(-\gamma , \lVert a - b\rVert^2\right) $$
gamma controls how fast similarity fades — it's the kernel's "reach."
def rbf(a, b, gamma=1.0):
d2 = np.sum((np.asarray(a, float) - np.asarray(b, float)) ** 2)
return float(np.exp(-gamma * d2))
for g in [0.1, 1.0, 10.0]:
near = rbf([0,0], [0.5, 0], gamma=g) # squared dist 0.25
far = rbf([0,0], [2.0, 0], gamma=g) # squared dist 4.0
print(f"gamma={g:<4} near={near:.4f} far={far:.4f}")
Output:
gamma=0.1 near=0.9753 far=0.6703
gamma=1.0 near=0.7788 far=0.0183
gamma=10.0 near=0.0821 far=0.0000
Small gamma = wide, gentle reach (far points still count). Large gamma =
narrow, picky reach (only near-identical points count). The RBF kernel is the
heart of the RBF-SVM and a building block in many "kernel methods" — and gamma
is the single most important knob to tune. We meet kernels properly in
Chapter 5.
Recipe 6 — Jaccard (set overlap)
For sets or binary vectors (tags, shingles, sparse boolean features): the size of the intersection over the size of the union.
$$ J(A, B) = \frac{|A \cap B|}{|A \cup B|} $$
def jaccard(a, b):
a, b = set(a), set(b)
return len(a & b) / len(a | b) if (a | b) else 1.0
print(jaccard([1, 2, 3], [2, 3, 4])) # overlap {2,3}, union {1,2,3,4}
Output:
0.5
Use it for deduplication, plagiarism/near-duplicate detection (with MinHash), and any "how much do these two tag sets overlap?" question.
The all-pairs versions (what you actually ship)
In production you rarely compare two vectors — you compare a query against thousands. Do it in one matrix operation, not a Python loop:
def normalize_rows(X, eps=1e-12):
return X / (np.linalg.norm(X, axis=1, keepdims=True) + eps)
def cosine_matrix(X): # all-pairs cosine, shape (n, n)
Xn = normalize_rows(X)
return Xn @ Xn.T
X = np.array([[1., 0.], [0.9, 0.1], [0., 1.]])
print(np.round(cosine_matrix(X), 3))
Output:
[[1. 0.994 0. ]
[0.994 1. 0.11 ]
[0. 0.11 1. ]]
Row 0 vs row 1 = 0.994 (nearly identical), row 0 vs row 2 = 0.0
(perpendicular). Normalize once, multiply, done — this Xn @ Xn.T is the same
operation that scores attention in a transformer. For squared Euclidean
all-pairs, use the identity $\lVert a-b\rVert^2 = \lVert a\rVert^2 + \lVert
b\rVert^2 - 2a\cdot b$ (the pairwise_sq_dists recipe) — never a double loop.
Quick chooser
| You want… | Use | Returns |
|---|---|---|
| Text / semantic similarity | cosine (or dot on normalized vectors) | similarity ↑ |
| Geometric closeness, clustering | Euclidean (after standardizing) | distance ↓ |
| Robustness to outliers | Manhattan (L1) | distance ↓ |
| Smooth, tunable similarity / SVM | RBF kernel | similarity ↑ |
| Overlap of tags / sets | Jaccard | similarity ↑ |
| Fast many-vs-many | normalize, then X @ Q.T | similarity ↑ |
The takeaway
Know whether you hold a similarity or a distance; cosine compares direction, dot adds magnitude, Euclidean compares position, RBF turns distance into tunable similarity, Jaccard compares sets. Normalize first and dot = cosine. These six recipes recur in every system in this series. The word "kernel" snuck in twice already — let's pin down all five things it means. 👉