A 5-minute primer: vectors, embeddings & neighbors
This book assumes no prior experience. This page gives you just enough Python, NumPy, and vocabulary to read every example. Skip it if you're comfortable.
Reading the code boxes
Grey boxes contain Python code; the box right after shows what it prints:
print("hello")
print(2 + 3)
Output:
hello
5
Variables, lists, functions
x = 10 # x now refers to the number 10
words = ["cat", "dog"] # a list: an ordered collection
def add(a, b): # define a reusable function
return a + b # "return" hands a result back
print(add(2, 3))
Output:
5
NumPy and vectors
NumPy is a library for fast number-crunching. We nickname it np:
import numpy as np
v = np.array([0.2, 0.9, 0.1]) # a "vector" = a list of numbers
print(v, "has", v.shape[0], "dimensions")
Output:
[0.2 0.9 0.1] has 3 dimensions
A vector is just a point in space. [0.2, 0.9, 0.1] is a point in
3-dimensional space. Real embeddings have hundreds or thousands of dimensions —
we can't picture that, but the math is identical to 2-D and 3-D.
What is an embedding?
An embedding is a vector that represents something complex — a word, a sentence, an image, a user — as numbers, arranged so that similar things get nearby vectors. For example, a good text model places the sentences "I love dogs" and "puppies are great" close together, and "tax law" far away. How embeddings are produced is a separate topic (neural networks); for HNSW we just assume we're handed vectors and need to find the closest ones.
Distance: how "close" two vectors are
Closeness is measured with a distance. Two common ones:
- Euclidean distance — ordinary straight-line distance. In 2-D, the distance between $(x_1,y_1)$ and $(x_2,y_2)$ is $\sqrt{(x_1-x_2)^2 + (y_1-y_2)^2}$.
- Cosine similarity — measures the angle between vectors, ignoring length.
It's the go-to for text embeddings. Cosine distance is
1 - cosine similarity(0 = identical direction, larger = more different).
import numpy as np
a = np.array([1.0, 0.0])
b = np.array([0.0, 1.0])
print("euclidean:", np.sqrt(np.sum((a - b) ** 2)))
print("cosine sim:", np.dot(a, b)) # both length 1 here, so dot = cosine
Output:
euclidean: 1.4142135623730951
cosine sim: 0.0
(a and b point in completely different directions, so cosine similarity is 0.)
The nearest-neighbor problem
Given a big collection of vectors and a query vector, the nearest-neighbor problem is: which stored vectors are closest to the query?
- k-nearest-neighbors (k-NN): return the closest
kof them. - Exact search checks every vector (slow but perfect).
- Approximate search (ANN) — what HNSW does — is allowed to occasionally miss one, in exchange for being dramatically faster. We'll measure exactly how often it misses (its recall) in the demo.
NumPy bits we use
| You'll see | Meaning |
|---|---|
np.array([...]) | make a vector / matrix |
np.dot(a, b) | dot product (basis of cosine similarity) |
np.linalg.norm(v) | length of a vector |
np.argsort(d) | the indices that would sort d (used to find the smallest distances) |
a @ b | matrix multiplication |
That's all you need. Next: why exact search doesn't scale, and what "approximate" buys us. 👉