A 5-minute primer: vectors, embeddings & neighbors

This book assumes no prior experience. This page gives you just enough Python, NumPy, and vocabulary to read every example. Skip it if you're comfortable.

Reading the code boxes

Grey boxes contain Python code; the box right after shows what it prints:

print("hello")
print(2 + 3)

Output:

hello
5

Variables, lists, functions

x = 10                 # x now refers to the number 10
words = ["cat", "dog"] # a list: an ordered collection

def add(a, b):         # define a reusable function
    return a + b       # "return" hands a result back

print(add(2, 3))

Output:

5

NumPy and vectors

NumPy is a library for fast number-crunching. We nickname it np:

import numpy as np

v = np.array([0.2, 0.9, 0.1])   # a "vector" = a list of numbers
print(v, "has", v.shape[0], "dimensions")

Output:

[0.2 0.9 0.1] has 3 dimensions

A vector is just a point in space. [0.2, 0.9, 0.1] is a point in 3-dimensional space. Real embeddings have hundreds or thousands of dimensions — we can't picture that, but the math is identical to 2-D and 3-D.

What is an embedding?

An embedding is a vector that represents something complex — a word, a sentence, an image, a user — as numbers, arranged so that similar things get nearby vectors. For example, a good text model places the sentences "I love dogs" and "puppies are great" close together, and "tax law" far away. How embeddings are produced is a separate topic (neural networks); for HNSW we just assume we're handed vectors and need to find the closest ones.

Distance: how "close" two vectors are

Closeness is measured with a distance. Two common ones:

  • Euclidean distance — ordinary straight-line distance. In 2-D, the distance between $(x_1,y_1)$ and $(x_2,y_2)$ is $\sqrt{(x_1-x_2)^2 + (y_1-y_2)^2}$.
  • Cosine similarity — measures the angle between vectors, ignoring length. It's the go-to for text embeddings. Cosine distance is 1 - cosine similarity (0 = identical direction, larger = more different).
import numpy as np
a = np.array([1.0, 0.0])
b = np.array([0.0, 1.0])
print("euclidean:", np.sqrt(np.sum((a - b) ** 2)))
print("cosine sim:", np.dot(a, b))     # both length 1 here, so dot = cosine

Output:

euclidean: 1.4142135623730951
cosine sim: 0.0

(a and b point in completely different directions, so cosine similarity is 0.)

The nearest-neighbor problem

Given a big collection of vectors and a query vector, the nearest-neighbor problem is: which stored vectors are closest to the query?

  • k-nearest-neighbors (k-NN): return the closest k of them.
  • Exact search checks every vector (slow but perfect).
  • Approximate search (ANN) — what HNSW does — is allowed to occasionally miss one, in exchange for being dramatically faster. We'll measure exactly how often it misses (its recall) in the demo.

NumPy bits we use

You'll seeMeaning
np.array([...])make a vector / matrix
np.dot(a, b)dot product (basis of cosine similarity)
np.linalg.norm(v)length of a vector
np.argsort(d)the indices that would sort d (used to find the smallest distances)
a @ bmatrix multiplication

That's all you need. Next: why exact search doesn't scale, and what "approximate" buys us. 👉