Step 1 — Distances & the data structure
Two foundations before any searching: how we measure closeness, and how we store the graph.
Distances
HNSW needs one thing from your data: a way to score how close two vectors are. We support the two most common choices.
import numpy as np
def euclidean(a, b):
"""Squared Euclidean distance. We skip the square root because it doesn't
change the *order* of distances and saves time."""
d = a - b
return float(np.dot(d, d))
def cosine(a, b):
"""Cosine distance for UNIT-LENGTH vectors: 1 - (a . b).
0 means identical direction; larger means more different."""
return 1.0 - float(np.dot(a, b))
Two practical notes:
- Squared vs. true distance. For Euclidean we use the squared distance. Since $\sqrt{\cdot}$ is increasing, the nearest neighbor is the same either way — and skipping the square root is faster.
- Cosine needs unit vectors. Cosine distance equals
1 - dotonly when both vectors have length 1. So when using cosine, we normalize every vector (divide by its length) as it goes in, and normalize queries too. Then a plain dot product gives cosine similarity.
def normalize(v):
n = np.linalg.norm(v)
return v / n if n > 0 else v
print(np.dot(normalize(np.array([3.0, 4.0])),
normalize(np.array([3.0, 4.0])))) # a vector with itself -> 1.0
Output:
1.0
The data structure
An HNSW index needs to remember three things:
- The vectors themselves, so we can compute distances. We keep them in a
list; a vector's position in the list is its id (
0, 1, 2, …). - The graph: for each layer, and each node in that layer, the list of its
neighbor ids. We store this as a list of dictionaries —
graph[level][node]is a list of neighbor ids. - The entry point (the id of a node in the top layer) and the current top layer number.
class HNSW:
def __init__(self, dim, M=16, ef_construction=200, ef=50,
distance="euclidean", seed=42):
self.dim = dim
self.M = M # target neighbors per node
self.Mmax = M # cap above layer 0
self.Mmax0 = 2 * M # cap at layer 0 (the base layer is denser)
self.ef_construction = ef_construction
self.ef = ef
self.distance_name = distance
self.data = [] # data[id] = vector
self.graph = [] # graph[level][id] = list of neighbor ids
self.entry_point = None # id of the top-layer entry node
self.max_level = -1
Why two caps (Mmax and Mmax0)? The bottom layer holds every node and does
the precise final search, so it benefits from more links per node (we use
2*M). Upper layers stay leaner to keep the "express lanes" fast.
A picture of the stored graph
For a tiny 4-node, 2-layer index, the dictionaries might look like:
graph[0] = {0: [1, 2], 1: [0, 2, 3], 2: [0, 1], 3: [1]} # layer 0: all nodes
graph[1] = {1: [3], 3: [1]} # layer 1: a subset
entry_point = 3, max_level = 1
Node 3 lives in both layers and is the entry point; nodes 0 and 2 only exist
at the bottom.
Why a graph (not a NumPy matrix)?
Unlike the KTS book — where everything was dense matrices — HNSW is inherently sparse and irregular: each node has a different handful of neighbors, and we follow links one at a time. That's a natural fit for Python dictionaries and lists, with NumPy used only for the per-vector distance math.
Now that we can store the graph and measure distance, let's write the routine that does the actual searching. 👉