Step 1 — Distances & the data structure

Two foundations before any searching: how we measure closeness, and how we store the graph.

Distances

HNSW needs one thing from your data: a way to score how close two vectors are. We support the two most common choices.

import numpy as np


def euclidean(a, b):
    """Squared Euclidean distance. We skip the square root because it doesn't
    change the *order* of distances and saves time."""
    d = a - b
    return float(np.dot(d, d))


def cosine(a, b):
    """Cosine distance for UNIT-LENGTH vectors: 1 - (a . b).
    0 means identical direction; larger means more different."""
    return 1.0 - float(np.dot(a, b))

Two practical notes:

  • Squared vs. true distance. For Euclidean we use the squared distance. Since $\sqrt{\cdot}$ is increasing, the nearest neighbor is the same either way — and skipping the square root is faster.
  • Cosine needs unit vectors. Cosine distance equals 1 - dot only when both vectors have length 1. So when using cosine, we normalize every vector (divide by its length) as it goes in, and normalize queries too. Then a plain dot product gives cosine similarity.
def normalize(v):
    n = np.linalg.norm(v)
    return v / n if n > 0 else v

print(np.dot(normalize(np.array([3.0, 4.0])),
             normalize(np.array([3.0, 4.0]))))   # a vector with itself -> 1.0

Output:

1.0

The data structure

An HNSW index needs to remember three things:

  1. The vectors themselves, so we can compute distances. We keep them in a list; a vector's position in the list is its id (0, 1, 2, …).
  2. The graph: for each layer, and each node in that layer, the list of its neighbor ids. We store this as a list of dictionaries — graph[level][node] is a list of neighbor ids.
  3. The entry point (the id of a node in the top layer) and the current top layer number.
class HNSW:
    def __init__(self, dim, M=16, ef_construction=200, ef=50,
                 distance="euclidean", seed=42):
        self.dim = dim
        self.M = M                    # target neighbors per node
        self.Mmax = M                 # cap above layer 0
        self.Mmax0 = 2 * M            # cap at layer 0 (the base layer is denser)
        self.ef_construction = ef_construction
        self.ef = ef
        self.distance_name = distance

        self.data = []                # data[id] = vector
        self.graph = []               # graph[level][id] = list of neighbor ids
        self.entry_point = None       # id of the top-layer entry node
        self.max_level = -1

Why two caps (Mmax and Mmax0)? The bottom layer holds every node and does the precise final search, so it benefits from more links per node (we use 2*M). Upper layers stay leaner to keep the "express lanes" fast.

A picture of the stored graph

For a tiny 4-node, 2-layer index, the dictionaries might look like:

graph[0] = {0: [1, 2], 1: [0, 2, 3], 2: [0, 1], 3: [1]}   # layer 0: all nodes
graph[1] = {1: [3], 3: [1]}                                # layer 1: a subset
entry_point = 3,  max_level = 1

Node 3 lives in both layers and is the entry point; nodes 0 and 2 only exist at the bottom.

Why a graph (not a NumPy matrix)?

Unlike the KTS book — where everything was dense matrices — HNSW is inherently sparse and irregular: each node has a different handful of neighbors, and we follow links one at a time. That's a natural fit for Python dictionaries and lists, with NumPy used only for the per-vector distance math.

Now that we can store the graph and measure distance, let's write the routine that does the actual searching. 👉