Introduction

No background required. This book assumes you know nothing about programming, AI, or advanced math. Every concept, line of code, and symbol is explained, and every code example is followed by the exact output it produces. If you've never written Python, read the 5-minute primer first.

The one-sentence idea

HNSW (Hierarchical Navigable Small World) is a clever data structure that finds the items most similar to a query — out of millions — in a tiny fraction of the time it would take to compare against every item.

Why you've already used it

Every time you see "similar products", "related articles", semantic search, a chatbot that "remembers" documents (RAG), or face/image search, there's a good chance HNSW (or something like it) is doing the heavy lifting behind the scenes. It's the default index in FAISS, OpenSearch / Elasticsearch, Lucene, and most vector databases (Pinecone, Weaviate, Qdrant, Milvus…).

An everyday analogy

Imagine finding the house nearest to you in a huge city. The slow way: measure the distance to every house. The HNSW way is like using a layered map:

  • A highway map (top layer) has only a few major cities — you hop between them to get roughly close, fast.
  • A regional map (middle layer) has more towns — you refine your position.
  • A street map (bottom layer) has every house — you do the final fine search only in the small area you've zoomed into.

By zooming from coarse to fine, you reach the right neighborhood in a few big jumps, then look carefully only nearby — instead of checking the whole city. That layered zoom is exactly what HNSW does, and the layers are what the "Hierarchical" in its name refers to.

What "similar" means: vectors & embeddings

Computers compare things by turning them into lists of numbers called vectors (or, for text/images/audio, embeddings). Similar items get nearby vectors. So "find similar items" becomes a geometry problem: find the vectors closest to the query vector. That's the nearest-neighbor problem, and HNSW solves it approximately but extremely fast. (The primer explains vectors and embeddings gently.)

What you'll build

A complete, working HNSW index in one readable file, hnsw.py, plus:

The two ingredients

HNSW combines two older ideas, which are the heart of this book:

  1. Small-world graphs — connect each item to a handful of neighbors so that any item is reachable from any other in just a few hops (like "six degrees of separation"). You can then walk the graph greedily toward a query.
  2. A hierarchy of layers (inspired by skip lists) — sparse at the top for big jumps, dense at the bottom for precision.

We'll meet both in Chapter 2. First, let's understand the problem they solve. 👉

A 5-minute primer: vectors, embeddings & neighbors

This book assumes no prior experience. This page gives you just enough Python, NumPy, and vocabulary to read every example. Skip it if you're comfortable.

Reading the code boxes

Grey boxes contain Python code; the box right after shows what it prints:

print("hello")
print(2 + 3)

Output:

hello
5

Variables, lists, functions

x = 10                 # x now refers to the number 10
words = ["cat", "dog"] # a list: an ordered collection

def add(a, b):         # define a reusable function
    return a + b       # "return" hands a result back

print(add(2, 3))

Output:

5

NumPy and vectors

NumPy is a library for fast number-crunching. We nickname it np:

import numpy as np

v = np.array([0.2, 0.9, 0.1])   # a "vector" = a list of numbers
print(v, "has", v.shape[0], "dimensions")

Output:

[0.2 0.9 0.1] has 3 dimensions

A vector is just a point in space. [0.2, 0.9, 0.1] is a point in 3-dimensional space. Real embeddings have hundreds or thousands of dimensions — we can't picture that, but the math is identical to 2-D and 3-D.

What is an embedding?

An embedding is a vector that represents something complex — a word, a sentence, an image, a user — as numbers, arranged so that similar things get nearby vectors. For example, a good text model places the sentences "I love dogs" and "puppies are great" close together, and "tax law" far away. How embeddings are produced is a separate topic (neural networks); for HNSW we just assume we're handed vectors and need to find the closest ones.

Distance: how "close" two vectors are

Closeness is measured with a distance. Two common ones:

  • Euclidean distance — ordinary straight-line distance. In 2-D, the distance between $(x_1,y_1)$ and $(x_2,y_2)$ is $\sqrt{(x_1-x_2)^2 + (y_1-y_2)^2}$.
  • Cosine similarity — measures the angle between vectors, ignoring length. It's the go-to for text embeddings. Cosine distance is 1 - cosine similarity (0 = identical direction, larger = more different).
import numpy as np
a = np.array([1.0, 0.0])
b = np.array([0.0, 1.0])
print("euclidean:", np.sqrt(np.sum((a - b) ** 2)))
print("cosine sim:", np.dot(a, b))     # both length 1 here, so dot = cosine

Output:

euclidean: 1.4142135623730951
cosine sim: 0.0

(a and b point in completely different directions, so cosine similarity is 0.)

The nearest-neighbor problem

Given a big collection of vectors and a query vector, the nearest-neighbor problem is: which stored vectors are closest to the query?

  • k-nearest-neighbors (k-NN): return the closest k of them.
  • Exact search checks every vector (slow but perfect).
  • Approximate search (ANN) — what HNSW does — is allowed to occasionally miss one, in exchange for being dramatically faster. We'll measure exactly how often it misses (its recall) in the demo.

NumPy bits we use

You'll seeMeaning
np.array([...])make a vector / matrix
np.dot(a, b)dot product (basis of cosine similarity)
np.linalg.norm(v)length of a vector
np.argsort(d)the indices that would sort d (used to find the smallest distances)
a @ bmatrix multiplication

That's all you need. Next: why exact search doesn't scale, and what "approximate" buys us. 👉

The problem: nearest-neighbor search

Before the clever data structure, let's be precise about the problem it solves and why the obvious solution isn't good enough.

The setup

We have $n$ stored vectors (the "database"), each with $d$ numbers (dimensions), and a query vector $q$. We want the $k$ stored vectors closest to $q$ under some distance (Euclidean or cosine).

import numpy as np

data = np.random.randn(1000, 16)   # 1000 vectors, 16 dimensions each
query = np.random.randn(16)        # one query vector

The obvious solution: brute force

Compute the distance from the query to every stored vector, then take the smallest few:

def brute_force_knn(data, query, k=5):
    d = np.sum((data - query) ** 2, axis=1)   # squared distance to each vector
    idx = np.argsort(d)[:k]                    # indices of the k smallest
    return idx, d[idx]

idx, dist = brute_force_knn(data, query, k=3)
print("nearest 3 ids:", idx.tolist())

Output (ids depend on the random data):

nearest 3 ids: [742, 88, 915]

Brute force is exact and simple. So what's wrong with it?

Why brute force doesn't scale

It does $n$ distance computations per query. That's fine for 1,000 vectors, but real systems have very different numbers:

Vectors $n$Distances per queryAt 1M queries/day
1,0001,000manageable
1,000,0001,000,000painful
100,000,000100,000,000impossible in real time

Cost grows linearly with $n$ — written $O(n)$. Double the database, double the work, forever. A web-scale search or recommendation system can't afford to scan hundreds of millions of vectors for every single query.

We want something closer to $O(\log n)$: when the database grows 1000×, the work should grow by a small additive amount, not 1000×.

"Just build a tree" — and the curse of dimensionality

In low dimensions (2-D, 3-D) classic structures like k-d trees give that logarithmic speedup by recursively splitting space. The trouble is the curse of dimensionality: as $d$ grows into the hundreds (where embeddings live), space becomes so vast and counterintuitive that these partitioning trees degrade until they're no faster than brute force. Distances between points become eerily similar, and the "split space in half" idea stops pruning anything.

Intuition for the curse. In high dimensions, almost all the volume of a region sits near its boundary, and nearly every pair of random points is about the same distance apart. Methods that rely on cleanly carving space apart lose their grip.

The escape hatch: accept "approximate"

Here's the key insight that makes web-scale search possible:

We almost never need the mathematically perfect nearest neighbor. Returning the true top-10 about 95–99% of the time is more than enough for search, recommendations, and RAG — and allowing that tiny slack unlocks enormous speedups.

This is Approximate Nearest Neighbor (ANN) search. We trade a sliver of accuracy for orders-of-magnitude speed, and we measure the trade-off with two numbers:

  • Recall — of the true nearest neighbors, what fraction did we return? (1.0 = perfect.)
  • Speed / work — how many distance computations (or milliseconds) per query?

A good ANN method lets you dial the balance: spend more effort for higher recall, or less for more speed. HNSW does this with a single knob (ef), as we'll see.

Where HNSW fits

HNSW is the most popular ANN method today because it offers an excellent recall-vs-speed curve, handles high-dimensional embeddings well, and supports adding vectors incrementally (no full rebuild). It gets there by combining two ideas — small-world graphs and a layered hierarchy — which we cover next. 👉

Background: skip lists & small-world graphs

HNSW = skip lists (the hierarchy) + small-world graphs (the navigation). Understanding these two ideas separately makes HNSW obvious when we combine them.

Small-world graphs: navigation by greedy hops

A graph is a set of items (nodes) joined by links (edges). A small-world graph has a magic property: even with only a handful of links per node, any node can reach any other in a small number of hops. This is the "six degrees of separation" idea — you're surprisingly few handshakes from anyone on Earth.

Why does that help search? Because of greedy routing. To find the node nearest a query:

  1. Start at some node.
  2. Look at its neighbors; move to whichever neighbor is closest to the query.
  3. Repeat. Stop when no neighbor is closer than where you are.

Each hop gets you strictly closer, and small-world connectivity means you home in after only a few hops. A tiny ASCII picture — greedy walk from A toward a query near G:

A —— B —— C
|    |    |          start at A, hop to the neighbor closest to the query,
D —— E —— F          keep hopping "downhill": A -> B -> E -> F -> G
     |    |
     G —— H          (G's neighbors are all farther from the query -> stop)

The danger: greedy routing can get stuck in a local minimum — a node that's closer than all its neighbors but not the true nearest. Two things fix this:

  • adding a few long-range links so you can take big shortcuts, and
  • searching with a beam (keeping several candidates at once, not just the single best) — the ef parameter we'll meet later.

Skip lists: a hierarchy for big jumps

A skip list speeds up search in an ordered list by stacking layers of "express lanes". The bottom layer has every element; each layer above keeps only a random subset, so it's sparser. You search from the top (few elements, big jumps), drop down a layer when you'd overshoot, and refine — reaching any element in about $O(\log n)$ steps.

layer 2:  1 ───────────────► 9            (express: big jumps)
layer 1:  1 ─────► 5 ──────► 9            (regional)
layer 0:  1 → 3 → 5 → 7 → 9 → ...         (every element)

The trick that makes the layer sizes work: each element is promoted to the next layer up with some fixed probability, so layers shrink geometrically — most elements live only at the bottom, a few reach the top.

The combination = HNSW

HNSW takes the navigability of small-world graphs and the multi-layer zoom of skip lists and fuses them:

  • Instead of an ordered list, each layer is a small-world graph.
  • The top layers are sparse (few nodes, long links) for fast, coarse approach — the "highways".
  • The bottom layer (layer 0) contains every vector with denser links for precise final search — the "streets".
  • Search starts at the top, greedily routes to get close, drops down a layer, refines, and repeats until it does a careful search at the bottom.

That's the whole idea. The next chapter shows it working end-to-end before we build each piece. 👉

How HNSW works (the big picture)

Before building each piece, let's see the whole thing at a glance. HNSW is a stack of graph layers; you search top-down and insert top-down.

The structure

   layer 2     (•)                         <- a few nodes, long-range links
                |  \
   layer 1   (•)–(•)–––––(•)               <- more nodes
              |   |       |
   layer 0  (•)-(•)-(•)-(•)-(•)-(•)-(•)     <- EVERY node, dense links
  • Every vector lives in layer 0. Each vector is also present in some number of higher layers, chosen randomly at insertion time.
  • Higher layers are exponentially sparser, so they act as express lanes.
  • There is a single entry point: a node in the top layer where every search begins.

The layers really are a pyramid

Each node is assigned a maximum layer drawn so that the chance of reaching layer $\ell$ falls off geometrically (probability $\sim 1/M$ per extra layer, where $M$ is the links-per-node setting). Building an index of 2000 random vectors with M=16 gives:

layer 0: 2000 nodes
layer 1: 136 nodes
layer 2: 9 nodes
entry point id: 13   top layer: 2

Layer 0 has everything; layer 1 has ~1/16 as many; layer 2 just a handful. That steep pyramid is what keeps the top-down search short.

Searching (find the nearest neighbors)

1. Start at the entry point in the TOP layer.
2. Greedily hop toward the query within the current layer (single best path).
3. Drop down one layer, using where you landed as the new start.
4. Repeat until layer 0.
5. At layer 0, do a WIDER search (keep the best `ef` candidates), and
   return the closest `k`.

The top layers get you into the right neighborhood with a few big jumps; the wide search at the bottom nails the precise answer.

Inserting a new vector

1. Pick a random maximum layer L for the new node (mostly 0; rarely high).
2. From the global entry point, greedily descend the layers ABOVE L
   (just like searching) to find a good entry near the new node.
3. For each layer from L down to 0:
     a. Search that layer to collect nearby candidates.
     b. Choose M of them as neighbors (using a diversity heuristic).
     c. Add links both ways (new node <-> neighbors).
     d. If a neighbor now has too many links, prune it back.
4. If L is higher than the current top, the new node becomes the entry point.

Insertion reuses the search routine — that's why we build search first.

The knobs

Three parameters control the whole structure; we'll meet each in context:

ParameterMeaningEffect
Mlinks kept per node per layerbigger = better recall, more memory
efConstructionsearch breadth while buildingbigger = better graph, slower build
efsearch breadth at query timebigger = higher recall, slower query

ef is the runtime dial: turn it up for accuracy, down for speed — without rebuilding the index.

The plan

We'll build HNSW bottom-up over the next four chapters:

  1. Distances & the data structure — what we store.
  2. Greedy search in one layer — the core routine.
  3. The hierarchy — searching across layers.
  4. Insertion — building the graph, with the neighbor heuristic.

Let's start with what an HNSW index actually holds in memory. 👉

Step 1 — Distances & the data structure

Two foundations before any searching: how we measure closeness, and how we store the graph.

Distances

HNSW needs one thing from your data: a way to score how close two vectors are. We support the two most common choices.

import numpy as np


def euclidean(a, b):
    """Squared Euclidean distance. We skip the square root because it doesn't
    change the *order* of distances and saves time."""
    d = a - b
    return float(np.dot(d, d))


def cosine(a, b):
    """Cosine distance for UNIT-LENGTH vectors: 1 - (a . b).
    0 means identical direction; larger means more different."""
    return 1.0 - float(np.dot(a, b))

Two practical notes:

  • Squared vs. true distance. For Euclidean we use the squared distance. Since $\sqrt{\cdot}$ is increasing, the nearest neighbor is the same either way — and skipping the square root is faster.
  • Cosine needs unit vectors. Cosine distance equals 1 - dot only when both vectors have length 1. So when using cosine, we normalize every vector (divide by its length) as it goes in, and normalize queries too. Then a plain dot product gives cosine similarity.
def normalize(v):
    n = np.linalg.norm(v)
    return v / n if n > 0 else v

print(np.dot(normalize(np.array([3.0, 4.0])),
             normalize(np.array([3.0, 4.0]))))   # a vector with itself -> 1.0

Output:

1.0

The data structure

An HNSW index needs to remember three things:

  1. The vectors themselves, so we can compute distances. We keep them in a list; a vector's position in the list is its id (0, 1, 2, …).
  2. The graph: for each layer, and each node in that layer, the list of its neighbor ids. We store this as a list of dictionaries — graph[level][node] is a list of neighbor ids.
  3. The entry point (the id of a node in the top layer) and the current top layer number.
class HNSW:
    def __init__(self, dim, M=16, ef_construction=200, ef=50,
                 distance="euclidean", seed=42):
        self.dim = dim
        self.M = M                    # target neighbors per node
        self.Mmax = M                 # cap above layer 0
        self.Mmax0 = 2 * M            # cap at layer 0 (the base layer is denser)
        self.ef_construction = ef_construction
        self.ef = ef
        self.distance_name = distance

        self.data = []                # data[id] = vector
        self.graph = []               # graph[level][id] = list of neighbor ids
        self.entry_point = None       # id of the top-layer entry node
        self.max_level = -1

Why two caps (Mmax and Mmax0)? The bottom layer holds every node and does the precise final search, so it benefits from more links per node (we use 2*M). Upper layers stay leaner to keep the "express lanes" fast.

A picture of the stored graph

For a tiny 4-node, 2-layer index, the dictionaries might look like:

graph[0] = {0: [1, 2], 1: [0, 2, 3], 2: [0, 1], 3: [1]}   # layer 0: all nodes
graph[1] = {1: [3], 3: [1]}                                # layer 1: a subset
entry_point = 3,  max_level = 1

Node 3 lives in both layers and is the entry point; nodes 0 and 2 only exist at the bottom.

Why a graph (not a NumPy matrix)?

Unlike the KTS book — where everything was dense matrices — HNSW is inherently sparse and irregular: each node has a different handful of neighbors, and we follow links one at a time. That's a natural fit for Python dictionaries and lists, with NumPy used only for the per-vector distance math.

Now that we can store the graph and measure distance, let's write the routine that does the actual searching. 👉

Step 2 — Greedy search in one layer

This is the engine of HNSW. Everything else — multi-layer search and insertion — calls this one routine. It answers: within a single layer, starting from some entry node(s), what are the closest nodes to the query?

The idea: a guided "beam" walk

Pure greedy walking (always step to the single best neighbor) is fast but can get stuck before reaching the true nearest. The fix is to explore with a small beam: keep a working set of the best ef candidates found so far, and keep expanding the most promising unexplored one until none can improve the set.

We track two collections:

  • candidates — nodes we've found but not yet explored, ordered so we always expand the closest one next (a min-heap).
  • results — the best ef nodes found so far, ordered so we can quickly see the worst one (a max-heap), since that's the one a newcomer must beat.

A heap is a structure that always gives you the smallest (or largest) item instantly. Python's heapq is a min-heap; to get a max-heap we store negative distances.

The stopping rule

We stop when the closest unexplored candidate is farther than the worst node in our results. At that point nothing left to explore could improve the answer, so we're done — that's what makes it fast (we don't visit the whole layer).

The code

import heapq


def _search_layer(self, q, entry_ids, ef, level):
    """Greedy best-first search within one layer.
    Returns up to `ef` nearest nodes as a list of (distance, id)."""
    visited = set(entry_ids)
    candidates = []          # min-heap by distance: explore nearest first
    results = []             # max-heap by -distance: the best `ef` so far

    for e in entry_ids:
        d = self._dist(q, self.data[e])
        heapq.heappush(candidates, (d, e))
        heapq.heappush(results, (-d, e))

    while candidates:
        d_c, c = heapq.heappop(candidates)         # closest unexplored
        if d_c > -results[0][0]:                    # farther than our worst kept?
            break                                   # ...then nothing can improve
        for nb in self.graph[level].get(c, []):     # look at c's neighbors
            if nb in visited:
                continue
            visited.add(nb)
            d_nb = self._dist(q, self.data[nb])
            if d_nb < -results[0][0] or len(results) < ef:
                heapq.heappush(candidates, (d_nb, nb))
                heapq.heappush(results, (-d_nb, nb))
                if len(results) > ef:               # keep only the best ef
                    heapq.heappop(results)
    return [(-d, n) for d, n in results]

Reading it

  • We seed both heaps with the entry node(s).
  • Each loop: take the closest unexplored candidate c. If it's already worse than our worst kept result, stop (the key efficiency rule).
  • Otherwise, look at each unvisited neighbor of c, compute its distance, and if it's good enough (better than our current worst, or we don't yet have ef results), add it to both heaps. Trim results back to ef.
  • visited ensures we never process the same node twice.

The ef parameter is the beam width: larger ef explores more, finds better answers, and costs more — the recall/speed dial.

See it work

Eight points in 2-D: a cluster near $(5,5)$, a cluster near the origin, and two outliers. We query near $(5.2, 5.2)$ and ask for the 3 nearest:

import numpy as np
from hnsw import HNSW, brute_force_knn

pts = np.array([[0., 0.], [1., 0.], [0., 1.],      # cluster near origin
                [5., 5.], [6., 5.], [5., 6.],      # cluster near (5,5)
                [10., 0.], [0., 10.]])             # two outliers
idx = HNSW(dim=2, M=4, ef_construction=20, distance="euclidean", seed=3)
for p in pts:
    idx.add(p)

q = np.array([5.2, 5.2])
print("HNSW  top3:", [(i, round(d, 2)) for i, d in idx.search(q, k=3, ef=10)])
print("exact top3:", [(i, round(d, 2)) for i, d in brute_force_knn(pts, q, k=3)])

Output:

HNSW  top3: [(3, 0.08), (4, 0.68), (5, 0.68)]
exact top3: [(3, 0.08), (4, 0.68), (5, 0.68)]

HNSW returns exactly the three points of the $(5,5)$ cluster — ids 3, 4, 5 — identical to the exact brute-force answer. 🎯 The greedy walk hopped straight into the right cluster and the beam picked up its members.

So far this searches one layer. Next we stack layers to make it scale. 👉

Step 3 — The hierarchy

One-layer greedy search is the engine. The hierarchy is the gearbox: it lets a few coarse layers carry the query most of the way before the fine layer does the precise work.

Why layers help

In a single flat graph, greedy search from a random start can take many hops to cross the whole dataset. With layers:

  • Top layers are sparse, so their links span long distances — each hop covers a lot of ground. A few hops get you roughly to the query's region.
  • Lower layers are denser, so hops are short and precise — perfect for refining once you're already close.

It's the highway → regional road → street progression from the introduction. The result is roughly logarithmic search cost instead of linear.

Search across layers

The multi-layer search is short, because it just calls _search_layer repeatedly:

def search(self, query, k=5, ef=None):
    """Approximate k nearest neighbors as [(id, distance), ...]."""
    if self.entry_point is None:
        return []
    ef = ef or max(self.ef, k)
    q = self._prepare(query)                 # normalize if cosine

    ep = [self.entry_point]
    for l in range(self.max_level, 0, -1):   # top layers: narrow, fast descent
        W = self._search_layer(q, ep, 1, l)  # ef=1: just the single best path
        ep = [min(W)[1]]                     # carry the closest node downward
    W = self._search_layer(q, ep, ef, 0)     # base layer: WIDE search (ef)
    W.sort()
    return [(n, d) for d, n in W[:k]]

The two phases

  1. Descend (layers max_level … 1). Here we search with ef = 1 — pure greedy, single best path. We don't need breadth yet; we just want to arrive near the query quickly. After each layer we take the closest node found and use it as the entry point for the layer below.
  2. Refine (layer 0). Now we search with the full ef beam. The base layer has every vector and dense links, so this wide search produces the accurate top-k. We sort and return the closest k.

_prepare simply normalizes the query to unit length when using cosine (so the dot-product distance is correct).

A mental trace

top layer:   start at entry point ── greedy ──► node A          (1 big hop)
                                         │ drop down
mid layer:   start at A ── greedy ──► node B                    (a few hops)
                                         │ drop down
layer 0:     start at B ── WIDE beam search (ef) ──► best k near the query

By the time we reach layer 0 we're already in the right neighborhood, so the expensive wide search only explores a small local region — not the whole dataset.

Why descend with ef = 1?

It's a deliberate efficiency choice. The upper layers exist only to approach the target; precision there is wasted effort because the real answers all live at layer 0 anyway. Spending the beam budget (ef) only at the bottom is what keeps HNSW fast while staying accurate.

We can now search a fully built hierarchy. The last missing piece is how that hierarchy gets built — insertion. 👉

Step 4 — Insertion & neighbor selection

Building the index means inserting vectors one at a time. Insertion reuses the search routine, then wires the new node into each layer using a clever neighbor-selection heuristic. This is the subtlest part of HNSW — we'll take it slowly.

Choosing the new node's top layer

Each new node gets a random maximum layer. The distribution is designed so layers thin out geometrically (most nodes only at layer 0, very few up high). The formula:

$$ \ell = \big\lfloor -\ln(U)\cdot m_L \big\rfloor, \qquad m_L = \frac{1}{\ln M}. $$

Here $U$ is a uniform random number in $(0,1]$, $\ln$ is the natural logarithm, and $\lfloor\cdot\rfloor$ rounds down. The $-\ln(U)$ part produces mostly small numbers and occasionally large ones; multiplying by $m_L = 1/\ln M$ tunes how fast the layers thin. With this rule, the probability of reaching each higher layer drops by a factor of about $M$.

import math, random

def random_level(M, rng):
    mL = 1.0 / math.log(M)
    return int(-math.log(rng.random()) * mL)

Running this many times for M=16 gives mostly 0, sometimes 1, rarely 2+ — exactly the pyramid we saw earlier (2000 → 136 → 9 nodes).

The insertion algorithm

def add(self, vector):
    v = self._prepare(vector)                       # normalize if cosine
    node = len(self.data)
    self.data.append(v)

    level = int(-math.log(self.rng.random()) * self.mL)   # the new node's top layer
    while len(self.graph) <= level:                 # grow the layer stack if needed
        self.graph.append({})
    for l in range(level + 1):
        self.graph[l].setdefault(node, [])

    if self.entry_point is None:                    # the very first node
        self.entry_point = node
        self.max_level = level
        return node

    ep = [self.entry_point]
    L = self.max_level

    # Phase 1: greedily descend the layers ABOVE the new node's level (ef=1),
    # exactly like a search, to arrive near where the node belongs.
    for l in range(L, level, -1):
        W = self._search_layer(v, ep, 1, l)
        ep = [min(W)[1]]

    # Phase 2: from the node's level down to 0, connect it into each layer.
    for l in range(min(L, level), -1, -1):
        W = self._search_layer(v, ep, self.ef_construction, l)   # gather candidates
        neighbors = self._select_neighbors(v, W, self.M)         # pick diverse M
        self.graph[l][node] = list(neighbors)

        Mmax = self.Mmax0 if l == 0 else self.Mmax
        for nb in neighbors:                        # links go both ways
            self.graph[l].setdefault(nb, []).append(node)
            if len(self.graph[l][nb]) > Mmax:       # neighbor now over capacity?
                nb_vec = self.data[nb]
                cand = [(self._dist(nb_vec, self.data[x]), x)
                        for x in self.graph[l][nb]]
                self.graph[l][nb] = self._select_neighbors(nb_vec, cand, Mmax)
        ep = [n for _, n in W]                       # carry candidates downward

    if level > self.max_level:                       # new tallest node -> new entry
        self.max_level = level
        self.entry_point = node
    return node

Walking through it

  • Pick a level and make sure the graph has that many layers.
  • First node? It just becomes the entry point — nothing to link to.
  • Phase 1 (approach): descend the layers above the new node's top level with ef=1, identical to searching, to find a good starting point near the new node.
  • Phase 2 (connect): for each layer the node belongs to, from its top level down to 0:
    • search that layer with breadth efConstruction to collect candidate neighbors,
    • choose M of them with the heuristic (below),
    • add bidirectional links,
    • if a neighbor now exceeds its cap (Mmax, or Mmax0=2M at layer 0), prune it back down by re-running the heuristic on its links.
  • New top? If the node's level beats the current maximum, it becomes the new global entry point.

efConstruction is the build-time beam width: larger means we consider more candidates per insertion, producing a higher-quality graph (better recall later) at the cost of slower building.

The neighbor-selection heuristic

The obvious choice — "just keep the M closest candidates" — builds a worse graph. It tends to link a node only to others in the same tight cluster, leaving few bridges between regions, so greedy search gets stuck.

HNSW uses a diversity heuristic (Algorithm 4 in the paper): keep a candidate only if it's closer to the new node than to any neighbor already chosen. This spreads links out in different directions and keeps the graph well-connected.

def _select_neighbors(self, base_vec, candidates, M):
    """Keep diverse neighbors, not just the closest ones."""
    result = []
    for d_e, e in sorted(candidates):              # consider nearest first
        keep = True
        for r in result:
            # is e closer to an already-picked neighbor than to base? -> redundant
            if self._dist(self.data[e], self.data[r]) < d_e:
                keep = False
                break
        if keep:
            result.append(e)
        if len(result) >= M:
            break
    return result

Why diversity matters (a picture)

keep-closest:                 diversity heuristic:
  new                            new
  /|\                           / | \
 o o o   (all in one clump)    o  |  o   (links fan out in different directions)
                                  o

With links fanning out, a greedy walk arriving from any direction can find its way through the node — that's the connectivity that makes search reliable.

The build-time knobs, recapped

KnobWhenBigger means
Mstructuremore links/node → better recall, more memory
efConstructionbuildingbetter-quality graph, slower build

That completes all four pieces. Next we assemble them into the full, runnable implementation. 👉

The full implementation

Here is the consolidated module that ties Steps 1–4 together — distances, the graph, single-layer greedy search, multi-layer search, and insertion with the neighbor heuristic. It also includes brute_force_knn so we can measure accuracy.

The runnable file lives at code/hnsw.py in this book's folder, reproduced below in full.

"""
HNSW (Hierarchical Navigable Small World) — from scratch in Python + NumPy.

Reference:
    Yu. A. Malkov, D. A. Yashunin.
    "Efficient and robust approximate nearest neighbor search using
     Hierarchical Navigable Small World graphs." IEEE TPAMI, 2018.
    (arXiv:1603.09320)

This is the consolidated, readable version of the step-by-step derivation in the
accompanying mdBook. It implements:

    HNSW.add(vector)            -> insert a point, returns its id
    HNSW.search(query, k)       -> approximate k nearest neighbors

Distances supported: "euclidean" (squared) and "cosine" (vectors auto-normalized).
Only NumPy + the standard library are used.
"""

from __future__ import annotations

import heapq
import math
import random

import numpy as np


class HNSW:
    def __init__(self, dim, M=16, ef_construction=200, ef=50,
                 distance="euclidean", seed=42):
        self.dim = dim
        self.M = M                       # target neighbors per node
        self.Mmax = M                    # cap per node above layer 0
        self.Mmax0 = 2 * M               # cap per node at layer 0 (denser)
        self.ef_construction = ef_construction
        self.ef = ef                     # default search breadth
        self.mL = 1.0 / math.log(M)      # level-generation normalizer
        self.rng = random.Random(seed)
        self.distance_name = distance

        self.data = []                   # data[id] = vector
        self.graph = []                  # graph[level][id] = list of neighbor ids
        self.entry_point = None          # id of the current top-layer entry
        self.max_level = -1
        self.dist_count = 0              # distance evaluations (for analysis)

    # ----------------------------------------------------------------- distance
    def _dist(self, a, b):
        self.dist_count += 1
        if self.distance_name == "cosine":
            return 1.0 - float(np.dot(a, b))     # vectors are pre-normalized
        d = a - b
        return float(np.dot(d, d))               # squared Euclidean (monotonic)

    def _prepare(self, v):
        v = np.asarray(v, dtype=np.float64)
        if self.distance_name == "cosine":
            n = np.linalg.norm(v)
            if n > 0:
                v = v / n
        return v

    # ------------------------------------------------------------ search a layer
    def _search_layer(self, q, entry_ids, ef, level):
        """Greedy best-first search within a single layer.

        Returns up to `ef` nearest found, as a list of (distance, id).
        """
        visited = set(entry_ids)
        candidates = []                  # min-heap by distance (explore nearest)
        results = []                     # max-heap by distance (-d) of the best ef
        for e in entry_ids:
            d = self._dist(q, self.data[e])
            heapq.heappush(candidates, (d, e))
            heapq.heappush(results, (-d, e))

        while candidates:
            d_c, c = heapq.heappop(candidates)
            if d_c > -results[0][0]:      # nearest candidate worse than worst result
                break
            for nb in self.graph[level].get(c, []):
                if nb in visited:
                    continue
                visited.add(nb)
                d_nb = self._dist(q, self.data[nb])
                if d_nb < -results[0][0] or len(results) < ef:
                    heapq.heappush(candidates, (d_nb, nb))
                    heapq.heappush(results, (-d_nb, nb))
                    if len(results) > ef:
                        heapq.heappop(results)
        return [(-d, n) for d, n in results]

    # -------------------------------------------------- neighbor selection heuristic
    def _select_neighbors(self, base_vec, candidates, M):
        """Algorithm 4 (heuristic): pick diverse neighbors, not just the closest.

        Keep candidate e only if it is closer to the base point than to any
        neighbor already chosen — this avoids clustering all links in one
        direction and keeps the graph well-connected.
        """
        result = []
        for d_e, e in sorted(candidates):            # nearest first
            keep = True
            for r in result:
                if self._dist(self.data[e], self.data[r]) < d_e:
                    keep = False
                    break
            if keep:
                result.append(e)
            if len(result) >= M:
                break
        return result

    # ----------------------------------------------------------------- insertion
    def add(self, vector):
        v = self._prepare(vector)
        node = len(self.data)
        self.data.append(v)

        # random level: geometric-like, so few nodes reach the high layers
        level = int(-math.log(self.rng.random()) * self.mL)
        while len(self.graph) <= level:
            self.graph.append({})
        for l in range(level + 1):
            self.graph[l].setdefault(node, [])

        if self.entry_point is None:                 # first ever point
            self.entry_point = node
            self.max_level = level
            return node

        ep = [self.entry_point]
        L = self.max_level

        # Phase 1: descend from the top to just above the new node's level (ef=1).
        for l in range(L, level, -1):
            W = self._search_layer(v, ep, 1, l)
            ep = [min(W)[1]]

        # Phase 2: from the new node's level down to 0, connect it in.
        for l in range(min(L, level), -1, -1):
            W = self._search_layer(v, ep, self.ef_construction, l)
            neighbors = self._select_neighbors(v, W, self.M)
            self.graph[l][node] = list(neighbors)

            Mmax = self.Mmax0 if l == 0 else self.Mmax
            for nb in neighbors:                     # links are bidirectional
                self.graph[l].setdefault(nb, []).append(node)
                if len(self.graph[l][nb]) > Mmax:    # prune over-full neighbor
                    nb_vec = self.data[nb]
                    cand = [(self._dist(nb_vec, self.data[x]), x)
                            for x in self.graph[l][nb]]
                    self.graph[l][nb] = self._select_neighbors(nb_vec, cand, Mmax)
            ep = [n for _, n in W]

        if level > self.max_level:                   # new top of the hierarchy
            self.max_level = level
            self.entry_point = node
        return node

    # -------------------------------------------------------------------- search
    def search(self, query, k=5, ef=None):
        """Return the approximate k nearest neighbors as [(id, distance), ...]."""
        if self.entry_point is None:
            return []
        ef = ef or max(self.ef, k)
        q = self._prepare(query)

        ep = [self.entry_point]
        for l in range(self.max_level, 0, -1):       # greedily descend the hierarchy
            W = self._search_layer(q, ep, 1, l)
            ep = [min(W)[1]]
        W = self._search_layer(q, ep, ef, 0)         # wide search at the base layer
        W.sort()
        return [(n, d) for d, n in W[:k]]

    def __len__(self):
        return len(self.data)


# ------------------------------------------------------------------ brute force
def brute_force_knn(data, query, k=5, distance="euclidean"):
    """Exact nearest neighbors, for measuring recall against HNSW."""
    X = np.asarray(data, dtype=np.float64)
    q = np.asarray(query, dtype=np.float64)
    if distance == "cosine":
        X = X / np.maximum(np.linalg.norm(X, axis=1, keepdims=True), 1e-12)
        q = q / max(np.linalg.norm(q), 1e-12)
        d = 1.0 - X @ q
    else:
        d = np.sum((X - q) ** 2, axis=1)
    idx = np.argsort(d)[:k]
    return [(int(i), float(d[i])) for i in idx]


__all__ = ["HNSW", "brute_force_knn"]

The {{#include}} directive splices in the real code/hnsw.py at build time, so the documentation can never drift from the actual implementation.

API at a glance

CallPurpose
HNSW(dim, M, ef_construction, ef, distance, seed)Create an empty index.
index.add(vector)Insert one vector; returns its integer id.
index.search(query, k, ef)Approximate k nearest neighbors as [(id, distance), …].
brute_force_knn(data, query, k, distance)Exact answer, for measuring recall.

Typical usage

import numpy as np
from hnsw import HNSW

vectors = np.load("embeddings.npy")      # shape (n, d)

index = HNSW(dim=vectors.shape[1], M=16, ef_construction=200, distance="cosine")
for v in vectors:
    index.add(v)

query = vectors[0]
for rank, (i, dist) in enumerate(index.search(query, k=5, ef=64), 1):
    print(rank, i, round(dist, 4))

A note on dist_count

The class increments self.dist_count on every distance computation. Reset it before a query (index.dist_count = 0) to measure how much work a search did — we use this in the demo to compare HNSW's effort against brute force in a way that isn't skewed by NumPy's vectorization.

What this is and isn't

This implementation is written for clarity: it shows exactly how HNSW works and produces correct, high-recall results. Production libraries (FAISS, hnswlib, Lucene) add heavy optimization — C/C++ inner loops, SIMD distance kernels, memory-packed graphs, multi-threaded builds, deletions, and filtered search. They share this exact algorithm; they just run it far faster. (See the use-cases chapter.)

Next: let's measure how well it actually works. 👉

A worked demo: recall vs. speed

Let's measure the two things that matter for an approximate method: how accurate is it (recall), and how much work does it save (distance computations). The full script is code/demo.py.

The setup

We build an index over 2000 random 32-dimensional vectors, then run 200 queries and compare HNSW's results to the exact answers from brute force.

import time
import numpy as np
from hnsw import HNSW, brute_force_knn

rng = np.random.default_rng(0)
N, D, Q, K = 2000, 32, 200, 10
data = rng.normal(size=(N, D))
queries = rng.normal(size=(Q, D))

index = HNSW(dim=D, M=16, ef_construction=200, distance="euclidean", seed=1)
for v in data:
    index.add(v)
  • Recall@10 = of the true 10 nearest neighbors, what fraction did HNSW return?
  • Distances/query = how many distance computations HNSW used. Brute force always uses N = 2000 (it checks everything).

Measuring the trade-off

We sweep the search-breadth knob ef and, for each value, record recall and the number of distance computations (using index.dist_count):

exact = [set(i for i, _ in brute_force_knn(data, q, k=K)) for q in queries]

for ef in [10, 20, 50, 100, 200]:
    hits, total_dists = 0, 0
    for q, ex in zip(queries, exact):
        index.dist_count = 0
        approx = set(i for i, _ in index.search(q, k=K, ef=ef))
        total_dists += index.dist_count
        hits += len(approx & ex)
    print(ef, hits / (Q * K), total_dists / Q)

The result

Running python demo.py produces:

dataset: 2000 points in 32 dimensions; 200 queries; k=10

built HNSW index in 8.37s (top layer = 2)

    ef   recall@10   dists/query   vs brute
  --------------------------------------------
    10       0.758           278      7.2x
    20       0.898           418      4.8x
    50       0.986           756      2.6x
   100       0.999          1129      1.8x
   200       1.000          1533      1.3x

Brute force examines all 2000 points per query.

How to read this

  • ef is the dial. Turning it up trades effort for accuracy: at ef=10 we examine ~278 points (7.2× fewer than brute force) for 76% recall; at ef=50 we hit 98.6% recall while still examining ~⅓ of the data; at ef=200 we get perfect recall. You choose the point on this curve your application needs — without rebuilding the index.
  • High recall is cheap. ~99% recall (ef=100) is the sweet spot for most search/RAG systems, and it's reached well before examining everything.

"But the speedup looks small!"

At only 2000 points the vs brute column is modest — and a wall-clock timer would even show this pure-Python HNSW losing to NumPy's vectorized brute force. That's expected and important to understand:

  1. NumPy brute force is vectorized C; our HNSW is a pure-Python graph walk with per-node interpreter overhead. At small N, that overhead dominates.
  2. The win is algorithmic and grows with N. Brute force is $O(N)$ — its work doubles when the data doubles, forever. HNSW's search cost grows roughly $O(\log N)$. The "distances/query" gap you see at N=2000 becomes a chasm at N=1,000,000: brute force does a million distance computations per query while HNSW still does only a few thousand.

That's why production systems with millions or billions of vectors use HNSW: it's the difference between a query taking milliseconds versus seconds. Our distance-count comparison shows the real algorithmic advantage that wall-clock timing on tiny data would hide.

Things to try

  • Raise N to 20,000 and watch the vs brute column widen (the build is pure Python, so be patient).
  • Increase M (e.g. 32): higher recall at the same ef, more memory.
  • Increase efConstruction: a better graph (higher recall) but slower build.
  • Switch to distance="cosine" for embedding-style data.

Now that it works, let's see where HNSW is actually used. 👉

Use cases: NLP, search & recommendation

HNSW solves one generic problem — find the nearest vectors, fast — and that problem sits at the center of modern search, NLP, and recommendation systems. This chapter connects the algorithm to how it's actually used.

The common pattern: embed → index → query

Almost every application follows the same three steps:

  1. Embed. Turn each item (document, product, image, user) into a vector with a model, so that similar items land near each other.
  2. Index. Insert all those vectors into an HNSW index (once, offline).
  3. Query. Embed the incoming query the same way and ask HNSW for the nearest vectors — those are your search results / recommendations / retrieved context.

HNSW is step 2–3. The quality of step 1 (the embeddings) decides what "similar" means; HNSW just finds the neighbors quickly.

Traditional keyword search matches words literally — a search for "car" misses a document that only says "automobile". Semantic search fixes this by comparing meaning: a text-embedding model maps sentences with similar meaning to nearby vectors, regardless of exact wording.

  • Embeddings: models like sentence-transformers or hosted embedding APIs turn text into vectors (often 384–1536 dimensions). Similar meaning → small cosine distance.
  • Index: all document embeddings go into an HNSW index (cosine distance).
  • Query: embed the user's question, HNSW returns the most semantically similar documents.

Our CLI tool demonstrates exactly this — with a simple from-scratch TF-IDF embedding so it needs no ML libraries. It correctly retrieves "Neural networks … speech recognition" for the query "deep learning for recognizing speech", even though they share almost no words.

Retrieval-Augmented Generation (RAG)

RAG is why HNSW is everywhere in 2024+. An LLM doesn't know your private documents, so:

1. Split your documents into chunks; embed each chunk; index with HNSW.   (offline)
2. User asks a question -> embed it -> HNSW finds the most relevant chunks.
3. Paste those chunks into the LLM prompt as context -> grounded answer.

Step 2 is an HNSW nearest-neighbor query. Every "chat with your docs" product, and the memory of many AI agents, is built on this loop. At scale (millions of chunks) the retrieval must be approximate — brute force would make each chat turn far too slow.

Search engines: OpenSearch, Elasticsearch, Lucene

Vector search is now a first-class feature of mainstream search engines:

  • OpenSearch and Elasticsearch offer a knn_vector field type; under the hood they use HNSW (via the nmslib/Lucene or FAISS engines) to index embeddings.
  • Apache Lucene has a native HNSW implementation, so anything built on Lucene inherits it.

A typical OpenSearch query asks for the k nearest vectors to a query embedding — that call runs an HNSW search across the shard's graph. The big practical wins:

  • Hybrid search: combine classic keyword (BM25) scores with vector similarity for the best of both literal and semantic matching.
  • Incremental updates: HNSW supports adding vectors without a full rebuild, which suits search indices that change constantly.
  • Filtering: restrict the neighbor search to documents matching metadata (e.g. "only in-stock products"), an active area of engineering on top of HNSW.

Recommendation systems & matrix factorization

Recommendations are a nearest-neighbor problem in disguise, and this is where the connection to matrix factorization comes in.

Where the vectors come from

Start with the interaction matrix $R$: rows are users, columns are items, and each entry is a rating or a click (mostly empty). Matrix factorization approximates this huge sparse matrix as the product of two thin matrices:

$$ R \approx U \, V^\top $$

  • $U$ has one row per user — that row is the user's embedding.
  • $V$ has one row per item — that row is the item's embedding.
  • Both live in the same $f$-dimensional "taste space" (e.g. $f = 64$).

The model is trained so that the dot product of a user vector and an item vector predicts how much that user will like that item:

$$ \hat{r}_{ui} = u_u \cdot v_i. $$

This is the classic approach behind the Netflix Prize and most collaborative filtering. (Modern "two-tower" neural recommenders do the same thing with deeper encoders — they still end up producing a user vector and item vectors compared by dot product.)

Why HNSW is the serving engine

At serving time, recommending for a user means: find the items whose vectors give the highest dot product with the user's vector — i.e. the user's nearest neighbors among item vectors. With millions of items you cannot score them all per request, so:

offline:  factorize R -> user vectors U, item vectors V
          build an HNSW index over the ITEM vectors V        (once)
online:   look up the user's vector u
          HNSW-search the item index with u -> top-N items   (milliseconds)

Maximizing dot product vs. nearest distance. Recommendation ranks by largest dot product (called MIPS — Maximum Inner Product Search), while HNSW as written finds smallest distance. The two line up exactly when vectors are normalized (cosine). When item magnitudes matter, a standard trick appends an extra coordinate to each vector so that ordinary nearest-neighbor distance reproduces the inner-product ranking — letting the same HNSW index serve recommendations.

The same pattern powers "similar items" ("customers also viewed"): index item vectors, then for a given item return its nearest item neighbors.

Other domains, same shape

  • Image / video / audio search — embed media with a vision/audio model, index, query by example ("find pictures like this one").
  • De-duplication & entity resolution — near-duplicate detection via nearest neighbors in embedding space.
  • Anomaly detection — points with no close neighbors are outliers.
  • Clustering at scale — nearest-neighbor graphs feed graph-based clustering.

When HNSW is (and isn't) the right tool

Great fit when:

  • you have many high-dimensional vectors and need fast top-k similarity,
  • a tiny recall loss is acceptable for big speedups,
  • the dataset grows incrementally (HNSW adds vectors without a rebuild).

Consider alternatives when:

  • the dataset is tiny (a few thousand) — brute force is simpler and, with vectorized math, often faster (as our demo shows);
  • you need exact guarantees;
  • memory is extremely tight — HNSW stores the graph on top of the vectors (quantization-based indexes like IVF-PQ trade recall for much smaller memory);
  • you do heavy deletions — HNSW handles updates well but deletions are awkward (usually soft-deleted and periodically rebuilt).

Next, a look back at the paper that introduced all this. 👉

The original paper: understand & reproduce

Like the KTS book, we built HNSW as if from thin air. It actually comes from a research paper, and this chapter tells that story: what the paper says, its lineage, how to read a paper like it, and an honest reflection on what we reproduced versus what production systems add.

The paper

Yu. A. Malkov, D. A. Yashunin. Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. IEEE TPAMI, 2018 (arXiv:1603.09320).

The paper's contribution is the hierarchy. Malkov and co-authors had earlier proposed flat Navigable Small World (NSW) graphs (2014); those worked but could get trapped and scaled worse on hard datasets. HNSW adds the skip-list-style layered structure and a neighbor-selection heuristic, which together give state-of-the-art recall/speed and robustness. The four algorithms in the paper map directly onto our code:

PaperOur code
Algorithm 1 — INSERTHNSW.add
Algorithm 2 — SEARCH-LAYERHNSW._search_layer
Algorithm 3 — K-NN-SEARCHHNSW.search
Algorithm 4 — SELECT-NEIGHBORS-HEURISTICHNSW._select_neighbors

Lineage

  • Small-world / navigable graphs — Kleinberg's work on navigable small-world networks, and the NSW papers (Malkov et al., 2012–2014).
  • Skip lists — Pugh, 1990: the probabilistic layered structure HNSW borrows for its hierarchy.
  • Probabilistic routing — the idea that greedy walks on the right graph reach targets in $O(\log n)$ hops.

Recognizing this lineage helps: when the HNSW paper is terse (e.g. on why the level distribution is exponential), the skip-list literature explains it.

How to read a paper like this

The same three-pass approach as the KTS book:

  1. Skim (5 min). Abstract, figures (Figure 1 — the layered graph — is the whole idea), algorithm headers. Decide what you actually need.
  2. Method, carefully. Read the four algorithms and the parameter discussion. Write down, in your own words: the data structure, the search routine, the insert routine, the heuristic, and the parameters (M, efConstruction, ef, mL).
  3. Reproduce. Implement the algorithms and measure recall vs. a brute-force baseline — the only true test of understanding (our demo).

Tip: pseudocode in papers hides real decisions — heap orderings, the visited-set, tie-breaking, when to use ef=1 vs ef. Implementing forces you to resolve every one, which is exactly why reproduction teaches more than reading.

From paper to code: the process

  1. Build the primitive first. SEARCH-LAYER (Algorithm 2) is called by everything else, so we implemented and tested it first.
  2. Translate the data structures. "A graph per layer" → a list of dicts graph[level][node] = [neighbors]; the priority queues → Python heapq (with negated distances for the max-heap).
  3. Handle the details the pseudocode glosses over — the visited set, the stopping rule (nearest candidate worse than the worst kept result), descending upper layers with ef=1 but the base layer with full ef.
  4. Get the heuristic right. The keep-diverse rule (Algorithm 4) is what makes recall good; a naive "keep closest" version measurably underperforms.
  5. Validate against ground truth. Recall vs. brute force across ef — and sanity that recall → 1.0 as ef grows confirms correctness.

Reflection: what we built vs. production HNSW

Our implementation is algorithmically faithful — same four routines, same parameters, correct recall curve. What we deliberately left out is engineering for speed and scale, which is most of what a production library is:

AspectThis bookProduction (FAISS, hnswlib, Lucene)
Languagepure PythonC/C++ with SIMD distance kernels
Distanceone pair at a timebatched, hardware-accelerated
Graph storagedicts of listspacked contiguous arrays
Buildsingle-threadedmulti-threaded
Memoryfull float vectorsoptional quantization (PQ/SQ)
Deletes / filtersnot handledsoft-delete, filtered search

This is the same lesson as the KTS book: the algorithm is compact and learnable; the performance comes from careful systems work layered on top. A faithful from-scratch version is the right way to understand it, and a great baseline to validate an optimized version against.

A reflection on the parameters

The paper's biggest practical gift is which knobs exist and what they trade:

  • M — graph degree. Higher = better recall, more memory and slower build. Typical: 12–48.
  • efConstruction — build-time breadth. Higher = better graph, slower build. Typical: 100–500.
  • ef — query-time breadth. The runtime dial for recall vs. speed, tunable per query without rebuilding. Typical: 50–400.
  • mL — level normalizer, set to $1/\ln M$ in the paper; rarely changed.

Knowing these turns "HNSW is a black box" into "HNSW is three dials I understand".

See the References for the paper and its roots.

One-file CLI: semantic search

Everything in this book, distilled into one self-contained tool that does a genuinely useful job: semantic search over a text file. It builds vector embeddings, indexes them with our from-scratch HNSW, and answers queries by meaning — no machine-learning libraries required.

The steps

  1. Load documents (one per line) from a text file.
  2. Embed each document as a TF-IDF vector, built from scratch (term frequency × inverse document frequency — a classic, model-free text embedding).
  3. Index the vectors with HNSW (cosine distance).
  4. Query: embed the query the same way and HNSW-search for the nearest documents.
  5. Report the best matches with similarity scores.

There's also a VECTORS mode (--vectors embeddings.npy) to plug in real embeddings from a sentence model — the same index, better semantics.

Install & run

pip install numpy        # that's all TEXT mode needs

# one-shot query
python hnsw_search.py docs.txt --query "deep learning for recognizing speech"

# interactive
python hnsw_search.py docs.txt

# use real precomputed embeddings instead of TF-IDF
python hnsw_search.py docs.txt --vectors embeddings.npy --query "..."

It works — real output

With a small docs.txt of ten unrelated sentences:

$ python hnsw_search.py docs.txt --query "deep learning for recognizing speech" -k 3
loaded 10 documents from docs.txt
built TF-IDF index (vocab=81 terms)

query: 'deep learning for recognizing speech'
  1. (sim=0.312)  Neural networks are used for image and speech recognition tasks.
  2. (sim=0.217)  Machine learning models can recognize patterns in large datasets.
  3. (sim=0.096)  Vector embeddings turn words and sentences into numbers for search.
$ python hnsw_search.py docs.txt --query "how do search engines use vectors" -k 3
query: 'how do search engines use vectors'
  1. (sim=0.312)  Approximate nearest neighbor search powers modern vector databases.
  2. (sim=0.293)  Vector embeddings turn words and sentences into numbers for search.
  3. (sim=0.000)  The cat sat on the warm windowsill in the sunshine.
$ python hnsw_search.py docs.txt --query "baking homemade bread" -k 3
query: 'baking homemade bread'
  1. (sim=0.334)  Bake the bread at 220 degrees for about thirty minutes.

Each query surfaces the meaning-related document at the top — even when the words differ ("deep learning" → "neural networks"; "baking" → "bake"). That's HNSW serving semantic search.

The complete script

#!/usr/bin/env python3
"""
hnsw_search.py — a self-contained semantic search CLI built on HNSW.

Two modes:

  TEXT mode (default, no extra deps): build TF-IDF vectors from a text file
  (one document per line) entirely from scratch, index them with HNSW, and
  answer nearest-neighbor queries.

      python hnsw_search.py docs.txt --query "machine learning for search"
      python hnsw_search.py docs.txt          # interactive prompt

  VECTORS mode: use real precomputed embeddings (e.g. from a sentence model),
  one row per line in docs.txt.

      python hnsw_search.py docs.txt --vectors embeddings.npy --query "..."

The HNSW index is the from-scratch implementation in hnsw.py.

Requirements: numpy (and the local hnsw.py). No ML libraries needed for TEXT mode.
"""

from __future__ import annotations

import argparse
import re
import sys

import numpy as np

from hnsw import HNSW


# --------------------------------------------------------------------------- #
# From-scratch TF-IDF (so TEXT mode needs no ML libraries)
# --------------------------------------------------------------------------- #
def tokenize(text):
    return re.findall(r"[a-z0-9]+", text.lower())


class Tfidf:
    """Minimal TF-IDF vectorizer: learns a vocabulary + IDF, builds vectors."""

    def fit(self, docs):
        self.vocab = {}
        df = {}
        for doc in docs:
            seen = set(tokenize(doc))
            for w in seen:
                df[w] = df.get(w, 0) + 1
        for w in sorted(df):
            self.vocab[w] = len(self.vocab)
        n = max(len(docs), 1)
        self.idf = np.zeros(len(self.vocab))
        for w, i in self.vocab.items():
            self.idf[i] = np.log((1 + n) / (1 + df[w])) + 1.0   # smoothed idf
        return self

    def transform(self, docs):
        rows = np.zeros((len(docs), len(self.vocab)))
        for r, doc in enumerate(docs):
            for w in tokenize(doc):
                j = self.vocab.get(w)
                if j is not None:
                    rows[r, j] += 1.0                          # term frequency
            rows[r] *= self.idf                                # * idf
        return rows


# --------------------------------------------------------------------------- #
# Index building
# --------------------------------------------------------------------------- #
def build_text_index(docs, M, ef_construction):
    vec = Tfidf().fit(docs)
    X = vec.transform(docs)
    index = HNSW(dim=X.shape[1], M=M, ef_construction=ef_construction,
                 distance="cosine", seed=1)
    for row in X:
        index.add(row)
    return index, vec


def build_vector_index(vectors, M, ef_construction):
    X = np.asarray(vectors, dtype=np.float64)
    index = HNSW(dim=X.shape[1], M=M, ef_construction=ef_construction,
                 distance="cosine", seed=1)
    for row in X:
        index.add(row)
    return index


# --------------------------------------------------------------------------- #
# CLI
# --------------------------------------------------------------------------- #
def main(argv=None):
    p = argparse.ArgumentParser(description="Semantic search over a text file "
                                            "using a from-scratch HNSW index.")
    p.add_argument("docs", help="text file, one document per line")
    p.add_argument("--query", help="query string (omit for interactive mode)")
    p.add_argument("-k", type=int, default=5, help="results to return (default 5)")
    p.add_argument("--ef", type=int, default=50, help="search breadth (default 50)")
    p.add_argument("--M", type=int, default=16, help="graph degree (default 16)")
    p.add_argument("--ef-construction", type=int, default=200,
                   help="build breadth (default 200)")
    p.add_argument("--vectors", help="optional .npy of precomputed embeddings")
    args = p.parse_args(argv)

    with open(args.docs, encoding="utf-8") as f:
        docs = [line.rstrip("\n") for line in f if line.strip()]
    print(f"loaded {len(docs)} documents from {args.docs}")

    if args.vectors:
        vectors = np.load(args.vectors)
        if len(vectors) != len(docs):
            sys.exit("error: number of vectors != number of documents")
        index = build_vector_index(vectors, args.M, args.ef_construction)
        vec = None
        print(f"indexed precomputed embeddings (dim={vectors.shape[1]})")
    else:
        index, vec = build_text_index(docs, args.M, args.ef_construction)
        print(f"built TF-IDF index (vocab={len(vec.vocab)} terms)")

    def run_query(text):
        if vec is not None:
            q = vec.transform([text])[0]
        else:
            sys.exit("VECTORS mode needs precomputed query embeddings")
        results = index.search(q, k=args.k, ef=args.ef)
        print(f"\nquery: {text!r}")
        for rank, (i, dist) in enumerate(results, 1):
            print(f"  {rank}. (sim={1 - dist:.3f})  {docs[i]}")

    if args.query:
        run_query(args.query)
    else:
        print("\ntype a query (empty line to quit):")
        try:
            while True:
                text = input("> ").strip()
                if not text:
                    break
                run_query(text)
        except (EOFError, KeyboardInterrupt):
            pass


if __name__ == "__main__":
    main()

That's the whole tool: load → TF-IDF embed → HNSW index → search, in one file you can read top to bottom. Swap TF-IDF for a real embedding model and you have the core of a production semantic-search or RAG retriever.

References

  1. Yu. A. Malkov, D. A. Yashunin. Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. IEEE TPAMI, 2018. arXiv:1603.09320. — The HNSW paper.

  2. Yu. Malkov, A. Ponomarenko, A. Logvinov, V. Krylov. Approximate nearest neighbor algorithm based on navigable small world graphs. Information Systems, 2014. — The flat NSW predecessor HNSW builds on.

  3. William Pugh. Skip Lists: A Probabilistic Alternative to Balanced Trees. CACM, 1990. — The layered probabilistic structure HNSW borrows for its hierarchy.

  4. Jon Kleinberg. Navigation in a small world. Nature, 2000. — Why greedy routing succeeds on the right graphs.

  5. Y. Koren, R. Bell, C. Volinsky. Matrix Factorization Techniques for Recommender Systems. IEEE Computer, 2009. — The user/item embedding model behind the recommendation use case.

  • FAISS (Meta) — IndexHNSWFlat and many other ANN indexes.
  • hnswlib — the authors' compact C++/Python HNSW library.
  • Apache Lucene / OpenSearch / Elasticsearch — native HNSW vector search.
  • Vector databases — Qdrant, Weaviate, Milvus, Pinecone (HNSW-based indexes).
  • MIPS (Maximum Inner Product Search) — the recommendation ranking problem; reduces to nearest-neighbor search via normalization or an extra coordinate.
  • IVF-PQ / product quantization — memory-efficient ANN, often combined with graph indexes.

This book's code

All depend only on NumPy and the standard library.