The original paper: understand & reproduce
Like the KTS book, we built HNSW as if from thin air. It actually comes from a research paper, and this chapter tells that story: what the paper says, its lineage, how to read a paper like it, and an honest reflection on what we reproduced versus what production systems add.
The paper
Yu. A. Malkov, D. A. Yashunin. Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. IEEE TPAMI, 2018 (arXiv:1603.09320).
The paper's contribution is the hierarchy. Malkov and co-authors had earlier proposed flat Navigable Small World (NSW) graphs (2014); those worked but could get trapped and scaled worse on hard datasets. HNSW adds the skip-list-style layered structure and a neighbor-selection heuristic, which together give state-of-the-art recall/speed and robustness. The four algorithms in the paper map directly onto our code:
| Paper | Our code |
|---|---|
Algorithm 1 — INSERT | HNSW.add |
Algorithm 2 — SEARCH-LAYER | HNSW._search_layer |
Algorithm 3 — K-NN-SEARCH | HNSW.search |
Algorithm 4 — SELECT-NEIGHBORS-HEURISTIC | HNSW._select_neighbors |
Lineage
- Small-world / navigable graphs — Kleinberg's work on navigable small-world networks, and the NSW papers (Malkov et al., 2012–2014).
- Skip lists — Pugh, 1990: the probabilistic layered structure HNSW borrows for its hierarchy.
- Probabilistic routing — the idea that greedy walks on the right graph reach targets in $O(\log n)$ hops.
Recognizing this lineage helps: when the HNSW paper is terse (e.g. on why the level distribution is exponential), the skip-list literature explains it.
How to read a paper like this
The same three-pass approach as the KTS book:
- Skim (5 min). Abstract, figures (Figure 1 — the layered graph — is the whole idea), algorithm headers. Decide what you actually need.
- Method, carefully. Read the four algorithms and the parameter discussion.
Write down, in your own words: the data structure, the search routine, the
insert routine, the heuristic, and the parameters (
M,efConstruction,ef,mL). - Reproduce. Implement the algorithms and measure recall vs. a brute-force baseline — the only true test of understanding (our demo).
Tip: pseudocode in papers hides real decisions — heap orderings, the visited-set, tie-breaking, when to use
ef=1vsef. Implementing forces you to resolve every one, which is exactly why reproduction teaches more than reading.
From paper to code: the process
- Build the primitive first.
SEARCH-LAYER(Algorithm 2) is called by everything else, so we implemented and tested it first. - Translate the data structures. "A graph per layer" → a list of dicts
graph[level][node] = [neighbors]; the priority queues → Pythonheapq(with negated distances for the max-heap). - Handle the details the pseudocode glosses over — the
visitedset, the stopping rule (nearest candidate worse than the worst kept result), descending upper layers withef=1but the base layer with fullef. - Get the heuristic right. The keep-diverse rule (Algorithm 4) is what makes recall good; a naive "keep closest" version measurably underperforms.
- Validate against ground truth. Recall vs. brute force across
ef— and sanity that recall → 1.0 asefgrows confirms correctness.
Reflection: what we built vs. production HNSW
Our implementation is algorithmically faithful — same four routines, same parameters, correct recall curve. What we deliberately left out is engineering for speed and scale, which is most of what a production library is:
| Aspect | This book | Production (FAISS, hnswlib, Lucene) |
|---|---|---|
| Language | pure Python | C/C++ with SIMD distance kernels |
| Distance | one pair at a time | batched, hardware-accelerated |
| Graph storage | dicts of lists | packed contiguous arrays |
| Build | single-threaded | multi-threaded |
| Memory | full float vectors | optional quantization (PQ/SQ) |
| Deletes / filters | not handled | soft-delete, filtered search |
This is the same lesson as the KTS book: the algorithm is compact and learnable; the performance comes from careful systems work layered on top. A faithful from-scratch version is the right way to understand it, and a great baseline to validate an optimized version against.
A reflection on the parameters
The paper's biggest practical gift is which knobs exist and what they trade:
M— graph degree. Higher = better recall, more memory and slower build. Typical: 12–48.efConstruction— build-time breadth. Higher = better graph, slower build. Typical: 100–500.ef— query-time breadth. The runtime dial for recall vs. speed, tunable per query without rebuilding. Typical: 50–400.mL— level normalizer, set to $1/\ln M$ in the paper; rarely changed.
Knowing these turns "HNSW is a black box" into "HNSW is three dials I understand".
See the References for the paper and its roots.