A worked demo: recall vs. speed

Let's measure the two things that matter for an approximate method: how accurate is it (recall), and how much work does it save (distance computations). The full script is code/demo.py.

The setup

We build an index over 2000 random 32-dimensional vectors, then run 200 queries and compare HNSW's results to the exact answers from brute force.

import time
import numpy as np
from hnsw import HNSW, brute_force_knn

rng = np.random.default_rng(0)
N, D, Q, K = 2000, 32, 200, 10
data = rng.normal(size=(N, D))
queries = rng.normal(size=(Q, D))

index = HNSW(dim=D, M=16, ef_construction=200, distance="euclidean", seed=1)
for v in data:
    index.add(v)
  • Recall@10 = of the true 10 nearest neighbors, what fraction did HNSW return?
  • Distances/query = how many distance computations HNSW used. Brute force always uses N = 2000 (it checks everything).

Measuring the trade-off

We sweep the search-breadth knob ef and, for each value, record recall and the number of distance computations (using index.dist_count):

exact = [set(i for i, _ in brute_force_knn(data, q, k=K)) for q in queries]

for ef in [10, 20, 50, 100, 200]:
    hits, total_dists = 0, 0
    for q, ex in zip(queries, exact):
        index.dist_count = 0
        approx = set(i for i, _ in index.search(q, k=K, ef=ef))
        total_dists += index.dist_count
        hits += len(approx & ex)
    print(ef, hits / (Q * K), total_dists / Q)

The result

Running python demo.py produces:

dataset: 2000 points in 32 dimensions; 200 queries; k=10

built HNSW index in 8.37s (top layer = 2)

    ef   recall@10   dists/query   vs brute
  --------------------------------------------
    10       0.758           278      7.2x
    20       0.898           418      4.8x
    50       0.986           756      2.6x
   100       0.999          1129      1.8x
   200       1.000          1533      1.3x

Brute force examines all 2000 points per query.

How to read this

  • ef is the dial. Turning it up trades effort for accuracy: at ef=10 we examine ~278 points (7.2× fewer than brute force) for 76% recall; at ef=50 we hit 98.6% recall while still examining ~⅓ of the data; at ef=200 we get perfect recall. You choose the point on this curve your application needs — without rebuilding the index.
  • High recall is cheap. ~99% recall (ef=100) is the sweet spot for most search/RAG systems, and it's reached well before examining everything.

"But the speedup looks small!"

At only 2000 points the vs brute column is modest — and a wall-clock timer would even show this pure-Python HNSW losing to NumPy's vectorized brute force. That's expected and important to understand:

  1. NumPy brute force is vectorized C; our HNSW is a pure-Python graph walk with per-node interpreter overhead. At small N, that overhead dominates.
  2. The win is algorithmic and grows with N. Brute force is $O(N)$ — its work doubles when the data doubles, forever. HNSW's search cost grows roughly $O(\log N)$. The "distances/query" gap you see at N=2000 becomes a chasm at N=1,000,000: brute force does a million distance computations per query while HNSW still does only a few thousand.

That's why production systems with millions or billions of vectors use HNSW: it's the difference between a query taking milliseconds versus seconds. Our distance-count comparison shows the real algorithmic advantage that wall-clock timing on tiny data would hide.

Things to try

  • Raise N to 20,000 and watch the vs brute column widen (the build is pure Python, so be patient).
  • Increase M (e.g. 32): higher recall at the same ef, more memory.
  • Increase efConstruction: a better graph (higher recall) but slower build.
  • Switch to distance="cosine" for embedding-style data.

Now that it works, let's see where HNSW is actually used. 👉