Use cases: NLP, search & recommendation
HNSW solves one generic problem — find the nearest vectors, fast — and that problem sits at the center of modern search, NLP, and recommendation systems. This chapter connects the algorithm to how it's actually used.
The common pattern: embed → index → query
Almost every application follows the same three steps:
- Embed. Turn each item (document, product, image, user) into a vector with a model, so that similar items land near each other.
- Index. Insert all those vectors into an HNSW index (once, offline).
- Query. Embed the incoming query the same way and ask HNSW for the nearest vectors — those are your search results / recommendations / retrieved context.
HNSW is step 2–3. The quality of step 1 (the embeddings) decides what "similar" means; HNSW just finds the neighbors quickly.
NLP and semantic search
Traditional keyword search matches words literally — a search for "car" misses a document that only says "automobile". Semantic search fixes this by comparing meaning: a text-embedding model maps sentences with similar meaning to nearby vectors, regardless of exact wording.
- Embeddings: models like sentence-transformers or hosted embedding APIs turn text into vectors (often 384–1536 dimensions). Similar meaning → small cosine distance.
- Index: all document embeddings go into an HNSW index (cosine distance).
- Query: embed the user's question, HNSW returns the most semantically similar documents.
Our CLI tool demonstrates exactly this — with a simple from-scratch TF-IDF embedding so it needs no ML libraries. It correctly retrieves "Neural networks … speech recognition" for the query "deep learning for recognizing speech", even though they share almost no words.
Retrieval-Augmented Generation (RAG)
RAG is why HNSW is everywhere in 2024+. An LLM doesn't know your private documents, so:
1. Split your documents into chunks; embed each chunk; index with HNSW. (offline)
2. User asks a question -> embed it -> HNSW finds the most relevant chunks.
3. Paste those chunks into the LLM prompt as context -> grounded answer.
Step 2 is an HNSW nearest-neighbor query. Every "chat with your docs" product, and the memory of many AI agents, is built on this loop. At scale (millions of chunks) the retrieval must be approximate — brute force would make each chat turn far too slow.
Search engines: OpenSearch, Elasticsearch, Lucene
Vector search is now a first-class feature of mainstream search engines:
- OpenSearch and Elasticsearch offer a
knn_vectorfield type; under the hood they use HNSW (via the nmslib/Lucene or FAISS engines) to index embeddings. - Apache Lucene has a native HNSW implementation, so anything built on Lucene inherits it.
A typical OpenSearch query asks for the k nearest vectors to a query embedding —
that call runs an HNSW search across the shard's graph. The big practical wins:
- Hybrid search: combine classic keyword (BM25) scores with vector similarity for the best of both literal and semantic matching.
- Incremental updates: HNSW supports adding vectors without a full rebuild, which suits search indices that change constantly.
- Filtering: restrict the neighbor search to documents matching metadata (e.g. "only in-stock products"), an active area of engineering on top of HNSW.
Recommendation systems & matrix factorization
Recommendations are a nearest-neighbor problem in disguise, and this is where the connection to matrix factorization comes in.
Where the vectors come from
Start with the interaction matrix $R$: rows are users, columns are items, and each entry is a rating or a click (mostly empty). Matrix factorization approximates this huge sparse matrix as the product of two thin matrices:
$$ R \approx U \, V^\top $$
- $U$ has one row per user — that row is the user's embedding.
- $V$ has one row per item — that row is the item's embedding.
- Both live in the same $f$-dimensional "taste space" (e.g. $f = 64$).
The model is trained so that the dot product of a user vector and an item vector predicts how much that user will like that item:
$$ \hat{r}_{ui} = u_u \cdot v_i. $$
This is the classic approach behind the Netflix Prize and most collaborative filtering. (Modern "two-tower" neural recommenders do the same thing with deeper encoders — they still end up producing a user vector and item vectors compared by dot product.)
Why HNSW is the serving engine
At serving time, recommending for a user means: find the items whose vectors give the highest dot product with the user's vector — i.e. the user's nearest neighbors among item vectors. With millions of items you cannot score them all per request, so:
offline: factorize R -> user vectors U, item vectors V
build an HNSW index over the ITEM vectors V (once)
online: look up the user's vector u
HNSW-search the item index with u -> top-N items (milliseconds)
Maximizing dot product vs. nearest distance. Recommendation ranks by largest dot product (called MIPS — Maximum Inner Product Search), while HNSW as written finds smallest distance. The two line up exactly when vectors are normalized (cosine). When item magnitudes matter, a standard trick appends an extra coordinate to each vector so that ordinary nearest-neighbor distance reproduces the inner-product ranking — letting the same HNSW index serve recommendations.
The same pattern powers "similar items" ("customers also viewed"): index item vectors, then for a given item return its nearest item neighbors.
Other domains, same shape
- Image / video / audio search — embed media with a vision/audio model, index, query by example ("find pictures like this one").
- De-duplication & entity resolution — near-duplicate detection via nearest neighbors in embedding space.
- Anomaly detection — points with no close neighbors are outliers.
- Clustering at scale — nearest-neighbor graphs feed graph-based clustering.
When HNSW is (and isn't) the right tool
Great fit when:
- you have many high-dimensional vectors and need fast top-
ksimilarity, - a tiny recall loss is acceptable for big speedups,
- the dataset grows incrementally (HNSW adds vectors without a rebuild).
Consider alternatives when:
- the dataset is tiny (a few thousand) — brute force is simpler and, with vectorized math, often faster (as our demo shows);
- you need exact guarantees;
- memory is extremely tight — HNSW stores the graph on top of the vectors (quantization-based indexes like IVF-PQ trade recall for much smaller memory);
- you do heavy deletions — HNSW handles updates well but deletions are awkward (usually soft-deleted and periodically rebuilt).
Next, a look back at the paper that introduced all this. 👉