Use cases: NLP, search & recommendation

HNSW solves one generic problem — find the nearest vectors, fast — and that problem sits at the center of modern search, NLP, and recommendation systems. This chapter connects the algorithm to how it's actually used.

The common pattern: embed → index → query

Almost every application follows the same three steps:

  1. Embed. Turn each item (document, product, image, user) into a vector with a model, so that similar items land near each other.
  2. Index. Insert all those vectors into an HNSW index (once, offline).
  3. Query. Embed the incoming query the same way and ask HNSW for the nearest vectors — those are your search results / recommendations / retrieved context.

HNSW is step 2–3. The quality of step 1 (the embeddings) decides what "similar" means; HNSW just finds the neighbors quickly.

Traditional keyword search matches words literally — a search for "car" misses a document that only says "automobile". Semantic search fixes this by comparing meaning: a text-embedding model maps sentences with similar meaning to nearby vectors, regardless of exact wording.

  • Embeddings: models like sentence-transformers or hosted embedding APIs turn text into vectors (often 384–1536 dimensions). Similar meaning → small cosine distance.
  • Index: all document embeddings go into an HNSW index (cosine distance).
  • Query: embed the user's question, HNSW returns the most semantically similar documents.

Our CLI tool demonstrates exactly this — with a simple from-scratch TF-IDF embedding so it needs no ML libraries. It correctly retrieves "Neural networks … speech recognition" for the query "deep learning for recognizing speech", even though they share almost no words.

Retrieval-Augmented Generation (RAG)

RAG is why HNSW is everywhere in 2024+. An LLM doesn't know your private documents, so:

1. Split your documents into chunks; embed each chunk; index with HNSW.   (offline)
2. User asks a question -> embed it -> HNSW finds the most relevant chunks.
3. Paste those chunks into the LLM prompt as context -> grounded answer.

Step 2 is an HNSW nearest-neighbor query. Every "chat with your docs" product, and the memory of many AI agents, is built on this loop. At scale (millions of chunks) the retrieval must be approximate — brute force would make each chat turn far too slow.

Search engines: OpenSearch, Elasticsearch, Lucene

Vector search is now a first-class feature of mainstream search engines:

  • OpenSearch and Elasticsearch offer a knn_vector field type; under the hood they use HNSW (via the nmslib/Lucene or FAISS engines) to index embeddings.
  • Apache Lucene has a native HNSW implementation, so anything built on Lucene inherits it.

A typical OpenSearch query asks for the k nearest vectors to a query embedding — that call runs an HNSW search across the shard's graph. The big practical wins:

  • Hybrid search: combine classic keyword (BM25) scores with vector similarity for the best of both literal and semantic matching.
  • Incremental updates: HNSW supports adding vectors without a full rebuild, which suits search indices that change constantly.
  • Filtering: restrict the neighbor search to documents matching metadata (e.g. "only in-stock products"), an active area of engineering on top of HNSW.

Recommendation systems & matrix factorization

Recommendations are a nearest-neighbor problem in disguise, and this is where the connection to matrix factorization comes in.

Where the vectors come from

Start with the interaction matrix $R$: rows are users, columns are items, and each entry is a rating or a click (mostly empty). Matrix factorization approximates this huge sparse matrix as the product of two thin matrices:

$$ R \approx U \, V^\top $$

  • $U$ has one row per user — that row is the user's embedding.
  • $V$ has one row per item — that row is the item's embedding.
  • Both live in the same $f$-dimensional "taste space" (e.g. $f = 64$).

The model is trained so that the dot product of a user vector and an item vector predicts how much that user will like that item:

$$ \hat{r}_{ui} = u_u \cdot v_i. $$

This is the classic approach behind the Netflix Prize and most collaborative filtering. (Modern "two-tower" neural recommenders do the same thing with deeper encoders — they still end up producing a user vector and item vectors compared by dot product.)

Why HNSW is the serving engine

At serving time, recommending for a user means: find the items whose vectors give the highest dot product with the user's vector — i.e. the user's nearest neighbors among item vectors. With millions of items you cannot score them all per request, so:

offline:  factorize R -> user vectors U, item vectors V
          build an HNSW index over the ITEM vectors V        (once)
online:   look up the user's vector u
          HNSW-search the item index with u -> top-N items   (milliseconds)

Maximizing dot product vs. nearest distance. Recommendation ranks by largest dot product (called MIPS — Maximum Inner Product Search), while HNSW as written finds smallest distance. The two line up exactly when vectors are normalized (cosine). When item magnitudes matter, a standard trick appends an extra coordinate to each vector so that ordinary nearest-neighbor distance reproduces the inner-product ranking — letting the same HNSW index serve recommendations.

The same pattern powers "similar items" ("customers also viewed"): index item vectors, then for a given item return its nearest item neighbors.

Other domains, same shape

  • Image / video / audio search — embed media with a vision/audio model, index, query by example ("find pictures like this one").
  • De-duplication & entity resolution — near-duplicate detection via nearest neighbors in embedding space.
  • Anomaly detection — points with no close neighbors are outliers.
  • Clustering at scale — nearest-neighbor graphs feed graph-based clustering.

When HNSW is (and isn't) the right tool

Great fit when:

  • you have many high-dimensional vectors and need fast top-k similarity,
  • a tiny recall loss is acceptable for big speedups,
  • the dataset grows incrementally (HNSW adds vectors without a rebuild).

Consider alternatives when:

  • the dataset is tiny (a few thousand) — brute force is simpler and, with vectorized math, often faster (as our demo shows);
  • you need exact guarantees;
  • memory is extremely tight — HNSW stores the graph on top of the vectors (quantization-based indexes like IVF-PQ trade recall for much smaller memory);
  • you do heavy deletions — HNSW handles updates well but deletions are awkward (usually soft-deleted and periodically rebuilt).

Next, a look back at the paper that introduced all this. 👉