Use cases & how to choose

You've now built every piece that powers industrial vector search. This chapter connects them to the real world — especially FAISS, the library where these algorithms live in production — and gives you a decision guide.

FAISS: where these algorithms actually run

FAISS (Facebook AI Similarity Search) is the most widely used library for this, and it is essentially a highly optimized, composable version of exactly what this book built. Every class here maps directly to a FAISS index:

This bookFAISS indexWhat it is
flat_knnIndexFlatL2 / IndexFlatIPexact brute force
ScalarQuantizerIndexScalarQuantizer (SQ8)1 byte/dim, 4× smaller
IVFIndexIVFFlatpartition + probe, full vectors
PQIndexPQproduct-quantized codes
IVFPQIndexIVFPQpartition + PQ residuals (the workhorse)
(HNSW book)IndexHNSWFlatgraph navigation

The FAISS "index factory"

FAISS lets you describe an index with a short string, and it assembles the pieces — the same Lego blocks from this book:

"Flat"                exact search
"IVF4096,Flat"        IVF with 4096 cells, full vectors        (our IVF)
"PQ16"                product quantization, 16 bytes/vector     (our PQ)
"IVF4096,PQ16"        IVF + PQ residuals                        (our IVFPQ)
"OPQ16,IVF4096,PQ16"  add a learned rotation before PQ (higher recall)
"HNSW32"              HNSW graph, 32 links/node                 (the HNSW book)
"IVF4096,PQ16,RFlat"  IVFPQ + exact re-ranking (RFlat = refine) (our re-rank trick)

Reading those strings is now easy: each comma-separated piece is a stage you understand. OPQ is one we didn't build — a learned rotation of the data before PQ that spreads information evenly across sub-vectors, measurably improving recall; it's a drop-in upgrade to the PQ stage.

Why FAISS is fast (and our version isn't)

Same algorithms, serious engineering: SIMD/AVX distance kernels, a special PQFastScan that does PQ table lookups in registers, GPU implementations, multi-threaded training on a sample of the data, and memory-mapped on-disk indexes for sets too big for RAM. Our from-scratch version is for understanding; FAISS is for shipping — and validating your understanding against.

As in the HNSW book, almost every application is:

  1. Embed items (text/images/users) into vectors with a model.
  2. Index them — and at scale, that means IVFPQ (or HNSW) to fit memory and stay fast.
  3. Search with a query embedding; optionally re-rank the shortlist exactly.

IVF/PQ is the part that makes step 2 survive billions of items.

Where it's used

  • Large-scale semantic search & RAG. When a corpus has tens of millions of chunks, storing full float vectors is too costly; IVFPQ compresses them and keeps retrieval fast. The retrieved shortlist is then re-ranked (and fed to an LLM, for RAG).
  • Recommendations. Item embeddings from matrix factorization or two-tower models (see the HNSW use-cases) are indexed with IVFPQ; "similar items" and per-user candidate generation become compressed nearest-neighbor lookups over hundreds of millions of items.
  • Image / video / audio search at web scale — billions of media embeddings compressed to bytes.
  • De-duplication & clustering of massive corpora, where keeping full vectors in RAM is infeasible.

How to choose: a decision guide

SituationUse
Small data (≤ ~10k–100k)Flat (exact). Simple, and vectorized brute force is plenty fast.
Need speed, RAM is fineHNSW or IVFFlat. Full vectors, near-exact, fast.
RAM-constrained, big dataPQ / IVFPQ. Compress 16–64×; re-rank for recall.
Billions of vectorsIVFPQ (often OPQ,IVF…,PQ…), the standard at scale.
Maximum recall, RAM availableHNSW (optionally on SQ8) + exact re-rank.
Need exact guaranteesFlat only.

Rules of thumb:

  • Start exact. If brute force is fast enough, don't add complexity.
  • Hit a speed wall? Add IVF (or HNSW).
  • Hit a memory wall? Add PQ, and always pair it with re-ranking.
  • Both walls (web scale)? IVFPQ, tuned via nprobe and re-rank size.

How IVF/PQ and HNSW relate

They're complementary layers, often combined:

  • HNSW as a coarse quantizer. FAISS can replace IVF's flat list of centroids with an HNSW graph over the centroids (IndexHNSWFlat as the quantizer), making the "which cells?" step itself fast when nlist is huge.
  • HNSW on compressed vectors. Run the graph over PQ/SQ-compressed vectors to cut the graph's memory.
  • Division of labor. HNSW = fast navigation; IVF = partitioning; PQ = compression. Real systems mix the three to hit a target recall, latency, and memory budget simultaneously.

Next, a look at the papers that introduced these ideas. 👉