Use cases & how to choose
You've now built every piece that powers industrial vector search. This chapter connects them to the real world — especially FAISS, the library where these algorithms live in production — and gives you a decision guide.
FAISS: where these algorithms actually run
FAISS (Facebook AI Similarity Search) is the most widely used library for this, and it is essentially a highly optimized, composable version of exactly what this book built. Every class here maps directly to a FAISS index:
| This book | FAISS index | What it is |
|---|---|---|
flat_knn | IndexFlatL2 / IndexFlatIP | exact brute force |
ScalarQuantizer | IndexScalarQuantizer (SQ8) | 1 byte/dim, 4× smaller |
IVF | IndexIVFFlat | partition + probe, full vectors |
PQ | IndexPQ | product-quantized codes |
IVFPQ | IndexIVFPQ | partition + PQ residuals (the workhorse) |
| (HNSW book) | IndexHNSWFlat | graph navigation |
The FAISS "index factory"
FAISS lets you describe an index with a short string, and it assembles the pieces — the same Lego blocks from this book:
"Flat" exact search
"IVF4096,Flat" IVF with 4096 cells, full vectors (our IVF)
"PQ16" product quantization, 16 bytes/vector (our PQ)
"IVF4096,PQ16" IVF + PQ residuals (our IVFPQ)
"OPQ16,IVF4096,PQ16" add a learned rotation before PQ (higher recall)
"HNSW32" HNSW graph, 32 links/node (the HNSW book)
"IVF4096,PQ16,RFlat" IVFPQ + exact re-ranking (RFlat = refine) (our re-rank trick)
Reading those strings is now easy: each comma-separated piece is a stage you
understand. OPQ is one we didn't build — a learned rotation of the data before
PQ that spreads information evenly across sub-vectors, measurably improving recall;
it's a drop-in upgrade to the PQ stage.
Why FAISS is fast (and our version isn't)
Same algorithms, serious engineering: SIMD/AVX distance kernels, a special
PQFastScan that does PQ table lookups in registers, GPU implementations,
multi-threaded training on a sample of the data, and memory-mapped on-disk
indexes for sets too big for RAM. Our from-scratch version is for understanding;
FAISS is for shipping — and validating your understanding against.
The common pattern: embed → index → search
As in the HNSW book, almost every application is:
- Embed items (text/images/users) into vectors with a model.
- Index them — and at scale, that means IVFPQ (or HNSW) to fit memory and stay fast.
- Search with a query embedding; optionally re-rank the shortlist exactly.
IVF/PQ is the part that makes step 2 survive billions of items.
Where it's used
- Large-scale semantic search & RAG. When a corpus has tens of millions of chunks, storing full float vectors is too costly; IVFPQ compresses them and keeps retrieval fast. The retrieved shortlist is then re-ranked (and fed to an LLM, for RAG).
- Recommendations. Item embeddings from matrix factorization or two-tower models (see the HNSW use-cases) are indexed with IVFPQ; "similar items" and per-user candidate generation become compressed nearest-neighbor lookups over hundreds of millions of items.
- Image / video / audio search at web scale — billions of media embeddings compressed to bytes.
- De-duplication & clustering of massive corpora, where keeping full vectors in RAM is infeasible.
How to choose: a decision guide
| Situation | Use |
|---|---|
| Small data (≤ ~10k–100k) | Flat (exact). Simple, and vectorized brute force is plenty fast. |
| Need speed, RAM is fine | HNSW or IVFFlat. Full vectors, near-exact, fast. |
| RAM-constrained, big data | PQ / IVFPQ. Compress 16–64×; re-rank for recall. |
| Billions of vectors | IVFPQ (often OPQ,IVF…,PQ…), the standard at scale. |
| Maximum recall, RAM available | HNSW (optionally on SQ8) + exact re-rank. |
| Need exact guarantees | Flat only. |
Rules of thumb:
- Start exact. If brute force is fast enough, don't add complexity.
- Hit a speed wall? Add IVF (or HNSW).
- Hit a memory wall? Add PQ, and always pair it with re-ranking.
- Both walls (web scale)? IVFPQ, tuned via
nprobeand re-rank size.
How IVF/PQ and HNSW relate
They're complementary layers, often combined:
- HNSW as a coarse quantizer. FAISS can replace IVF's flat list of centroids
with an HNSW graph over the centroids (
IndexHNSWFlatas the quantizer), making the "which cells?" step itself fast whennlistis huge. - HNSW on compressed vectors. Run the graph over PQ/SQ-compressed vectors to cut the graph's memory.
- Division of labor. HNSW = fast navigation; IVF = partitioning; PQ = compression. Real systems mix the three to hit a target recall, latency, and memory budget simultaneously.
Next, a look at the papers that introduced these ideas. 👉