Introduction

No background required. Like its companion books, this one assumes you know nothing about programming, AI, or advanced math. Every concept, line of code, and symbol is explained, and every code example is followed by the exact output it produces. New to Python? Read the 5-minute primer first.

What this book is about

The HNSW book covered one way to search billions of vectors quickly: build a clever graph and navigate it. This book covers the other half of the story — how to make vector search fit in memory and partition the work — using two classic techniques:

  • IVF (Inverted File index) — chop the vector space into regions and only search the few regions near the query.
  • Quantizationcompress each vector from hundreds of floats down to a handful of bytes, so a billion vectors fit in RAM.

Together (as IVFPQ) they are the workhorse behind FAISS and most billion-scale vector databases.

Storing $n$ vectors of $d$ dimensions as 32-bit floats and scanning them all per query has two costs:

  1. Time — $n$ distance computations per query (the problem HNSW attacks).
  2. Memory — $n \times d \times 4$ bytes. One billion 768-dimensional embeddings is about 3 terabytes. That won't fit in RAM on any normal machine.

This book is mostly about problem 2 (and partly problem 1 via IVF). Quantization can shrink that 3 TB by 16–64×, down to something that fits on a single server — the difference between "impossible" and "routine".

Two everyday analogies

IVF — the library. A library doesn't make you read every book to find one about cooking. Books are shelved by topic; you go straight to the cooking section and look only there. IVF shelves vectors into regions (by similarity) and searches only the regions near your query.

Quantization — paint-by-numbers. Instead of storing the exact color of every pixel, paint-by-numbers stores a small palette and, per pixel, just the number of the nearest palette color. You lose a little fidelity but store far less. Quantization does exactly this with vectors: keep a small set of representative vectors (a "codebook") and store, per vector, just the id of the nearest one.

What you'll build

A complete, working compressed index in one readable file, ivfpq.py, containing k-means, IVF, PQ, and IVFPQ — plus:

  • a demo measuring recall vs. memory vs. work, and
  • a CLI tool that estimates the trade-off for your own vectors.

How it relates to HNSW

These aren't competitors — they compose. A common production setup is "HNSW on top of quantized vectors", or "IVF to partition, PQ to compress". HNSW answers which graph node next?; IVF/PQ answer how do we store a billion vectors and only look at a few? We'll map out exactly when to use which in the use-cases chapter.

Everything rests on one humble algorithm — k-means clustering — so after the problem statement, that's where we start. 👉