Serving at scale

We've built models that produce user and item embeddings scored by a dot product. This chapter is about turning that into a system that answers millions of requests per second over millions of items — and it's where this book plugs directly into the HNSW and IVF-PQ books.

The retrieval problem, restated

Candidate generation is: given the user's vector, find the item vectors with the highest dot product. That's nearest-neighbor search — exactly what the ANN books solve. Doing it exactly (scoring every item) is too slow at catalog scale, so we use Approximate Nearest Neighbor (ANN) indexes:

HNSW — the navigable-graph index; great recall and latency, used when RAM is available.
IVF / PQ — partition + compression; used when the vector set is huge and must be shrunk to fit memory.

Maximum inner product, not distance. Recommenders rank by largest dot product (MIPS — Maximum Inner Product Search), while ANN indexes find smallest distance. They coincide when vectors are normalized (cosine); when item magnitudes matter, a standard transform adds one extra coordinate so nearest-distance reproduces largest-dot-product. Either way, the same ANN index serves recommendations. (This is detailed in the IVF-PQ book's use-cases.)

The systems you'll actually use

These are the engines that store embeddings and run ANN search in production:

FAISS (library) — the reference ANN toolkit; IndexHNSWFlat, IndexIVFPQ, etc. You embed offline and query in-process.
OpenSearch / Elasticsearch — a knn_vector field type backed by HNSW (via Lucene/nmslib/FAISS). Bonus: hybrid search — blend vector similarity with classic keyword/metadata filters in one query (e.g. "nearest items that are in stock and in this category"), which is invaluable for recommendation business rules.
Milvus / Qdrant / Weaviate / Pinecone — purpose-built vector databases with IVF/PQ/HNSW indexes, metadata filtering, and horizontal scaling.

The workflow with any of them is the same: build the index offline from item embeddings, then at request time compute the user embedding and issue a top-k ANN query (optionally with filters).

Offline vs. online: the two-loop architecture

Real systems run two loops at very different speeds:

  OFFLINE (hours/daily)                  ONLINE (milliseconds, per request)
  ─────────────────────                  ──────────────────────────────────
  • train models                         • build/look up the user vector
  • compute ALL item embeddings          • ANN query item index -> candidates
  • build the ANN index                  • rank candidates with a fast model
  • compute item-item neighbor lists     • apply business rules / re-rank
  • precompute popularity / trending     • return top-k

Item embeddings change slowly, so they're computed in big batch jobs and loaded into the ANN index periodically.
The user side is real-time: a returning user's vector may be refreshed from their latest clicks within the request, so recommendations react immediately to what they just did.

The feature store

The ranking model needs the same features offline (for training) and online (for serving) — user history aggregates, item stats, counts, context. A feature store is the system that computes, stores, and serves these consistently, and prevents train/serve skew (the classic, painful bug where a feature is computed one way in training and a slightly different way in production, quietly wrecking quality). It typically pairs an offline store (for training data) with a low-latency online store (for serving).

Keeping it fresh & the feedback loop

Freshness. News/short-video catalogs change by the minute; you re-embed and re-index new items continuously, and lean on content/two-tower embeddings + trending so new items are recommendable instantly.
The feedback loop (and its danger). The system's recommendations shape what users click, which becomes tomorrow's training data, which shapes tomorrow's recommendations. Left unchecked this amplifies popularity bias and narrows what users ever see. Deliberate exploration (cold start) and diversity constraints (best practices) are how you keep the loop healthy.

Latency budget — a rough picture

A typical request has tens of milliseconds total. Spending it well:

  build user vector        ~1-3 ms
  ANN retrieval (top few hundred)   ~1-10 ms   <-- HNSW / IVF-PQ
  feature fetch + ranking  ~5-20 ms
  re-rank / business rules ~1-5 ms

The ANN step is fast because of the algorithms in the companion books — without them, retrieval over millions of items couldn't fit in the budget.

Next, the consolidated implementation of every algorithm we built. 👉

Recommendation Systems from Scratch