Neural recommenders & two-stage retrieval
Every model so far ended the same way: a user vector, item vectors, and a dot product. Neural recommenders generalize that pattern, and the two-stage architecture is how all of it runs at industrial scale. This chapter is conceptual (no new from-scratch code) but it's where the whole book clicks together.
From matrix factorization to two-tower networks
MF learns user/item vectors as raw lookup tables. Its limits: it can't use side features (a user's age, an item's text, the time of day), and it can't embed a brand-new user or item (no row in the table → cold start).
The two-tower model fixes both. Replace the lookup tables with two neural networks ("towers"):
user features item features
(history, age, ctx) (text, category, price)
│ │
┌────▼────┐ ┌────▼────┐
│ USER │ │ ITEM │
│ tower │ (neural net) │ tower │ (neural net)
└────┬────┘ └────┬────┘
│ user vector ── dot product ── item vector
└──────────────► score ◄──────────────┘
- Each tower maps raw features → an embedding.
- Score is still the dot product of the two embeddings (so MF is the special case where each tower is a lookup table).
- Trained with a ranking loss and negative sampling — exactly the BPR idea from the last chapter, scaled up.
The payoff: because the item tower turns features into a vector, you can embed a new item from its content alone (helping cold start), and the user tower can fold in rich context. This is the backbone of YouTube, large-scale ad systems, and modern retrieval. (Other neural variants exist — Neural Collaborative Filtering, sequence models like GRU4Rec/SASRec that model the order of interactions, and graph neural networks — but two-tower is the dominant retrieval design.)
Why the dot product keeps mattering
Keeping the score a dot product of independent user/item vectors is a deliberate, load-bearing choice:
- Item vectors can be computed offline and stored in an ANN index.
- At request time you compute just the user vector, then do one nearest-neighbor search over item vectors.
If the score instead mixed user and item features together (a "cross" network), you couldn't precompute item vectors and you'd be back to scoring millions of items per request. The dot-product constraint is what makes fast retrieval possible — and it's why the HNSW and IVF-PQ books are the serving engine for everything in this book.
The two-stage architecture
No single model can be both fast enough for millions of items and rich enough for accurate ranking. So production systems split the work (introduced in Chapter 1):
Stage 1 — Candidate generation (retrieval)
- Job: cut millions of items down to a few hundred plausible ones.
- Optimize for recall (don't drop good items) and speed.
- How: embed the user, run ANN nearest-neighbor search over item embeddings (from MF / two-tower / content models). Often several retrievers run in parallel — a CF retriever, a content retriever, a trending retriever — and their candidates are pooled.
Stage 2 — Ranking
- Job: order those few hundred candidates precisely.
- Optimize for precision / NDCG.
- How: a heavy model (gradient-boosted trees or a deep network) scores each candidate using many features — user history, item attributes, context, cross-features, even predicted watch-time. Affordable because it only scores hundreds of items, not millions.
catalog (10^7) ──[retrieval: ANN over embeddings]──► candidates (10^2)
──[ranking: heavy model + features]──► ordered top-k ──► user
Many systems add a stage 3 — re-ranking for business rules: diversity (don't show ten near-duplicates), freshness, de-duplication, fairness, and hard constraints ("in stock", "not already purchased").
Where each chapter fits
| Chapter | Role in the architecture |
|---|---|
| Popularity / Trending (4) | a candidate source + cold-start fallback |
| Content-based (5) | a candidate source (item-embedding retrieval) |
| Item-item CF (6) | a candidate source |
| MF / ALS (7) | embeddings for retrieval |
| BPR / two-tower (8, here) | embeddings for retrieval; the ranking loss |
| ANN serving (11) | how retrieval runs fast |
The big remaining gap is users and items the models have never seen. That's cold start. 👉