Introduction

No background required. Like its companion books, this one assumes you know nothing about programming, AI, or recommendation systems. Every concept, line of code, and symbol is explained, and every code example is followed by the exact output it produces. New to Python? Read the 5-minute primer first. At the same time, the later chapters go deep enough — the math and the production architecture — to be useful to seasoned engineers.

What is a recommendation system?

A recommendation system predicts what a person will want next and shows it to them. The "Recommended for you" row on Netflix, "Customers also bought" on Amazon, your personalized feed, "Up next" on YouTube, suggested songs, suggested connections — all of it is recommendation systems deciding, out of millions of items, the handful to put in front of you.

Done well, it's one of the highest-leverage pieces of software a company can build: it drives a large share of engagement and revenue at the biggest platforms on earth.

What this book covers

We build the whole discipline from the ground up:

  • The design principles — how to think about the problem, the data, and what "good" even means (how we measure a recommender).
  • Every major algorithm, implemented from scratch and explained under the hood — not just names and one-liners, but the actual math and code: popularity/trending baselines, content-based filtering, neighborhood collaborative filtering, matrix factorization, and learning-to-rank.
  • The hard real-world problems — especially cold start (what do you recommend to a brand-new user with no history?) and serving at scale.
  • The production architecture — the two-stage "candidate generation + ranking" design that essentially every large system uses, and how it connects to the nearest-neighbor search from the HNSW and IVF-PQ books.

A concrete example to anchor on

Here's a complete, real recommender — the kind this book teaches you to build and reason about — described in plain English:

  1. Turn every article into an embedding (a vector capturing its meaning).
  2. For a given user, take the articles they've read, and average their embeddings — but weight recent reads more than old ones (a time decay), because interests drift. That weighted average is the user's taste vector.
  3. Run a nearest-neighbor (kNN) search over all the other article embeddings (in OpenSearch, FAISS, Milvus, …) to find the articles closest to that taste vector. Those are the recommendations.
  4. If the user is brand new with no reading history, you can't build a taste vector — so fall back to trending items until you learn something about them.

That single example touches content embeddings, time decay, kNN serving, and cold start. We'll build exactly this (Chapters 5 and 10) — and then go well beyond it.

What you'll build

How to read this book

Read it front to back the first time — each chapter builds on the last (data → metrics → baselines → algorithms → architecture). If you're experienced, you can jump to a specific algorithm chapter; each is self-contained with its own code and output.

Let's start with the vocabulary and the smallest bit of Python you'll need. 👉

A 5-minute primer: users, items & vectors

Just enough Python, NumPy, and vocabulary to read every example. Skip if you're comfortable (the HNSW primer covers vectors in more depth).

The core vocabulary

TermMeaning
userthe person we recommend to
itemthe thing we recommend (article, product, movie, song)
interactiona user doing something with an item (click, view, buy, rate)
catalogall the items we could recommend
embedding / vectora list of numbers representing an item or user
top-kthe k items we actually show (e.g. top-10)

The whole field is about one question: given who the user is and what they've done, which items should be in their top-k?

Code boxes show their output

print("recommended:", [42, 7, 13])

Output:

recommended: [42, 7, 13]

A tiny bit of Python

history = [5, 12, 3]              # a list: item ids the user interacted with
scores = {7: 0.9, 2: 0.4}        # a dict: item id -> predicted score

def top_k(scores, k):            # a function
    return sorted(scores, key=scores.get, reverse=True)[:k]

print(top_k({7: 0.9, 2: 0.4, 5: 0.7}, 2))

Output:

[7, 5]

sorted(..., reverse=True)[:k] is the heart of every recommender: score the items, sort high-to-low, keep the top k.

NumPy and vectors

NumPy (nicknamed np) does fast math on arrays of numbers. A vector is a list of numbers — a point in space. Similar items get nearby vectors.

import numpy as np
a = np.array([1.0, 0.0, 0.0])    # an "action movie" vector
b = np.array([0.9, 0.1, 0.0])    # similar
c = np.array([0.0, 0.0, 1.0])    # a "documentary" vector
print("a·b =", np.dot(a, b))     # high -> similar
print("a·c =", np.dot(a, c))     # low  -> different

Output:

a·b = 0.9
a·c = 0.0

The dot product measures similarity. Cosine similarity is the dot product after scaling both vectors to length 1 (so it measures direction, not size) — it is the workhorse similarity in this book.

The user-item matrix

Most recommenders start from a big table R: one row per user, one column per item, with a mark where they interacted. It's mostly empty (a user touches a tiny fraction of the catalog) — we call that sparse.

          item0  item1  item2  item3
 user0      1      .      1      .
 user1      .      1      .      1
 user2      1      1      .      .

A 1 = "interacted"; . = "no data" (not "disliked" — an important distinction we'll return to). Recommending = filling in the blanks: predicting which empty cells the user would light up next.

NumPy bits we use

You'll seeMeaning
np.argsort(-scores)item ids ordered by highest score first
X @ Y.Tall pairwise dot products between rows of X and Y
np.linalg.norm(v)length of a vector (for cosine)
np.exp(-x)the decay curve used for recency weighting
np.linalg.solve(A, b)solve A x = b (used in matrix factorization)

That's the toolkit. Now, the problem itself and the landscape of solutions. 👉

The problem & the landscape

Before any algorithm, let's frame what we're solving and map the families of solutions so every later chapter has a place on the map.

The problem, precisely

We have users, items, and a log of interactions between them. For a given user, produce a short ranked list (top-k) of items they haven't seen yet and are most likely to want next.

Two things make this hard:

  1. Scale. Millions of users × millions of items. We can't score every item for every user in real time (that's where the ANN books come in).
  2. Sparsity & ambiguity. Each user has touched a tiny fraction of the catalog, and a non-interaction usually means "never saw it", not "disliked it". We're predicting from very incomplete, noisy signals.

The two big families

Almost every recommender is one of these two ideas, or a blend (a hybrid):

1. Content-based filtering

"Recommend items similar to what this user already liked, based on the items' content."

It uses features of the items (text, genre, embeddings). If you read three articles about space, it recommends more space articles. It needs item features but not other users' data — so it works for a brand-new item, and even for a user with a short history. Covered in Chapter 5.

2. Collaborative filtering (CF)

"Recommend items that people similar to this user liked — using the interaction patterns alone, ignoring item content."

It learns from the crowd: "users who interacted with X also interacted with Y." It needs no item features and often gives better, more surprising results — but it struggles with brand-new users/items (no interactions yet = the cold-start problem). CF splits into:

  • Neighborhood methods — direct similarity between users or items (Chapter 6).
  • Model-based methods — learn compact latent vectors that explain the interactions: matrix factorization (Chapter 7), learning-to-rank (Chapter 8), and neural models (Chapter 9).

Comparison

Content-basedCollaborative filtering
Usesitem featuresinteraction patterns
New item with no interactions✅ works❌ cold start
New user with no history⚠️ needs a little history❌ cold start
Surprising / cross-topic recs❌ stays "on topic"✅ finds hidden links
Needs item features✅ yes❌ no

Neither dominates — production systems combine them (and add popularity for cold start). That blend is a hybrid.

The modern architecture: two stages

At scale you can't run a fancy model over millions of items per request. So real systems split the job in two (Chapter 9):

   millions of items
          │
   ┌──────▼───────┐   CANDIDATE GENERATION (retrieval)
   │  cheap & fast │   cut millions -> a few hundred candidates
   │  e.g. ANN kNN │   (content/CF embeddings + nearest-neighbor search)
   └──────┬───────┘
          │  ~hundreds
   ┌──────▼───────┐   RANKING
   │ rich & slow   │   score the few hundred with a heavy model
   │ many features │   using lots of features, pick the top-k
   └──────┬───────┘
          │
        top-k  ──►  shown to the user
  • Candidate generation is about recall — don't miss good items — and must be fast. Embeddings + nearest-neighbor search (the HNSW/IVF-PQ books) live here.
  • Ranking is about precision — order the survivors well — and can afford a heavy model because it only scores a few hundred items.

Keep this picture in mind: most chapters are building a candidate generator or a ranker.

How we'll proceed

Data → how to measure success → simple baselines → the algorithms (simplest to most powerful) → cold start → serving → best practices. We start with the data, because the kind of feedback you have changes everything. 👉

The data: feedback & the user-item matrix

Recommenders are only as good as the signal they learn from. This short chapter covers the two kinds of feedback and the data structure everything is built on — including a subtle point that trips up newcomers and experts alike.

Explicit vs. implicit feedback

Explicit feedback is when the user deliberately rates something: 5-star reviews, thumbs up/down, likes. It's clear but rare — most people never rate anything.

Implicit feedback is the user's behavior: clicks, views, watch time, purchases, dwell time. It's abundant (every interaction is a signal) but noisy and one-sided:

  • A click isn't a guarantee of "like" (maybe a misleading title).
  • No click is not a dislike. The user probably just never saw the item.

This asymmetry is the single most important fact about recommender data. With explicit ratings, a blank means "unknown". With implicit data, a 1 means "some positive signal" and a blank means "no information" — not a negative. Algorithms must treat "missing" as "unknown", not "disliked". It's why implicit models use ideas like confidence weighting (Chapter 7) and negative sampling (Chapter 8) instead of pretending blanks are zeros-meaning-dislike.

Most real systems run on implicit feedback, so this book focuses there.

The user-item matrix

We organize interactions into a matrix R: rows = users, columns = items.

import numpy as np

def build_matrix(interactions, n_users, n_items):
    """interactions: list of (user, item, time). Returns a binary matrix R."""
    R = np.zeros((n_users, n_items))
    for u, i, t in interactions:
        R[u, i] = 1.0          # 1 = interacted; 0 = no data (NOT 'disliked')
    return R

R = build_matrix([(0, 0, 0), (0, 2, 1), (1, 1, 2), (2, 0, 3), (2, 1, 4)], 3, 4)
print(R)

Output:

[[1. 0. 1. 0.]
 [0. 1. 0. 1.]
 [1. 1. 0. 0.]]

User 0 interacted with items 0 and 2; user 1 with items 1 and 3; etc.

Sparsity

Real matrices are enormous and almost entirely empty. A site with 1M users and 1M items where each user touches 100 items has a matrix that is $100 / 1{,}000{,}000 = 0.01%$ full. We never store it as a dense grid (that would be a trillion cells) — we store only the interactions (the (user, item) pairs) and use sparse math. This sparsity is also why recommendation is hard: we must generalize from a vanishingly small set of observed cells.

Time matters

Interactions have timestamps, and they're gold:

  • Recency: what you watched yesterday predicts tomorrow better than what you watched a year ago. We'll weight recent interactions more with time decay (Chapters 4 and 5).
  • Evaluation: to imitate reality, we train on the past and test on the future. Throughout this book we hold out each user's chronologically last interaction as the test target (called leave-last-out) — never a random one, which would let the model "peek" at the future.

Other signals (briefly)

Beyond the interaction matrix, production systems fold in side features: item metadata (category, price, author), user attributes, and context (time of day, device, location). These are crucial for the ranking stage and for cold start, and we'll use item features directly in content-based filtering next.

With the data understood, the next question is how we judge whether a recommender is any good. 👉

Measuring success: offline metrics

You can't improve what you can't measure. Before building recommenders we need a fair way to score them — otherwise we're just guessing. This chapter defines the metrics used in every later chapter's leaderboard.

The evaluation setup

We use leave-last-out: for each user, hide their chronologically last interaction, train on the rest, then ask the model for a top-k list and check whether the hidden item shows up. Using the last (not a random) interaction mimics reality — predict the future from the past, never peek ahead.

The held-out item is the ground truth; the model's ranked list is the prediction. All metrics compare the two.

The core metrics

Recall@k (a.k.a. Hit Rate here)

Of the users, what fraction had their hidden item appear in their top-k?

With one hidden item per user, "recall@k" = "hit rate": 1 if the item is in the top-k, else 0, averaged over users. Higher is better. It answers "did we find it at all (within k)?" but ignores where in the list.

NDCG@k — rank-aware

Showing the right item at position 1 is better than at position 10. NDCG (Normalized Discounted Cumulative Gain) rewards higher placement with a logarithmic discount. For a single relevant item at 0-based rank $r$:

$$ \text{NDCG} = \frac{1}{\log_2(r + 2)}. $$

Rank 0 → $1/\log_2 2 = 1.0$ (perfect); rank 1 → $1/\log_2 3 \approx 0.63$; and so on. Averaged over users. NDCG is the most-reported offline metric because it captures ordering quality, which is what users feel.

MAP@k — precision-of-ranking

Mean Average Precision also rewards ranking the hit early; with one relevant item it reduces to $1/(\text{rank}+1)$ averaged over users (rank 1-based here).

The code

import numpy as np

def recall_at_k(recs, truth, k):
    hits = sum(1 for u, item in truth.items() if item in recs.get(u, [])[:k])
    return hits / len(truth)

def ndcg_at_k(recs, truth, k):
    total = 0.0
    for u, item in truth.items():
        lst = recs.get(u, [])[:k]
        if item in lst:
            total += 1.0 / np.log2(lst.index(item) + 2)   # 0-based rank
    return total / len(truth)

def map_at_k(recs, truth, k):
    total = 0.0
    for u, item in truth.items():
        lst = recs.get(u, [])[:k]
        if item in lst:
            total += 1.0 / (lst.index(item) + 1)          # 1-based rank
    return total / len(truth)

See them on a tiny example

Three users; one hits at rank 1, one at rank 3, one misses:

truth = {0: 7, 1: 3, 2: 9}
recs  = {0: [7, 1, 5],     # hit at position 1
         1: [4, 8, 3],     # hit at position 3
         2: [2, 6, 0]}     # miss
print("recall@3:", round(recall_at_k(recs, truth, 3), 3))
print("ndcg@3  :", round(ndcg_at_k(recs, truth, 3), 3))
print("map@3   :", round(map_at_k(recs, truth, 3), 3))

Output:

recall@3: 0.667
ndcg@3  : 0.5
map@3   : 0.444

Two of three users got a hit → recall 0.667. NDCG and MAP are lower because the hits aren't all at the top (one was at rank 3, which is discounted).

Beyond accuracy

Accuracy isn't everything. Production teams also track:

  • Coverage — what fraction of the catalog ever gets recommended? (A recommender that only ever shows the 10 most popular items has tiny coverage.)
  • Diversity — are the k items varied, or ten near-duplicates?
  • Novelty / serendipity — does it surface things the user wouldn't have found alone, or only the obvious?
  • Popularity bias — is it just re-recommending blockbusters and ignoring the long tail?

These often trade off against raw accuracy, and tuning that balance is a core design decision (see Best practices).

The offline–online gap

Crucial caveat: good offline numbers don't guarantee a better product. Offline metrics score the model on past logged behavior, which was itself shaped by the old recommender (a feedback loop), and they can't measure whether a new recommendation would have been clicked. The real verdict comes from online A/B tests — showing the new system to a slice of live traffic and comparing business metrics (engagement, retention, revenue). Offline metrics are a fast filter for which ideas to A/B test, not the final word.

With a scoreboard in hand, let's build the simplest possible recommenders: baselines. 👉

Baselines: popularity & trending

Always build the dumb version first. Popularity and trending are non-personalized recommenders — they show everyone (almost) the same thing — yet they're surprisingly strong, they're the standard cold-start fallback, and they're the bar every fancy model must clear to justify its complexity.

Count how many times each item was interacted with; recommend the most-counted items the user hasn't already seen.

import numpy as np

class Popularity:
    def fit(self, train, n_users, n_items):
        self.counts = np.zeros(n_items)
        for u, i, t in train:
            self.counts[i] += 1               # how many interactions per item
        self.seen = user_seen(train, n_users) # to skip items the user already has
        return self

    def recommend(self, u, k=10):
        return _topk_excluding(self.counts, self.seen.get(u, set()), k)

It's trivial, but it encodes real signal: popular items are popular because many people liked them, so a random user probably will too. It's also the thing to show a brand-new user (more in Cold start).

Plain popularity treats a click from a year ago the same as one from this morning. Trending fixes that by weighting recent interactions more, using exponential time decay.

Exponential decay, explained

We want each interaction's weight to shrink as it ages. The exponential decay weight for an interaction that happened at time $t$, evaluated at "now", is

$$ w = e^{-\lambda (\text{now} - t)}, \qquad \lambda = \frac{\ln 2}{H}. $$

  • $(\text{now} - t)$ is the interaction's age.
  • $\lambda$ (lambda) is the decay rate.
  • $H$ is the half-life: the age at which weight drops to exactly 0.5. Setting $\lambda = \ln 2 / H$ guarantees that. Two half-lives → 0.25, three → 0.125, and so on.

Half-life is the intuitive knob: "an interaction from one half-life ago counts half as much." A short half-life = very recency-biased (fast-moving news); a long half-life ≈ plain popularity.

class Trending:
    def __init__(self, half_life=2000.0):
        self.half_life = half_life

    def fit(self, train, n_users, n_items):
        lam = np.log(2) / self.half_life
        now = max(t for u, i, t in train)
        self.scores = np.zeros(n_items)
        for u, i, t in train:
            self.scores[i] += np.exp(-lam * (now - t))   # recent counts more
        self.seen = user_seen(train, n_users)
        return self

    def recommend(self, u, k=10):
        return _topk_excluding(self.scores, self.seen.get(u, set()), k)

The same idea runs the internet

The "hot" rankings on Hacker News and Reddit are exactly this: a score that combines votes with an age penalty so fresh items bubble up and old ones sink. Hacker News uses roughly

$$ \text{score} = \frac{(\text{votes} - 1)^{0.8}}{(\text{age}_{\text{hours}} + 2)^{1.8}}, $$

a power-law decay rather than exponential, but the principle is identical: popularity discounted by age. Time decay is one of the most reused tricks in all of recommendations.

How good are baselines?

From the leaderboard (recall@10, higher is better):

  Random             0.033
  Popularity         0.073
  Trending           0.073

Popularity doubles random — a real signal from a one-line idea. (Trending matches it here because our synthetic data has no strong time-of-day trend; on genuinely fast-moving catalogs trending pulls ahead.) Personalized models will beat these, but not by as much as you'd expect — which is exactly why popularity is the baseline you must always measure against.

When baselines are the right answer

  • Cold start — a new user with zero history: you have nothing personal to go on, so trending is your best guess.
  • Cold catalogs — brand-new items with no interactions yet.
  • A sanity floor — if your deep model can't beat popularity, something is wrong.

Next we make recommendations personal using item content. 👉

Content-based filtering (decay + kNN)

This is the recommender from the introduction — and a great first personalized method. The idea: describe items by their content (embeddings), summarize a user by the content they engage with, and recommend the nearest items. It needs no other users' data, so it works from a user's very first interactions.

The recipe

  1. Embed every item. Turn each item into a vector capturing its content (genre features, or a text/image model's embedding). Similar items → nearby vectors.
  2. Build the user's taste vector. Average the embeddings of the items the user interacted with — weighting recent interactions more (time decay), because interests drift.
  3. Find nearest items. Rank all unseen items by cosine similarity to the taste vector (a nearest-neighbor / kNN search) and recommend the top-k.
  read articles        their embeddings        decayed average      kNN over catalog
  (recent first)   ──►  •  •  •           ──►   ★ taste vector  ──►  nearest unseen items
                       (older = lighter)

Why the time decay matters

A user who spent last year on cooking but this week on travel should get travel recs. A plain average buries the recent shift under a year of cooking. Weighting by recency (the same $w = e^{-\lambda\,\text{age}}$, half-life $H$ from Chapter 4) lets the profile follow the user.

$$ \text{taste}u = \frac{\sum{(i,t)\in \text{history}} e^{-\lambda(\text{now}-t)}\; \text{emb}i}{\sum{(i,t)\in \text{history}} e^{-\lambda(\text{now}-t)}}. $$

It's a weighted average of item embeddings, recent items weighing most.

The code

import numpy as np

class ContentBased:
    def __init__(self, half_life=2000.0):
        self.half_life = half_life

    def fit(self, train, item_feats, n_users, n_items):
        self.item_feats = item_feats
        norm = np.linalg.norm(item_feats, axis=1, keepdims=True)
        self.feat_unit = item_feats / np.maximum(norm, 1e-9)   # unit vectors -> cosine
        self.hist = {u: [] for u in range(n_users)}
        for u, i, t in train:
            self.hist[u].append((t, i))
        self.now = max(t for u, i, t in train)
        self.seen = user_seen(train, n_users)
        return self

    def profile(self, u):
        h = self.hist.get(u, [])
        if not h:
            return None                                        # no history -> cold start
        lam = np.log(2) / self.half_life
        items = [i for t, i in h]
        w = np.array([np.exp(-lam * (self.now - t)) for t, i in h])
        return (w[:, None] * self.item_feats[items]).sum(0) / w.sum()   # decayed average

    def recommend(self, u, k=10):
        p = self.profile(u)
        if p is None:
            return []                                          # caller falls back to trending
        pu = p / max(np.linalg.norm(p), 1e-9)
        scores = self.feat_unit @ pu                           # cosine to every item
        return _topk_excluding(scores, self.seen.get(u, set()), k)

feat_unit @ pu is the kNN step done with one matrix-vector product: it computes the cosine similarity between the taste vector and every item at once, then we take the top-k unseen. At catalog scale you'd replace this exact scan with an approximate nearest-neighbor index (HNSW or IVF-PQ) — that's the direct bridge to the other books, and the serving chapter.

See it work

The article-recommender CLI is exactly this method on text (embeddings via TF-IDF). A user who read two tech articles, asking for 3 recs:

$ python recommend_cli.py articles.txt --history 0,1 -k 3
history (oldest->newest): [0, 1]
  most recent read: 'Google releases smartphone software update improving camera battery and performance'

recommended for you:
  1. (score=0.453)  [3] Samsung reveals smartphone with upgraded camera processor and faster software
  2. (score=0.337)  [2] New laptop launches with a powerful processor faster software and better battery
  3. (score=0.311)  [4] Chipmaker announces faster processor boosting laptop and smartphone performance

All three recommendations are tech articles — the taste vector landed in the "tech" region of embedding space and kNN returned its neighbors. A user with a sports history gets sports back instead.

Where content-based shines — and where it doesn't

Strengths

  • No cold-start for items. A brand-new article can be recommended the instant it's embedded — no interactions needed. (Great for news, where items are born and die daily.)
  • Works from a short history and is explainable ("because you read X").
  • No dependence on other users.

Weaknesses

  • Over-specialization / filter bubble. It keeps recommending the same topic; it can't surface a great item outside your past interests (no serendipity).
  • Only as good as the features. Weak embeddings → weak recs. (Modern systems use strong text/image embedding models here.)
  • Still needs some user history to build a profile — a brand-new user with zero reads gets nothing, which is why the CLI falls back to representative/ trending items (Cold start).

Content-based looks only at items. The next family looks at the crowd — often finding connections content can't. 👉

Neighborhood collaborative filtering

Collaborative filtering (CF) ignores item content and learns purely from the interaction patterns of the crowd: "people who liked what you liked also liked this." The simplest, most intuitive form is the neighborhood method, and it remains a strong, explainable baseline.

Two flavors

  • User-user: find users similar to you, recommend what they liked.
  • Item-item: find items similar to the ones you liked, recommend those.

Item-item is the workhorse (it's what Amazon famously deployed) because item similarities are more stable than user similarities (an item's audience changes slowly; a user's tastes and the user base shift fast) and can be precomputed. We'll build item-item.

The key idea: items are similar if the same people interact with them

Forget content. Two items are "similar" if they tend to be touched by the same users. Represent each item by its column in the user-item matrix R (which users interacted with it), and measure similarity between those columns with cosine similarity.

              item A column   item B column   item C column
   user0          1               1               0
   user1          1               1               0
   user2          0               0               1
   user3          1               0               1

   A and B share users 0,1  -> HIGH similarity
   A and C share only user 3 -> LOW similarity

The code

import numpy as np

class ItemItemCF:
    def fit(self, train, n_users, n_items):
        R = build_matrix(train, n_users, n_items)        # users x items
        self.R = R
        norm = np.linalg.norm(R, axis=0, keepdims=True)  # length of each item column
        Rn = R / np.maximum(norm, 1e-9)                  # normalize columns
        self.S = Rn.T @ Rn                               # item-item cosine similarity
        np.fill_diagonal(self.S, 0.0)                    # an item isn't its own neighbor
        self.seen = user_seen(train, n_users)
        return self

    def recommend(self, u, k=10):
        scores = self.R[u] @ self.S                      # spread from items the user has
        return _topk_excluding(scores, self.seen.get(u, set()), k)

How the math works

  • Rn.T @ Rn computes, in one matrix multiply, the cosine similarity between every pair of items (rows/cols of the resulting S are items). Normalizing the columns first turns the dot product into a cosine.
  • self.R[u] @ self.S is the recommendation step: R[u] is a 0/1 vector of the items user u has. Multiplying by S spreads a vote from each of those items to its similar items and sums the votes. Items similar to many of the user's items score highest.

That's the whole "users who interacted with X also interacted with Y" intuition, expressed as two matrix multiplications.

How it does

From the leaderboard:

  Popularity         0.073
  ContentBased       0.080
  ItemItemCF         0.093

Item-item CF beats both popularity and content-based here, and it does so without any item features — purely from co-interaction patterns. It also tends to be more explainable ("recommended because you interacted with X").

Strengths and limits

Strengths

  • Simple, intuitive, explainable.
  • No item features required.
  • Item-item similarities can be precomputed offline and served fast.

Limits

  • Cold start: a new item with no interactions has an all-zero column → no similarities → it can never be recommended. Same for a new user with no history.
  • Sparsity: when the matrix is extremely sparse, co-interactions are rare and similarities get noisy.
  • Scale & memory: the item-item matrix is items × items. For millions of items that's far too big to store densely — you keep only each item's top-N neighbors, or move to the compact latent-vector models in the next chapter.

Those limits motivate matrix factorization: instead of an items × items similarity table, learn a small vector per user and per item that explains all the interactions. 👉

Matrix factorization (SGD & implicit ALS)

Matrix factorization (MF) is the idea that won the Netflix Prize and still underpins huge production systems. Instead of a giant similarity table, it learns a short latent vector for every user and every item, such that their dot product predicts interaction. This chapter explains it from the ground up, with the real math.

The core idea: hidden factors

Imagine every movie can be described by a few hidden dials — how much action, how much romance, how "indie" it is — and every user by how much they like each dial. You'd predict a user's interest in a movie by matching their preferences to the movie's traits.

MF learns these dials automatically. Pick a small number f of latent factors (say 32). Then:

  • each user $u$ gets a vector $x_u \in \mathbb{R}^f$,
  • each item $i$ gets a vector $y_i \in \mathbb{R}^f$,
  • the predicted affinity is their dot product $\hat{r}_{ui} = x_u \cdot y_i$.

Stacking these into matrices $X$ (users × f) and $Y$ (items × f), we're approximating the whole interaction matrix as a product:

$$ R \approx X\,Y^\top. $$

We compress an enormous, sparse R into two thin, dense matrices. The factors are "latent" because we never name them — the model discovers whatever dimensions best explain the data.

Version 1: explicit ratings via SGD

If we have explicit ratings, we want $x_u \cdot y_i$ to match the known rating $r_{ui}$. Minimize the squared error over observed ratings, plus a regularization term ($\lambda$) that keeps vectors small to avoid overfitting:

$$ \min_{X,Y} \sum_{(u,i)\,\text{observed}} \big(r_{ui} - x_u \cdot y_i\big)^2 + \lambda\big(\lVert x_u\rVert^2 + \lVert y_i\rVert^2\big). $$

Stochastic Gradient Descent (SGD) optimizes this by nudging the vectors toward lower error, one observed rating at a time. For the error $e_{ui} = r_{ui} - x_u\cdot y_i$:

for (u, i, r) in observed_ratings:
    e = r - X[u] @ Y[i]
    X[u] += lr * (e * Y[i] - reg * X[u])     # step toward lower error
    Y[i] += lr * (e * X[u] - reg * Y[i])

Each update moves $x_u$ and $y_i$ so their dot product gets closer to the true rating. Repeat over all ratings for several epochs. Simple and effective — but it assumes we have explicit ratings, and treats missing entries as "ignore", which isn't right for implicit data.

Version 2: implicit feedback via ALS (the real-world case)

Most data is implicit (clicks, views), and here the "missing means unknown, not disliked" problem (from Chapter 2) bites. The classic solution is Implicit ALS (Hu, Koren & Volinsky, 2008), and it's worth understanding deeply because it's everywhere.

Two new concepts: preference and confidence

For each user-item cell, define:

  • Preference $p_{ui} = 1$ if the user interacted with $i$, else $0$.
  • Confidence $c_{ui} = 1 + \alpha\, r_{ui}$, where $r_{ui}$ is the interaction count (or strength). We're more sure about a "1" the more the user engaged, and only weakly sure a "0" means disinterest.

The objective sums over every cell (observed and not), weighted by confidence:

$$ \min_{X,Y}\ \sum_{u,i} c_{ui}\big(p_{ui} - x_u\cdot y_i\big)^2 + \lambda\big(\lVert X\rVert^2 + \lVert Y\rVert^2\big). $$

The genius: every blank is included (as a low-confidence 0), so the model learns "probably not interested" softly, while strong interactions pull hard toward 1.

Why ALS instead of SGD here?

That sum is over all user×item cells — far too many to loop over with SGD. But notice: if we fix $Y$, the problem becomes an ordinary (weighted) least-squares problem in $X$ — solvable exactly, one user at a time. Then fix $X$ and solve for $Y$. Alternate until convergence. That's Alternating Least Squares.

For a single user, the optimal vector has a closed form:

$$ x_u = \big(Y^\top C^u Y + \lambda I\big)^{-1}\, Y^\top C^u p_u, $$

where $C^u$ is the diagonal matrix of that user's confidences and $p_u$ their preference vector. The matching formula for items swaps roles. (The Hu et al. speedup: $Y^\top C^u Y = Y^\top Y + Y^\top (C^u - I) Y$, and $C^u - I$ is zero except on the user's few interactions — so each solve is cheap.)

The code

import numpy as np

class ImplicitALS:
    def __init__(self, factors=32, reg=0.1, alpha=40.0, iters=15, seed=0):
        self.f, self.reg, self.alpha, self.iters, self.seed = factors, reg, alpha, iters, seed

    def fit(self, train, n_users, n_items):
        R = build_matrix(train, n_users, n_items)
        rng = np.random.default_rng(self.seed)
        X = rng.normal(scale=0.01, size=(n_users, self.f))
        Y = rng.normal(scale=0.01, size=(n_items, self.f))
        C = 1.0 + self.alpha * R          # confidence
        P = (R > 0).astype(float)         # preference (0/1)
        I = self.reg * np.eye(self.f)

        for _ in range(self.iters):
            YtY = Y.T @ Y                                  # shared across users
            for u in range(n_users):
                cu = C[u]
                A = YtY + Y.T @ (Y * (cu - 1.0)[:, None]) + I   # Y^T C^u Y + reg I
                b = (Y * (cu * P[u])[:, None]).sum(0)           # Y^T C^u p_u
                X[u] = np.linalg.solve(A, b)               # closed-form solve
            XtX = X.T @ X
            for i in range(n_items):
                ci = C[:, i]
                A = XtX + X.T @ (X * (ci - 1.0)[:, None]) + I
                b = (X * (ci * P[:, i])[:, None]).sum(0)
                Y[i] = np.linalg.solve(A, b)
        self.X, self.Y = X, Y
        self.seen = user_seen(train, n_users)
        return self

    def recommend(self, u, k=10):
        scores = self.Y @ self.X[u]        # dot product with every item vector
        return _topk_excluding(scores, self.seen.get(u, set()), k)

np.linalg.solve(A, b) is the exact least-squares solution for one user's vector — that's the "LS" in ALS. The outer loop alternates user and item updates.

How it does

From the leaderboard:

  ItemItemCF         0.093
  ImplicitALS        0.087
  BPR                0.110

ALS is competitive with item-item here and far more scalable and compact: it stores only f numbers per user/item (not an items×items table), and recommending is a dot product — which at scale becomes an ANN search over the item vectors (HNSW / IVF-PQ), the bridge to the serving chapter.

Why MF is so important

  • Generalization. It discovers latent structure, finding links that neighborhood methods miss (it can connect items that share no direct co-interactions but occupy the same latent region).
  • Compactness & speed. User/item vectors are tiny; scoring is a dot product → perfect for ANN serving.
  • The blueprint for deep models. "Learn a user vector and item vectors, score by dot product" is exactly what modern two-tower neural networks do — MF is their linear ancestor (next chapters).

But squared error optimizes rating prediction, not ranking. The next chapter optimizes the ranking directly. 👉

Learning to rank: BPR

Matrix factorization minimized prediction error — how close $x_u\cdot y_i$ is to 0/1. But we never show users a number; we show a ranked list. What we actually want is for items they'll like to be ranked above items they won't. BPR (Bayesian Personalized Ranking, Rendle et al. 2009) optimizes that directly, and it introduces the pivotal idea of negative sampling.

The shift: pairs, not points

"Pointwise" methods (like ALS) score each item in isolation. Pairwise methods look at pairs: for a user, a positive item (one they interacted with) should score higher than a negative item (one they didn't). We don't care about the absolute scores, only the order.

BPR's assumption: for user $u$, an observed item $i$ is preferred over an unobserved item $j$. Train the model so $\hat{x}{ui} > \hat{x}{uj}$ for as many such pairs as possible.

This sidesteps the "missing = unknown" trap elegantly: we never claim $j$ is disliked, only that $i$ is preferred to $j$.

Negative sampling

There are astronomically many $(u, i, j)$ triples. BPR samples them: for each observed $(u, i)$, draw a random item $j$ the user hasn't interacted with as the negative. This negative sampling is what makes training over implicit data tractable, and it reappears in virtually every modern neural recommender and embedding model.

The objective

For a triple $(u, i, j)$, let the score gap be $\hat{x}{uij} = x_u\cdot y_i - x_u\cdot y_j$. BPR pushes this gap to be large and positive by maximizing $\ln \sigma(\hat{x}{uij})$, where $\sigma(z) = 1/(1+e^{-z})$ is the logistic (sigmoid) function. Maximizing it makes "positive ranked above negative" more probable.

The gradient gives a clean update. With $s = \sigma(-\hat{x}{uij}) = 1 - \sigma(\hat{x}{uij})$ (how wrong the current order is — big when the pair is mis-ranked):

x_u += lr * ( s * (y_i - y_j) - reg * x_u )
y_i += lr * ( s *  x_u        - reg * y_i )
y_j += lr * ( s * (-x_u)      - reg * y_j )

Read it: if the model already ranks $i$ above $j$ ($s \approx 0$), barely move; if it's wrong ($s \approx 1$), take a big step pulling $y_i$ toward the user and pushing $y_j$ away.

The code

import numpy as np

class BPR:
    def __init__(self, factors=32, lr=0.05, reg=0.01, epochs=30, seed=0):
        self.f, self.lr, self.reg, self.epochs, self.seed = factors, lr, reg, epochs, seed

    def fit(self, train, n_users, n_items):
        rng = np.random.default_rng(self.seed)
        X = rng.normal(scale=0.1, size=(n_users, self.f))
        Y = rng.normal(scale=0.1, size=(n_items, self.f))
        ui = {u: set() for u in range(n_users)}
        for u, i, t in train:
            ui[u].add(i)
        pos = [(u, i) for u in ui for i in ui[u]]      # all observed (user, item)

        for _ in range(self.epochs):
            rng.shuffle(pos)
            for u, i in pos:
                j = int(rng.integers(n_items))         # sample a negative
                while j in ui[u]:
                    j = int(rng.integers(n_items))
                diff = X[u] @ (Y[i] - Y[j])
                s = 1.0 / (1.0 + np.exp(diff))         # = sigma(-diff): "how wrong"
                xu = X[u].copy()
                X[u] += self.lr * (s * (Y[i] - Y[j]) - self.reg * X[u])
                Y[i] += self.lr * (s * xu - self.reg * Y[i])
                Y[j] += self.lr * (-s * xu - self.reg * Y[j])
        self.X, self.Y = X, Y
        self.seen = user_seen(train, n_users)
        return self

    def recommend(self, u, k=10):
        scores = self.Y @ self.X[u]
        return _topk_excluding(scores, self.seen.get(u, set()), k)

Same X/Y vectors and dot-product scoring as MF — only the training objective changed, from "predict the value" to "get the order right".

How it does

From the leaderboard:

  ItemItemCF         0.093
  ImplicitALS        0.087
  BPR                0.110

BPR tops the recall@10 board here — unsurprising, since it optimizes ranking, which is exactly what recall@k measures. (On other splits/metrics ALS or item-item can edge ahead; always measure on your data.)

The bigger picture

  • WARP (Weighted Approximate-Rank Pairwise) is a popular cousin that samples harder negatives (keep drawing until you find a mis-ranked one) and weights updates by the estimated rank — often better top-k accuracy.
  • Negative sampling + a ranking loss is the template for modern retrieval/embedding training (including two-tower models and many text-embedding models). BPR is the clearest place to understand it.

So far, every model produces embeddings scored by a dot product. The next chapter generalizes that to neural networks and shows the architecture that ties the whole book together. 👉

Neural recommenders & two-stage retrieval

Every model so far ended the same way: a user vector, item vectors, and a dot product. Neural recommenders generalize that pattern, and the two-stage architecture is how all of it runs at industrial scale. This chapter is conceptual (no new from-scratch code) but it's where the whole book clicks together.

From matrix factorization to two-tower networks

MF learns user/item vectors as raw lookup tables. Its limits: it can't use side features (a user's age, an item's text, the time of day), and it can't embed a brand-new user or item (no row in the table → cold start).

The two-tower model fixes both. Replace the lookup tables with two neural networks ("towers"):

   user features                     item features
  (history, age, ctx)              (text, category, price)
        │                                  │
   ┌────▼────┐                        ┌────▼────┐
   │ USER     │                       │ ITEM     │
   │ tower    │  (neural net)         │ tower    │  (neural net)
   └────┬────┘                        └────┬────┘
        │  user vector  ── dot product ──  item vector
        └──────────────►  score  ◄──────────────┘
  • Each tower maps raw features → an embedding.
  • Score is still the dot product of the two embeddings (so MF is the special case where each tower is a lookup table).
  • Trained with a ranking loss and negative sampling — exactly the BPR idea from the last chapter, scaled up.

The payoff: because the item tower turns features into a vector, you can embed a new item from its content alone (helping cold start), and the user tower can fold in rich context. This is the backbone of YouTube, large-scale ad systems, and modern retrieval. (Other neural variants exist — Neural Collaborative Filtering, sequence models like GRU4Rec/SASRec that model the order of interactions, and graph neural networks — but two-tower is the dominant retrieval design.)

Why the dot product keeps mattering

Keeping the score a dot product of independent user/item vectors is a deliberate, load-bearing choice:

  • Item vectors can be computed offline and stored in an ANN index.
  • At request time you compute just the user vector, then do one nearest-neighbor search over item vectors.

If the score instead mixed user and item features together (a "cross" network), you couldn't precompute item vectors and you'd be back to scoring millions of items per request. The dot-product constraint is what makes fast retrieval possible — and it's why the HNSW and IVF-PQ books are the serving engine for everything in this book.

The two-stage architecture

No single model can be both fast enough for millions of items and rich enough for accurate ranking. So production systems split the work (introduced in Chapter 1):

Stage 1 — Candidate generation (retrieval)

  • Job: cut millions of items down to a few hundred plausible ones.
  • Optimize for recall (don't drop good items) and speed.
  • How: embed the user, run ANN nearest-neighbor search over item embeddings (from MF / two-tower / content models). Often several retrievers run in parallel — a CF retriever, a content retriever, a trending retriever — and their candidates are pooled.

Stage 2 — Ranking

  • Job: order those few hundred candidates precisely.
  • Optimize for precision / NDCG.
  • How: a heavy model (gradient-boosted trees or a deep network) scores each candidate using many features — user history, item attributes, context, cross-features, even predicted watch-time. Affordable because it only scores hundreds of items, not millions.
  catalog (10^7) ──[retrieval: ANN over embeddings]──► candidates (10^2)
                ──[ranking: heavy model + features]──► ordered top-k ──► user

Many systems add a stage 3 — re-ranking for business rules: diversity (don't show ten near-duplicates), freshness, de-duplication, fairness, and hard constraints ("in stock", "not already purchased").

Where each chapter fits

ChapterRole in the architecture
Popularity / Trending (4)a candidate source + cold-start fallback
Content-based (5)a candidate source (item-embedding retrieval)
Item-item CF (6)a candidate source
MF / ALS (7)embeddings for retrieval
BPR / two-tower (8, here)embeddings for retrieval; the ranking loss
ANN serving (11)how retrieval runs fast

The big remaining gap is users and items the models have never seen. That's cold start. 👉

Cold start

The hardest, most practical problem in recommendations: what do you recommend when you have no data? A brand-new user with no history, or a brand-new item with no interactions, breaks every collaborative-filtering model (no row/column to learn from). Handling cold start well is what separates a toy from a product.

Three flavors of cold start

TypeSituationWhy CF fails
New userjust signed up, zero historyno user vector / no neighbors
New itemjust added to the catalogno interactions → all-zero column
New systembrand-new product, almost no data at allnothing to learn from

Each needs a different mix of fixes.

New-user cold start

You can't personalize from nothing, so degrade gracefully along this ladder, adding personalization as signal arrives:

  1. Show trending / popular items. The single best zero-knowledge guess — it's why every app's first screen is "Popular now". (This is the Trending baseline, and exactly the fallback in our demo below.)
  2. Ask a little (onboarding). "Pick a few topics/creators you like." Three taps convert a cold user into a warm one — these picks seed a content-based profile immediately.
  3. Use what context you have. Sign-up country, language, device, referrer, time of day — even a coarse popularity-by-segment ("popular with users in your region") beats global popularity.
  4. Switch to personalization fast. After even one or two interactions, content-based filtering can build a (rough) taste vector; after more, CF/MF kicks in. The art is blending these as confidence grows.

See the fallback in code

Our demo asks a personalized model and a fallback model to recommend for a brand-new user id with no history:

cold start — a new user with no interactions:
  ContentBased.recommend(new_user) -> []  (empty: no history)
  Trending fallback                -> [127, 105, 87, 78, 13]  (works without history)

The personalized model correctly returns nothing (it has no profile to work from), and the system falls back to trending — which needs no per-user data. That graceful hand-off is the core cold-start pattern, in two lines.

New-item cold start

A new item has no interactions, so CF can never surface it. Fixes:

  1. Content-based filtering. This is content-based's superpower: embed the new item from its features/text and it's immediately recommendable to users whose taste vector is nearby — no interactions required (Chapter 5).
  2. Two-tower models. The item tower turns features into an embedding, so a new item gets a vector the moment it's added (Chapter 9).
  3. Deliberate exploration. Intentionally show the new item to some users to gather the interactions CF needs — which leads to the key idea below.

Exploration vs. exploitation (bandits)

There's a fundamental tension:

  • Exploit: show what you already know works (safe, good short-term metrics).
  • Explore: show uncertain/new items to learn whether they're good (risky short-term, essential long-term).

Pure exploitation creates a trap: new items never get shown, so they never gather data, so they're never shown — the feedback loop that buries the long tail. Multi-armed bandit algorithms balance the two:

  • ε-greedy: exploit the best item with probability $1-\varepsilon$; with probability $\varepsilon$, show a random/new item to learn. Dead simple, works.
  • Upper Confidence Bound (UCB): rank by optimistic estimate — value plus an uncertainty bonus that's large for rarely-shown items — so under-explored items get a chance precisely because we're unsure about them.
  • Thompson sampling: keep a probability distribution over each item's true value and sample from it; naturally shows promising-but-uncertain items more often.

Bandits are how systems give new items a fair shot without tanking metrics, and contextual bandits (which condition the choice on user/context features) are widely used for cold-start and exploration in real recommenders.

A practical cold-start policy

A real system stitches these together by how much it knows about the user:

   interactions known about the user
   0            ──►  trending / popular (+ context, + onboarding picks)
   1 .. a few   ──►  content-based taste vector (works from a short history)
   many         ──►  collaborative filtering / MF / two-tower (full personalization)
   throughout   ──►  a little exploration (bandit) to keep learning & surface new items

This blend — popularity for the cold, content-based for the lukewarm, CF for the warm, plus constant exploration — is the backbone of robust real-world systems.

Now, how all this runs fast at scale. 👉

Serving at scale

We've built models that produce user and item embeddings scored by a dot product. This chapter is about turning that into a system that answers millions of requests per second over millions of items — and it's where this book plugs directly into the HNSW and IVF-PQ books.

The retrieval problem, restated

Candidate generation is: given the user's vector, find the item vectors with the highest dot product. That's nearest-neighbor search — exactly what the ANN books solve. Doing it exactly (scoring every item) is too slow at catalog scale, so we use Approximate Nearest Neighbor (ANN) indexes:

  • HNSW — the navigable-graph index; great recall and latency, used when RAM is available.
  • IVF / PQ — partition + compression; used when the vector set is huge and must be shrunk to fit memory.

Maximum inner product, not distance. Recommenders rank by largest dot product (MIPS — Maximum Inner Product Search), while ANN indexes find smallest distance. They coincide when vectors are normalized (cosine); when item magnitudes matter, a standard transform adds one extra coordinate so nearest-distance reproduces largest-dot-product. Either way, the same ANN index serves recommendations. (This is detailed in the IVF-PQ book's use-cases.)

The systems you'll actually use

These are the engines that store embeddings and run ANN search in production:

  • FAISS (library) — the reference ANN toolkit; IndexHNSWFlat, IndexIVFPQ, etc. You embed offline and query in-process.
  • OpenSearch / Elasticsearch — a knn_vector field type backed by HNSW (via Lucene/nmslib/FAISS). Bonus: hybrid search — blend vector similarity with classic keyword/metadata filters in one query (e.g. "nearest items that are in stock and in this category"), which is invaluable for recommendation business rules.
  • Milvus / Qdrant / Weaviate / Pinecone — purpose-built vector databases with IVF/PQ/HNSW indexes, metadata filtering, and horizontal scaling.

The workflow with any of them is the same: build the index offline from item embeddings, then at request time compute the user embedding and issue a top-k ANN query (optionally with filters).

Offline vs. online: the two-loop architecture

Real systems run two loops at very different speeds:

  OFFLINE (hours/daily)                  ONLINE (milliseconds, per request)
  ─────────────────────                  ──────────────────────────────────
  • train models                         • build/look up the user vector
  • compute ALL item embeddings          • ANN query item index -> candidates
  • build the ANN index                  • rank candidates with a fast model
  • compute item-item neighbor lists     • apply business rules / re-rank
  • precompute popularity / trending     • return top-k
  • Item embeddings change slowly, so they're computed in big batch jobs and loaded into the ANN index periodically.
  • The user side is real-time: a returning user's vector may be refreshed from their latest clicks within the request, so recommendations react immediately to what they just did.

The feature store

The ranking model needs the same features offline (for training) and online (for serving) — user history aggregates, item stats, counts, context. A feature store is the system that computes, stores, and serves these consistently, and prevents train/serve skew (the classic, painful bug where a feature is computed one way in training and a slightly different way in production, quietly wrecking quality). It typically pairs an offline store (for training data) with a low-latency online store (for serving).

Keeping it fresh & the feedback loop

  • Freshness. News/short-video catalogs change by the minute; you re-embed and re-index new items continuously, and lean on content/two-tower embeddings + trending so new items are recommendable instantly.
  • The feedback loop (and its danger). The system's recommendations shape what users click, which becomes tomorrow's training data, which shapes tomorrow's recommendations. Left unchecked this amplifies popularity bias and narrows what users ever see. Deliberate exploration (cold start) and diversity constraints (best practices) are how you keep the loop healthy.

Latency budget — a rough picture

A typical request has tens of milliseconds total. Spending it well:

  build user vector        ~1-3 ms
  ANN retrieval (top few hundred)   ~1-10 ms   <-- HNSW / IVF-PQ
  feature fetch + ranking  ~5-20 ms
  re-rank / business rules ~1-5 ms

The ANN step is fast because of the algorithms in the companion books — without them, retrieval over millions of items couldn't fit in the budget.

Next, the consolidated implementation of every algorithm we built. 👉

The full implementation

Here is the consolidated module — synthetic data, metrics, and every recommender (Random, Popularity, Trending, ContentBased, ItemItemCF, ImplicitALS, BPR) — in one file, code/recsys.py, reproduced in full.

"""
Recommendation algorithms, from scratch in NumPy.

Implements, from simplest to most powerful:
    metrics: recall@k, precision@k, ndcg@k, map@k
    Random / Popularity / Trending (time-decayed popularity)   -- baselines
    ContentBased (decayed user profile + cosine kNN)           -- the "article" method
    ItemItemCF (neighborhood collaborative filtering)
    ImplicitALS (matrix factorization for implicit feedback)
    BPR (Bayesian Personalized Ranking — pairwise learning-to-rank)

Plus make_synthetic() to generate an evaluable interaction dataset, and
train_test_split_last() for leave-last-out evaluation.

Only NumPy is used.
"""

from __future__ import annotations

import numpy as np


# ===========================================================================
# Data: a synthetic but realistic interaction log
# ===========================================================================
def make_synthetic(n_users=300, n_items=150, n_genres=8, n_inter=18000,
                   pop_weight=0.8, taste_pow=2.5, seed=0):
    """
    Returns:
        interactions : list of (user, item, time) tuples, time increasing
        item_feats   : (n_items, n_genres) content vectors (the "embeddings")
        n_users, n_items
    Each item belongs to a genre; each user has a peaky genre taste; item
    popularity is skewed. An interaction's probability mixes the user's taste with
    global popularity (pop_weight controls the blend), so popularity is a real
    baseline and personalized methods can still win by matching taste.
    """
    rng = np.random.default_rng(seed)
    item_genre = rng.integers(0, n_genres, size=n_items)
    item_feats = np.full((n_items, n_genres), 0.05)
    item_feats[np.arange(n_items), item_genre] = 1.0          # mostly one genre
    item_feats += rng.random((n_items, n_genres)) * 0.05

    pop = rng.power(0.5, size=n_items)                        # skewed popularity
    pop = pop / pop.sum()

    taste = rng.random((n_users, n_genres)) ** taste_pow      # peaky tastes
    taste /= taste.sum(1, keepdims=True)

    interactions = []
    for t in range(n_inter):
        u = int(rng.integers(n_users))
        score = taste[u, item_genre] * (pop_weight * pop + (1 - pop_weight) / n_items)
        score /= score.sum()
        i = int(rng.choice(n_items, p=score))
        interactions.append((u, i, t))
    return interactions, item_feats, n_users, n_items


def train_test_split_last(interactions, n_users):
    """Leave-last-out: each user's chronologically last interaction -> test."""
    by_user = {}
    for u, i, t in interactions:
        by_user.setdefault(u, []).append((t, i))
    train, test = [], {}
    for u, lst in by_user.items():
        lst.sort()
        if len(lst) >= 2:
            *rest, last = lst
            test[u] = last[1]
            for t, i in rest:
                train.append((u, i, t))
        else:
            for t, i in lst:
                train.append((u, i, t))
    return train, test


# ===========================================================================
# Metrics
# ===========================================================================
def recall_at_k(recs, truth, k):
    """Fraction of users whose held-out item appears in their top-k (hit rate)."""
    hits = sum(1 for u, item in truth.items() if item in recs.get(u, [])[:k])
    return hits / len(truth)


def precision_at_k(recs, truth, k):
    """With one held-out item per user, this is recall/k — reported for completeness."""
    return recall_at_k(recs, truth, k) / k


def ndcg_at_k(recs, truth, k):
    """Normalized Discounted Cumulative Gain: rewards ranking the hit higher."""
    total = 0.0
    for u, item in truth.items():
        lst = recs.get(u, [])[:k]
        if item in lst:
            rank = lst.index(item)              # 0-based
            total += 1.0 / np.log2(rank + 2)    # ideal DCG is 1 here
    return total / len(truth)


def map_at_k(recs, truth, k):
    """Mean Average Precision (single relevant item => 1/rank)."""
    total = 0.0
    for u, item in truth.items():
        lst = recs.get(u, [])[:k]
        if item in lst:
            total += 1.0 / (lst.index(item) + 1)
    return total / len(truth)


# ===========================================================================
# Helpers
# ===========================================================================
def build_matrix(train, n_users, n_items):
    """Binary user-item interaction matrix R."""
    R = np.zeros((n_users, n_items))
    for u, i, t in train:
        R[u, i] = 1.0
    return R


def _topk_excluding(scores, seen, k):
    """Top-k item ids by score, skipping items the user already has."""
    order = np.argsort(-scores)
    out = []
    for i in order:
        if i not in seen:
            out.append(int(i))
            if len(out) == k:
                break
    return out


def user_seen(train, n_users):
    seen = {u: set() for u in range(n_users)}
    for u, i, t in train:
        seen[u].add(i)
    return seen


# ===========================================================================
# Baselines
# ===========================================================================
class Random:
    """Recommend random unseen items — the sanity-check floor."""
    def __init__(self, seed=0):
        self.seed = seed

    def fit(self, train, n_users, n_items):
        self.n_items = n_items
        self.rng = np.random.default_rng(self.seed)
        self.seen = user_seen(train, n_users)
        return self

    def recommend(self, u, k=10):
        return _topk_excluding(self.rng.random(self.n_items), self.seen.get(u, set()), k)


class Popularity:
    """Recommend the globally most-interacted items to everyone."""
    def fit(self, train, n_users, n_items):
        self.counts = np.zeros(n_items)
        for u, i, t in train:
            self.counts[i] += 1
        self.seen = user_seen(train, n_users)
        return self

    def recommend(self, u, k=10):
        return _topk_excluding(self.counts, self.seen.get(u, set()), k)


class Trending:
    """Time-decayed popularity: recent interactions count more (a 'trending' list)."""
    def __init__(self, half_life=2000.0):
        self.half_life = half_life

    def fit(self, train, n_users, n_items):
        lam = np.log(2) / self.half_life
        now = max((t for u, i, t in train), default=0)
        self.scores = np.zeros(n_items)
        for u, i, t in train:
            self.scores[i] += np.exp(-lam * (now - t))
        self.seen = user_seen(train, n_users)
        return self

    def recommend(self, u, k=10):
        return _topk_excluding(self.scores, self.seen.get(u, set()), k)


# ===========================================================================
# Content-based: decayed user profile + cosine kNN (the "article" method)
# ===========================================================================
class ContentBased:
    """
    Build each user's profile as a TIME-DECAYED average of the content vectors of
    items they interacted with, then score all items by cosine similarity to it.
    This is the 'average article embedding -> kNN over the catalog' approach.
    """
    def __init__(self, half_life=2000.0):
        self.half_life = half_life

    def fit(self, train, item_feats, n_users, n_items):
        self.item_feats = item_feats
        norm = np.linalg.norm(item_feats, axis=1, keepdims=True)
        self.feat_unit = item_feats / np.maximum(norm, 1e-9)   # for cosine
        self.hist = {u: [] for u in range(n_users)}
        for u, i, t in train:
            self.hist[u].append((t, i))
        self.now = max((t for u, i, t in train), default=0)
        self.seen = user_seen(train, n_users)
        return self

    def profile(self, u):
        h = self.hist.get(u, [])
        if not h:
            return None
        lam = np.log(2) / self.half_life
        items = [i for t, i in h]
        w = np.array([np.exp(-lam * (self.now - t)) for t, i in h])
        return (w[:, None] * self.item_feats[items]).sum(0) / w.sum()

    def recommend(self, u, k=10):
        p = self.profile(u)
        if p is None:
            return []
        pu = p / max(np.linalg.norm(p), 1e-9)
        scores = self.feat_unit @ pu                  # cosine similarity to every item
        return _topk_excluding(scores, self.seen.get(u, set()), k)


# ===========================================================================
# Neighborhood collaborative filtering: item-item
# ===========================================================================
class ItemItemCF:
    """
    'Users who interacted with X also interacted with Y.' Similarity between items
    is the cosine of their interaction columns; a user's score for an item is the
    summed similarity to items they already have.
    """
    def fit(self, train, n_users, n_items):
        R = build_matrix(train, n_users, n_items)
        self.R = R
        norm = np.linalg.norm(R, axis=0, keepdims=True)
        Rn = R / np.maximum(norm, 1e-9)
        self.S = Rn.T @ Rn                            # item-item cosine similarity
        np.fill_diagonal(self.S, 0.0)
        self.seen = user_seen(train, n_users)
        return self

    def recommend(self, u, k=10):
        scores = self.R[u] @ self.S                   # spread from items the user has
        return _topk_excluding(scores, self.seen.get(u, set()), k)


# ===========================================================================
# Matrix factorization for implicit feedback (Hu, Koren, Volinsky 2008)
# ===========================================================================
class ImplicitALS:
    """
    Factor R ~ X Y^T with confidence-weighted Alternating Least Squares.
    Confidence c_ui = 1 + alpha * R_ui; preference p_ui = 1 if interacted.
    """
    def __init__(self, factors=32, reg=0.1, alpha=40.0, iters=15, seed=0):
        self.f, self.reg, self.alpha, self.iters, self.seed = factors, reg, alpha, iters, seed

    def fit(self, train, n_users, n_items):
        R = build_matrix(train, n_users, n_items)
        rng = np.random.default_rng(self.seed)
        X = rng.normal(scale=0.01, size=(n_users, self.f))
        Y = rng.normal(scale=0.01, size=(n_items, self.f))
        C = 1.0 + self.alpha * R                       # confidence
        P = (R > 0).astype(float)                      # preference
        I = self.reg * np.eye(self.f)

        for _ in range(self.iters):
            YtY = Y.T @ Y
            for u in range(n_users):
                cu = C[u]
                Yw = Y * (cu - 1.0)[:, None]
                A = YtY + Y.T @ Yw + I
                b = (Y * (cu * P[u])[:, None]).sum(0)
                X[u] = np.linalg.solve(A, b)
            XtX = X.T @ X
            for i in range(n_items):
                ci = C[:, i]
                Xw = X * (ci - 1.0)[:, None]
                A = XtX + X.T @ Xw + I
                b = (X * (ci * P[:, i])[:, None]).sum(0)
                Y[i] = np.linalg.solve(A, b)
        self.X, self.Y = X, Y
        self.seen = user_seen(train, n_users)
        return self

    def recommend(self, u, k=10):
        scores = self.Y @ self.X[u]
        return _topk_excluding(scores, self.seen.get(u, set()), k)


# ===========================================================================
# Bayesian Personalized Ranking (pairwise learning-to-rank)
# ===========================================================================
class BPR:
    """
    Optimize the RANKING directly: for each (user, positive item), sample a
    negative item and push the positive's score above the negative's.
    """
    def __init__(self, factors=32, lr=0.05, reg=0.01, epochs=30, seed=0):
        self.f, self.lr, self.reg, self.epochs, self.seed = factors, lr, reg, epochs, seed

    def fit(self, train, n_users, n_items):
        rng = np.random.default_rng(self.seed)
        X = rng.normal(scale=0.1, size=(n_users, self.f))
        Y = rng.normal(scale=0.1, size=(n_items, self.f))
        ui = {u: set() for u in range(n_users)}
        for u, i, t in train:
            ui[u].add(i)
        pos = [(u, i) for u in ui for i in ui[u]]

        for _ in range(self.epochs):
            rng.shuffle(pos)
            for u, i in pos:
                j = int(rng.integers(n_items))
                while j in ui[u]:
                    j = int(rng.integers(n_items))
                diff = X[u] @ (Y[i] - Y[j])
                sig = 1.0 / (1.0 + np.exp(diff))        # gradient weight
                xu = X[u].copy()
                X[u] += self.lr * (sig * (Y[i] - Y[j]) - self.reg * X[u])
                Y[i] += self.lr * (sig * xu - self.reg * Y[i])
                Y[j] += self.lr * (-sig * xu - self.reg * Y[j])
        self.X, self.Y = X, Y
        self.seen = user_seen(train, n_users)
        return self

    def recommend(self, u, k=10):
        scores = self.Y @ self.X[u]
        return _topk_excluding(scores, self.seen.get(u, set()), k)


__all__ = [
    "make_synthetic", "train_test_split_last",
    "recall_at_k", "precision_at_k", "ndcg_at_k", "map_at_k",
    "build_matrix", "user_seen",
    "Random", "Popularity", "Trending", "ContentBased", "ItemItemCF",
    "ImplicitALS", "BPR",
]

The {{#include}} directive splices in the real code/recsys.py at build time, so the documentation can never drift from the actual implementation.

API at a glance

Every model follows the same tiny interface — .fit(...) then .recommend(u, k) — so they're interchangeable in the leaderboard:

CallPurpose
make_synthetic(...)generate an evaluable interaction log + item features
train_test_split_last(...)leave-last-out split (test = each user's last item)
recall_at_k / ndcg_at_k / map_at_koffline metrics
Popularity().fit(train, nu, ni)popularity baseline
Trending(half_life=…).fit(…)time-decayed popularity
ContentBased(half_life=…).fit(train, feats, nu, ni)embeddings + decayed profile + kNN
ItemItemCF().fit(…)neighborhood collaborative filtering
ImplicitALS(factors=…).fit(…)matrix factorization (implicit ALS)
BPR(factors=…).fit(…)pairwise learning-to-rank

Typical usage

from recsys import make_synthetic, train_test_split_last, ImplicitALS, recall_at_k

inter, feats, nu, ni = make_synthetic(seed=0)
train, test = train_test_split_last(inter, nu)

model = ImplicitALS(factors=32, iters=15).fit(train, nu, ni)
recs = {u: model.recommend(u, k=10) for u in test}
print("recall@10:", recall_at_k(recs, test, 10))

What this is and isn't

These implementations are written for clarity — they show exactly how each algorithm works and produce correct, measurable rankings on small data. Production systems differ in scale, not in concept:

  • Sparse math (we used dense matrices; real systems use sparse formats and only store interactions).
  • Optimized librariesimplicit (ALS/BPR in Cython), LightFM (hybrid WARP/BPR), TensorFlow/PyTorch for two-tower models, gradient-boosted trees for ranking.
  • ANN serving.recommend here scans all items; at scale that dot-product scan becomes an HNSW/IVF-PQ query.
  • Feature stores, batch/stream pipelines, A/B testing around the model (serving).

Same algorithms; the rest is engineering. Let's measure them head-to-head. 👉

A worked evaluation: the leaderboard

Now we put every algorithm on the same data and the same scoreboard. This is how you'd actually decide what to ship: train them all, measure offline, and read the ranking. The full script is code/demo.py.

The setup

make_synthetic builds an interaction log with real structure: each item has a genre, each user has a peaky genre taste, and item popularity is skewed — so popularity is a genuine baseline and personalization can still win. We hold out each user's last interaction (leave-last-out) and score the top-10.

from recsys import (make_synthetic, train_test_split_last,
                    recall_at_k, ndcg_at_k, map_at_k,
                    Random, Popularity, Trending, ContentBased,
                    ItemItemCF, ImplicitALS, BPR)

inter, feats, nu, ni = make_synthetic(seed=0)
train, test = train_test_split_last(inter, nu)

The result

Running python demo.py (~10s) prints:

users=300  items=150  interactions=18000  test users=300

leaderboard (k=10, higher is better):

  model           recall@k   ndcg@k   map@k
  -----------------------------------------
  Random             0.033    0.014   0.008
  Popularity         0.073    0.032   0.020
  Trending           0.073    0.031   0.019
  ContentBased       0.080    0.035   0.021
  ItemItemCF         0.093    0.051   0.038
  ImplicitALS        0.087    0.041   0.028
  BPR                0.110    0.045   0.026

cold start — a new user with no interactions:
  ContentBased.recommend(new_user) -> []  (empty: no history)
  Trending fallback                -> [127, 105, 87, 78, 13]  (works without history)

How to read it

The ranking tells the whole story of the book, bottom to top:

  • Random (0.033) — the floor. Any real method must beat this.
  • Popularity / Trending (0.073) — non-personalized, yet 2× random. Always measure against this; it's deceptively strong and it's your cold-start fallback.
  • ContentBased (0.080) — the first personalization, from item features + decayed profile. Modest here (genre features are coarse), but it's the method that handles new items and short histories.
  • ItemItemCF (0.093) — learning from the crowd beats content, with no item features at all. Note its higher ndcg/map: it ranks the hit earlier, not just within top-10.
  • ImplicitALS (0.087) — competitive and compact/scalable (tiny vectors, ANN- servable) — the production-friendly choice.
  • BPR (0.110) — best recall@10, because it optimizes ranking directly, which is what recall@k rewards.

And the cold-start lines show the graceful fallback: the personalized model returns nothing for a brand-new user, so the system serves trending instead.

The honest caveats

  • Absolute numbers are dataset-dependent. Offline recall@10 on sparse implicit data is often in this range; what matters is the ordering and the gap over random/popularity, not the raw value.
  • No single winner. BPR leads recall@10; item-item leads ndcg/map. The "best" model depends on your metric, data, and constraints — which is why you run a leaderboard instead of trusting a reputation.
  • Offline ≠ online. This leaderboard filters candidates worth A/B testing; the live test decides (metrics chapter).

Try it yourself

  • Change seed in make_synthetic and re-run — the ordering is stable, the exact numbers wobble.
  • Raise pop_weight (more popularity-driven data) and watch Popularity climb.
  • Tune factors/iters (ALS) or epochs/lr (BPR) and watch them move.
  • Increase k to 20 and see recall rise (more chances to include the hit).

Finally, the wisdom that doesn't fit in a metric: best practices and pitfalls. 👉

Best practices & pitfalls

The algorithms are the easy part. What separates a good recommender from a harmful or useless one is everything around them. This chapter is the hard-won wisdom — the traps that sink real systems and the practices that keep them healthy.

Evaluation traps

  • Don't test on the past you trained on. Always split by time (leave-last-out or a time cutoff), never randomly — a random split lets the model "see the future", inflating offline scores that collapse in production.
  • Don't trust offline metrics alone. They score the model on behavior the old recommender produced and can't tell whether a new recommendation would've been clicked. Use offline metrics to filter ideas, then A/B test the survivors on live traffic and judge by business metrics.
  • Always include dumb baselines. If you can't beat popularity, your fancy model is broken or pointless. Random and popularity belong on every leaderboard.
  • Beware leakage. A feature that secretly encodes the answer (e.g. "number of interactions with this item" computed including the test interaction) gives fantastic offline numbers and fails live.

Bias and feedback loops

  • Popularity bias. Models trained on logged clicks learn to recommend what's already popular, which gets more clicks, which reinforces it — the rich get richer and the long tail starves. Counter it with exploration (bandits), diversity constraints, and sometimes popularity de-biasing in the loss.
  • Position bias. Users click top-ranked items partly because they're on top, not because they're best. Training naively on clicks teaches "whatever we already ranked high is good." Mitigations: model the position, randomize positions slightly, or use inverse-propensity weighting.
  • The feedback loop. Recommendations shape data shapes recommendations. Without deliberate exploration, the system narrows over time and you stop learning about anything you don't already show.
  • Filter bubbles. Over-personalization traps users in a narrow slice of the catalog. Inject diversity and serendipity on purpose — both for user experience and for catalog health.

Beyond accuracy: what to optimize

Maximizing recall@k alone produces a boring, narrow, popularity-heavy product. Balance it with:

  • Diversity — don't show ten near-duplicates; enforce variety in the top-k (e.g. with Maximal Marginal Relevance, which trades relevance against similarity to already-picked items).
  • Novelty / serendipity — surface things the user wouldn't have found alone.
  • Coverage — make sure the catalog's long tail gets shown, not just the hits.
  • Freshness — new items reach users quickly.
  • Business rules — in-stock, licensing, fairness, "don't recommend what they just bought", content-safety.

These usually live in the re-rank stage as constraints on top of the model's scores.

Practical engineering wisdom

  • Start simple. Popularity → content-based → item-item → MF. Ship the simplest thing that beats the baseline; add complexity only when it earns its keep in an A/B test.
  • Implicit ≠ explicit. Treat missing data as unknown, not negative (use confidence weighting / negative sampling). This is the most common modeling mistake.
  • Prevent train/serve skew. Compute features the same way offline and online — use a feature store. Subtle mismatches quietly destroy quality.
  • Log everything, including what you showed. You need impressions (not just clicks) to model position bias and to debug. And log the context of each recommendation.
  • Tune the time decay. Half-life is a real product lever: short for fast-moving catalogs (news, short video), long for stable ones (movies, books).
  • Plan for cold start from day one — it's the first thing real users hit, not an edge case.

Ethical considerations

Recommenders shape attention at enormous scale, so design choices have real consequences:

  • Engagement ≠ well-being. Optimizing pure engagement can amplify outrage, misinformation, or addictive patterns. Pick objectives deliberately and include guardrails.
  • Fairness. Both to users (don't entrench bias) and to item providers (give new or minority creators exposure — see exploration).
  • Transparency & control. Explain recommendations where you can ("because you watched X") and give users ways to steer or reset them.
  • Privacy. Interaction histories are sensitive; handle them accordingly.

The one-paragraph summary

Build the simplest model that beats popularity; treat implicit feedback honestly; split by time and validate with A/B tests; serve embeddings through an ANN index; handle cold start with trending + content + exploration; and deliberately balance accuracy against diversity, novelty, freshness, and fairness — because the metric you optimize is the product you get.

Last, a runnable tool that puts the content-based method in your hands. 👉

One-file CLI: an article recommender

Everything from the content-based chapter, distilled into one runnable tool — the exact pipeline from the introduction: embed articles, build a time-decayed user profile, nearest-neighbor over the catalog, and fall back to representative/trending items on cold start. It needs no machine-learning libraries (embeddings via from-scratch TF-IDF).

The steps

  1. Load a catalog (one article per line).
  2. Embed each article as a TF-IDF vector (built from scratch).
  3. Profile the user: a time-decayed average of the articles they've read (most recent weighted most).
  4. Recommend: cosine nearest-neighbor between the profile and every unread article.
  5. Cold start: with no history, return the most representative articles (a content-only stand-in for trending).

Install & run

pip install numpy        # that's all it needs

# personalized: the user read articles 0 and 1 (1 most recent)
python recommend_cli.py articles.txt --history 0,1 -k 3

# cold start: a brand-new user
python recommend_cli.py articles.txt -k 3

A sample articles.txt (20 short articles across tech / health / sports / space) ships alongside the script.

It works — real output

Personalized — a user who read two tech articles gets tech back:

$ python recommend_cli.py articles.txt --history 0,1 -k 3
catalog: 20 articles
history (oldest->newest): [0, 1]
  most recent read: 'Google releases smartphone software update improving camera battery and performance'

recommended for you:
  1. (score=0.453)  [3] Samsung reveals smartphone with upgraded camera processor and faster software
  2. (score=0.337)  [2] New laptop launches with a powerful processor faster software and better battery
  3. (score=0.311)  [4] Chipmaker announces faster processor boosting laptop and smartphone performance

Switch the history to sports articles and you get sports back instead — the taste vector moves to a different region of embedding space and kNN follows.

Cold start — no history, so fall back:

$ python recommend_cli.py articles.txt -k 3
catalog: 20 articles
no history -> COLD START: showing representative/popular-by-content

cold-start picks:
  1. (score=0.133)  [5] Daily exercise and a healthy diet reduce heart disease risk a new study finds
  2. (score=0.109)  [7] Study shows healthy diet and exercise improve heart health and reduce disease
  3. (score=0.107)  [12] Football team celebrates a championship victory after a dramatic final match

The complete script

#!/usr/bin/env python3
"""
recommend_cli.py — a content-based article recommender, from scratch.

This is the canonical "embed articles -> build a time-decayed user profile ->
nearest-neighbor over the catalog" pipeline, runnable on a plain text file.

  - Catalog: a text file, one article (title/summary) per line.
  - History: the indices of articles the user has read, oldest-first
    (most recent counts most, via time decay).
  - Recommend: cosine k-NN between the user's profile and every unread article.
  - Cold start: with no history, fall back to the most representative articles.

Usage:
    # personalized: user read articles 0, 5, 9 (9 most recent)
    python recommend_cli.py articles.txt --history 0,5,9 -k 5

    # cold start: a brand-new user
    python recommend_cli.py articles.txt -k 5

Requirements: numpy. (TF-IDF is built from scratch; no ML libraries.)
"""

from __future__ import annotations

import argparse
import re

import numpy as np


def tokenize(text):
    return re.findall(r"[a-z0-9]+", text.lower())


def tfidf(docs):
    """Return an (n_docs, vocab) TF-IDF matrix (rows L2-normalized)."""
    df = {}
    for d in docs:
        for w in set(tokenize(d)):
            df[w] = df.get(w, 0) + 1
    vocab = {w: j for j, w in enumerate(sorted(df))}
    n = len(docs)
    idf = np.zeros(len(vocab))
    for w, j in vocab.items():
        idf[j] = np.log((1 + n) / (1 + df[w])) + 1.0
    X = np.zeros((n, len(vocab)))
    for r, d in enumerate(docs):
        for w in tokenize(d):
            X[r, vocab[w]] += 1.0
        X[r] *= idf
    X /= np.maximum(np.linalg.norm(X, axis=1, keepdims=True), 1e-9)   # unit rows
    return X


def decayed_profile(X, history, half_life):
    """Time-decayed average of read-article vectors. history: oldest-first ids."""
    lam = np.log(2) / half_life
    n = len(history)
    # age 0 = most recent (last in list) -> highest weight
    weights = np.array([np.exp(-lam * (n - 1 - pos)) for pos in range(n)])
    profile = (weights[:, None] * X[history]).sum(0) / weights.sum()
    return profile / max(np.linalg.norm(profile), 1e-9)


def main(argv=None):
    p = argparse.ArgumentParser(description="Content-based article recommender.")
    p.add_argument("articles", help="text file, one article per line")
    p.add_argument("--history", default="",
                   help="comma-separated article indices the user read, oldest first")
    p.add_argument("-k", type=int, default=5, help="recommendations to return")
    p.add_argument("--half-life", type=float, default=3.0,
                   help="recency half-life in #articles (smaller = more recency-biased)")
    args = p.parse_args(argv)

    with open(args.articles, encoding="utf-8") as f:
        docs = [ln.rstrip("\n") for ln in f if ln.strip()]
    X = tfidf(docs)
    print(f"catalog: {len(docs)} articles")

    history = [int(x) for x in args.history.split(",") if x.strip() != ""]

    if history:
        prof = decayed_profile(X, history, args.half_life)
        scores = X @ prof
        scores[history] = -np.inf                       # don't re-recommend read items
        print(f"history (oldest->newest): {history}")
        print(f"  most recent read: {docs[history[-1]]!r}\n")
        label = "recommended for you"
    else:
        # cold start: most "representative" articles (highest average similarity).
        sim = X @ X.T
        np.fill_diagonal(sim, 0.0)
        scores = sim.mean(1)
        print("no history -> COLD START: showing representative/popular-by-content\n")
        label = "cold-start picks"

    top = np.argsort(-scores)[:args.k]
    print(f"{label}:")
    for rank, i in enumerate(top, 1):
        print(f"  {rank}. (score={scores[i]:.3f})  [{i}] {docs[i]}")


if __name__ == "__main__":
    main()

From toy to production

This is a real, if small, content-based recommender. To productionize it you'd swap two pieces and keep the structure:

  • Better embeddings. Replace TF-IDF with a neural text-embedding model — the profile-and-kNN logic is unchanged, the relevance jumps.
  • ANN serving. Replace the exhaustive X @ profile scan with an HNSW or IVF-PQ index so it scales to millions of articles in milliseconds.
  • Real trending for cold start. Swap the content-only "representative" fallback for true trending from your interaction logs.

That's the whole book in one tool: embeddings → time decay → nearest-neighbor → cold-start fallback — the backbone of real recommendation and retrieval systems. 🎉

Capstone: a real news recommender

The rest of this book taught the ideas. This capstone assembles them into a single, production-style application you can run, evaluate, track, serve, and put a UI on — the kind of end-to-end project you'd build on the job (or show in a portfolio).

The whole project lives in code/capstone/. Every snippet in these chapters is {{#include}}'d from that real, runnable code, and every output shown was produced by running it.

What we build: a news recommender

A recommender for news articles — heavy on FIFA/soccer plus general categories (politics, tech, health, finance, …). Given a user's reading history it recommends articles; given a question it answers from the news (RAG); and it handles brand-new users (cold start) gracefully.

The architecture

                         ┌────────────────────── offline ──────────────────────┐
   data (MIND schema)    │  embed articles → build ANN index                    │
   news.tsv + behaviors  │  compute trending (time decay)                       │
        │                │  train stage-2 ranker on click logs                  │
        └───────────────►│  evaluate (recall@k, NDCG, AUC) → log to MLflow      │
                         └──────────────────────────┬───────────────────────────┘
                                                    │  model artifact
                         ┌──────────────────────────▼──────── online ───────────┐
   React UI  ──HTTP──►   │  FastAPI:                                             │
   (recs, search, ask)   │   /recommend  = decayed profile → ANN candidates →   │
                         │                 logistic ranker → top-k              │
                         │   /search     = vector search over articles          │
                         │   /ask        = RAG: retrieve + Claude (or offline)  │
                         │   /feedback   = online history update                │
                         └──────────────────────────────────────────────────────┘

How it ties the whole series together

ComponentBuilt on
Article embeddingscontent-based filtering
Time-decayed user profilebaselines + content-based
ANN candidate generationthe HNSW & IVF-PQ books
Two-stage candidate gen + rankingneural & two-stage
Cold start → trending fallbackcold start
Evaluation (recall@k, NDCG, AUC)metrics
RAG assistantretrieval = vector search (HNSW/IVF-PQ books)

The tech stack

  • Python + NumPy — the recommender, ranker, and retrieval (from scratch).
  • MLflow — experiment tracking + model artifact (Chapter 19).
  • FastAPI — the serving API (Chapter 21).
  • React (Vite) — the UI (Chapter 22).
  • Claude (Anthropic API) — RAG generation, with an offline fallback (Chapter 20).
  • Docker Compose — one command to run everything (Chapter 23).

Run it in five phases

You can stop after any phase — each works on its own:

Phase 0  setup            pip install -r requirements.txt; python scripts/make_sample_data.py
Phase 1  train + track    python -m newsreco.train        (+ mlflow ui)
Phase 2  serve API        uvicorn newsreco.api:app --port 8000
Phase 3  React UI         cd frontend && npm install && npm run dev
Phase 4  real RAG         export ANTHROPIC_API_KEY=...    (else offline mode)
Phase 5  all-in-one       docker compose up --build

A note on the dataset

We build on the Microsoft MIND news-recommendation dataset's schema (the standard real benchmark, which naturally includes sports/soccer). To keep the project runnable anywhere, it ships a realistic soccer-heavy sample in the same format; swapping in the full MIND download needs no code changes. Details in the next chapter. 👉

The dataset (MIND schema)

A real recommender starts with real data. We build on the schema of MIND (Microsoft News Dataset) — the standard public benchmark for news recommendation, with millions of impressions across categories including sports/soccer. To keep the capstone runnable anywhere, we ship a soccer-heavy sample in the exact MIND format; swapping in full MIND is a file drop.

The MIND format

Two tab-separated files:

news.tsv — one row per article:

news_id  category  subcategory  title  abstract  url  title_entities  abstract_entities

behaviors.tsv — one row per impression (a session where articles were shown):

impression_id  user_id  time  history  impressions
  • history — space-separated ids the user clicked before this session.
  • impressions — what was shown, each tagged with a click label, e.g. N255-0 N64-1 N299-0 (N64 was clicked, the others weren't).

A real sample row from our generated news.tsv:

N0  sports  soccer  France reach the World Cup quarter-final after dramatic win  Portugal advanced to the quarter-final ...

and behaviors.tsv:

0  U106  06/01/2026 08:00:00 AM  N31 N3 N16 N52 ...  N255-0 N64-1 N299-0 N21-0

This format carries everything a recommender needs: content (title/abstract), categories, who clicked what, and when (for time decay and time-based evaluation).

The bundled sample

scripts/make_sample_data.py generates 300 articles (≈40% soccer/FIFA, the rest spread across politics, world, AI, gadgets, health, finance, movies, travel) and 4,000 impressions for 400 users with realistic, topical vocabulary so content-based methods actually cluster. Users have category affinities (soccer over-represented), and clicks reflect those affinities — so the recommenders have genuine structure to learn.

$ python scripts/make_sample_data.py
wrote 300 articles to data/news.tsv
wrote 4000 impressions to data/behaviors.tsv
categories: 10 subcategories, soccer-heavy

The loader

data.py reads both files into clean structures and, crucially, parses the click labels into interactions (for training) and per-user history (for the decayed profile). The same loader handles the sample and the full MIND download.

"""Load the MIND-schema news + behaviors files into usable structures.

Works for both the bundled sample and the real Microsoft MIND dataset (same
TSV format).
"""

from __future__ import annotations

from dataclasses import dataclass, field
from datetime import datetime


def _parse_time(s: str) -> float:
    """MIND uses '11/11/2019 9:05:58 AM'. Fall back to ISO. Return epoch seconds."""
    s = s.strip()
    for fmt in ("%m/%d/%Y %I:%M:%S %p", "%m/%d/%Y %H:%M:%S"):
        try:
            return datetime.strptime(s, fmt).timestamp()
        except ValueError:
            pass
    try:
        return datetime.fromisoformat(s).timestamp()
    except ValueError:
        return 0.0


@dataclass
class Article:
    news_id: str
    category: str
    subcategory: str
    title: str
    abstract: str

    @property
    def text(self) -> str:
        return f"{self.title}. {self.abstract}"


@dataclass
class NewsData:
    articles: dict = field(default_factory=dict)          # news_id -> Article
    interactions: list = field(default_factory=list)      # (user, news_id, time) clicks
    user_history: dict = field(default_factory=dict)      # user -> [(news_id, time), ...]
    impressions: list = field(default_factory=list)       # raw impressions for ranker

    @property
    def article_ids(self):
        return list(self.articles.keys())


def load_news(path: str) -> dict:
    """Return {news_id: Article}."""
    articles = {}
    with open(path, encoding="utf-8") as f:
        for line in f:
            parts = line.rstrip("\n").split("\t")
            if len(parts) < 5:
                continue
            nid, cat, sub, title, abstract = parts[0], parts[1], parts[2], parts[3], parts[4]
            articles[nid] = Article(nid, cat, sub, title, abstract)
    return articles


def load_behaviors(path: str, articles: dict):
    """
    Parse behaviors.tsv. Returns (interactions, user_history, impressions):
      interactions : list of (user, news_id, time) for CLICKS
      user_history : user -> [(news_id, time)] from the history column + clicks
      impressions  : list of dicts {user, time, candidates:[(news_id,label)]}
    Unknown news ids (not in `articles`) are skipped.
    """
    interactions, impressions = [], []
    user_history = {}
    with open(path, encoding="utf-8") as f:
        for line in f:
            parts = line.rstrip("\n").split("\t")
            if len(parts) < 5:
                continue
            _, user, time_s, history_s, imp_s = parts[0], parts[1], parts[2], parts[3], parts[4]
            t = _parse_time(time_s)
            hist = [h for h in history_s.split() if h in articles]
            user_history.setdefault(user, [])
            for h in hist:
                user_history[user].append((h, t))
            cands = []
            for tok in imp_s.split():
                if "-" not in tok:
                    continue
                nid, lab = tok.rsplit("-", 1)
                if nid in articles:
                    cands.append((nid, int(lab)))
                    if lab == "1":
                        interactions.append((user, nid, t))
                        user_history[user].append((nid, t))
            if cands:
                impressions.append({"user": user, "time": t, "candidates": cands})
    return interactions, user_history, impressions


def load_all(cfg) -> NewsData:
    articles = load_news(cfg.news_path)
    interactions, user_history, impressions = load_behaviors(cfg.behaviors_path, articles)
    return NewsData(articles=articles, interactions=interactions,
                    user_history=user_history, impressions=impressions)

Loading the sample:

articles=300 interactions=4000 users=400 impressions=4000
  • interactions(user, news_id, time) for every click; the training signal.
  • user_history — each user's clicked items with timestamps; powers the time-decayed profile.
  • impressions — the shown-and-labeled candidates; the training data for the stage-2 ranker.

Using the real MIND dataset

  1. Download MIND (small or large) from https://msnews.github.io/.
  2. Drop its news.tsv and behaviors.tsv into data/.
  3. Re-run training — no code changes, because we coded to the MIND schema from the start.

Why a sample at all? Full MIND is gigabytes — too big to bundle and slow to iterate on. Developing against a faithful small sample, then scaling to the real dataset, is exactly how you'd work in practice.

With data in hand, let's build the recommender and ranker. 👉

The recommender & two-stage ranker

This is the heart of the capstone: the two-stage design from Chapter 9, built for real. Stage 1 generates candidates fast with embeddings + a time-decayed profile + ANN search. Stage 2 re-ranks them with a learned click model.

Stage 1 — candidate generation

recommender.py embeds every article (title + abstract), builds a vector index, and computes a time-decayed user profile, then retrieves the nearest unseen articles. It also computes trending (time-decayed popularity) for cold-start users, and dedupes by title so the feed isn't ten near-identical stories.

"""The recommender: content embeddings + time-decayed user profile + ANN
candidate generation, with a trending fallback for cold-start users.

This is stage 1 (candidate generation). The LogisticRanker (ranker.py) is the
optional stage 2 that re-scores candidates with extra features.
"""

from __future__ import annotations

import math

import numpy as np

from .ann import VectorIndex
from .embeddings import make_embedder


class NewsRecommender:
    def __init__(self, embedder="tfidf", half_life_hours=72.0):
        self.embedder = make_embedder(embedder) if isinstance(embedder, str) else embedder
        self.half_life = half_life_hours * 3600.0      # to seconds (timestamps are seconds)
        self.index = VectorIndex()

    # ---------------------------------------------------------------- fitting
    def fit(self, data):
        self.data = data
        self.ids = data.article_ids
        self.id_to_row = {nid: r for r, nid in enumerate(self.ids)}
        texts = [data.articles[nid].text for nid in self.ids]
        self.embedder.fit(texts)
        self.emb = self.embedder.transform(texts)      # L2-normalized rows
        self.index.build(self.emb, self.ids)

        # popularity + time-decayed "trending" from clicks
        self.now = max((t for _, _, t in data.interactions), default=0.0)
        lam = math.log(2) / self.half_life
        self.pop = np.zeros(len(self.ids))
        self.trend = np.zeros(len(self.ids))
        for _, nid, t in data.interactions:
            r = self.id_to_row.get(nid)
            if r is not None:
                self.pop[r] += 1.0
                self.trend[r] += math.exp(-lam * (self.now - t))

        # per-user click history (item, time)
        self.history = {u: list(h) for u, h in data.user_history.items()}
        # per-user preferred categories (for the ranker's category-match feature)
        self.user_cats = {}
        for u, hist in self.history.items():
            cats = {}
            for nid, _ in hist:
                a = data.articles.get(nid)
                if a:
                    cats[a.subcategory] = cats.get(a.subcategory, 0) + 1
            self.user_cats[u] = cats
        return self

    # ---------------------------------------------------------------- profile
    def profile(self, user):
        hist = self.history.get(user)
        if not hist:
            return None
        lam = math.log(2) / self.half_life
        rows, weights = [], []
        for nid, t in hist:
            r = self.id_to_row.get(nid)
            if r is not None:
                rows.append(r)
                weights.append(math.exp(-lam * (self.now - t)))
        if not rows:
            return None
        w = np.asarray(weights)
        p = (w[:, None] * self.emb[rows]).sum(0) / w.sum()
        n = np.linalg.norm(p)
        return p / n if n > 0 else None

    def seen(self, user):
        return {nid for nid, _ in self.history.get(user, [])}

    # ------------------------------------------------------- recommendation
    def trending(self, k=10, exclude=()):
        order = np.argsort(-self.trend)
        out = [self.ids[r] for r in order if self.ids[r] not in exclude]
        return out[:k]

    def candidates(self, user, k=100, dedup=True):
        """Stage-1 candidate generation. Returns [(news_id, content_score)].

        Dedupes near-duplicate articles by title (a basic diversity guard —
        production systems do this so the feed isn't ten copies of one story).
        """
        p = self.profile(user)
        if p is None:
            return [(nid, 0.0) for nid in self.trending(k, self.seen(user))]
        hits = self.index.search(p, (k + len(self.seen(user))) * 3)
        seen = self.seen(user)
        out, titles = [], set()
        for nid, s in hits:
            if nid in seen:
                continue
            title = self.data.articles[nid].title.lower() if dedup else nid
            if title in titles:
                continue
            titles.add(title)
            out.append((nid, s))
            if len(out) >= k:
                break
        return out

    def recommend(self, user, k=10, ranker=None):
        """Full pipeline: candidate generation, then optional ranking."""
        cands = self.candidates(user, k=max(k * 10, 50))
        if not cands:
            return self.trending(k, self.seen(user))
        if ranker is None:
            return [nid for nid, _ in cands[:k]]
        feats = np.array([self.features(user, nid, content) for nid, content in cands])
        scores = ranker.predict(feats)
        order = np.argsort(-scores)
        return [cands[i][0] for i in order[:k]]

    # --------------------------------------------------- ranker features
    def features(self, user, news_id, content_score=None):
        """Feature vector for a (user, candidate) pair (used by the ranker)."""
        r = self.id_to_row.get(news_id)
        if content_score is None:
            p = self.profile(user)
            content_score = float(self.emb[r] @ p) if (p is not None and r is not None) else 0.0
        pop = math.log1p(self.pop[r]) if r is not None else 0.0
        trend = self.trend[r] if r is not None else 0.0
        a = self.data.articles.get(news_id)
        cat_match = float(self.user_cats.get(user, {}).get(a.subcategory, 0) > 0) if a else 0.0
        return [content_score, pop, trend, cat_match, 1.0]   # last = bias term

    n_features = 5

Key pieces:

  • fit embeds articles via the pluggable embedder (TF-IDF by default, sentence-transformers in production) and builds the ANN index — the VectorIndex here is exact cosine, but its interface matches HNSW/FAISS for a drop-in production swap.
  • profile is the time-decayed average of the user's clicked-article embeddings — recent reads weigh most (the half-life knob).
  • candidates runs the ANN search from the profile, filters seen items, and dedupes by title.
  • recommend returns candidates directly, or — if given a ranker — re-scores them (stage 2).
  • features builds the (user, candidate) feature vector the ranker consumes.

Stage 2 — the learned ranker

Stage 1 optimizes recall (don't miss good items); stage 2 optimizes precision (order them well). ranker.py is a from-scratch logistic-regression click model trained on the impression labels — the click-through-rate model real systems use, kept small and transparent.

"""Stage-2 ranker: a from-scratch logistic-regression click model.

It re-scores the candidates from stage 1 using features (content similarity,
popularity, trending, category match). Trained on the impression labels in
behaviors.tsv — exactly the click-through-rate model real systems use, just
small and transparent.
"""

from __future__ import annotations

import numpy as np


class LogisticRanker:
    def __init__(self, lr=0.1, reg=1e-4, epochs=200, seed=0):
        self.lr, self.reg, self.epochs, self.seed = lr, reg, epochs, seed

    def fit(self, X, y):
        X = np.asarray(X, dtype=np.float64)
        y = np.asarray(y, dtype=np.float64)
        # standardize features (except the bias column, last) for stable training
        self.mean = X.mean(0); self.mean[-1] = 0.0
        self.std = X.std(0); self.std[self.std < 1e-9] = 1.0; self.std[-1] = 1.0
        Xs = (X - self.mean) / self.std
        rng = np.random.default_rng(self.seed)
        self.w = rng.normal(scale=0.01, size=Xs.shape[1])
        n = len(y)
        for _ in range(self.epochs):
            z = Xs @ self.w
            p = 1.0 / (1.0 + np.exp(-z))
            grad = Xs.T @ (p - y) / n + self.reg * self.w
            self.w -= self.lr * grad
        return self

    def predict(self, X):
        Xs = (np.asarray(X, dtype=np.float64) - self.mean) / self.std
        return 1.0 / (1.0 + np.exp(-(Xs @ self.w)))


def build_training_samples(recommender, impressions, max_impressions=None):
    """
    Turn impression logs into (X, y) for the click model: each shown candidate
    becomes one row, labelled 1 if clicked.
    """
    X, y = [], []
    imps = impressions if max_impressions is None else impressions[:max_impressions]
    for imp in imps:
        u = imp["user"]
        for nid, label in imp["candidates"]:
            X.append(recommender.features(u, nid))
            y.append(label)
    return np.array(X), np.array(y)

Its features per (user, candidate): content similarity, popularity, trending, category match, and a bias term. It learns weights that predict clicks.

See it run

Recommending for a soccer-history user, then training the ranker:

user U106 top-5 (candidate gen only):
  soccer | Bayern Munich and Chelsea play out thrilling 2-2 draw
  soccer | Barcelona and Inter Milan play out thrilling 4-4 draw
  soccer | Bukayo Saka wins Ballon d'Or after stellar football season
  soccer | Manchester City and Paris Saint-Germain play out thrilling 4-4 draw
  soccer | Kylian Mbappe wins Ballon d'Or after stellar football season

ranker trained on 22009 samples; click-AUC = 0.925
ranker weights [content, pop, trend, cat_match, bias]: [0.93, 0.118, 0.021, 0.68, -1.92]

Two things worth noting:

  • The candidates are all soccer (the user's taste) and deduped (distinct titles) — exactly what we want.
  • The ranker reaches AUC 0.925 at predicting clicks, and its learned weights are interpretable: content similarity (0.93) and category match (0.68) drive clicks most; the negative bias reflects that most shown items aren't clicked. That's a sensible, debuggable model — not a black box.

Why two stages (recap)

Running the logistic ranker over all 300 articles per request would be wasteful, and over millions it'd be impossible. Stage 1's ANN search cheaply narrows millions → a few hundred; stage 2 spends its effort only there. This is the architecture behind essentially every large recommender — and here it is in ~150 lines you can read end to end.

Next: track training runs so you can compare models and ship the best one. 👉

Experiment tracking with MLflow

The moment you tune a recommender — different embedders, half-lives, ranker settings — you need to know which run was best and why. MLflow records every run's parameters, metrics, and artifacts so experiments are reproducible and comparable. This is the difference between "I think the new model is better" and "run #14 improved NDCG@10 from 0.044 to 0.048; here's the proof."

What MLflow gives you

  • Tracking — log params, metrics, and artifacts (the model file) per run.
  • UI — a web dashboard to compare runs side by side.
  • Model registry — promote a run's model through StagingProduction with versioning.

The training pipeline

train.py does a proper offline evaluation and logs everything. It splits clicks leave-last-out (train on the past, test on each user's last click — never peeking ahead, per Chapter 3), fits the recommender, trains the ranker on an 80/20 impression split, evaluates, saves the model, and logs to MLflow. MLflow is optional — if it isn't installed, everything still runs and just skips logging.

"""Training + offline evaluation pipeline, tracked with MLflow.

Steps:
  1. load data, split clicks leave-last-out (train on past, test on each user's
     last click);
  2. fit the recommender (embeddings + index + trending) on the train split;
  3. train the stage-2 logistic ranker on impression labels (80/20), report AUC;
  4. evaluate recall@k / ndcg@k on the held-out clicks;
  5. log params, metrics, and the model artifact to MLflow.

MLflow is optional: if it isn't installed, everything still runs and prints to
the console (logging is a no-op). Run:

    python -m newsreco.train
"""

from __future__ import annotations

import os
import pickle

from .config import Config
from .data import load_all, NewsData
from .recommender import NewsRecommender
from .ranker import LogisticRanker, build_training_samples
from . import metrics

try:                                   # MLflow is optional
    import mlflow
    _HAS_MLFLOW = True
except Exception:
    _HAS_MLFLOW = False


def leave_last_out(data: NewsData):
    """Split clicks: each user's chronologically last click -> test target."""
    by_user = {}
    for u, nid, t in data.interactions:
        by_user.setdefault(u, []).append((t, nid))
    train_inter, test = [], {}
    train_hist = {}
    for u, lst in by_user.items():
        lst.sort()
        if len(lst) >= 2:
            *rest, last = lst
            test[u] = last[1]
            for t, nid in rest:
                train_inter.append((u, nid, t))
                train_hist.setdefault(u, []).append((nid, t))
        else:
            for t, nid in lst:
                train_inter.append((u, nid, t))
                train_hist.setdefault(u, []).append((nid, t))
    train = NewsData(articles=data.articles, interactions=train_inter,
                     user_history=train_hist, impressions=data.impressions)
    return train, test


def run(cfg: Config = None):
    cfg = cfg or Config()
    data = load_all(cfg)
    train, test = leave_last_out(data)
    print(f"articles={len(data.articles)} clicks={len(data.interactions)} "
          f"test_users={len(test)}")

    rec = NewsRecommender(embedder=cfg.embedder, half_life_hours=cfg.half_life_hours)
    rec.fit(train)

    # stage-2 ranker (80/20 split of impressions)
    imps = train.impressions
    cut = int(len(imps) * 0.8)
    Xtr, ytr = build_training_samples(rec, imps[:cut])
    Xte, yte = build_training_samples(rec, imps[cut:])
    ranker = LogisticRanker().fit(Xtr, ytr)
    train_auc = metrics.auc(ytr, ranker.predict(Xtr))
    test_auc = metrics.auc(yte, ranker.predict(Xte))

    # recall/ndcg on held-out clicks, candidate-gen vs. full two-stage
    k = cfg.top_k
    recs_cg = {u: rec.recommend(u, k) for u in test}
    recs_rk = {u: rec.recommend(u, k, ranker=ranker) for u in test}
    results = {
        "ranker_train_auc": round(train_auc, 4),
        "ranker_test_auc": round(test_auc, 4),
        f"recall@{k}_candgen": round(metrics.recall_at_k(recs_cg, test, k), 4),
        f"ndcg@{k}_candgen": round(metrics.ndcg_at_k(recs_cg, test, k), 4),
        f"recall@{k}_ranked": round(metrics.recall_at_k(recs_rk, test, k), 4),
        f"ndcg@{k}_ranked": round(metrics.ndcg_at_k(recs_rk, test, k), 4),
    }
    print("\nmetrics:")
    for kk, vv in results.items():
        print(f"  {kk:<22} {vv}")

    # persist the model
    os.makedirs("models", exist_ok=True)
    model_path = os.path.join("models", "newsreco.pkl")
    with open(model_path, "wb") as f:
        pickle.dump({"recommender": rec, "ranker": ranker}, f)
    print(f"\nsaved model -> {model_path}")

    # MLflow tracking
    if _HAS_MLFLOW:
        mlflow.set_tracking_uri(cfg.mlflow_uri)
        mlflow.set_experiment(cfg.experiment)
        with mlflow.start_run():
            mlflow.log_params({
                "embedder": cfg.embedder,
                "half_life_hours": cfg.half_life_hours,
                "top_k": k,
                "n_articles": len(data.articles),
                "n_clicks": len(data.interactions),
            })
            mlflow.log_metrics({kk.replace("@", "_at_"): float(vv)
                                for kk, vv in results.items()})
            mlflow.log_artifact(model_path)
        print(f"logged run to MLflow at {cfg.mlflow_uri} (experiment={cfg.experiment})")
    else:
        print("MLflow not installed — skipped tracking (pip install mlflow to enable)")

    return results


if __name__ == "__main__":
    run()

Running it

$ python -m newsreco.train
articles=300 clicks=4000 test_users=399

metrics:
  ranker_train_auc       0.9217
  ranker_test_auc        0.947
  recall@10_candgen      0.0827
  ndcg@10_candgen        0.0436
  recall@10_ranked       0.0827
  ndcg@10_ranked         0.0477

saved model -> models/newsreco.pkl
logged run to MLflow at file:./mlruns (experiment=news-recommender)

Reading the results:

  • ranker_test_auc 0.947 — the click model generalizes well to held-out impressions (AUC 0.5 = random, 1.0 = perfect).
  • recall@10 / ndcg@10 — held-out-click accuracy. The ranker doesn't change which items are in the candidate set (so recall is unchanged) but it orders them better, nudging NDCG up (0.0436 → 0.0477). On larger/real data the gap is bigger; the point is the pipeline measures it honestly.

Viewing and comparing runs

mlflow ui --backend-store-uri ./mlruns --port 5000      # http://localhost:5000

Each python -m newsreco.train (with different NEWSRECO_HALFLIFE, NEWSRECO_EMBEDDER, etc.) creates a new run; the UI plots them so you can pick the winner. Try:

NEWSRECO_HALFLIFE=24  python -m newsreco.train     # more recency-biased
NEWSRECO_EMBEDDER=sbert python -m newsreco.train   # semantic embeddings

The model registry (promoting to production)

Once a run looks good, register and stage its model so the serving layer always loads "the current Production model":

import mlflow
# from a chosen run:
mlflow.register_model("runs:/<run_id>/newsreco.pkl", "news-recommender")
# then in the MLflow UI (or API) transition that version to "Production".

The API (Chapter 21) loads models/newsreco.pkl; in a fuller setup you'd have it pull the current Production version from the registry, so deploying a better model is a registry transition, not a code change.

Why this matters in production

  • Reproducibility — every result is tied to its exact params and code.
  • Comparability — no more guessing whether a change helped.
  • Auditability & rollback — the registry records what's live and lets you revert instantly.

Next: the RAG news assistant. 👉

The RAG news assistant

A recommender shows articles; a RAG (Retrieval-Augmented Generation) assistant answers questions about them. It's the natural companion feature for a news app — "What happened in the Champions League?" — and it reuses the exact same embeddings and vector index the recommender already built.

What RAG is

An LLM doesn't know your private/fresh news corpus, and asking it to recall facts invites hallucination. RAG fixes both:

question ──► embed ──► vector search over articles ──► top-k articles (context)
                                                              │
                            "Answer using ONLY these articles" + question
                                                              │
                                                          LLM (Claude)
                                                              │
                                                    grounded, cited answer

The LLM only reasons over retrieved facts, so answers stay grounded and you can cite sources. The retrieval step is precisely the nearest-neighbor search from the HNSW/IVF-PQ books — RAG is a recommender for context.

Pluggable generation: Claude + offline fallback

rag.py retrieves with the recommender's index, then generates:

  • if ANTHROPIC_API_KEY is set (and the anthropic package is installed), it calls Claude with the retrieved articles as context;
  • otherwise it returns a transparent extractive answer (the lead sentences of the top articles), so the assistant always works — no key required to demo.
"""RAG news assistant: answer a question grounded in retrieved articles.

Retrieval reuses the recommender's embedder + vector index (the same embeddings
that power recommendations). Generation is pluggable:

  * if ANTHROPIC_API_KEY is set and the `anthropic` package is installed, it
    calls Claude with the retrieved articles as context;
  * otherwise it falls back to a transparent extractive answer (the lead
    sentences of the top articles) so the system ALWAYS returns something.

This mirrors production RAG: retrieve over your vector store, then generate.
"""

from __future__ import annotations

import os

import numpy as np


SYSTEM_PROMPT = (
    "You are a news assistant. Answer the user's question using ONLY the provided "
    "articles. Cite article numbers like [1], [2]. If the articles don't contain "
    "the answer, say so."
)


class NewsAssistant:
    def __init__(self, recommender, api_key=None, model="claude-opus-4-8"):
        self.rec = recommender
        self.api_key = api_key if api_key is not None else os.environ.get("ANTHROPIC_API_KEY", "")
        self.model = model

    # ---------------------------------------------------------------- retrieve
    def retrieve(self, query, k=5):
        """Top-k articles most relevant to the query (vector search)."""
        q = self.rec.embedder.transform([query])[0]
        hits = self.rec.index.search(q, k)
        return [(nid, score, self.rec.data.articles[nid]) for nid, score in hits]

    # ----------------------------------------------------------------- answer
    def ask(self, query, k=5):
        """Return {'answer': str, 'sources': [...], 'mode': 'claude'|'extractive'}."""
        retrieved = self.retrieve(query, k)
        context = "\n".join(
            f"[{i+1}] {art.title}. {art.abstract}" for i, (_, _, art) in enumerate(retrieved))
        sources = [{"id": nid, "title": art.title, "score": round(score, 3)}
                   for nid, score, art in retrieved]

        if self.api_key:
            try:
                answer = self._claude(query, context)
                return {"answer": answer, "sources": sources, "mode": "claude"}
            except Exception as e:                       # never fail the request
                answer = self._extractive(query, retrieved)
                return {"answer": answer, "sources": sources,
                        "mode": f"extractive (claude error: {e})"}
        return {"answer": self._extractive(query, retrieved),
                "sources": sources, "mode": "extractive"}

    # ------------------------------------------------------------ generators
    def _claude(self, query, context):
        import anthropic  # lazy import; only needed when a key is set
        client = anthropic.Anthropic(api_key=self.api_key)
        msg = client.messages.create(
            model=self.model,
            max_tokens=400,
            system=SYSTEM_PROMPT,
            messages=[{"role": "user",
                       "content": f"Articles:\n{context}\n\nQuestion: {query}"}],
        )
        return "".join(block.text for block in msg.content if block.type == "text")

    def _extractive(self, query, retrieved):
        """Offline fallback: stitch the lead sentence of each top article + cite."""
        if not retrieved:
            return "No relevant articles found."
        lines = [f"Based on {len(retrieved)} related articles:"]
        for i, (_, _, art) in enumerate(retrieved, 1):
            lead = art.abstract.split(". ")[0].strip()
            lines.append(f"  [{i}] {lead}. ({art.title})")
        return "\n".join(lines)

See it work (offline mode)

>>> NewsAssistant(rec, api_key="").ask("Who won the World Cup match?", k=3)
mode = extractive
Based on 3 related articles:
  [1] Netherlands advanced to the final of the World Cup, with Vinicius Junior inspiring a memorable victory ... (Croatia reach the World Cup final after dramatic win)
  [2] Portugal advanced to the final of the World Cup, with Kylian Mbappe inspiring a memorable victory ... (Morocco reach the World Cup quarter-final after dramatic win)
  [3] Netherlands advanced to the final of the World Cup ... (Portugal reach the World Cup quarter-final after dramatic win)

Retrieval found the right (World Cup) articles by meaning, and the assistant answered with citations — with zero external dependencies.

Turning on Claude

export ANTHROPIC_API_KEY=sk-ant-...

Now ask() sends the retrieved articles + the question to Claude with a strict system prompt ("answer using ONLY the provided articles, cite by number"), and returns a fluent, grounded answer with the same sources list. The code is the canonical Anthropic call:

client = anthropic.Anthropic(api_key=...)
msg = client.messages.create(
    model="claude-opus-4-8", max_tokens=400, system=SYSTEM_PROMPT,
    messages=[{"role": "user", "content": f"Articles:\n{context}\n\nQuestion: {query}"}],
)

Robustness: if the Claude call fails (network, quota), the assistant catches the error and falls back to the extractive answer rather than erroring out — a small but important production habit. The response's mode field tells you which path ran.

Production notes

  • Chunking. Long articles should be split into passages and each embedded, so retrieval returns the relevant passage, not a whole document.
  • Grounding & citations. Always instruct the model to use only retrieved context and cite it; surface the sources in the UI (our React app does).
  • Same index, two products. Recommendations and RAG share one embedding + vector-search backend — build it once, serve both.

Next: wrap all of this in an API. 👉

The FastAPI service

Models are useless until something can call them. api.py wraps the recommender and RAG assistant in a FastAPI service — the online half of the system.

The endpoints

Method & pathPurpose
GET /healthliveness + which model/embedder/RAG mode is loaded
GET /recommend/{user_id}?k=personalized recommendations (two-stage)
GET /search?q=&k=content/vector search over the catalog
POST /ask {query, k}the RAG news assistant
POST /feedback {user_id, news_id, event}log an interaction (online update)

It loads models/newsreco.pkl if present (from training) or trains a fresh model on startup, so it always comes up ready.

"""FastAPI service exposing the recommender + RAG assistant.

Endpoints:
  GET  /health
  GET  /recommend/{user_id}?k=10        -> personalized recommendations
  GET  /search?q=...&k=10               -> content search over the catalog
  POST /ask         {query, k}          -> RAG news assistant (Claude or offline)
  POST /feedback    {user_id, news_id, event}  -> log an interaction (online update)

Loads a trained model from models/newsreco.pkl if present; otherwise it trains a
fresh one from the data on startup. Run:

    uvicorn newsreco.api:app --reload --port 8000
"""

from __future__ import annotations

import os
import pickle

from .config import Config
from .data import load_all
from .recommender import NewsRecommender
from .rag import NewsAssistant

try:
    from fastapi import FastAPI, HTTPException
    from pydantic import BaseModel
except Exception as e:                  # pragma: no cover
    raise SystemExit("FastAPI not installed. pip install fastapi uvicorn") from e


cfg = Config()


def _load_or_train():
    # 1. If asked, load the current 'production' model from the MLflow registry.
    if os.environ.get("NEWSRECO_USE_REGISTRY") == "1":
        try:
            from .registry import load_production
            bundle = load_production()
            if bundle:
                return bundle["recommender"], bundle.get("ranker")
        except Exception:
            pass  # fall through to local artifact / fresh train
    # 2. Otherwise load the locally trained artifact.
    path = os.path.join("models", "newsreco.pkl")
    if os.path.exists(path):
        with open(path, "rb") as f:
            bundle = pickle.load(f)
        return bundle["recommender"], bundle.get("ranker")
    # 3. Last resort: train a fresh model on startup.
    rec = NewsRecommender(embedder=cfg.embedder, half_life_hours=cfg.half_life_hours)
    rec.fit(load_all(cfg))
    return rec, None


app = FastAPI(title="News Recommender", version="1.0")
rec, ranker = _load_or_train()
assistant = NewsAssistant(rec, api_key=cfg.anthropic_api_key, model=cfg.llm_model)


def _article_dict(nid, score=None):
    a = rec.data.articles[nid]
    d = {"id": nid, "title": a.title, "abstract": a.abstract,
         "category": a.category, "subcategory": a.subcategory}
    if score is not None:
        d["score"] = round(float(score), 4)
    return d


class AskRequest(BaseModel):
    query: str
    k: int = 5


class FeedbackRequest(BaseModel):
    user_id: str
    news_id: str
    event: str = "click"


@app.get("/health")
def health():
    return {"status": "ok", "articles": len(rec.data.articles),
            "embedder": rec.embedder.name, "has_ranker": ranker is not None,
            "rag_mode": "claude" if cfg.anthropic_api_key else "extractive"}


@app.get("/recommend/{user_id}")
def recommend(user_id: str, k: int = 10):
    ids = rec.recommend(user_id, k=k, ranker=ranker)
    cold = rec.profile(user_id) is None
    return {"user_id": user_id, "cold_start": cold,
            "recommendations": [_article_dict(nid) for nid in ids]}


@app.get("/search")
def search(q: str, k: int = 10):
    qv = rec.embedder.transform([q])[0]
    hits = rec.index.search(qv, k)
    return {"query": q, "results": [_article_dict(nid, s) for nid, s in hits]}


@app.post("/ask")
def ask(req: AskRequest):
    return assistant.ask(req.query, k=req.k)


@app.post("/feedback")
def feedback(req: FeedbackRequest):
    if req.news_id not in rec.data.articles:
        raise HTTPException(status_code=404, detail="unknown news_id")
    # online update: append to the user's history so the next request reflects it.
    rec.history.setdefault(req.user_id, []).append((req.news_id, rec.now))
    return {"status": "recorded", "user_id": req.user_id, "news_id": req.news_id}

Verified responses

These are real responses from the running app (via FastAPI's TestClient):

GET /health
{'status': 'ok', 'articles': 300, 'embedder': 'tfidf',
 'has_ranker': True, 'rag_mode': 'extractive'}

GET /recommend/U106?k=3   (cold_start=False)
  soccer | Erling Haaland wins Ballon d'Or after stellar football season
  soccer | Jude Bellingham wins Ballon d'Or after stellar football season
  soccer | Bukayo Saka wins Ballon d'Or after stellar football season

GET /search?q=world cup final winner
  0.542 | Croatia reach the World Cup final after dramatic win
  0.535 | Morocco reach the World Cup quarter-final after dramatic win
  0.535 | Portugal reach the World Cup quarter-final after dramatic win

POST /ask {"query":"Who scored in the Champions League?","k":2}   (mode=extractive)
  sources: ['Chelsea beat Juventus 2-0 in the Europa League',
            'Liverpool beat Barcelona 3-0 in the Premier League']

POST /feedback {"user_id":"U106","news_id":"N2"}
  {'status': 'recorded', 'user_id': 'U106', 'news_id': 'N2'}

GET /recommend/UNEW   ->  cold_start = True   (new user → trending fallback)

Everything works as designed: personalized soccer recs for a soccer user, semantic search, grounded RAG answers, feedback logging, and the cold-start flag flips to True for an unknown user.

Running it

uvicorn newsreco.api:app --reload --port 8000

Then explore the auto-generated API docs at http://localhost:8000/docs (FastAPI builds an interactive Swagger UI for free), or curl it:

curl localhost:8000/health
curl "localhost:8000/recommend/U106?k=5"
curl -X POST localhost:8000/ask -H 'content-type: application/json' \
     -d '{"query":"Who won the Champions League?","k":3}'

The online-feedback loop

POST /feedback appends the click to the user's history in memory, so the next /recommend call reflects it immediately — a minimal version of real-time personalization. In production you'd write feedback to a stream/store and refresh the user vector from it, but the principle is the same: the system learns from behavior as it happens.

Production hardening (checklist)

The code is clean and correct; before real traffic you'd add:

  • CORS middleware for the browser, auth on write endpoints, and request validation/limits.
  • Async + batching for the model calls; load the model once per worker.
  • Caching of per-user candidate lists; rate limiting.
  • Observability — structured logs, latency/error metrics, tracing.
  • Load the Production model from the MLflow registry rather than a local pickle.

Next, the user interface. 👉

The React frontend

A recommender needs a face. The capstone ships a small React (Vite) app that talks to the FastAPI backend and gives you three panels: recommendations, search, and the RAG assistant.

What it does

  • Recommendations — enter a user id (e.g. U106), see their top articles; a badge shows when a user is cold start (falling back to trending). Clicking a card sends /feedback and refreshes, so you can watch recommendations adapt in real time.
  • Search — type a query, get vector-search results with similarity scores.
  • Ask — ask a question, get a RAG answer with cited sources and the mode (claude or extractive).

The API client

A thin wrapper around fetch. In dev, Vite proxies /api/* to the backend on :8000 (configured in vite.config.js), so there are no CORS issues.

// Thin client for the FastAPI backend. In dev, vite proxies /api -> :8000.
const BASE = import.meta.env.VITE_API_BASE || '/api'

async function get(path) {
  const r = await fetch(`${BASE}${path}`)
  if (!r.ok) throw new Error(`${r.status} ${r.statusText}`)
  return r.json()
}

async function post(path, body) {
  const r = await fetch(`${BASE}${path}`, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify(body),
  })
  if (!r.ok) throw new Error(`${r.status} ${r.statusText}`)
  return r.json()
}

export const api = {
  health: () => get('/health'),
  recommend: (userId, k = 10) => get(`/recommend/${encodeURIComponent(userId)}?k=${k}`),
  search: (q, k = 10) => get(`/search?q=${encodeURIComponent(q)}&k=${k}`),
  ask: (query, k = 5) => post('/ask', { query, k }),
  feedback: (userId, newsId) => post('/feedback', { user_id: userId, news_id: newsId, event: 'click' }),
}

The app

One component, three sections, plain React hooks — easy to read and extend.

import React, { useEffect, useState } from 'react'
import { api } from './api.js'

function Card({ a, onClick }) {
  return (
    <div className="card" onClick={onClick}>
      <span className={`tag tag-${a.subcategory}`}>{a.subcategory}</span>
      <h4>{a.title}</h4>
      <p>{a.abstract}</p>
      {a.score !== undefined && <small>score {a.score}</small>}
    </div>
  )
}

export default function App() {
  const [health, setHealth] = useState(null)
  const [userId, setUserId] = useState('U106')
  const [recs, setRecs] = useState([])
  const [cold, setCold] = useState(false)
  const [query, setQuery] = useState('world cup final')
  const [results, setResults] = useState([])
  const [question, setQuestion] = useState('Who won the Champions League?')
  const [answer, setAnswer] = useState(null)
  const [err, setErr] = useState(null)

  useEffect(() => { api.health().then(setHealth).catch(e => setErr(String(e))) }, [])

  async function loadRecs() {
    try { const r = await api.recommend(userId, 8); setRecs(r.recommendations); setCold(r.cold_start) }
    catch (e) { setErr(String(e)) }
  }
  useEffect(() => { loadRecs() }, [])

  async function doSearch() {
    try { setResults((await api.search(query, 6)).results) } catch (e) { setErr(String(e)) }
  }
  async function doAsk() {
    try { setAnswer(await api.ask(question, 4)) } catch (e) { setErr(String(e)) }
  }
  async function click(newsId) {
    await api.feedback(userId, newsId); loadRecs()   // online update: recs react
  }

  return (
    <div className="app">
      <header>
        <h1>📰 News Recommender</h1>
        {health && <span className="status">
          {health.articles} articles · {health.embedder} · RAG: {health.rag_mode}
        </span>}
      </header>
      {err && <div className="error">Backend error: {err} — is the API running on :8000?</div>}

      <section>
        <h2>Recommendations</h2>
        <div className="row">
          <input value={userId} onChange={e => setUserId(e.target.value)} placeholder="user id (e.g. U106)" />
          <button onClick={loadRecs}>Load</button>
          {cold && <span className="cold">cold start → trending</span>}
        </div>
        <div className="grid">
          {recs.map(a => <Card key={a.id} a={a} onClick={() => click(a.id)} />)}
        </div>
        <small>Tip: clicking a card sends feedback and refreshes — recommendations adapt.</small>
      </section>

      <section>
        <h2>Search</h2>
        <div className="row">
          <input value={query} onChange={e => setQuery(e.target.value)} />
          <button onClick={doSearch}>Search</button>
        </div>
        <div className="grid">{results.map(a => <Card key={a.id} a={a} />)}</div>
      </section>

      <section>
        <h2>Ask the news assistant (RAG)</h2>
        <div className="row">
          <input value={question} onChange={e => setQuestion(e.target.value)} />
          <button onClick={doAsk}>Ask</button>
        </div>
        {answer && (
          <div className="answer">
            <pre>{answer.answer}</pre>
            <div className="sources">
              {answer.sources.map((s, i) => <span key={s.id}>[{i + 1}] {s.title}</span>)}
            </div>
            <small>mode: {answer.mode}</small>
          </div>
        )}
      </section>
    </div>
  )
}

Running it

The frontend needs the backend from Chapter 21 running on :8000, then:

cd frontend
npm install
npm run dev          # opens http://localhost:5173

That's it — a working recommender UI: browse personalized news, search by meaning, and chat with the news assistant.

Project layout

frontend/
├── index.html
├── package.json          # react + vite
├── vite.config.js        # dev proxy /api -> :8000
└── src/
    ├── main.jsx          # React entry
    ├── api.js            # fetch wrapper for the backend
    ├── App.jsx           # the 3-panel UI
    └── styles.css        # dark theme

Production build & deploy

npm run build            # outputs static files to frontend/dist/

Serve dist/ from any static host (S3 + CloudFront, Nginx, Vercel, Netlify…), point VITE_API_BASE at your deployed API, and you have a deployable UI. The Docker Compose setup runs the dev server alongside the backend so you can try the whole stack with one command.

Note. This box can't run a Node build, so the React app is provided as complete, standard code (verified by review, not executed here). npm install && npm run dev builds and runs it on your machine.

Finally: packaging and deploying the whole stack. 👉

Deployment & production checklist

The pieces are built; this chapter packages them to run together and lays out what it takes to run this for real.

One command: Docker Compose

The whole stack — API, MLflow UI, and the React dev server — comes up together:

docker compose up --build
#   backend  -> http://localhost:8000   (FastAPI + /docs)
#   mlflow   -> http://localhost:5000   (experiment dashboard)
#   frontend -> http://localhost:5173   (the UI)
# Full stack: API backend, MLflow tracking UI, and the React frontend.
# Usage:  docker compose up --build
services:
  backend:
    build: .
    ports: ["8000:8000"]
    environment:
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}
      - MLFLOW_TRACKING_URI=http://mlflow:5000
      - NEWSRECO_EMBEDDER=${NEWSRECO_EMBEDDER:-tfidf}
    depends_on: [mlflow]

  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.16.2
    command: mlflow server --host 0.0.0.0 --port 5000
             --backend-store-uri /mlruns --default-artifact-root /mlruns
    ports: ["5000:5000"]
    volumes: ["./mlruns:/mlruns"]

  frontend:
    image: node:20-slim
    working_dir: /app
    command: sh -c "npm install && npm run dev -- --host 0.0.0.0"
    environment:
      - VITE_API_BASE=http://localhost:8000
    ports: ["5173:5173"]
    volumes: ["./frontend:/app"]
    depends_on: [backend]

The backend image trains a model at build time so it's self-contained:

# Backend image: trains the model at build time, then serves the API.
FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

# Generate the sample data and train a model so the image is self-contained.
RUN python scripts/make_sample_data.py && python -m newsreco.train

EXPOSE 8000
CMD ["uvicorn", "newsreco.api:app", "--host", "0.0.0.0", "--port", "8000"]

Set a Claude key (optional) before up to enable generative RAG:

export ANTHROPIC_API_KEY=sk-ant-...

The five phases (recap)

You can run any subset, no Docker required:

Phase 0  pip install -r requirements.txt && python scripts/make_sample_data.py
Phase 1  python -m newsreco.train          # + mlflow ui --backend-store-uri ./mlruns
Phase 2  uvicorn newsreco.api:app --port 8000
Phase 3  cd frontend && npm install && npm run dev
Phase 4  export ANTHROPIC_API_KEY=...       # real RAG (else offline)
Phase 5  docker compose up --build          # everything at once

What was verified vs. what you run

Honest accounting, since this was built on a small ARM server:

ComponentStatus
Data generator + loader✅ run & verified
Recommender + two-stage ranker (AUC 0.925)✅ run & verified
Training pipeline + MLflow logging✅ run & verified (./mlruns populated)
FastAPI endpoints✅ verified via TestClient
Test suite (pytest)✅ 6/6 passing
RAG offline mode✅ run & verified
RAG with live Claude⚙️ code complete; needs your ANTHROPIC_API_KEY
React UI⚙️ complete code; npm install && npm run dev on your machine
Docker Compose⚙️ complete; docker compose up on a Docker host

Scaling to production

The capstone is correct and complete; scaling it is about swapping components, not rewriting logic:

  • Embeddings. TF-IDF → a neural model (NEWSRECO_EMBEDDER=sbert, or a hosted embedding API). Same interface.
  • ANN index. The exact VectorIndexhnswlib/FAISS for millions of articles, or a vector DB (OpenSearch/Milvus/Qdrant). The interface in ann.py already matches.
  • Ranker. Logistic regression → gradient-boosted trees / a neural ranker, with many more features from a feature store. Same (features → click) pipeline.
  • Serving. Multiple uvicorn workers behind a load balancer; cache per-user candidates; load the Production model from the MLflow registry.
  • Data pipeline. Batch jobs to (re)embed new articles and rebuild the index; stream clicks into the feature store; retrain on a schedule.

Operating it well (the recsys lessons, applied)

From Best practices, the things that keep it healthy:

  • Evaluate by time + A/B test. Offline metrics (logged in MLflow) filter ideas; live A/B tests decide. Watch for train/serve skew via the feature store.
  • Cold start from day one — trending fallback is wired in; add onboarding and a little exploration (bandits) to surface new articles.
  • Watch the feedback loop. Inject diversity (we already dedupe by title) and exploration so the system doesn't collapse onto a few popular stories.
  • Monitor. Latency, error rate, recommendation coverage/diversity, and click-through — not just accuracy.
  • Ground & cite RAG, and keep the offline fallback so the assistant degrades gracefully.

Where to go next

  • Drop in the real MIND dataset (Chapter 17) and re-run the pipeline.
  • Swap TF-IDF for neural embeddings and the exact index for HNSW — then compare runs in MLflow.
  • Add a sequence model (what the user read in order) for next-article prediction.

That's a complete, production-shaped recommendation system — data, models, tracking, serving, UI, and an LLM assistant — built from the ideas in this book. 🎓

Using the real MIND dataset

Why this chapter matters: the capstone shipped with a small, generated sample so it runs anywhere. To build something real (or a portfolio piece), you want a real dataset. This chapter shows how to plug in MIND — the standard real-world news-recommendation dataset — with zero code changes, because we coded to its format from the start.

What MIND is

MIND (Microsoft News Dataset) is a large public benchmark for news recommendation: real news articles and real anonymized user click logs collected from Microsoft News. It comes in two sizes:

  • MINDsmall — ~50,000 users, a manageable few-hundred-MB download. Great for a laptop / this capstone.
  • MINDlarge — ~1,000,000 users, the full benchmark used in research papers.

It naturally includes sports (and soccer/football) alongside many other categories — exactly the themed mix our sample imitates.

Getting it

  1. Download from https://msnews.github.io/ (MINDsmall is the easy starting point). You'll get news.tsv and behaviors.tsv for train and dev splits.
  2. Drop news.tsv and behaviors.tsv into the capstone's data/ directory (replacing the sample).
  3. Re-run the pipeline:
    python -m newsreco.train          # train + evaluate + log to MLflow
    uvicorn newsreco.api:app --port 8000
    

That's it. No code changes — the loader already speaks MIND.

Why no code changes are needed

Our data loader was written against the exact MIND schema, including its real-world quirks. Verified on a real-format fixture:

articles: 3 (note one has an EMPTY abstract)
empty-abstract ok: ''
interactions (clicks): [('U13740', 'N55528', 1573463158.0), ('U99', 'N61837', ...)]
user U99 empty-history ok: [('N61837', ...)]
times parsed (epoch>0): True

The loader correctly handles:

  • Empty abstracts — some MIND articles have a title but no abstract.
  • MIND's timestamp format11/11/2019 9:05:58 AM (note the non-padded hour), parsed to a sortable epoch for time decay and time-based splits.
  • Empty history — brand-new users have a blank history column (a cold-start case the loader and recommender handle).
  • The entity columns — MIND includes title_entities / abstract_entities (named entities); we read the first five columns and ignore the rest, but those entities are a great feature to add later.

What changes at MIND scale (and how the design absorbs it)

MINDsmall runs as-is. As you scale toward MINDlarge, two components need their production swap — both already designed for it:

Concern at scaleSwap (no logic change)
~100k+ articles → exact cosine scan too slowVectorIndexhnswlib/FAISS (same interface, see ann.py)
TF-IDF vocabulary explodesNEWSRECO_EMBEDDER=sbert → fixed-size neural embeddings
Dense user-item math too bigthe recommender already uses per-item embeddings + ANN, not a dense matrix
k-means / ranker training slowtrain on a sample of impressions (standard practice)

Everything else — the decayed profile, the two-stage ranker, MLflow tracking, the API, the UI, RAG — is unchanged.

A realistic workflow

This is exactly how you'd work in practice, and a good habit to internalize:

  1. Develop on the sample (or MINDsmall) — fast iteration, cheap experiments.
  2. Track every run in MLflow (next chapter) so you can compare embedders, half-lives, and ranker settings.
  3. Scale to the full dataset once the pipeline is right, swapping in the ANN backend and neural embeddings.
  4. Validate online with an A/B test before trusting the offline numbers (best practices).

Next: managing trained models with the MLflow registry. 👉

The MLflow model registry

Why this chapter matters: Chapter 19 used MLflow to track experiments. But once you've found a good model, how do you get it into production safely — and roll back if it misbehaves? That's the job of the model registry: a versioned catalog of models with a pointer to "the one that's live."

The problem it solves

Without a registry, "deploy a new model" means editing code or copying files onto servers — error-prone and hard to undo. With a registry:

  • every trained model becomes a numbered version,
  • one version is marked production (via an alias),
  • the serving code always asks for "the production version,"

so deploying a better model is a one-line registry operation, not a code change — and rolling back is just pointing the alias at the previous version.

   train run ──► register ──► version 1 ┐
   train run ──► register ──► version 2 ├─ alias "production" ──► API loads this
   train run ──► register ──► version 3 ┘   (move the alias to deploy/rollback)

A prerequisite: a database backend

The registry needs a database-backed tracking store — the plain file store (file:./mlruns) can't do it. Use SQLite locally (or Postgres/MySQL in production):

export MLFLOW_TRACKING_URI=sqlite:///mlflow.db

The helper

registry.py wraps the three operations — register, promote, load — using MLflow 3 aliases (production):

"""MLflow Model Registry helpers: register a trained model, mark a version as
'production', and load whatever the current production version is.

The registry is how you decouple *deploying a better model* from *changing code*:
the API always asks for the production version; promoting a new model is a registry
operation, not a redeploy.

NOTE: the registry needs a database-backed tracking store (e.g. sqlite or a
tracking server) — the plain file store does not support it. Set:
    export MLFLOW_TRACKING_URI=sqlite:///mlflow.db
This module uses MLflow 3 *aliases* ('production'); on MLflow 2 use stages
(transition_model_version_stage(..., stage='Production')).
"""

from __future__ import annotations

import pickle

import mlflow
from mlflow.tracking import MlflowClient

REGISTERED_NAME = "news-recommender"
ALIAS = "production"


def register(run_id, artifact_file="newsreco.pkl", name=REGISTERED_NAME):
    """Register a run's logged model artifact as a new model version."""
    client = MlflowClient()
    try:
        client.create_registered_model(name)
    except Exception:
        pass  # already exists
    mv = client.create_model_version(
        name=name, source=f"runs:/{run_id}/{artifact_file}", run_id=run_id)
    return int(mv.version)


def promote(version, name=REGISTERED_NAME, alias=ALIAS):
    """Point the 'production' alias at a specific version."""
    MlflowClient().set_registered_model_alias(name, alias, version)


def load_production(name=REGISTERED_NAME, alias=ALIAS):
    """Download + unpickle whatever version currently holds the 'production' alias.
    Returns the {'recommender', 'ranker'} bundle, or None if nothing is promoted."""
    client = MlflowClient()
    try:
        mv = client.get_model_version_by_alias(name, alias)
    except Exception:
        return None
    local = mlflow.artifacts.download_artifacts(mv.source)
    with open(local, "rb") as f:
        return pickle.load(f)

The full flow, verified

Train → register → promote → load back the production model:

from newsreco import registry
from newsreco.train import run
from newsreco.config import Config
import mlflow

cfg = Config()                              # MLFLOW_TRACKING_URI=sqlite:///mlflow.db
run(cfg)                                    # trains + logs a run

client = mlflow.tracking.MlflowClient()
exp = client.get_experiment_by_name(cfg.experiment)
run_id = client.search_runs([exp.experiment_id],
                            order_by=["attributes.start_time DESC"])[0].info.run_id

version = registry.register(run_id)         # -> new model version
registry.promote(version)                   # point 'production' alias at it
bundle = registry.load_production()         # load whatever is in production

Output:

latest run: 0e3f01f13906473c90733c96ca30e343
registered version: 1
promoted version 1 to alias 'production'
loaded production model -> 300 articles, ranker: True
sample rec for U106: ['N106', 'N98', 'N100']

The model was registered, promoted, then loaded straight back from the registry and used to recommend — the exact loop a deployment pipeline runs.

Wiring it into the API

The API will prefer the registry's production model when you ask it to (otherwise it uses the local artifact, then a fresh train):

# newsreco/api.py — _load_or_train()
if os.environ.get("NEWSRECO_USE_REGISTRY") == "1":
    from .registry import load_production
    bundle = load_production()
    if bundle:
        return bundle["recommender"], bundle.get("ranker")
# ... else local pickle ... else train fresh

So deploying a newly trained, better model is:

# 1. train a candidate (logs a run, saves artifact)
python -m newsreco.train
# 2. register + promote it (after checking its MLflow metrics look good)
python -c "from newsreco import registry, ...; v=registry.register(run_id); registry.promote(v)"
# 3. restart the API with NEWSRECO_USE_REGISTRY=1  -> it serves the new model

No code change, and rollback is registry.promote(previous_version).

MLflow versions note

This uses MLflow 3's aliases (set_registered_model_alias / get_model_version_by_alias). On MLflow 2.x the equivalent is stages: client.transition_model_version_stage(name, version, stage="Production") and client.get_latest_versions(name, stages=["Production"]). Same idea, older API.

Production registry hygiene

  • Gate promotion on metrics — only promote if the new run beats production on your offline metrics (and ideally an A/B test).
  • Keep history — never delete old versions; they're your rollback path.
  • Tag versions — record the dataset snapshot, code commit, and owner.
  • Automate — a CI job that trains, evaluates, registers, and (if it clears a bar) promotes is the backbone of continuous delivery for ML.

Next, a more advanced model: predicting the next article from reading order. 👉

Sequence-aware recommendation

Why this chapter matters: everything so far treated a user's history as an unordered bag of items ("you like soccer"). But order carries information: what you read just now predicts what you'll read next better than your average taste. Sequence-aware models exploit that, and they're behind "Up next" autoplay and session-based feeds. This chapter builds the simplest one from scratch and shows it wins at next-article prediction.

The shift: from "taste" to "what's next"

  • Static (earlier chapters): summarize all of a user's history into one taste vector, recommend similar items. Great for "more like what you generally like."
  • Sequential (here): model transitions — given the last item (or the recent sequence), predict the next. Great for "you just read X, so here's Y."

A news reader who just opened a Champions League final article is, right now, most likely to open another match report — not a random article from their all-time-favorite category. Order captures that intent.

The simplest sequence model: a first-order Markov chain

Count how often item B is read right after item A across all users. That gives a transition table; to recommend, look at the user's last item and return the items most likely to follow it.

build:   for each user's chronological history a -> b -> c ...
         T[a, b] += 1 ;  T[b, c] += 1 ; ...
recommend(user): take their last item L, rank items by T[L, *]

This is a first-order Markov model (only the last item matters). It's the transparent ancestor of neural session models like GRU4Rec (a recurrent net over the sequence) and SASRec (self-attention over the sequence), which learn richer, longer-range patterns — but the core idea is this table.

The code

"""Sequence-aware recommendation: predict the NEXT article from the ORDER of
what a user read, not just an unordered taste profile.

This is a from-scratch first-order Markov model over item transitions ("people
who read A next read B"), optionally blended with content similarity. It's the
transparent ancestor of session models like GRU4Rec / SASRec.
"""

from __future__ import annotations

import numpy as np


class SequenceRecommender:
    def __init__(self, blend=0.0):
        # blend in [0,1]: 0 = pure transitions, 1 = pure content similarity
        self.blend = blend

    def fit(self, data, content_rec=None):
        self.data = data
        self.ids = data.article_ids
        self.row = {nid: r for r, nid in enumerate(self.ids)}
        n = len(self.ids)
        self.T = np.zeros((n, n))                     # T[a, b] = count(a -> b)
        self.last = {}                                # user -> last clicked item

        for u, hist in data.user_history.items():
            seq = [nid for nid, _ in sorted(hist, key=lambda x: x[1])]
            for a, b in zip(seq, seq[1:]):
                ra, rb = self.row.get(a), self.row.get(b)
                if ra is not None and rb is not None:
                    self.T[ra, rb] += 1.0
            if seq:
                self.last[u] = seq[-1]

        # row-normalize transitions into probabilities
        rowsum = self.T.sum(1, keepdims=True)
        self.P = self.T / np.maximum(rowsum, 1e-9)
        self.content = content_rec                    # optional content blend
        return self

    def _scores_from_item(self, item_id):
        r = self.row.get(item_id)
        if r is None:
            return np.zeros(len(self.ids))
        scores = self.P[r].copy()
        if self.blend > 0 and self.content is not None:
            csim = self.content.emb @ self.content.emb[r]      # content similarity
            scores = (1 - self.blend) * scores + self.blend * csim
        return scores

    def predict_next(self, last_item, k=10, exclude=()):
        scores = self._scores_from_item(last_item)
        order = np.argsort(-scores)
        out = [self.ids[i] for i in order if self.ids[i] not in exclude and scores[i] > 0]
        return out[:k]

    def recommend(self, user, k=10):
        last = self.last.get(user)
        if last is None:
            return []
        seen = {nid for nid, _ in self.data.user_history.get(user, [])}
        return self.predict_next(last, k, exclude=seen)

Notes:

  • T[a, b] counts transitions; row-normalizing gives P[a, b] = probability of reading b next after a.
  • blend mixes the transition score with content similarity — pure transitions are sparse (many item pairs are never observed), so leaning on content fills the gaps. This hybrid is the practical sweet spot.

Does order actually help? (verified)

We evaluate next-article prediction: train on everything except each user's last click, then predict that held-out next click (the fair leave-last-out split from Chapter 3).

next-article prediction over 399 users:
  ContentBased (profile)   recall@10=0.083  ndcg@10=0.044
  Sequence (Markov)        recall@10=0.105  ndcg@10=0.054
  Sequence + content 0.3   recall@10=0.281  ndcg@10=0.116

The story is clear:

  • The pure Markov model already beats the static content profile (0.105 vs 0.083) — order carries signal.
  • Blending order with content is dramatically better (0.281, more than 3× the content-only baseline). Transitions say "what tends to come next"; content fills in unseen pairs with "what's similar." Together they're far stronger than either alone.

A concrete prediction — the user just read a Ballon d'Or piece, and the model suggests more soccer:

user U106: last read -> "Bukayo Saka wins Ballon d'Or after stellar football season"
   next -> soccer | Erling Haaland completes 60 million transfer to Bayern Munich
   next -> soccer | Juventus beat Arsenal 2-2 in La Liga
   next -> soccer | Manchester City and Paris Saint-Germain play out thrilling 4-4 draw

How to use it in the capstone

SequenceRecommender follows the same .fit(...) / .recommend(user, k) interface as the other models, so it slots into the API like any other — or, better, becomes another candidate generator in the two-stage architecture (Chapter 9): retrieve some candidates by taste (content/CF) and some by what's next (sequence), then let the ranker combine them. Mixing complementary retrievers is standard in production.

Going deeper

  • Higher-order / session models — condition on the last few items, not just one (GRU4Rec, SASRec, BERT4Rec). They capture longer patterns at the cost of a trained neural net.
  • Time-aware sequences — weight recent transitions more (the same time decay from Chapter 5).
  • Session boundaries — reset the sequence when a user starts a new visit; intent within a session is the strongest signal of all.

That rounds out the capstone: from baselines and classic CF, through matrix factorization and learning-to-rank, to a production stack with MLflow, an API, a UI, RAG, and now sequence models — the real toolkit of a modern recommender engineer. 🎓

References

  1. Yehuda Koren, Robert Bell, Chris Volinsky. Matrix Factorization Techniques for Recommender Systems. IEEE Computer, 2009. — The Netflix-Prize-era reference for MF (our SGD version).

  2. Yifan Hu, Yehuda Koren, Chris Volinsky. Collaborative Filtering for Implicit Feedback Datasets. ICDM 2008. — Implicit ALS: preference + confidence (our ImplicitALS).

  3. Steffen Rendle et al. BPR: Bayesian Personalized Ranking from Implicit Feedback. UAI 2009. — Pairwise learning-to-rank with negative sampling (our BPR).

  4. Greg Linden, Brent Smith, Jeremy York. Amazon.com Recommendations: Item-to-Item Collaborative Filtering. IEEE Internet Computing, 2003. — The item-item neighborhood method at scale.

  5. Paul Covington, Jay Adams, Emre Sargin. Deep Neural Networks for YouTube Recommendations. RecSys 2016. — The canonical two-stage (candidate generation + ranking) deep architecture.

  6. Xiangnan He et al. Neural Collaborative Filtering. WWW 2017. — Neural generalization of matrix factorization.

  7. Maurizio Ferrari Dacrema, Paolo Cremonesi, Dietmar Jannach. Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches. RecSys 2019. — Why strong, well-tuned baselines matter (and often win).

Tools & libraries

  • implicit — fast ALS / BPR (Cython).
  • LightFM — hybrid content + collaborative (WARP/BPR).
  • FAISS / HNSW / ScaNN — ANN serving for candidate generation.
  • OpenSearch / Milvus / Qdrant / Pinecone — vector databases with kNN search.
  • TensorFlow Recommenders / TorchRec — two-tower and deep ranking models.

Companion books here

This book's code

All depend only on NumPy and the standard library.