Content-based filtering (decay + kNN)

This is the recommender from the introduction — and a great first personalized method. The idea: describe items by their content (embeddings), summarize a user by the content they engage with, and recommend the nearest items. It needs no other users' data, so it works from a user's very first interactions.

The recipe

  1. Embed every item. Turn each item into a vector capturing its content (genre features, or a text/image model's embedding). Similar items → nearby vectors.
  2. Build the user's taste vector. Average the embeddings of the items the user interacted with — weighting recent interactions more (time decay), because interests drift.
  3. Find nearest items. Rank all unseen items by cosine similarity to the taste vector (a nearest-neighbor / kNN search) and recommend the top-k.
  read articles        their embeddings        decayed average      kNN over catalog
  (recent first)   ──►  •  •  •           ──►   ★ taste vector  ──►  nearest unseen items
                       (older = lighter)

Why the time decay matters

A user who spent last year on cooking but this week on travel should get travel recs. A plain average buries the recent shift under a year of cooking. Weighting by recency (the same $w = e^{-\lambda\,\text{age}}$, half-life $H$ from Chapter 4) lets the profile follow the user.

$$ \text{taste}u = \frac{\sum{(i,t)\in \text{history}} e^{-\lambda(\text{now}-t)}\; \text{emb}i}{\sum{(i,t)\in \text{history}} e^{-\lambda(\text{now}-t)}}. $$

It's a weighted average of item embeddings, recent items weighing most.

The code

import numpy as np

class ContentBased:
    def __init__(self, half_life=2000.0):
        self.half_life = half_life

    def fit(self, train, item_feats, n_users, n_items):
        self.item_feats = item_feats
        norm = np.linalg.norm(item_feats, axis=1, keepdims=True)
        self.feat_unit = item_feats / np.maximum(norm, 1e-9)   # unit vectors -> cosine
        self.hist = {u: [] for u in range(n_users)}
        for u, i, t in train:
            self.hist[u].append((t, i))
        self.now = max(t for u, i, t in train)
        self.seen = user_seen(train, n_users)
        return self

    def profile(self, u):
        h = self.hist.get(u, [])
        if not h:
            return None                                        # no history -> cold start
        lam = np.log(2) / self.half_life
        items = [i for t, i in h]
        w = np.array([np.exp(-lam * (self.now - t)) for t, i in h])
        return (w[:, None] * self.item_feats[items]).sum(0) / w.sum()   # decayed average

    def recommend(self, u, k=10):
        p = self.profile(u)
        if p is None:
            return []                                          # caller falls back to trending
        pu = p / max(np.linalg.norm(p), 1e-9)
        scores = self.feat_unit @ pu                           # cosine to every item
        return _topk_excluding(scores, self.seen.get(u, set()), k)

feat_unit @ pu is the kNN step done with one matrix-vector product: it computes the cosine similarity between the taste vector and every item at once, then we take the top-k unseen. At catalog scale you'd replace this exact scan with an approximate nearest-neighbor index (HNSW or IVF-PQ) — that's the direct bridge to the other books, and the serving chapter.

See it work

The article-recommender CLI is exactly this method on text (embeddings via TF-IDF). A user who read two tech articles, asking for 3 recs:

$ python recommend_cli.py articles.txt --history 0,1 -k 3
history (oldest->newest): [0, 1]
  most recent read: 'Google releases smartphone software update improving camera battery and performance'

recommended for you:
  1. (score=0.453)  [3] Samsung reveals smartphone with upgraded camera processor and faster software
  2. (score=0.337)  [2] New laptop launches with a powerful processor faster software and better battery
  3. (score=0.311)  [4] Chipmaker announces faster processor boosting laptop and smartphone performance

All three recommendations are tech articles — the taste vector landed in the "tech" region of embedding space and kNN returned its neighbors. A user with a sports history gets sports back instead.

Where content-based shines — and where it doesn't

Strengths

  • No cold-start for items. A brand-new article can be recommended the instant it's embedded — no interactions needed. (Great for news, where items are born and die daily.)
  • Works from a short history and is explainable ("because you read X").
  • No dependence on other users.

Weaknesses

  • Over-specialization / filter bubble. It keeps recommending the same topic; it can't surface a great item outside your past interests (no serendipity).
  • Only as good as the features. Weak embeddings → weak recs. (Modern systems use strong text/image embedding models here.)
  • Still needs some user history to build a profile — a brand-new user with zero reads gets nothing, which is why the CLI falls back to representative/ trending items (Cold start).

Content-based looks only at items. The next family looks at the crowd — often finding connections content can't. 👉