Content-based filtering (decay + kNN)
This is the recommender from the introduction — and a great first personalized method. The idea: describe items by their content (embeddings), summarize a user by the content they engage with, and recommend the nearest items. It needs no other users' data, so it works from a user's very first interactions.
The recipe
- Embed every item. Turn each item into a vector capturing its content (genre features, or a text/image model's embedding). Similar items → nearby vectors.
- Build the user's taste vector. Average the embeddings of the items the user interacted with — weighting recent interactions more (time decay), because interests drift.
- Find nearest items. Rank all unseen items by cosine similarity to the taste vector (a nearest-neighbor / kNN search) and recommend the top-k.
read articles their embeddings decayed average kNN over catalog
(recent first) ──► • • • ──► ★ taste vector ──► nearest unseen items
(older = lighter)
Why the time decay matters
A user who spent last year on cooking but this week on travel should get travel recs. A plain average buries the recent shift under a year of cooking. Weighting by recency (the same $w = e^{-\lambda\,\text{age}}$, half-life $H$ from Chapter 4) lets the profile follow the user.
$$ \text{taste}u = \frac{\sum{(i,t)\in \text{history}} e^{-\lambda(\text{now}-t)}\; \text{emb}i}{\sum{(i,t)\in \text{history}} e^{-\lambda(\text{now}-t)}}. $$
It's a weighted average of item embeddings, recent items weighing most.
The code
import numpy as np
class ContentBased:
def __init__(self, half_life=2000.0):
self.half_life = half_life
def fit(self, train, item_feats, n_users, n_items):
self.item_feats = item_feats
norm = np.linalg.norm(item_feats, axis=1, keepdims=True)
self.feat_unit = item_feats / np.maximum(norm, 1e-9) # unit vectors -> cosine
self.hist = {u: [] for u in range(n_users)}
for u, i, t in train:
self.hist[u].append((t, i))
self.now = max(t for u, i, t in train)
self.seen = user_seen(train, n_users)
return self
def profile(self, u):
h = self.hist.get(u, [])
if not h:
return None # no history -> cold start
lam = np.log(2) / self.half_life
items = [i for t, i in h]
w = np.array([np.exp(-lam * (self.now - t)) for t, i in h])
return (w[:, None] * self.item_feats[items]).sum(0) / w.sum() # decayed average
def recommend(self, u, k=10):
p = self.profile(u)
if p is None:
return [] # caller falls back to trending
pu = p / max(np.linalg.norm(p), 1e-9)
scores = self.feat_unit @ pu # cosine to every item
return _topk_excluding(scores, self.seen.get(u, set()), k)
feat_unit @ pu is the kNN step done with one matrix-vector product: it computes
the cosine similarity between the taste vector and every item at once, then we
take the top-k unseen. At catalog scale you'd replace this exact scan with an
approximate nearest-neighbor index (HNSW or IVF-PQ) — that's the direct bridge
to the other books, and the serving chapter.
See it work
The article-recommender CLI is exactly this method on text (embeddings via TF-IDF). A user who read two tech articles, asking for 3 recs:
$ python recommend_cli.py articles.txt --history 0,1 -k 3
history (oldest->newest): [0, 1]
most recent read: 'Google releases smartphone software update improving camera battery and performance'
recommended for you:
1. (score=0.453) [3] Samsung reveals smartphone with upgraded camera processor and faster software
2. (score=0.337) [2] New laptop launches with a powerful processor faster software and better battery
3. (score=0.311) [4] Chipmaker announces faster processor boosting laptop and smartphone performance
All three recommendations are tech articles — the taste vector landed in the "tech" region of embedding space and kNN returned its neighbors. A user with a sports history gets sports back instead.
Where content-based shines — and where it doesn't
Strengths
- No cold-start for items. A brand-new article can be recommended the instant it's embedded — no interactions needed. (Great for news, where items are born and die daily.)
- Works from a short history and is explainable ("because you read X").
- No dependence on other users.
Weaknesses
- Over-specialization / filter bubble. It keeps recommending the same topic; it can't surface a great item outside your past interests (no serendipity).
- Only as good as the features. Weak embeddings → weak recs. (Modern systems use strong text/image embedding models here.)
- Still needs some user history to build a profile — a brand-new user with zero reads gets nothing, which is why the CLI falls back to representative/ trending items (Cold start).
Content-based looks only at items. The next family looks at the crowd — often finding connections content can't. 👉