A worked evaluation: the leaderboard
Now we put every algorithm on the same data and the same scoreboard. This is how
you'd actually decide what to ship: train them all, measure offline, and read the
ranking. The full script is code/demo.py.
The setup
make_synthetic builds an interaction log with real structure: each item has a
genre, each user has a peaky genre taste, and item popularity is skewed — so
popularity is a genuine baseline and personalization can still win. We hold out
each user's last interaction (leave-last-out) and score the top-10.
from recsys import (make_synthetic, train_test_split_last,
recall_at_k, ndcg_at_k, map_at_k,
Random, Popularity, Trending, ContentBased,
ItemItemCF, ImplicitALS, BPR)
inter, feats, nu, ni = make_synthetic(seed=0)
train, test = train_test_split_last(inter, nu)
The result
Running python demo.py (~10s) prints:
users=300 items=150 interactions=18000 test users=300
leaderboard (k=10, higher is better):
model recall@k ndcg@k map@k
-----------------------------------------
Random 0.033 0.014 0.008
Popularity 0.073 0.032 0.020
Trending 0.073 0.031 0.019
ContentBased 0.080 0.035 0.021
ItemItemCF 0.093 0.051 0.038
ImplicitALS 0.087 0.041 0.028
BPR 0.110 0.045 0.026
cold start — a new user with no interactions:
ContentBased.recommend(new_user) -> [] (empty: no history)
Trending fallback -> [127, 105, 87, 78, 13] (works without history)
How to read it
The ranking tells the whole story of the book, bottom to top:
- Random (0.033) — the floor. Any real method must beat this.
- Popularity / Trending (0.073) — non-personalized, yet 2× random. Always measure against this; it's deceptively strong and it's your cold-start fallback.
- ContentBased (0.080) — the first personalization, from item features + decayed profile. Modest here (genre features are coarse), but it's the method that handles new items and short histories.
- ItemItemCF (0.093) — learning from the crowd beats content, with no item features at all. Note its higher ndcg/map: it ranks the hit earlier, not just within top-10.
- ImplicitALS (0.087) — competitive and compact/scalable (tiny vectors, ANN- servable) — the production-friendly choice.
- BPR (0.110) — best recall@10, because it optimizes ranking directly, which is what recall@k rewards.
And the cold-start lines show the graceful fallback: the personalized model returns nothing for a brand-new user, so the system serves trending instead.
The honest caveats
- Absolute numbers are dataset-dependent. Offline recall@10 on sparse implicit data is often in this range; what matters is the ordering and the gap over random/popularity, not the raw value.
- No single winner. BPR leads recall@10; item-item leads ndcg/map. The "best" model depends on your metric, data, and constraints — which is why you run a leaderboard instead of trusting a reputation.
- Offline ≠ online. This leaderboard filters candidates worth A/B testing; the live test decides (metrics chapter).
Try it yourself
- Change
seedinmake_syntheticand re-run — the ordering is stable, the exact numbers wobble. - Raise
pop_weight(more popularity-driven data) and watch Popularity climb. - Tune
factors/iters(ALS) orepochs/lr(BPR) and watch them move. - Increase
kto 20 and see recall rise (more chances to include the hit).
Finally, the wisdom that doesn't fit in a metric: best practices and pitfalls. 👉