A worked evaluation: the leaderboard

Now we put every algorithm on the same data and the same scoreboard. This is how you'd actually decide what to ship: train them all, measure offline, and read the ranking. The full script is code/demo.py.

The setup

make_synthetic builds an interaction log with real structure: each item has a genre, each user has a peaky genre taste, and item popularity is skewed — so popularity is a genuine baseline and personalization can still win. We hold out each user's last interaction (leave-last-out) and score the top-10.

from recsys import (make_synthetic, train_test_split_last,
                    recall_at_k, ndcg_at_k, map_at_k,
                    Random, Popularity, Trending, ContentBased,
                    ItemItemCF, ImplicitALS, BPR)

inter, feats, nu, ni = make_synthetic(seed=0)
train, test = train_test_split_last(inter, nu)

The result

Running python demo.py (~10s) prints:

users=300  items=150  interactions=18000  test users=300

leaderboard (k=10, higher is better):

  model           recall@k   ndcg@k   map@k
  -----------------------------------------
  Random             0.033    0.014   0.008
  Popularity         0.073    0.032   0.020
  Trending           0.073    0.031   0.019
  ContentBased       0.080    0.035   0.021
  ItemItemCF         0.093    0.051   0.038
  ImplicitALS        0.087    0.041   0.028
  BPR                0.110    0.045   0.026

cold start — a new user with no interactions:
  ContentBased.recommend(new_user) -> []  (empty: no history)
  Trending fallback                -> [127, 105, 87, 78, 13]  (works without history)

How to read it

The ranking tells the whole story of the book, bottom to top:

Random (0.033) — the floor. Any real method must beat this.
Popularity / Trending (0.073) — non-personalized, yet 2× random. Always measure against this; it's deceptively strong and it's your cold-start fallback.
ContentBased (0.080) — the first personalization, from item features + decayed profile. Modest here (genre features are coarse), but it's the method that handles new items and short histories.
ItemItemCF (0.093) — learning from the crowd beats content, with no item features at all. Note its higher ndcg/map: it ranks the hit earlier, not just within top-10.
ImplicitALS (0.087) — competitive and compact/scalable (tiny vectors, ANN- servable) — the production-friendly choice.
BPR (0.110) — best recall@10, because it optimizes ranking directly, which is what recall@k rewards.

And the cold-start lines show the graceful fallback: the personalized model returns nothing for a brand-new user, so the system serves trending instead.

The honest caveats

Absolute numbers are dataset-dependent. Offline recall@10 on sparse implicit data is often in this range; what matters is the ordering and the gap over random/popularity, not the raw value.
No single winner. BPR leads recall@10; item-item leads ndcg/map. The "best" model depends on your metric, data, and constraints — which is why you run a leaderboard instead of trusting a reputation.
Offline ≠ online. This leaderboard filters candidates worth A/B testing; the live test decides (metrics chapter).

Try it yourself

Change seed in make_synthetic and re-run — the ordering is stable, the exact numbers wobble.
Raise pop_weight (more popularity-driven data) and watch Popularity climb.
Tune factors/iters (ALS) or epochs/lr (BPR) and watch them move.
Increase k to 20 and see recall rise (more chances to include the hit).

Finally, the wisdom that doesn't fit in a metric: best practices and pitfalls. 👉