Measuring success: offline metrics
You can't improve what you can't measure. Before building recommenders we need a fair way to score them — otherwise we're just guessing. This chapter defines the metrics used in every later chapter's leaderboard.
The evaluation setup
We use leave-last-out: for each user, hide their chronologically last interaction, train on the rest, then ask the model for a top-k list and check whether the hidden item shows up. Using the last (not a random) interaction mimics reality — predict the future from the past, never peek ahead.
The held-out item is the ground truth; the model's ranked list is the prediction. All metrics compare the two.
The core metrics
Recall@k (a.k.a. Hit Rate here)
Of the users, what fraction had their hidden item appear in their top-k?
With one hidden item per user, "recall@k" = "hit rate": 1 if the item is in the top-k, else 0, averaged over users. Higher is better. It answers "did we find it at all (within k)?" but ignores where in the list.
NDCG@k — rank-aware
Showing the right item at position 1 is better than at position 10. NDCG (Normalized Discounted Cumulative Gain) rewards higher placement with a logarithmic discount. For a single relevant item at 0-based rank $r$:
$$ \text{NDCG} = \frac{1}{\log_2(r + 2)}. $$
Rank 0 → $1/\log_2 2 = 1.0$ (perfect); rank 1 → $1/\log_2 3 \approx 0.63$; and so on. Averaged over users. NDCG is the most-reported offline metric because it captures ordering quality, which is what users feel.
MAP@k — precision-of-ranking
Mean Average Precision also rewards ranking the hit early; with one relevant item it reduces to $1/(\text{rank}+1)$ averaged over users (rank 1-based here).
The code
import numpy as np
def recall_at_k(recs, truth, k):
hits = sum(1 for u, item in truth.items() if item in recs.get(u, [])[:k])
return hits / len(truth)
def ndcg_at_k(recs, truth, k):
total = 0.0
for u, item in truth.items():
lst = recs.get(u, [])[:k]
if item in lst:
total += 1.0 / np.log2(lst.index(item) + 2) # 0-based rank
return total / len(truth)
def map_at_k(recs, truth, k):
total = 0.0
for u, item in truth.items():
lst = recs.get(u, [])[:k]
if item in lst:
total += 1.0 / (lst.index(item) + 1) # 1-based rank
return total / len(truth)
See them on a tiny example
Three users; one hits at rank 1, one at rank 3, one misses:
truth = {0: 7, 1: 3, 2: 9}
recs = {0: [7, 1, 5], # hit at position 1
1: [4, 8, 3], # hit at position 3
2: [2, 6, 0]} # miss
print("recall@3:", round(recall_at_k(recs, truth, 3), 3))
print("ndcg@3 :", round(ndcg_at_k(recs, truth, 3), 3))
print("map@3 :", round(map_at_k(recs, truth, 3), 3))
Output:
recall@3: 0.667
ndcg@3 : 0.5
map@3 : 0.444
Two of three users got a hit → recall 0.667. NDCG and MAP are lower because the hits aren't all at the top (one was at rank 3, which is discounted).
Beyond accuracy
Accuracy isn't everything. Production teams also track:
- Coverage — what fraction of the catalog ever gets recommended? (A recommender that only ever shows the 10 most popular items has tiny coverage.)
- Diversity — are the k items varied, or ten near-duplicates?
- Novelty / serendipity — does it surface things the user wouldn't have found alone, or only the obvious?
- Popularity bias — is it just re-recommending blockbusters and ignoring the long tail?
These often trade off against raw accuracy, and tuning that balance is a core design decision (see Best practices).
The offline–online gap
Crucial caveat: good offline numbers don't guarantee a better product. Offline metrics score the model on past logged behavior, which was itself shaped by the old recommender (a feedback loop), and they can't measure whether a new recommendation would have been clicked. The real verdict comes from online A/B tests — showing the new system to a slice of live traffic and comparing business metrics (engagement, retention, revenue). Offline metrics are a fast filter for which ideas to A/B test, not the final word.
With a scoreboard in hand, let's build the simplest possible recommenders: baselines. 👉