Measuring success: offline metrics

You can't improve what you can't measure. Before building recommenders we need a fair way to score them — otherwise we're just guessing. This chapter defines the metrics used in every later chapter's leaderboard.

The evaluation setup

We use leave-last-out: for each user, hide their chronologically last interaction, train on the rest, then ask the model for a top-k list and check whether the hidden item shows up. Using the last (not a random) interaction mimics reality — predict the future from the past, never peek ahead.

The held-out item is the ground truth; the model's ranked list is the prediction. All metrics compare the two.

The core metrics

Recall@k (a.k.a. Hit Rate here)

Of the users, what fraction had their hidden item appear in their top-k?

With one hidden item per user, "recall@k" = "hit rate": 1 if the item is in the top-k, else 0, averaged over users. Higher is better. It answers "did we find it at all (within k)?" but ignores where in the list.

NDCG@k — rank-aware

Showing the right item at position 1 is better than at position 10. NDCG (Normalized Discounted Cumulative Gain) rewards higher placement with a logarithmic discount. For a single relevant item at 0-based rank $r$:

$$ \text{NDCG} = \frac{1}{\log_2(r + 2)}. $$

Rank 0 → $1/\log_2 2 = 1.0$ (perfect); rank 1 → $1/\log_2 3 \approx 0.63$; and so on. Averaged over users. NDCG is the most-reported offline metric because it captures ordering quality, which is what users feel.

MAP@k — precision-of-ranking

Mean Average Precision also rewards ranking the hit early; with one relevant item it reduces to $1/(\text{rank}+1)$ averaged over users (rank 1-based here).

The code

import numpy as np

def recall_at_k(recs, truth, k):
    hits = sum(1 for u, item in truth.items() if item in recs.get(u, [])[:k])
    return hits / len(truth)

def ndcg_at_k(recs, truth, k):
    total = 0.0
    for u, item in truth.items():
        lst = recs.get(u, [])[:k]
        if item in lst:
            total += 1.0 / np.log2(lst.index(item) + 2)   # 0-based rank
    return total / len(truth)

def map_at_k(recs, truth, k):
    total = 0.0
    for u, item in truth.items():
        lst = recs.get(u, [])[:k]
        if item in lst:
            total += 1.0 / (lst.index(item) + 1)          # 1-based rank
    return total / len(truth)

See them on a tiny example

Three users; one hits at rank 1, one at rank 3, one misses:

truth = {0: 7, 1: 3, 2: 9}
recs  = {0: [7, 1, 5],     # hit at position 1
         1: [4, 8, 3],     # hit at position 3
         2: [2, 6, 0]}     # miss
print("recall@3:", round(recall_at_k(recs, truth, 3), 3))
print("ndcg@3  :", round(ndcg_at_k(recs, truth, 3), 3))
print("map@3   :", round(map_at_k(recs, truth, 3), 3))

Output:

recall@3: 0.667
ndcg@3  : 0.5
map@3   : 0.444

Two of three users got a hit → recall 0.667. NDCG and MAP are lower because the hits aren't all at the top (one was at rank 3, which is discounted).

Beyond accuracy

Accuracy isn't everything. Production teams also track:

  • Coverage — what fraction of the catalog ever gets recommended? (A recommender that only ever shows the 10 most popular items has tiny coverage.)
  • Diversity — are the k items varied, or ten near-duplicates?
  • Novelty / serendipity — does it surface things the user wouldn't have found alone, or only the obvious?
  • Popularity bias — is it just re-recommending blockbusters and ignoring the long tail?

These often trade off against raw accuracy, and tuning that balance is a core design decision (see Best practices).

The offline–online gap

Crucial caveat: good offline numbers don't guarantee a better product. Offline metrics score the model on past logged behavior, which was itself shaped by the old recommender (a feedback loop), and they can't measure whether a new recommendation would have been clicked. The real verdict comes from online A/B tests — showing the new system to a slice of live traffic and comparing business metrics (engagement, retention, revenue). Offline metrics are a fast filter for which ideas to A/B test, not the final word.

With a scoreboard in hand, let's build the simplest possible recommenders: baselines. 👉