Best practices & pitfalls

The algorithms are the easy part. What separates a good recommender from a harmful or useless one is everything around them. This chapter is the hard-won wisdom — the traps that sink real systems and the practices that keep them healthy.

Evaluation traps

Don't test on the past you trained on. Always split by time (leave-last-out or a time cutoff), never randomly — a random split lets the model "see the future", inflating offline scores that collapse in production.
Don't trust offline metrics alone. They score the model on behavior the old recommender produced and can't tell whether a new recommendation would've been clicked. Use offline metrics to filter ideas, then A/B test the survivors on live traffic and judge by business metrics.
Always include dumb baselines. If you can't beat popularity, your fancy model is broken or pointless. Random and popularity belong on every leaderboard.
Beware leakage. A feature that secretly encodes the answer (e.g. "number of interactions with this item" computed including the test interaction) gives fantastic offline numbers and fails live.

Bias and feedback loops

Popularity bias. Models trained on logged clicks learn to recommend what's already popular, which gets more clicks, which reinforces it — the rich get richer and the long tail starves. Counter it with exploration (bandits), diversity constraints, and sometimes popularity de-biasing in the loss.
Position bias. Users click top-ranked items partly because they're on top, not because they're best. Training naively on clicks teaches "whatever we already ranked high is good." Mitigations: model the position, randomize positions slightly, or use inverse-propensity weighting.
The feedback loop. Recommendations shape data shapes recommendations. Without deliberate exploration, the system narrows over time and you stop learning about anything you don't already show.
Filter bubbles. Over-personalization traps users in a narrow slice of the catalog. Inject diversity and serendipity on purpose — both for user experience and for catalog health.

Beyond accuracy: what to optimize

Maximizing recall@k alone produces a boring, narrow, popularity-heavy product. Balance it with:

Diversity — don't show ten near-duplicates; enforce variety in the top-k (e.g. with Maximal Marginal Relevance, which trades relevance against similarity to already-picked items).
Novelty / serendipity — surface things the user wouldn't have found alone.
Coverage — make sure the catalog's long tail gets shown, not just the hits.
Freshness — new items reach users quickly.
Business rules — in-stock, licensing, fairness, "don't recommend what they just bought", content-safety.

These usually live in the re-rank stage as constraints on top of the model's scores.

Practical engineering wisdom

Start simple. Popularity → content-based → item-item → MF. Ship the simplest thing that beats the baseline; add complexity only when it earns its keep in an A/B test.
Implicit ≠ explicit. Treat missing data as unknown, not negative (use confidence weighting / negative sampling). This is the most common modeling mistake.
Prevent train/serve skew. Compute features the same way offline and online — use a feature store. Subtle mismatches quietly destroy quality.
Log everything, including what you showed. You need impressions (not just clicks) to model position bias and to debug. And log the context of each recommendation.
Tune the time decay. Half-life is a real product lever: short for fast-moving catalogs (news, short video), long for stable ones (movies, books).
Plan for cold start from day one — it's the first thing real users hit, not an edge case.

Ethical considerations

Recommenders shape attention at enormous scale, so design choices have real consequences:

Engagement ≠ well-being. Optimizing pure engagement can amplify outrage, misinformation, or addictive patterns. Pick objectives deliberately and include guardrails.
Fairness. Both to users (don't entrench bias) and to item providers (give new or minority creators exposure — see exploration).
Transparency & control. Explain recommendations where you can ("because you watched X") and give users ways to steer or reset them.
Privacy. Interaction histories are sensitive; handle them accordingly.

The one-paragraph summary

Build the simplest model that beats popularity; treat implicit feedback honestly; split by time and validate with A/B tests; serve embeddings through an ANN index; handle cold start with trending + content + exploration; and deliberately balance accuracy against diversity, novelty, freshness, and fairness — because the metric you optimize is the product you get.

Last, a runnable tool that puts the content-based method in your hands. 👉

Recommendation Systems from Scratch