Best practices & pitfalls

The algorithms are the easy part. What separates a good recommender from a harmful or useless one is everything around them. This chapter is the hard-won wisdom — the traps that sink real systems and the practices that keep them healthy.

Evaluation traps

  • Don't test on the past you trained on. Always split by time (leave-last-out or a time cutoff), never randomly — a random split lets the model "see the future", inflating offline scores that collapse in production.
  • Don't trust offline metrics alone. They score the model on behavior the old recommender produced and can't tell whether a new recommendation would've been clicked. Use offline metrics to filter ideas, then A/B test the survivors on live traffic and judge by business metrics.
  • Always include dumb baselines. If you can't beat popularity, your fancy model is broken or pointless. Random and popularity belong on every leaderboard.
  • Beware leakage. A feature that secretly encodes the answer (e.g. "number of interactions with this item" computed including the test interaction) gives fantastic offline numbers and fails live.

Bias and feedback loops

  • Popularity bias. Models trained on logged clicks learn to recommend what's already popular, which gets more clicks, which reinforces it — the rich get richer and the long tail starves. Counter it with exploration (bandits), diversity constraints, and sometimes popularity de-biasing in the loss.
  • Position bias. Users click top-ranked items partly because they're on top, not because they're best. Training naively on clicks teaches "whatever we already ranked high is good." Mitigations: model the position, randomize positions slightly, or use inverse-propensity weighting.
  • The feedback loop. Recommendations shape data shapes recommendations. Without deliberate exploration, the system narrows over time and you stop learning about anything you don't already show.
  • Filter bubbles. Over-personalization traps users in a narrow slice of the catalog. Inject diversity and serendipity on purpose — both for user experience and for catalog health.

Beyond accuracy: what to optimize

Maximizing recall@k alone produces a boring, narrow, popularity-heavy product. Balance it with:

  • Diversity — don't show ten near-duplicates; enforce variety in the top-k (e.g. with Maximal Marginal Relevance, which trades relevance against similarity to already-picked items).
  • Novelty / serendipity — surface things the user wouldn't have found alone.
  • Coverage — make sure the catalog's long tail gets shown, not just the hits.
  • Freshness — new items reach users quickly.
  • Business rules — in-stock, licensing, fairness, "don't recommend what they just bought", content-safety.

These usually live in the re-rank stage as constraints on top of the model's scores.

Practical engineering wisdom

  • Start simple. Popularity → content-based → item-item → MF. Ship the simplest thing that beats the baseline; add complexity only when it earns its keep in an A/B test.
  • Implicit ≠ explicit. Treat missing data as unknown, not negative (use confidence weighting / negative sampling). This is the most common modeling mistake.
  • Prevent train/serve skew. Compute features the same way offline and online — use a feature store. Subtle mismatches quietly destroy quality.
  • Log everything, including what you showed. You need impressions (not just clicks) to model position bias and to debug. And log the context of each recommendation.
  • Tune the time decay. Half-life is a real product lever: short for fast-moving catalogs (news, short video), long for stable ones (movies, books).
  • Plan for cold start from day one — it's the first thing real users hit, not an edge case.

Ethical considerations

Recommenders shape attention at enormous scale, so design choices have real consequences:

  • Engagement ≠ well-being. Optimizing pure engagement can amplify outrage, misinformation, or addictive patterns. Pick objectives deliberately and include guardrails.
  • Fairness. Both to users (don't entrench bias) and to item providers (give new or minority creators exposure — see exploration).
  • Transparency & control. Explain recommendations where you can ("because you watched X") and give users ways to steer or reset them.
  • Privacy. Interaction histories are sensitive; handle them accordingly.

The one-paragraph summary

Build the simplest model that beats popularity; treat implicit feedback honestly; split by time and validate with A/B tests; serve embeddings through an ANN index; handle cold start with trending + content + exploration; and deliberately balance accuracy against diversity, novelty, freshness, and fairness — because the metric you optimize is the product you get.

Last, a runnable tool that puts the content-based method in your hands. 👉