Using the real MIND dataset

Why this chapter matters: the capstone shipped with a small, generated sample so it runs anywhere. To build something real (or a portfolio piece), you want a real dataset. This chapter shows how to plug in MIND — the standard real-world news-recommendation dataset — with zero code changes, because we coded to its format from the start.

What MIND is

MIND (Microsoft News Dataset) is a large public benchmark for news recommendation: real news articles and real anonymized user click logs collected from Microsoft News. It comes in two sizes:

  • MINDsmall — ~50,000 users, a manageable few-hundred-MB download. Great for a laptop / this capstone.
  • MINDlarge — ~1,000,000 users, the full benchmark used in research papers.

It naturally includes sports (and soccer/football) alongside many other categories — exactly the themed mix our sample imitates.

Getting it

  1. Download from https://msnews.github.io/ (MINDsmall is the easy starting point). You'll get news.tsv and behaviors.tsv for train and dev splits.
  2. Drop news.tsv and behaviors.tsv into the capstone's data/ directory (replacing the sample).
  3. Re-run the pipeline:
    python -m newsreco.train          # train + evaluate + log to MLflow
    uvicorn newsreco.api:app --port 8000
    

That's it. No code changes — the loader already speaks MIND.

Why no code changes are needed

Our data loader was written against the exact MIND schema, including its real-world quirks. Verified on a real-format fixture:

articles: 3 (note one has an EMPTY abstract)
empty-abstract ok: ''
interactions (clicks): [('U13740', 'N55528', 1573463158.0), ('U99', 'N61837', ...)]
user U99 empty-history ok: [('N61837', ...)]
times parsed (epoch>0): True

The loader correctly handles:

  • Empty abstracts — some MIND articles have a title but no abstract.
  • MIND's timestamp format11/11/2019 9:05:58 AM (note the non-padded hour), parsed to a sortable epoch for time decay and time-based splits.
  • Empty history — brand-new users have a blank history column (a cold-start case the loader and recommender handle).
  • The entity columns — MIND includes title_entities / abstract_entities (named entities); we read the first five columns and ignore the rest, but those entities are a great feature to add later.

What changes at MIND scale (and how the design absorbs it)

MINDsmall runs as-is. As you scale toward MINDlarge, two components need their production swap — both already designed for it:

Concern at scaleSwap (no logic change)
~100k+ articles → exact cosine scan too slowVectorIndexhnswlib/FAISS (same interface, see ann.py)
TF-IDF vocabulary explodesNEWSRECO_EMBEDDER=sbert → fixed-size neural embeddings
Dense user-item math too bigthe recommender already uses per-item embeddings + ANN, not a dense matrix
k-means / ranker training slowtrain on a sample of impressions (standard practice)

Everything else — the decayed profile, the two-stage ranker, MLflow tracking, the API, the UI, RAG — is unchanged.

A realistic workflow

This is exactly how you'd work in practice, and a good habit to internalize:

  1. Develop on the sample (or MINDsmall) — fast iteration, cheap experiments.
  2. Track every run in MLflow (next chapter) so you can compare embedders, half-lives, and ranker settings.
  3. Scale to the full dataset once the pipeline is right, swapping in the ANN backend and neural embeddings.
  4. Validate online with an A/B test before trusting the offline numbers (best practices).

Next: managing trained models with the MLflow registry. 👉