Using the real MIND dataset

Why this chapter matters: the capstone shipped with a small, generated sample so it runs anywhere. To build something real (or a portfolio piece), you want a real dataset. This chapter shows how to plug in MIND — the standard real-world news-recommendation dataset — with zero code changes, because we coded to its format from the start.

What MIND is

MIND (Microsoft News Dataset) is a large public benchmark for news recommendation: real news articles and real anonymized user click logs collected from Microsoft News. It comes in two sizes:

MINDsmall — ~50,000 users, a manageable few-hundred-MB download. Great for a laptop / this capstone.
MINDlarge — ~1,000,000 users, the full benchmark used in research papers.

It naturally includes sports (and soccer/football) alongside many other categories — exactly the themed mix our sample imitates.

Getting it

Download from https://msnews.github.io/ (MINDsmall is the easy starting point). You'll get news.tsv and behaviors.tsv for train and dev splits.
Drop news.tsv and behaviors.tsv into the capstone's data/ directory (replacing the sample).

Re-run the pipeline:

python -m newsreco.train          # train + evaluate + log to MLflow
uvicorn newsreco.api:app --port 8000

That's it. No code changes — the loader already speaks MIND.

Why no code changes are needed

Our data loader was written against the exact MIND schema, including its real-world quirks. Verified on a real-format fixture:

articles: 3 (note one has an EMPTY abstract)
empty-abstract ok: ''
interactions (clicks): [('U13740', 'N55528', 1573463158.0), ('U99', 'N61837', ...)]
user U99 empty-history ok: [('N61837', ...)]
times parsed (epoch>0): True

The loader correctly handles:

Empty abstracts — some MIND articles have a title but no abstract.
MIND's timestamp format — 11/11/2019 9:05:58 AM (note the non-padded hour), parsed to a sortable epoch for time decay and time-based splits.
Empty history — brand-new users have a blank history column (a cold-start case the loader and recommender handle).
The entity columns — MIND includes title_entities / abstract_entities (named entities); we read the first five columns and ignore the rest, but those entities are a great feature to add later.

What changes at MIND scale (and how the design absorbs it)

MINDsmall runs as-is. As you scale toward MINDlarge, two components need their production swap — both already designed for it:

Concern at scale	Swap (no logic change)
~100k+ articles → exact cosine scan too slow	`VectorIndex` → hnswlib/FAISS (same interface, see `ann.py`)
TF-IDF vocabulary explodes	`NEWSRECO_EMBEDDER=sbert` → fixed-size neural embeddings
Dense user-item math too big	the recommender already uses per-item embeddings + ANN, not a dense matrix
k-means / ranker training slow	train on a sample of impressions (standard practice)

Everything else — the decayed profile, the two-stage ranker, MLflow tracking, the API, the UI, RAG — is unchanged.

A realistic workflow

This is exactly how you'd work in practice, and a good habit to internalize:

Develop on the sample (or MINDsmall) — fast iteration, cheap experiments.
Track every run in MLflow (next chapter) so you can compare embedders, half-lives, and ranker settings.
Scale to the full dataset once the pipeline is right, swapping in the ANN backend and neural embeddings.
Validate online with an A/B test before trusting the offline numbers (best practices).

Next: managing trained models with the MLflow registry. 👉