Using the real MIND dataset
Why this chapter matters: the capstone shipped with a small, generated sample so it runs anywhere. To build something real (or a portfolio piece), you want a real dataset. This chapter shows how to plug in MIND — the standard real-world news-recommendation dataset — with zero code changes, because we coded to its format from the start.
What MIND is
MIND (Microsoft News Dataset) is a large public benchmark for news recommendation: real news articles and real anonymized user click logs collected from Microsoft News. It comes in two sizes:
- MINDsmall — ~50,000 users, a manageable few-hundred-MB download. Great for a laptop / this capstone.
- MINDlarge — ~1,000,000 users, the full benchmark used in research papers.
It naturally includes sports (and soccer/football) alongside many other categories — exactly the themed mix our sample imitates.
Getting it
- Download from https://msnews.github.io/ (MINDsmall is the easy starting
point). You'll get
news.tsvandbehaviors.tsvfor train and dev splits. - Drop
news.tsvandbehaviors.tsvinto the capstone'sdata/directory (replacing the sample). - Re-run the pipeline:
python -m newsreco.train # train + evaluate + log to MLflow uvicorn newsreco.api:app --port 8000
That's it. No code changes — the loader already speaks MIND.
Why no code changes are needed
Our data loader was written against the exact MIND schema, including its real-world quirks. Verified on a real-format fixture:
articles: 3 (note one has an EMPTY abstract)
empty-abstract ok: ''
interactions (clicks): [('U13740', 'N55528', 1573463158.0), ('U99', 'N61837', ...)]
user U99 empty-history ok: [('N61837', ...)]
times parsed (epoch>0): True
The loader correctly handles:
- Empty abstracts — some MIND articles have a title but no abstract.
- MIND's timestamp format —
11/11/2019 9:05:58 AM(note the non-padded hour), parsed to a sortable epoch for time decay and time-based splits. - Empty history — brand-new users have a blank
historycolumn (a cold-start case the loader and recommender handle). - The entity columns — MIND includes
title_entities/abstract_entities(named entities); we read the first five columns and ignore the rest, but those entities are a great feature to add later.
What changes at MIND scale (and how the design absorbs it)
MINDsmall runs as-is. As you scale toward MINDlarge, two components need their production swap — both already designed for it:
| Concern at scale | Swap (no logic change) |
|---|---|
| ~100k+ articles → exact cosine scan too slow | VectorIndex → hnswlib/FAISS (same interface, see ann.py) |
| TF-IDF vocabulary explodes | NEWSRECO_EMBEDDER=sbert → fixed-size neural embeddings |
| Dense user-item math too big | the recommender already uses per-item embeddings + ANN, not a dense matrix |
| k-means / ranker training slow | train on a sample of impressions (standard practice) |
Everything else — the decayed profile, the two-stage ranker, MLflow tracking, the API, the UI, RAG — is unchanged.
A realistic workflow
This is exactly how you'd work in practice, and a good habit to internalize:
- Develop on the sample (or MINDsmall) — fast iteration, cheap experiments.
- Track every run in MLflow (next chapter) so you can compare embedders, half-lives, and ranker settings.
- Scale to the full dataset once the pipeline is right, swapping in the ANN backend and neural embeddings.
- Validate online with an A/B test before trusting the offline numbers (best practices).
Next: managing trained models with the MLflow registry. 👉