The dataset (MIND schema)
A real recommender starts with real data. We build on the schema of MIND (Microsoft News Dataset) — the standard public benchmark for news recommendation, with millions of impressions across categories including sports/soccer. To keep the capstone runnable anywhere, we ship a soccer-heavy sample in the exact MIND format; swapping in full MIND is a file drop.
The MIND format
Two tab-separated files:
news.tsv — one row per article:
news_id category subcategory title abstract url title_entities abstract_entities
behaviors.tsv — one row per impression (a session where articles were shown):
impression_id user_id time history impressions
history— space-separated ids the user clicked before this session.impressions— what was shown, each tagged with a click label, e.g.N255-0 N64-1 N299-0(N64was clicked, the others weren't).
A real sample row from our generated news.tsv:
N0 sports soccer France reach the World Cup quarter-final after dramatic win Portugal advanced to the quarter-final ...
and behaviors.tsv:
0 U106 06/01/2026 08:00:00 AM N31 N3 N16 N52 ... N255-0 N64-1 N299-0 N21-0
This format carries everything a recommender needs: content (title/abstract), categories, who clicked what, and when (for time decay and time-based evaluation).
The bundled sample
scripts/make_sample_data.py generates 300 articles (≈40% soccer/FIFA, the rest
spread across politics, world, AI, gadgets, health, finance, movies, travel) and
4,000 impressions for 400 users with realistic, topical vocabulary so
content-based methods actually cluster. Users have category affinities (soccer
over-represented), and clicks reflect those affinities — so the recommenders have
genuine structure to learn.
$ python scripts/make_sample_data.py
wrote 300 articles to data/news.tsv
wrote 4000 impressions to data/behaviors.tsv
categories: 10 subcategories, soccer-heavy
The loader
data.py reads both files into clean structures and, crucially, parses the click
labels into interactions (for training) and per-user history (for the
decayed profile). The same loader handles the sample and the full MIND download.
"""Load the MIND-schema news + behaviors files into usable structures.
Works for both the bundled sample and the real Microsoft MIND dataset (same
TSV format).
"""
from __future__ import annotations
from dataclasses import dataclass, field
from datetime import datetime
def _parse_time(s: str) -> float:
"""MIND uses '11/11/2019 9:05:58 AM'. Fall back to ISO. Return epoch seconds."""
s = s.strip()
for fmt in ("%m/%d/%Y %I:%M:%S %p", "%m/%d/%Y %H:%M:%S"):
try:
return datetime.strptime(s, fmt).timestamp()
except ValueError:
pass
try:
return datetime.fromisoformat(s).timestamp()
except ValueError:
return 0.0
@dataclass
class Article:
news_id: str
category: str
subcategory: str
title: str
abstract: str
@property
def text(self) -> str:
return f"{self.title}. {self.abstract}"
@dataclass
class NewsData:
articles: dict = field(default_factory=dict) # news_id -> Article
interactions: list = field(default_factory=list) # (user, news_id, time) clicks
user_history: dict = field(default_factory=dict) # user -> [(news_id, time), ...]
impressions: list = field(default_factory=list) # raw impressions for ranker
@property
def article_ids(self):
return list(self.articles.keys())
def load_news(path: str) -> dict:
"""Return {news_id: Article}."""
articles = {}
with open(path, encoding="utf-8") as f:
for line in f:
parts = line.rstrip("\n").split("\t")
if len(parts) < 5:
continue
nid, cat, sub, title, abstract = parts[0], parts[1], parts[2], parts[3], parts[4]
articles[nid] = Article(nid, cat, sub, title, abstract)
return articles
def load_behaviors(path: str, articles: dict):
"""
Parse behaviors.tsv. Returns (interactions, user_history, impressions):
interactions : list of (user, news_id, time) for CLICKS
user_history : user -> [(news_id, time)] from the history column + clicks
impressions : list of dicts {user, time, candidates:[(news_id,label)]}
Unknown news ids (not in `articles`) are skipped.
"""
interactions, impressions = [], []
user_history = {}
with open(path, encoding="utf-8") as f:
for line in f:
parts = line.rstrip("\n").split("\t")
if len(parts) < 5:
continue
_, user, time_s, history_s, imp_s = parts[0], parts[1], parts[2], parts[3], parts[4]
t = _parse_time(time_s)
hist = [h for h in history_s.split() if h in articles]
user_history.setdefault(user, [])
for h in hist:
user_history[user].append((h, t))
cands = []
for tok in imp_s.split():
if "-" not in tok:
continue
nid, lab = tok.rsplit("-", 1)
if nid in articles:
cands.append((nid, int(lab)))
if lab == "1":
interactions.append((user, nid, t))
user_history[user].append((nid, t))
if cands:
impressions.append({"user": user, "time": t, "candidates": cands})
return interactions, user_history, impressions
def load_all(cfg) -> NewsData:
articles = load_news(cfg.news_path)
interactions, user_history, impressions = load_behaviors(cfg.behaviors_path, articles)
return NewsData(articles=articles, interactions=interactions,
user_history=user_history, impressions=impressions)
Loading the sample:
articles=300 interactions=4000 users=400 impressions=4000
- interactions —
(user, news_id, time)for every click; the training signal. - user_history — each user's clicked items with timestamps; powers the time-decayed profile.
- impressions — the shown-and-labeled candidates; the training data for the stage-2 ranker.
Using the real MIND dataset
- Download MIND (small or large) from https://msnews.github.io/.
- Drop its
news.tsvandbehaviors.tsvintodata/. - Re-run training — no code changes, because we coded to the MIND schema from the start.
Why a sample at all? Full MIND is gigabytes — too big to bundle and slow to iterate on. Developing against a faithful small sample, then scaling to the real dataset, is exactly how you'd work in practice.
With data in hand, let's build the recommender and ranker. 👉