The dataset (MIND schema)

A real recommender starts with real data. We build on the schema of MIND (Microsoft News Dataset) — the standard public benchmark for news recommendation, with millions of impressions across categories including sports/soccer. To keep the capstone runnable anywhere, we ship a soccer-heavy sample in the exact MIND format; swapping in full MIND is a file drop.

The MIND format

Two tab-separated files:

news.tsv — one row per article:

news_id  category  subcategory  title  abstract  url  title_entities  abstract_entities

behaviors.tsv — one row per impression (a session where articles were shown):

impression_id  user_id  time  history  impressions
  • history — space-separated ids the user clicked before this session.
  • impressions — what was shown, each tagged with a click label, e.g. N255-0 N64-1 N299-0 (N64 was clicked, the others weren't).

A real sample row from our generated news.tsv:

N0  sports  soccer  France reach the World Cup quarter-final after dramatic win  Portugal advanced to the quarter-final ...

and behaviors.tsv:

0  U106  06/01/2026 08:00:00 AM  N31 N3 N16 N52 ...  N255-0 N64-1 N299-0 N21-0

This format carries everything a recommender needs: content (title/abstract), categories, who clicked what, and when (for time decay and time-based evaluation).

The bundled sample

scripts/make_sample_data.py generates 300 articles (≈40% soccer/FIFA, the rest spread across politics, world, AI, gadgets, health, finance, movies, travel) and 4,000 impressions for 400 users with realistic, topical vocabulary so content-based methods actually cluster. Users have category affinities (soccer over-represented), and clicks reflect those affinities — so the recommenders have genuine structure to learn.

$ python scripts/make_sample_data.py
wrote 300 articles to data/news.tsv
wrote 4000 impressions to data/behaviors.tsv
categories: 10 subcategories, soccer-heavy

The loader

data.py reads both files into clean structures and, crucially, parses the click labels into interactions (for training) and per-user history (for the decayed profile). The same loader handles the sample and the full MIND download.

"""Load the MIND-schema news + behaviors files into usable structures.

Works for both the bundled sample and the real Microsoft MIND dataset (same
TSV format).
"""

from __future__ import annotations

from dataclasses import dataclass, field
from datetime import datetime


def _parse_time(s: str) -> float:
    """MIND uses '11/11/2019 9:05:58 AM'. Fall back to ISO. Return epoch seconds."""
    s = s.strip()
    for fmt in ("%m/%d/%Y %I:%M:%S %p", "%m/%d/%Y %H:%M:%S"):
        try:
            return datetime.strptime(s, fmt).timestamp()
        except ValueError:
            pass
    try:
        return datetime.fromisoformat(s).timestamp()
    except ValueError:
        return 0.0


@dataclass
class Article:
    news_id: str
    category: str
    subcategory: str
    title: str
    abstract: str

    @property
    def text(self) -> str:
        return f"{self.title}. {self.abstract}"


@dataclass
class NewsData:
    articles: dict = field(default_factory=dict)          # news_id -> Article
    interactions: list = field(default_factory=list)      # (user, news_id, time) clicks
    user_history: dict = field(default_factory=dict)      # user -> [(news_id, time), ...]
    impressions: list = field(default_factory=list)       # raw impressions for ranker

    @property
    def article_ids(self):
        return list(self.articles.keys())


def load_news(path: str) -> dict:
    """Return {news_id: Article}."""
    articles = {}
    with open(path, encoding="utf-8") as f:
        for line in f:
            parts = line.rstrip("\n").split("\t")
            if len(parts) < 5:
                continue
            nid, cat, sub, title, abstract = parts[0], parts[1], parts[2], parts[3], parts[4]
            articles[nid] = Article(nid, cat, sub, title, abstract)
    return articles


def load_behaviors(path: str, articles: dict):
    """
    Parse behaviors.tsv. Returns (interactions, user_history, impressions):
      interactions : list of (user, news_id, time) for CLICKS
      user_history : user -> [(news_id, time)] from the history column + clicks
      impressions  : list of dicts {user, time, candidates:[(news_id,label)]}
    Unknown news ids (not in `articles`) are skipped.
    """
    interactions, impressions = [], []
    user_history = {}
    with open(path, encoding="utf-8") as f:
        for line in f:
            parts = line.rstrip("\n").split("\t")
            if len(parts) < 5:
                continue
            _, user, time_s, history_s, imp_s = parts[0], parts[1], parts[2], parts[3], parts[4]
            t = _parse_time(time_s)
            hist = [h for h in history_s.split() if h in articles]
            user_history.setdefault(user, [])
            for h in hist:
                user_history[user].append((h, t))
            cands = []
            for tok in imp_s.split():
                if "-" not in tok:
                    continue
                nid, lab = tok.rsplit("-", 1)
                if nid in articles:
                    cands.append((nid, int(lab)))
                    if lab == "1":
                        interactions.append((user, nid, t))
                        user_history[user].append((nid, t))
            if cands:
                impressions.append({"user": user, "time": t, "candidates": cands})
    return interactions, user_history, impressions


def load_all(cfg) -> NewsData:
    articles = load_news(cfg.news_path)
    interactions, user_history, impressions = load_behaviors(cfg.behaviors_path, articles)
    return NewsData(articles=articles, interactions=interactions,
                    user_history=user_history, impressions=impressions)

Loading the sample:

articles=300 interactions=4000 users=400 impressions=4000
  • interactions(user, news_id, time) for every click; the training signal.
  • user_history — each user's clicked items with timestamps; powers the time-decayed profile.
  • impressions — the shown-and-labeled candidates; the training data for the stage-2 ranker.

Using the real MIND dataset

  1. Download MIND (small or large) from https://msnews.github.io/.
  2. Drop its news.tsv and behaviors.tsv into data/.
  3. Re-run training — no code changes, because we coded to the MIND schema from the start.

Why a sample at all? Full MIND is gigabytes — too big to bundle and slow to iterate on. Developing against a faithful small sample, then scaling to the real dataset, is exactly how you'd work in practice.

With data in hand, let's build the recommender and ranker. 👉