Experiment tracking with MLflow

The moment you tune a recommender — different embedders, half-lives, ranker settings — you need to know which run was best and why. MLflow records every run's parameters, metrics, and artifacts so experiments are reproducible and comparable. This is the difference between "I think the new model is better" and "run #14 improved NDCG@10 from 0.044 to 0.048; here's the proof."

What MLflow gives you

Tracking — log params, metrics, and artifacts (the model file) per run.
UI — a web dashboard to compare runs side by side.
Model registry — promote a run's model through Staging → Production with versioning.

The training pipeline

train.py does a proper offline evaluation and logs everything. It splits clicks leave-last-out (train on the past, test on each user's last click — never peeking ahead, per Chapter 3), fits the recommender, trains the ranker on an 80/20 impression split, evaluates, saves the model, and logs to MLflow. MLflow is optional — if it isn't installed, everything still runs and just skips logging.

"""Training + offline evaluation pipeline, tracked with MLflow.

Steps:
  1. load data, split clicks leave-last-out (train on past, test on each user's
     last click);
  2. fit the recommender (embeddings + index + trending) on the train split;
  3. train the stage-2 logistic ranker on impression labels (80/20), report AUC;
  4. evaluate recall@k / ndcg@k on the held-out clicks;
  5. log params, metrics, and the model artifact to MLflow.

MLflow is optional: if it isn't installed, everything still runs and prints to
the console (logging is a no-op). Run:

    python -m newsreco.train
"""

from __future__ import annotations

import os
import pickle

from .config import Config
from .data import load_all, NewsData
from .recommender import NewsRecommender
from .ranker import LogisticRanker, build_training_samples
from . import metrics

try:                                   # MLflow is optional
    import mlflow
    _HAS_MLFLOW = True
except Exception:
    _HAS_MLFLOW = False


def leave_last_out(data: NewsData):
    """Split clicks: each user's chronologically last click -> test target."""
    by_user = {}
    for u, nid, t in data.interactions:
        by_user.setdefault(u, []).append((t, nid))
    train_inter, test = [], {}
    train_hist = {}
    for u, lst in by_user.items():
        lst.sort()
        if len(lst) >= 2:
            *rest, last = lst
            test[u] = last[1]
            for t, nid in rest:
                train_inter.append((u, nid, t))
                train_hist.setdefault(u, []).append((nid, t))
        else:
            for t, nid in lst:
                train_inter.append((u, nid, t))
                train_hist.setdefault(u, []).append((nid, t))
    train = NewsData(articles=data.articles, interactions=train_inter,
                     user_history=train_hist, impressions=data.impressions)
    return train, test


def run(cfg: Config = None):
    cfg = cfg or Config()
    data = load_all(cfg)
    train, test = leave_last_out(data)
    print(f"articles={len(data.articles)} clicks={len(data.interactions)} "
          f"test_users={len(test)}")

    rec = NewsRecommender(embedder=cfg.embedder, half_life_hours=cfg.half_life_hours)
    rec.fit(train)

    # stage-2 ranker (80/20 split of impressions)
    imps = train.impressions
    cut = int(len(imps) * 0.8)
    Xtr, ytr = build_training_samples(rec, imps[:cut])
    Xte, yte = build_training_samples(rec, imps[cut:])
    ranker = LogisticRanker().fit(Xtr, ytr)
    train_auc = metrics.auc(ytr, ranker.predict(Xtr))
    test_auc = metrics.auc(yte, ranker.predict(Xte))

    # recall/ndcg on held-out clicks, candidate-gen vs. full two-stage
    k = cfg.top_k
    recs_cg = {u: rec.recommend(u, k) for u in test}
    recs_rk = {u: rec.recommend(u, k, ranker=ranker) for u in test}
    results = {
        "ranker_train_auc": round(train_auc, 4),
        "ranker_test_auc": round(test_auc, 4),
        f"recall@{k}_candgen": round(metrics.recall_at_k(recs_cg, test, k), 4),
        f"ndcg@{k}_candgen": round(metrics.ndcg_at_k(recs_cg, test, k), 4),
        f"recall@{k}_ranked": round(metrics.recall_at_k(recs_rk, test, k), 4),
        f"ndcg@{k}_ranked": round(metrics.ndcg_at_k(recs_rk, test, k), 4),
    }
    print("\nmetrics:")
    for kk, vv in results.items():
        print(f"  {kk:<22} {vv}")

    # persist the model
    os.makedirs("models", exist_ok=True)
    model_path = os.path.join("models", "newsreco.pkl")
    with open(model_path, "wb") as f:
        pickle.dump({"recommender": rec, "ranker": ranker}, f)
    print(f"\nsaved model -> {model_path}")

    # MLflow tracking
    if _HAS_MLFLOW:
        mlflow.set_tracking_uri(cfg.mlflow_uri)
        mlflow.set_experiment(cfg.experiment)
        with mlflow.start_run():
            mlflow.log_params({
                "embedder": cfg.embedder,
                "half_life_hours": cfg.half_life_hours,
                "top_k": k,
                "n_articles": len(data.articles),
                "n_clicks": len(data.interactions),
            })
            mlflow.log_metrics({kk.replace("@", "_at_"): float(vv)
                                for kk, vv in results.items()})
            mlflow.log_artifact(model_path)
        print(f"logged run to MLflow at {cfg.mlflow_uri} (experiment={cfg.experiment})")
    else:
        print("MLflow not installed — skipped tracking (pip install mlflow to enable)")

    return results


if __name__ == "__main__":
    run()

Running it

$ python -m newsreco.train
articles=300 clicks=4000 test_users=399

metrics:
  ranker_train_auc       0.9217
  ranker_test_auc        0.947
  recall@10_candgen      0.0827
  ndcg@10_candgen        0.0436
  recall@10_ranked       0.0827
  ndcg@10_ranked         0.0477

saved model -> models/newsreco.pkl
logged run to MLflow at file:./mlruns (experiment=news-recommender)

Reading the results:

ranker_test_auc 0.947 — the click model generalizes well to held-out impressions (AUC 0.5 = random, 1.0 = perfect).
recall@10 / ndcg@10 — held-out-click accuracy. The ranker doesn't change which items are in the candidate set (so recall is unchanged) but it orders them better, nudging NDCG up (0.0436 → 0.0477). On larger/real data the gap is bigger; the point is the pipeline measures it honestly.

Viewing and comparing runs

mlflow ui --backend-store-uri ./mlruns --port 5000      # http://localhost:5000

Each python -m newsreco.train (with different NEWSRECO_HALFLIFE, NEWSRECO_EMBEDDER, etc.) creates a new run; the UI plots them so you can pick the winner. Try:

NEWSRECO_HALFLIFE=24  python -m newsreco.train     # more recency-biased
NEWSRECO_EMBEDDER=sbert python -m newsreco.train   # semantic embeddings

The model registry (promoting to production)

Once a run looks good, register and stage its model so the serving layer always loads "the current Production model":

import mlflow
# from a chosen run:
mlflow.register_model("runs:/<run_id>/newsreco.pkl", "news-recommender")
# then in the MLflow UI (or API) transition that version to "Production".

The API (Chapter 21) loads models/newsreco.pkl; in a fuller setup you'd have it pull the current Production version from the registry, so deploying a better model is a registry transition, not a code change.

Why this matters in production

Reproducibility — every result is tied to its exact params and code.
Comparability — no more guessing whether a change helped.
Auditability & rollback — the registry records what's live and lets you revert instantly.

Next: the RAG news assistant. 👉

Recommendation Systems from Scratch