Experiment tracking with MLflow
The moment you tune a recommender — different embedders, half-lives, ranker settings — you need to know which run was best and why. MLflow records every run's parameters, metrics, and artifacts so experiments are reproducible and comparable. This is the difference between "I think the new model is better" and "run #14 improved NDCG@10 from 0.044 to 0.048; here's the proof."
What MLflow gives you
- Tracking — log params, metrics, and artifacts (the model file) per run.
- UI — a web dashboard to compare runs side by side.
- Model registry — promote a run's model through
Staging→Productionwith versioning.
The training pipeline
train.py does a proper offline evaluation and logs everything. It splits clicks
leave-last-out (train on the past, test on each user's last click — never
peeking ahead, per Chapter 3), fits the recommender, trains the
ranker on an 80/20 impression split, evaluates, saves the model, and logs to
MLflow. MLflow is optional — if it isn't installed, everything still runs and
just skips logging.
"""Training + offline evaluation pipeline, tracked with MLflow.
Steps:
1. load data, split clicks leave-last-out (train on past, test on each user's
last click);
2. fit the recommender (embeddings + index + trending) on the train split;
3. train the stage-2 logistic ranker on impression labels (80/20), report AUC;
4. evaluate recall@k / ndcg@k on the held-out clicks;
5. log params, metrics, and the model artifact to MLflow.
MLflow is optional: if it isn't installed, everything still runs and prints to
the console (logging is a no-op). Run:
python -m newsreco.train
"""
from __future__ import annotations
import os
import pickle
from .config import Config
from .data import load_all, NewsData
from .recommender import NewsRecommender
from .ranker import LogisticRanker, build_training_samples
from . import metrics
try: # MLflow is optional
import mlflow
_HAS_MLFLOW = True
except Exception:
_HAS_MLFLOW = False
def leave_last_out(data: NewsData):
"""Split clicks: each user's chronologically last click -> test target."""
by_user = {}
for u, nid, t in data.interactions:
by_user.setdefault(u, []).append((t, nid))
train_inter, test = [], {}
train_hist = {}
for u, lst in by_user.items():
lst.sort()
if len(lst) >= 2:
*rest, last = lst
test[u] = last[1]
for t, nid in rest:
train_inter.append((u, nid, t))
train_hist.setdefault(u, []).append((nid, t))
else:
for t, nid in lst:
train_inter.append((u, nid, t))
train_hist.setdefault(u, []).append((nid, t))
train = NewsData(articles=data.articles, interactions=train_inter,
user_history=train_hist, impressions=data.impressions)
return train, test
def run(cfg: Config = None):
cfg = cfg or Config()
data = load_all(cfg)
train, test = leave_last_out(data)
print(f"articles={len(data.articles)} clicks={len(data.interactions)} "
f"test_users={len(test)}")
rec = NewsRecommender(embedder=cfg.embedder, half_life_hours=cfg.half_life_hours)
rec.fit(train)
# stage-2 ranker (80/20 split of impressions)
imps = train.impressions
cut = int(len(imps) * 0.8)
Xtr, ytr = build_training_samples(rec, imps[:cut])
Xte, yte = build_training_samples(rec, imps[cut:])
ranker = LogisticRanker().fit(Xtr, ytr)
train_auc = metrics.auc(ytr, ranker.predict(Xtr))
test_auc = metrics.auc(yte, ranker.predict(Xte))
# recall/ndcg on held-out clicks, candidate-gen vs. full two-stage
k = cfg.top_k
recs_cg = {u: rec.recommend(u, k) for u in test}
recs_rk = {u: rec.recommend(u, k, ranker=ranker) for u in test}
results = {
"ranker_train_auc": round(train_auc, 4),
"ranker_test_auc": round(test_auc, 4),
f"recall@{k}_candgen": round(metrics.recall_at_k(recs_cg, test, k), 4),
f"ndcg@{k}_candgen": round(metrics.ndcg_at_k(recs_cg, test, k), 4),
f"recall@{k}_ranked": round(metrics.recall_at_k(recs_rk, test, k), 4),
f"ndcg@{k}_ranked": round(metrics.ndcg_at_k(recs_rk, test, k), 4),
}
print("\nmetrics:")
for kk, vv in results.items():
print(f" {kk:<22} {vv}")
# persist the model
os.makedirs("models", exist_ok=True)
model_path = os.path.join("models", "newsreco.pkl")
with open(model_path, "wb") as f:
pickle.dump({"recommender": rec, "ranker": ranker}, f)
print(f"\nsaved model -> {model_path}")
# MLflow tracking
if _HAS_MLFLOW:
mlflow.set_tracking_uri(cfg.mlflow_uri)
mlflow.set_experiment(cfg.experiment)
with mlflow.start_run():
mlflow.log_params({
"embedder": cfg.embedder,
"half_life_hours": cfg.half_life_hours,
"top_k": k,
"n_articles": len(data.articles),
"n_clicks": len(data.interactions),
})
mlflow.log_metrics({kk.replace("@", "_at_"): float(vv)
for kk, vv in results.items()})
mlflow.log_artifact(model_path)
print(f"logged run to MLflow at {cfg.mlflow_uri} (experiment={cfg.experiment})")
else:
print("MLflow not installed — skipped tracking (pip install mlflow to enable)")
return results
if __name__ == "__main__":
run()
Running it
$ python -m newsreco.train
articles=300 clicks=4000 test_users=399
metrics:
ranker_train_auc 0.9217
ranker_test_auc 0.947
recall@10_candgen 0.0827
ndcg@10_candgen 0.0436
recall@10_ranked 0.0827
ndcg@10_ranked 0.0477
saved model -> models/newsreco.pkl
logged run to MLflow at file:./mlruns (experiment=news-recommender)
Reading the results:
- ranker_test_auc 0.947 — the click model generalizes well to held-out impressions (AUC 0.5 = random, 1.0 = perfect).
- recall@10 / ndcg@10 — held-out-click accuracy. The ranker doesn't change which items are in the candidate set (so recall is unchanged) but it orders them better, nudging NDCG up (0.0436 → 0.0477). On larger/real data the gap is bigger; the point is the pipeline measures it honestly.
Viewing and comparing runs
mlflow ui --backend-store-uri ./mlruns --port 5000 # http://localhost:5000
Each python -m newsreco.train (with different NEWSRECO_HALFLIFE,
NEWSRECO_EMBEDDER, etc.) creates a new run; the UI plots them so you can pick the
winner. Try:
NEWSRECO_HALFLIFE=24 python -m newsreco.train # more recency-biased
NEWSRECO_EMBEDDER=sbert python -m newsreco.train # semantic embeddings
The model registry (promoting to production)
Once a run looks good, register and stage its model so the serving layer always loads "the current Production model":
import mlflow
# from a chosen run:
mlflow.register_model("runs:/<run_id>/newsreco.pkl", "news-recommender")
# then in the MLflow UI (or API) transition that version to "Production".
The API (Chapter 21) loads models/newsreco.pkl; in a
fuller setup you'd have it pull the current Production version from the registry,
so deploying a better model is a registry transition, not a code change.
Why this matters in production
- Reproducibility — every result is tied to its exact params and code.
- Comparability — no more guessing whether a change helped.
- Auditability & rollback — the registry records what's live and lets you revert instantly.
Next: the RAG news assistant. 👉