Monitoring & drift detection

Here's the failure mode that catches every team once: you ship a great model, it works, everyone moves on — and six months later it's quietly making bad predictions, because the world changed and nobody was watching. A deployed model rots. Monitoring is how you catch the rot before your users (or your boss) do. It's the most-skipped and most-important stage of the lifecycle.

Install: the PSI check below is pure NumPy and runs live. Richer monitoring uses pip install evidently.

Two kinds of monitoring (don't conflate them)

Don't be confused: operational monitoring vs. ML monitoring. Operational monitoring watches the service — latency, error rate, throughput, CPU. Standard DevOps (Prometheus/Grafana); if the API is slow or 500ing, you'll know. ML monitoring watches the model's quality — is it still accurate? A model can be perfectly healthy operationally (fast, 200 OK) while being completely wrong. Operational monitoring won't catch that. You need both.

This chapter is about the second kind — the one teams forget.

Why models rot: drift

The world your model learned no longer matches the world it's serving. Two flavors:

  • Data drift — the input distribution shifts. New users behave differently, an upstream feature changes units, a season turns. The model sees inputs unlike its training data.
  • Concept drift — the relationship between inputs and the answer shifts. Fraudsters invent new tactics; "expensive" means something different after inflation. Even with identical inputs, the right answer changed.

Both silently erode accuracy — and you often can't measure accuracy live, because the true labels arrive late or never (you learn if a loan defaults months later). So you monitor drift in the inputs as an early-warning proxy.

Detecting drift with PSI

The Population Stability Index is the classic, dependency-free drift metric. It compares a reference distribution (your training data) to the current one (live traffic), bucket by bucket, and sums how much they diverge. From serving/drift_check.py:

def psi(reference, current, bins=10):
    edges = np.quantile(reference, np.linspace(0, 1, bins + 1))
    edges[0], edges[-1] = -np.inf, np.inf
    ref_pct = np.histogram(reference, edges)[0] / len(reference)
    cur_pct = np.histogram(current, edges)[0] / len(current)
    ref_pct, cur_pct = np.clip(ref_pct, 1e-6, None), np.clip(cur_pct, 1e-6, None)
    return float(np.sum((cur_pct - ref_pct) * np.log(cur_pct / ref_pct)))

The industry-standard thresholds:

PSIMeaningAction
< 0.1no significant driftfine
0.1 – 0.25moderate driftinvestigate
> 0.25major driftretrain

It runs — live output

cd code
python serving/drift_check.py

Output:

stable    PSI=0.0012  -> no significant drift
drifted   PSI=0.3089  -> MAJOR drift — retrain

Two production scenarios compared against the training distribution: traffic that matches it scores PSI 0.0012 (no drift — all good), while traffic shifted in mean and spread scores PSI 0.3089 (well past 0.25 — retrain now). Run this nightly on a sample of live inputs and you get an automatic early warning, before accuracy visibly craters.

What to monitor

  • Input feature distributions — PSI/KS-test per feature vs. training (above).
  • Prediction distribution — if your model suddenly predicts "positive" 90% of the time when it used to be 50%, something changed upstream.
  • Model accuracywhen labels eventually arrive, compare predictions to truth and track accuracy/AUC over time.
  • Data quality — nulls, out-of-range values, schema changes. Often the real culprit: a broken upstream pipeline, not "concept drift."
  • Operational metrics — latency, errors, throughput (the other kind).

Evidently: drift reports out of the box

Computing PSI by hand teaches the idea; in practice Evidently generates full drift and quality reports with a few lines:

from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=train_df, current_data=live_df)
report.save_html("drift_report.html")        # per-feature drift, with charts

It checks every feature, picks appropriate statistical tests, and produces a shareable HTML report. Hosted tools (WhyLabs, Arize, Fiddler) add dashboards, alerts, and historical tracking.

Closing the loop

Monitoring isn't passive — it's the trigger for the rest of the lifecycle. When the drift check fires, it kicks off the Prefect retraining pipeline (Chapter 9), which trains a fresh model, gates it on quality, and promotes it via the registry (Chapter 3) — all automatically. That closed loop — serve → monitor → detect drift → retrain → promote → serve — is what "production ML" actually means, and it's the whole picture from Chapter 0 made real.

Don't be confused: retrain on drift, not on a fixed schedule (ideally). Many teams retrain "every week" by default — simple, but it retrains when nothing changed (waste) and misses drift that strikes mid-week. Drift-triggered retraining responds to the actual signal. A scheduled retrain is a fine starting point; a drift-triggered one is the goal.

The takeaway

Models rot as the world drifts from their training data, so you must monitor quality (drift, prediction shifts, accuracy when labels arrive), not just uptime. PSI is the classic drift metric — our live run flagged shifted traffic at 0.31 and clean traffic at 0.001 — and Evidently automates the full report. Crucially, monitoring triggers retraining, closing the production loop. That completes the classical MLOps stack — next we add the part every modern AI team now needs: the GenAI / LLM stack, starting with the vector databases that power semantic search and RAG. 👉