Evaluation metrics
A model is only as trustworthy as the number you judge it by — and accuracy, the metric everyone reaches for first, is often a trap. This chapter gives you the metrics that actually tell the truth, when to use each, and the confusions that make people ship broken models thinking they're great.
Why accuracy lies
Accuracy = fraction of predictions that are correct. Sounds perfect. Now imagine a fraud detector where 5% of transactions are fraud. A model that predicts "not fraud" for everything — a model that does literally nothing — scores:
import numpy as np
y = np.array([1]*5 + [0]*95) # 5 fraud, 95 legit
predict_all_negative = np.zeros(100, dtype=int)
print("accuracy:", (predict_all_negative == y).mean())
Output:
accuracy: 0.95
95% accuracy for a useless model. On imbalanced data — fraud, disease, click- through, anomalies — accuracy is actively misleading. You need metrics that look at which errors happen.
The confusion matrix: the source of all classification metrics
Every binary prediction falls into one of four buckets. The 2×2 grid of their counts is the confusion matrix, and every other metric is just arithmetic on these four numbers:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actually Positive | True Positive (TP) | False Negative (FN) |
| Actually Negative | False Positive (FP) | True Negative (TN) |
- TP — caught a real positive ✅
- FP — false alarm (predicted positive, was negative) — a Type I error
- FN — missed it (predicted negative, was positive) — a Type II error
- TN — correctly left a negative alone ✅
y_true = np.array([1,1,1,1, 0,0,0,0,0,0])
y_pred = np.array([1,1,0,0, 1,0,0,0,0,0])
TP = int(((y_pred==1) & (y_true==1)).sum())
FP = int(((y_pred==1) & (y_true==0)).sum())
FN = int(((y_pred==0) & (y_true==1)).sum())
TN = int(((y_pred==0) & (y_true==0)).sum())
print(f"TP={TP} FP={FP} FN={FN} TN={TN}")
Output:
TP=2 FP=1 FN=2 TN=5
Precision and recall: the two questions that matter
From those four numbers come the two metrics you'll quote constantly:
$$ \text{Precision} = \frac{TP}{TP+FP} \qquad \text{Recall} = \frac{TP}{TP+FN} $$
- Precision — "Of everything I flagged, how much was right?" Punishes false alarms (FP).
- Recall (sensitivity) — "Of everything I should have caught, how much did I?" Punishes misses (FN).
precision = TP / (TP + FP)
recall = TP / (TP + FN)
f1 = 2 * precision * recall / (precision + recall)
print(f"precision={precision:.3f} recall={recall:.3f} f1={f1:.3f}")
print(f"accuracy={(y_pred==y_true).mean():.3f}")
Output:
precision=0.667 recall=0.500 f1=0.571
accuracy=0.700
Don't be confused: precision vs. recall — and which to optimize. They trade off. Spam filter: a false positive deletes a real email — you want high precision (be sure before flagging). Cancer screening: a false negative misses a sick patient — you want high recall (never miss one, tolerate false alarms). Decide which error is more expensive in your domain, then optimize that metric. You can always trade one for the other by moving the decision threshold.
F1: one number when you want balance
The F1 score is the harmonic mean of precision and recall — high only when both are high (a model can't game it by maxing one).
$$ F_1 = 2\cdot\frac{\text{precision}\cdot\text{recall}}{\text{precision}+\text{recall}} $$
Why harmonic, not plain average? The harmonic mean is dragged down by the smaller value. Precision 1.0, recall 0.0 → plain average 0.5 (looks okay!), but F1 = 0.0 (correctly says "useless"). That's the whole point. For imbalanced data, F1 is a far better headline than accuracy.
ROC-AUC: quality across all thresholds
Precision/recall depend on where you set the decision threshold (default 0.5). The ROC curve plots true-positive rate vs. false-positive rate as you sweep the threshold from 0 to 1, and AUC (Area Under the Curve) summarizes it in one number:
- AUC = 1.0 → perfect ranking. AUC = 0.5 → no better than random.
- Interpretation: the probability that the model scores a random positive higher than a random negative.
Use AUC to compare models independent of threshold, especially on imbalanced data. (Related: PR-AUC, area under the precision–recall curve, which is even more informative when positives are very rare.)
Don't be confused: threshold-dependent vs. threshold-free metrics. Precision, recall, F1, and accuracy all depend on the 0.5 cutoff — change the threshold and they change. AUC is threshold-free: it measures how well the model ranks positives above negatives, regardless of cutoff. Report AUC to judge the model; pick a threshold afterward to deploy it.
Regression metrics (predicting numbers)
For continuous targets, the standard trio:
yt = np.array([3.0, -0.5, 2.0, 7.0])
yp = np.array([2.5, 0.0, 2.0, 8.0])
mae = np.mean(np.abs(yp - yt))
rmse = np.sqrt(np.mean((yp - yt) ** 2))
r2 = 1 - np.sum((yt-yp)**2) / np.sum((yt - yt.mean())**2)
print(f"MAE={mae:.3f} RMSE={rmse:.3f} R2={r2:.3f}")
Output:
MAE=0.500 RMSE=0.612 R2=0.949
- MAE — average absolute error, in the target's own units. Robust, easy to explain ("off by $0.50 on average").
- RMSE — root mean squared error; same units, but penalizes big misses more (it's the square root of MSE). Use when large errors are especially bad.
- R² (coefficient of determination) — fraction of variance explained, on a 0–1 scale (can go negative for a terrible model). R² = 0.949 means the model explains 95% of the variation. R² = 0 means "no better than always guessing the mean."
Don't be confused: RMSE vs. MAE (again, but for reporting). RMSE ≥ MAE always, and the gap between them signals the presence of large outlier errors. If RMSE ≫ MAE, a few predictions are badly wrong. Reporting both is more informative than either alone.
A metric-picking cheat sheet
| Situation | Use |
|---|---|
| Balanced classification | accuracy is fine |
| Imbalanced classification | F1, precision/recall, PR-AUC — not accuracy |
| Compare classifiers, any threshold | ROC-AUC |
| Cost of FP ≠ cost of FN | tune the threshold, optimize the costly metric |
| Regression, explainable | MAE |
| Regression, punish big errors | RMSE |
| Regression, "how much variance explained" | R² |
| Ranking / recommendations | Precision@k, Recall@k, NDCG, MAP (see recsys book) |
The one rule above all
Pick your metric before you train, and match it to the real-world cost of each error. The metric is a proxy for what you actually care about — choose it carelessly and you'll cheerfully optimize the wrong thing all the way to production.
The takeaway
Accuracy lies on imbalanced data. The confusion matrix (TP/FP/FN/TN) generates precision (trust your alarms), recall (catch everything), and F1 (balance both). AUC judges ranking across thresholds. For regression use MAE/RMSE/R². Choose the metric by which mistake hurts most. We've now covered all of classical ML's machinery — time to stack linear units into something far more powerful: a neural network, built by hand. 👉