Evaluation metrics

A model is only as trustworthy as the number you judge it by — and accuracy, the metric everyone reaches for first, is often a trap. This chapter gives you the metrics that actually tell the truth, when to use each, and the confusions that make people ship broken models thinking they're great.

Why accuracy lies

Accuracy = fraction of predictions that are correct. Sounds perfect. Now imagine a fraud detector where 5% of transactions are fraud. A model that predicts "not fraud" for everything — a model that does literally nothing — scores:

import numpy as np
y = np.array([1]*5 + [0]*95)         # 5 fraud, 95 legit
predict_all_negative = np.zeros(100, dtype=int)
print("accuracy:", (predict_all_negative == y).mean())

Output:

accuracy: 0.95

95% accuracy for a useless model. On imbalanced data — fraud, disease, click- through, anomalies — accuracy is actively misleading. You need metrics that look at which errors happen.

The confusion matrix: the source of all classification metrics

Every binary prediction falls into one of four buckets. The 2×2 grid of their counts is the confusion matrix, and every other metric is just arithmetic on these four numbers:

	Predicted Positive	Predicted Negative
Actually Positive	True Positive (TP)	False Negative (FN)
Actually Negative	False Positive (FP)	True Negative (TN)

TP — caught a real positive ✅
FP — false alarm (predicted positive, was negative) — a Type I error
FN — missed it (predicted negative, was positive) — a Type II error
TN — correctly left a negative alone ✅

y_true = np.array([1,1,1,1, 0,0,0,0,0,0])
y_pred = np.array([1,1,0,0, 1,0,0,0,0,0])
TP = int(((y_pred==1) & (y_true==1)).sum())
FP = int(((y_pred==1) & (y_true==0)).sum())
FN = int(((y_pred==0) & (y_true==1)).sum())
TN = int(((y_pred==0) & (y_true==0)).sum())
print(f"TP={TP} FP={FP} FN={FN} TN={TN}")

Output:

TP=2 FP=1 FN=2 TN=5

Precision and recall: the two questions that matter

From those four numbers come the two metrics you'll quote constantly:

$$ \text{Precision} = \frac{TP}{TP+FP} \qquad \text{Recall} = \frac{TP}{TP+FN} $$

Precision — "Of everything I flagged, how much was right?" Punishes false alarms (FP).
Recall (sensitivity) — "Of everything I should have caught, how much did I?" Punishes misses (FN).

precision = TP / (TP + FP)
recall    = TP / (TP + FN)
f1        = 2 * precision * recall / (precision + recall)
print(f"precision={precision:.3f}  recall={recall:.3f}  f1={f1:.3f}")
print(f"accuracy={(y_pred==y_true).mean():.3f}")

Output:

precision=0.667  recall=0.500  f1=0.571
accuracy=0.700

Don't be confused: precision vs. recall — and which to optimize. They trade off. Spam filter: a false positive deletes a real email — you want high precision (be sure before flagging). Cancer screening: a false negative misses a sick patient — you want high recall (never miss one, tolerate false alarms). Decide which error is more expensive in your domain, then optimize that metric. You can always trade one for the other by moving the decision threshold.

F1: one number when you want balance

The F1 score is the harmonic mean of precision and recall — high only when both are high (a model can't game it by maxing one).

$$ F_1 = 2\cdot\frac{\text{precision}\cdot\text{recall}}{\text{precision}+\text{recall}} $$

Why harmonic, not plain average? The harmonic mean is dragged down by the smaller value. Precision 1.0, recall 0.0 → plain average 0.5 (looks okay!), but F1 = 0.0 (correctly says "useless"). That's the whole point. For imbalanced data, F1 is a far better headline than accuracy.

ROC-AUC: quality across all thresholds

Precision/recall depend on where you set the decision threshold (default 0.5). The ROC curve plots true-positive rate vs. false-positive rate as you sweep the threshold from 0 to 1, and AUC (Area Under the Curve) summarizes it in one number:

AUC = 1.0 → perfect ranking. AUC = 0.5 → no better than random.
Interpretation: the probability that the model scores a random positive higher than a random negative.

Use AUC to compare models independent of threshold, especially on imbalanced data. (Related: PR-AUC, area under the precision–recall curve, which is even more informative when positives are very rare.)

Don't be confused: threshold-dependent vs. threshold-free metrics. Precision, recall, F1, and accuracy all depend on the 0.5 cutoff — change the threshold and they change. AUC is threshold-free: it measures how well the model ranks positives above negatives, regardless of cutoff. Report AUC to judge the model; pick a threshold afterward to deploy it.

Regression metrics (predicting numbers)

For continuous targets, the standard trio:

yt = np.array([3.0, -0.5, 2.0, 7.0])
yp = np.array([2.5,  0.0, 2.0, 8.0])
mae  = np.mean(np.abs(yp - yt))
rmse = np.sqrt(np.mean((yp - yt) ** 2))
r2   = 1 - np.sum((yt-yp)**2) / np.sum((yt - yt.mean())**2)
print(f"MAE={mae:.3f}  RMSE={rmse:.3f}  R2={r2:.3f}")

Output:

MAE=0.500  RMSE=0.612  R2=0.949

MAE — average absolute error, in the target's own units. Robust, easy to explain ("off by $0.50 on average").
RMSE — root mean squared error; same units, but penalizes big misses more (it's the square root of MSE). Use when large errors are especially bad.
R² (coefficient of determination) — fraction of variance explained, on a 0–1 scale (can go negative for a terrible model). R² = 0.949 means the model explains 95% of the variation. R² = 0 means "no better than always guessing the mean."

Don't be confused: RMSE vs. MAE (again, but for reporting). RMSE ≥ MAE always, and the gap between them signals the presence of large outlier errors. If RMSE ≫ MAE, a few predictions are badly wrong. Reporting both is more informative than either alone.

A metric-picking cheat sheet

Situation	Use
Balanced classification	accuracy is fine
Imbalanced classification	F1, precision/recall, PR-AUC — not accuracy
Compare classifiers, any threshold	ROC-AUC
Cost of FP ≠ cost of FN	tune the threshold, optimize the costly metric
Regression, explainable	MAE
Regression, punish big errors	RMSE
Regression, "how much variance explained"	R²
Ranking / recommendations	Precision@k, Recall@k, NDCG, MAP (see recsys book)

The one rule above all

Pick your metric before you train, and match it to the real-world cost of each error. The metric is a proxy for what you actually care about — choose it carelessly and you'll cheerfully optimize the wrong thing all the way to production.

The takeaway

Accuracy lies on imbalanced data. The confusion matrix (TP/FP/FN/TN) generates precision (trust your alarms), recall (catch everything), and F1 (balance both). AUC judges ranking across thresholds. For regression use MAE/RMSE/R². Choose the metric by which mistake hurts most. We've now covered all of classical ML's machinery — time to stack linear units into something far more powerful: a neural network, built by hand. 👉

AI Foundations in Depth