Loss functions

The loss is the single number that says how wrong the model is right now. Training is nothing but "make this number small." Choosing the right loss for your problem is one of the highest-leverage decisions in ML — the loss defines what "good" means, and the model will optimize exactly what you ask for, for better or worse.

Don't be confused: loss vs. metric. A loss is what the model optimizes during training — it must be smooth/differentiable so gradients flow. A metric is what you read to judge quality (accuracy, F1, revenue) — it can be non-differentiable. You can't gradient-descend on accuracy (it's flat then jumps), so you train on cross-entropy (the smooth stand-in) and report accuracy. They usually move together but aren't the same number.

Regression losses (predicting a number)

Mean Squared Error (MSE / L2)

Average of the squared errors. Squaring punishes big misses far more than small ones, and makes the loss smooth.

$$ \text{MSE} = \frac{1}{n}\sum_i (\hat{y}_i - y_i)^2 $$

import numpy as np
y    = np.array([3.0, -0.5, 2.0, 7.0])
yhat = np.array([2.5,  0.0, 2.0, 8.0])
print("MSE:", round(float(np.mean((yhat - y) ** 2)), 3))

Output:

MSE: 0.375

Mean Absolute Error (MAE / L1)

Average of the absolute errors. Treats all errors proportionally, so it's robust to outliers — one wildly wrong point doesn't dominate.

$$ \text{MAE} = \frac{1}{n}\sum_i |\hat{y}_i - y_i| $$

print("MAE:", round(float(np.mean(np.abs(yhat - y))), 3))

Output:

MAE: 0.5

Don't be confused: MSE vs. MAE. MSE squares, so a single huge error dominates the loss — the model bends to chase outliers, and MSE's "average" is the mean. MAE's optimum is the median, so it shrugs off outliers. Noisy data with outliers → lean MAE (or Huber loss, which is MSE near zero and MAE in the tails — the best of both).

Classification losses (predicting a category)

Binary cross-entropy (log loss)

For yes/no problems. It rewards confident-correct predictions and savagely punishes confident-wrong ones.

$$ \text{BCE} = -\big[,y\log p + (1-y)\log(1-p),\big] $$

Watch how the loss for a true label of 1 explodes as the prediction gets worse:

import numpy as np
y = 1.0
for p in [0.9, 0.5, 0.1]:
    bce = -(y*np.log(p) + (1-y)*np.log(1-p))
    print(f"true=1  pred={p}  loss={bce:.3f}")

Output:

true=1  pred=0.9  loss=0.105
true=1  pred=0.5  loss=0.693
true=1  pred=0.1  loss=2.303

Predict 0.9 when the truth is 1 → tiny loss (0.105). Predict 0.1 → big loss (2.303). And 0.5 ("I have no idea") → exactly $\ln 2 = 0.693$, a number you'll recognize everywhere as "the loss of a coin flip." This asymmetry — confidence must be earned — is why cross-entropy beats squared error for classification.

Categorical cross-entropy

The multi-class version: take the model's softmax probability for the true class and penalize $-\log$ of it. Only the true class's probability matters.

$$ \text{CE} = -\log p_{\text{true class}} $$

probs_good = np.array([0.7, 0.2, 0.1])   # confident in the right class (0)
probs_bad  = np.array([0.1, 0.2, 0.7])   # confident in the WRONG class
print("CE (true class 0), good model:", round(float(-np.log(probs_good[0])), 3))
print("CE (true class 0), bad model :", round(float(-np.log(probs_bad[0])),  3))

Output:

CE (true class 0), good model: 0.357
CE (true class 0), bad model : 2.303

This is the loss behind virtually every classifier and every language model (which is just a classifier over the vocabulary at each position). Related term: perplexity = $e^{\text{CE}}$, the standard way to report language-model quality.

Don't be confused: "cross-entropy" vs. "log loss" vs. "negative log likelihood" (NLL). For classification these are the same loss under different names. Cross-entropy = NLL of a softmax/sigmoid model = log loss. Don't let the three names fool you into thinking they're three things.

Hinge loss (the SVM loss)

Used by support vector machines. It only cares that the correct class beats the others by a margin of at least 1; once you're safely correct, zero loss.

$$ \text{hinge} = \max(0,; 1 - \text{margin}) $$

def hinge(margin): return max(0.0, 1.0 - margin)
for m in [2.0, 1.0, 0.5, -1.0]:
    print(f"margin={m:>4}  hinge={hinge(m)}")

Output:

margin= 2.0  hinge=0.0
margin= 1.0  hinge=0.0
margin= 0.5  hinge=0.5
margin=-1.0  hinge=2.0

Don't be confused: hinge vs. cross-entropy. Once a point is correct with margin ≥ 1, hinge loss is exactly 0 — the model stops caring about it. Cross- entropy is never exactly 0; it always nudges toward more confidence. Hinge → "be right by a margin, then relax"; cross-entropy → "always grow confidence." Hinge gives SVMs their sparse, support-vector behavior.

How to choose

Problem	Default loss
Predict a number	MSE (→ MAE/Huber if outliers)
Yes/no	binary cross-entropy (on logits)
Pick 1 of N classes	categorical cross-entropy (softmax)
Multi-label (several can be true)	per-label binary cross-entropy
Max-margin classifier	hinge
Rankings / recommendations	pairwise (BPR, triplet) — see the recsys book

The takeaway

The loss is the scalar training minimizes, and it encodes your goal: MSE/MAE for numbers, cross-entropy for categories, hinge for margins. Keep loss (smooth, for optimizing) separate from metric (what you report). Now the central question: given a loss, how does the model actually find the parameters that minimize it? Enter gradient descent. 👉

AI Foundations in Depth