Numerical gotchas & "don't be confused"
Computers don't do real arithmetic — they do finite-precision floating-point
arithmetic, and the gap bites AI code constantly: a softmax returns NaN, a loss
goes inf, two "equal" numbers aren't, a model trains fine then mysteriously
explodes. This chapter is the field guide to those landmines, plus a consolidated
"don't be confused" glossary of the near-synonyms that trip everyone up. Knowing
these is the difference between "the model is broken" and "oh, that's just float32."
Floats are approximate — stop testing them with ==
print("0.1 + 0.2 == 0.3 :", 0.1 + 0.2 == 0.3)
print("0.1 + 0.2 =", repr(0.1 + 0.2))
print("np.isclose :", np.isclose(0.1 + 0.2, 0.3))
Output:
0.1 + 0.2 == 0.3 : False
0.1 + 0.2 = 0.30000000000000004
np.isclose : True
0.1, 0.2, 0.3 can't be represented exactly in binary, so they don't add up
exactly. Never compare floats with ==. Use np.isclose / np.allclose (with a
tolerance) for scalars and arrays. This is the #1 reason "my test passes locally but
fails in CI" — different hardware rounds the last bit differently.
Softmax overflow — and the trick that fixes it
exp overflows fast: exp(1000) is larger than any float can hold (→ inf), and
inf / inf is NaN. A naive softmax detonates on large logits:
def naive_softmax(z): e = np.exp(z); return e / e.sum()
def stable_softmax(z): z = z - z.max(); e = np.exp(z); return e / e.sum()
big = np.array([1000., 1001., 1002.])
print("naive :", naive_softmax(big))
print("stable:", np.round(stable_softmax(big), 3))
Output:
naive : [nan nan nan]
stable: [0.09 0.245 0.665]
The fix — subtract the max before exponentiating — is mathematically a no-op
(softmax is shift-invariant: $\text{softmax}(z) = \text{softmax}(z - c)$) but
numerically it's the whole game: now the largest exponent is exp(0) = 1, no
overflow. Every library's softmax does this. It's why you hand logits to
CrossEntropyLoss (Chapter 12) — the framework applies the stable
version internally.
log(0) = -inf, and the + eps habit
Cross-entropy takes log(p). If the model ever predicts exactly p = 0 for the true
class, you get -inf, and gradients become NaN:
print("log(0):", np.log(0.0))
Output:
log(0): -inf
The defensive habit you saw throughout this book — np.log(p + 1e-9) or clipping p
to [eps, 1-eps] — exists precisely to dodge this. (Better still: use the fused
BCEWithLogitsLoss / CrossEntropyLoss, which are built to be safe.)
Log-sum-exp: the pattern behind all of the above
When you must compute $\log\sum_i e^{z_i}$ (it appears in softmax, cross-entropy, and probabilistic models), the same overflow lurks. The log-sum-exp trick factors out the max:
$$ \log\sum_i e^{z_i} = m + \log\sum_i e^{z_i - m}, \qquad m = \max_i z_i $$
def logsumexp(z):
m = z.max()
return m + np.log(np.exp(z - m).sum())
z = np.array([1000., 1001.])
print("logsumexp (stable):", round(logsumexp(z), 3))
print("naive log(sum(exp)):", np.log(np.exp(z).sum()))
Output:
logsumexp (stable): 1001.313
naive log(sum(exp)): inf
The stable version gives the right answer (1001.313); the naive one overflows to
inf. Recognize this pattern — pull out the max, then exponentiate the
differences — and you'll understand the internals of half the loss functions in
PyTorch.
NaN is contagious
A single NaN poisons everything it touches — one bad value silently turns your
whole loss into NaN:
a = np.array([1., np.nan, 3.])
print("mean :", np.mean(a))
print("nanmean:", np.nanmean(a))
Output:
mean : nan
nanmean: 2.0
If your loss suddenly becomes NaN mid-training, the usual suspects are: learning
rate too high (exploding gradients — Chapter 8),
log(0) / divide-by-0, or a NaN already in your input data. Use np.isnan(x).any()
to hunt it down, and nanmean/nansum when missing values are expected.
A few more silent traps
- Integer division.
np.array([1,2,3]) / 2is fine (NumPy promotes to float), but integer tensors and//can silently truncate. Keep features infloat32. - In-place ops.
x += 1modifies the array other variables may share. With autograd, in-place edits on tensors that need gradients raise errors — make a copy. - Broadcasting surprises. Adding shape
(n,)to(n,1)broadcasts to(n,n)— a silent memory blowup, not an error (Chapter 2). - float32 vs float64 mismatch at the NumPy↔PyTorch boundary (Chapter 12).
The consolidated "don't be confused" glossary
Every near-synonym pair this book untangled, in one place:
| These sound alike but… | The distinction |
|---|---|
| parameter vs. hyperparameter | learned by the model vs. set by you |
| loss vs. metric | what you optimize (smooth) vs. what you report (any) |
| logit vs. probability | pre-sigmoid score $(-\infty,\infty)$ vs. post $(0,1)$ |
| sigmoid vs. softmax | one independent prob vs. competing probs summing to 1 |
| normalization vs. standardization | to unit length / [0,1] vs. mean-0/std-1 |
| L1 vs. L2 | absolute (sparse, robust) vs. squared (smooth, outlier-sensitive) |
| precision vs. recall | trust of alarms vs. coverage of positives |
| validation vs. test set | for tuning (seen often) vs. for reporting (seen once) |
| backprop vs. gradient descent vs. autograd | compute grads vs. apply them vs. automate it |
| cosine vs. dot product | direction only vs. direction + magnitude |
| similarity vs. distance | bigger = closer vs. smaller = closer |
| RAG vs. fine-tuning | adds knowledge vs. adds behavior |
| token vs. word vs. character | model's subword unit vs. word vs. letter |
* vs. @ | elementwise vs. dot / matrix product |
axis=0 vs axis=1 | collapse rows (per-column) vs. collapse cols (per-row) |
model.train() vs .eval() | dropout/batchnorm on vs. off |
| batch vs. epoch vs. iteration | examples/step vs. full pass vs. one update |
| overfitting vs. underfitting | memorized noise vs. too simple |
| bias (the term) two ways | the +b in w·x+b vs. bias in bias–variance |
| kernel (×5) | SVM similarity / conv filter / GPU fn / OS core / stats window |
| tensor (ML) vs. (physics) | "n-D array" vs. a strict transformation object |
| inference two ways | running a trained model vs. statistical inference |
The meta-lesson: AI vocabulary is full of overloaded words and false twins because the field fused statistics, computer science, and engineering, each bringing its own dialect. When a term feels slippery, it's usually because it is — pin down which of two meanings is in play and the fog lifts.
The takeaway
Floats are approximate (isclose, never ==); exp overflows (subtract the max —
the softmax/log-sum-exp trick); log(0) = -inf (add eps or use fused losses); one
NaN poisons everything (suspect a high learning rate). And when two terms blur
together, consult the glossary above — almost every confusion in AI is a pair of
near-synonyms wearing one coat. That completes the deep-learning spine. What
follows broadens you into a complete, interview-ready engineer: the classical
algorithms (trees, boosting, k-NN), the math and statistics interviews assume, the
tools/cloud/landscape you'll ship with, and a system-design and interview playbook.
First up — the classical ML algorithm zoo that rules tabular data. 👉