Numerical gotchas & "don't be confused"

Computers don't do real arithmetic — they do finite-precision floating-point arithmetic, and the gap bites AI code constantly: a softmax returns NaN, a loss goes inf, two "equal" numbers aren't, a model trains fine then mysteriously explodes. This chapter is the field guide to those landmines, plus a consolidated "don't be confused" glossary of the near-synonyms that trip everyone up. Knowing these is the difference between "the model is broken" and "oh, that's just float32."

Floats are approximate — stop testing them with ==

print("0.1 + 0.2 == 0.3 :", 0.1 + 0.2 == 0.3)
print("0.1 + 0.2        =", repr(0.1 + 0.2))
print("np.isclose       :", np.isclose(0.1 + 0.2, 0.3))

Output:

0.1 + 0.2 == 0.3 : False
0.1 + 0.2        = 0.30000000000000004
np.isclose       : True

0.1, 0.2, 0.3 can't be represented exactly in binary, so they don't add up exactly. Never compare floats with ==. Use np.isclose / np.allclose (with a tolerance) for scalars and arrays. This is the #1 reason "my test passes locally but fails in CI" — different hardware rounds the last bit differently.

Softmax overflow — and the trick that fixes it

exp overflows fast: exp(1000) is larger than any float can hold (→ inf), and inf / inf is NaN. A naive softmax detonates on large logits:

def naive_softmax(z):  e = np.exp(z);             return e / e.sum()
def stable_softmax(z): z = z - z.max(); e = np.exp(z); return e / e.sum()

big = np.array([1000., 1001., 1002.])
print("naive :", naive_softmax(big))
print("stable:", np.round(stable_softmax(big), 3))

Output:

naive : [nan nan nan]
stable: [0.09  0.245 0.665]

The fix — subtract the max before exponentiating — is mathematically a no-op (softmax is shift-invariant: $\text{softmax}(z) = \text{softmax}(z - c)$) but numerically it's the whole game: now the largest exponent is exp(0) = 1, no overflow. Every library's softmax does this. It's why you hand logits to CrossEntropyLoss (Chapter 12) — the framework applies the stable version internally.

log(0) = -inf, and the + eps habit

Cross-entropy takes log(p). If the model ever predicts exactly p = 0 for the true class, you get -inf, and gradients become NaN:

print("log(0):", np.log(0.0))

Output:

log(0): -inf

The defensive habit you saw throughout this book — np.log(p + 1e-9) or clipping p to [eps, 1-eps] — exists precisely to dodge this. (Better still: use the fused BCEWithLogitsLoss / CrossEntropyLoss, which are built to be safe.)

Log-sum-exp: the pattern behind all of the above

When you must compute $\log\sum_i e^{z_i}$ (it appears in softmax, cross-entropy, and probabilistic models), the same overflow lurks. The log-sum-exp trick factors out the max:

$$ \log\sum_i e^{z_i} = m + \log\sum_i e^{z_i - m}, \qquad m = \max_i z_i $$

def logsumexp(z):
    m = z.max()
    return m + np.log(np.exp(z - m).sum())

z = np.array([1000., 1001.])
print("logsumexp (stable):", round(logsumexp(z), 3))
print("naive log(sum(exp)):", np.log(np.exp(z).sum()))

Output:

logsumexp (stable): 1001.313
naive log(sum(exp)): inf

The stable version gives the right answer (1001.313); the naive one overflows to inf. Recognize this pattern — pull out the max, then exponentiate the differences — and you'll understand the internals of half the loss functions in PyTorch.

NaN is contagious

A single NaN poisons everything it touches — one bad value silently turns your whole loss into NaN:

a = np.array([1., np.nan, 3.])
print("mean   :", np.mean(a))
print("nanmean:", np.nanmean(a))

Output:

mean   : nan
nanmean: 2.0

If your loss suddenly becomes NaN mid-training, the usual suspects are: learning rate too high (exploding gradients — Chapter 8), log(0) / divide-by-0, or a NaN already in your input data. Use np.isnan(x).any() to hunt it down, and nanmean/nansum when missing values are expected.

A few more silent traps

  • Integer division. np.array([1,2,3]) / 2 is fine (NumPy promotes to float), but integer tensors and // can silently truncate. Keep features in float32.
  • In-place ops. x += 1 modifies the array other variables may share. With autograd, in-place edits on tensors that need gradients raise errors — make a copy.
  • Broadcasting surprises. Adding shape (n,) to (n,1) broadcasts to (n,n) — a silent memory blowup, not an error (Chapter 2).
  • float32 vs float64 mismatch at the NumPy↔PyTorch boundary (Chapter 12).

The consolidated "don't be confused" glossary

Every near-synonym pair this book untangled, in one place:

These sound alike but…The distinction
parameter vs. hyperparameterlearned by the model vs. set by you
loss vs. metricwhat you optimize (smooth) vs. what you report (any)
logit vs. probabilitypre-sigmoid score $(-\infty,\infty)$ vs. post $(0,1)$
sigmoid vs. softmaxone independent prob vs. competing probs summing to 1
normalization vs. standardizationto unit length / [0,1] vs. mean-0/std-1
L1 vs. L2absolute (sparse, robust) vs. squared (smooth, outlier-sensitive)
precision vs. recalltrust of alarms vs. coverage of positives
validation vs. test setfor tuning (seen often) vs. for reporting (seen once)
backprop vs. gradient descent vs. autogradcompute grads vs. apply them vs. automate it
cosine vs. dot productdirection only vs. direction + magnitude
similarity vs. distancebigger = closer vs. smaller = closer
RAG vs. fine-tuningadds knowledge vs. adds behavior
token vs. word vs. charactermodel's subword unit vs. word vs. letter
* vs. @elementwise vs. dot / matrix product
axis=0 vs axis=1collapse rows (per-column) vs. collapse cols (per-row)
model.train() vs .eval()dropout/batchnorm on vs. off
batch vs. epoch vs. iterationexamples/step vs. full pass vs. one update
overfitting vs. underfittingmemorized noise vs. too simple
bias (the term) two waysthe +b in w·x+b vs. bias in bias–variance
kernel (×5)SVM similarity / conv filter / GPU fn / OS core / stats window
tensor (ML) vs. (physics)"n-D array" vs. a strict transformation object
inference two waysrunning a trained model vs. statistical inference

The meta-lesson: AI vocabulary is full of overloaded words and false twins because the field fused statistics, computer science, and engineering, each bringing its own dialect. When a term feels slippery, it's usually because it is — pin down which of two meanings is in play and the fog lifts.

The takeaway

Floats are approximate (isclose, never ==); exp overflows (subtract the max — the softmax/log-sum-exp trick); log(0) = -inf (add eps or use fused losses); one NaN poisons everything (suspect a high learning rate). And when two terms blur together, consult the glossary above — almost every confusion in AI is a pair of near-synonyms wearing one coat. That completes the deep-learning spine. What follows broadens you into a complete, interview-ready engineer: the classical algorithms (trees, boosting, k-NN), the math and statistics interviews assume, the tools/cloud/landscape you'll ship with, and a system-design and interview playbook. First up — the classical ML algorithm zoo that rules tabular data. 👉