Probability & statistics for ML
Machine learning is applied statistics wearing a hoodie. Loss functions are likelihoods, regularization is a prior, evaluation is hypothesis testing, and "data scientist" interviews are half probability puzzles. This chapter is the statistical backbone — distributions, Bayes' theorem, expectation, the Central Limit Theorem, MLE, and hypothesis testing — with the intuition and the code, not the proofs.
The vocabulary of uncertainty
- Random variable — a quantity whose value is uncertain (a die roll, tomorrow's clicks).
- Probability distribution — how likely each value is. Discrete (countable outcomes) uses a probability mass function; continuous uses a probability density function.
- Expectation $E[X]$ — the long-run average. Variance $\text{Var}(X)$ — how spread out values are; standard deviation is its square root.
import numpy as np
x = np.arange(1, 7); p = np.ones(6) / 6 # a fair die
print("E[X] =", (x * p).sum()) # mean
print("Var =", round(((x - 3.5)**2 * p).sum(), 4))
Output:
E[X] = 3.5
Var = 2.9167
The distributions you must recognize
| Distribution | Models | Show up as |
|---|---|---|
| Bernoulli | one yes/no trial | a single binary label |
| Binomial | # successes in n trials | conversion counts |
| Normal (Gaussian) | bell curve, sums of many effects | noise, weights, the CLT |
| Uniform | all values equally likely | random init, baselines |
| Poisson | counts of rare events per interval | arrivals, # of clicks |
| Exponential | time between events | wait times, survival |
The Normal is the celebrity — defined by its mean $\mu$ and standard deviation $\sigma$ — because of the next idea.
The Central Limit Theorem — why Normal is everywhere
The CLT says: average enough independent samples from almost any distribution, and the distribution of that average is approximately Normal. Watch averages of uniform numbers become a tight bell curve:
rng = np.random.default_rng(0)
sample_means = [rng.uniform(0, 1, size=30).mean() for _ in range(10000)]
print("mean of sample-means: %.3f" % np.mean(sample_means))
print("std of sample-means: %.3f" % np.std(sample_means))
print("CLT prediction 1/sqrt(12*30) = %.3f" % (1/np.sqrt(12*30)))
Output:
mean of sample-means: 0.499
std of sample-means: 0.053
CLT prediction 1/sqrt(12*30) = 0.053
The spread of the average shrank to exactly the CLT's prediction. This is why we can put confidence intervals on metrics and run A/B tests (Chapter 24) — the uncertainty of an average is predictable, and it shrinks like $1/\sqrt{n}$ (quadruple the data to halve the error).
Bayes' theorem — updating belief with evidence
The single most important formula in ML statistics. It flips a conditional probability:
$$ P(A \mid B) = \frac{P(B \mid A),P(A)}{P(B)} $$
The classic interview gut-punch: a disease affects 1% of people; a test is 99% accurate on the sick and has a 5% false-positive rate. You test positive — what's the chance you're actually sick?
pD, p_pos_given_D, p_pos_given_notD = 0.01, 0.99, 0.05
p_pos = p_pos_given_D * pD + p_pos_given_notD * (1 - pD) # total prob of a positive
p_D_given_pos = p_pos_given_D * pD / p_pos # Bayes
print("P(sick | positive test) =", round(p_D_given_pos, 4))
Output:
P(sick | positive test) = 0.1667
Only 17% — wildly lower than the "99% accurate" framing suggests. Because the disease is rare, the false positives from the huge healthy population swamp the true positives. This base-rate fallacy is why a model with 99% accuracy on rare events can still be useless (Chapter 10) — and why interviewers love this question.
- Prior $P(A)$ — belief before evidence (the 1%).
- Likelihood $P(B\mid A)$ — how well evidence fits the hypothesis (the 99%).
- Posterior $P(A\mid B)$ — updated belief (the 17%).
MLE — where loss functions come from
Maximum Likelihood Estimation picks the parameters that make the observed data most probable. It's not an aside — it's the origin of the losses in Chapter 7:
- Maximizing the likelihood of Gaussian-noised data ⟺ minimizing MSE.
- Maximizing the likelihood of Bernoulli labels ⟺ minimizing cross-entropy.
So when you minimize cross-entropy, you are doing maximum likelihood. MAP (Maximum A Posteriori) adds a prior via Bayes — and a Gaussian prior on the weights is exactly L2 regularization (Chapter 9). The whole loss-and-regularization story is Bayesian statistics in disguise.
Hypothesis testing & p-values
The framework for "is this difference real or just noise?" You assume a null hypothesis (no effect), then compute the p-value: the probability of seeing data this extreme or more if the null were true. Small p (conventionally < 0.05) → "unlikely to be chance" → reject the null. A clean, assumption-free way to compute one is a permutation test — shuffle the labels and see how often chance reproduces your effect:
a = rng.normal(0.0, 1, 50); b = rng.normal(0.6, 1, 50) # b really is higher
observed = b.mean() - a.mean()
pool = np.concatenate([a, b]); count = 0; N = 20000
for _ in range(N): # shuffle group labels
perm = rng.permutation(pool)
if abs(perm[50:].mean() - perm[:50].mean()) >= abs(observed):
count += 1
print("observed difference: %.3f" % observed)
print("permutation p-value: %.4f" % (count / N))
Output:
observed difference: 0.515
permutation p-value: 0.0141
p = 0.014 < 0.05 → the difference is statistically significant (unlikely to be chance). Classic tests (t-test, chi-squared, z-test) compute the same idea with formulas instead of shuffling.
Don't be confused: what a p-value is NOT. A p-value is not "the probability the null is true," and it says nothing about effect size. p = 0.001 on a trivial 0.01% improvement is significant but useless; a big effect with p = 0.06 may matter a lot. Always report the effect size and a confidence interval, not just "p < 0.05." Also: test many things and some will look significant by luck (the multiple- comparisons problem) — correct for it.
Don't be confused: statistical vs. practical significance. "Significant" means "probably not noise," not "big enough to care about." With enough data, any tiny difference becomes statistically significant. Engineers ship on practical significance (is the lift worth the complexity?), informed by statistical significance (is it even real?).
Frequentist vs. Bayesian (the two schools)
- Frequentist — parameters are fixed unknowns; probability is long-run frequency. p-values, confidence intervals. The default in A/B testing.
- Bayesian — parameters have distributions representing belief; update with data via Bayes. Priors, posteriors, credible intervals. Natural for small data and online updating.
Both are correct tools; interviewers like to hear you know the difference and when each fits.
The takeaway
Probability is the language of ML uncertainty: distributions describe randomness, the CLT makes averages Normal (and uncertainty shrink like $1/\sqrt n$), Bayes updates belief with evidence (mind the base rate), MLE is where your loss functions come from, and hypothesis testing tells real effects from noise — but a p-value is not an effect size. This is half of any data-science interview. Next, the other mathematical pillar: the linear algebra that everything runs on. 👉