Experimentation & A/B testing

A model that scores well offline is a hypothesis, not a win. The only proof it helps is a controlled experiment on real users — an A/B test. This is the bridge between ML and business impact, the skill product teams interview hardest on, and the place careers are made or embarrassed. It's also where the statistics of Chapter 22 become a daily tool.

The core idea

Randomly split users into two groups, change one thing for one of them, and measure the difference:

Control (A) — the current experience.
Treatment (B) — the new model/feature/UI.
Randomization — the magic. Random assignment makes the groups statistically identical on average, so any difference in outcome is caused by your change, not by confounds. This is what makes an A/B test a causal experiment, not just a correlation.

You pick one primary metric ahead of time (conversion, click-through, revenue per user) and ask: is B's metric reliably better than A's, or is the difference just noise?

A worked test — and a humbling result

Say control converts at 10%, your new model at 12%, with 2,000 users per arm:

import numpy as np
from math import erf, sqrt
rng = np.random.default_rng(0)
nA = nB = 2000
a = rng.random(nA) < 0.10          # control: 10% true conversion
b = rng.random(nB) < 0.12          # treatment: 12% true conversion
cA, cB = a.mean(), b.mean()
print("control=%.3f  treatment=%.3f  observed lift=%.1f%%" % (cA, cB, 100*(cB-cA)/cA))

# two-proportion z-test
p_pool = (a.sum() + b.sum()) / (nA + nB)
se = np.sqrt(p_pool * (1 - p_pool) * (1/nA + 1/nB))
z = (cB - cA) / se
p_value = 2 * (1 - 0.5 * (1 + erf(abs(z) / sqrt(2))))
print("z=%.3f  p-value=%.4f" % (z, p_value))
lo, hi = (cB-cA) - 1.96*se, (cB-cA) + 1.96*se
print("95%% CI for the difference: [%.4f, %.4f]" % (lo, hi))

Output:

control=0.105  treatment=0.118  observed lift=12.4%
z=1.306  p-value=0.1915
95% CI for the difference: [-0.0065, 0.0325]

A 12.4% lift that you cannot trust. The p-value is 0.19 (well above 0.05) and the confidence interval includes zero — meaning "no effect" is still plausible. With only 2,000 users per arm, this test is underpowered: the real effect exists, but there isn't enough data to distinguish it from noise. Shipping on this would be a mistake — and recognizing it is exactly what separates a strong candidate from a weak one.

Power & sample size — decide before you run

The fix is to size the experiment up front. Statistical power is the probability of detecting a real effect of a given size. A back-of-envelope rule for the sample size per arm to detect an absolute change delta at baseline rate p (≈80% power, 5% significance):

def sample_size(p, delta):
    return int(16 * p * (1 - p) / delta ** 2)
print("users/arm to detect +2pp at 10% baseline:", sample_size(0.10, 0.02))

Output:

users/arm to detect +2pp at 10% baseline: 3600

We needed 3,600 per arm, not 2,000 — which is why the test above failed to reach significance. The lesson interviewers want: compute the required sample size and runtime before launching, never peek and stop early. Smaller effects need quadratically more users (halve delta → 4× the sample).

The four numbers of every experiment

Quantity	Question it answers
Significance (α)	how often will we falsely claim an effect? (usually 5%)
Power (1−β)	how often will we catch a real effect? (aim 80%)
Minimum detectable effect	the smallest lift worth detecting
Sample size / duration	how many users and how long, derived from the above

The classic mistakes (interview gold)

Don't be confused: these traps invalidate most amateur A/B tests.

Peeking / early stopping. Checking the p-value daily and stopping when it dips below 0.05 massively inflates false positives — you're running dozens of tests. Fix the duration in advance (or use sequential-testing methods built for peeking).

Underpowered tests. Too few users → you'll miss real effects and call winners losers (exactly the demo above).

Multiple metrics / comparisons. Test 20 metrics and one will look "significant" by chance. Pick one primary metric; correct the rest.

The novelty effect. Users react to anything new; run long enough for the bump to fade.

Sample-ratio mismatch (SRM). If your 50/50 split arrives as 52/48, the randomization is broken — trust nothing until you fix it.

Simpson's paradox. A treatment can win overall yet lose in every subgroup (or vice versa) when group sizes are imbalanced. Segment your results.

Beyond the basic A/B test

A/A test — run control against itself to validate the pipeline; it should show no difference. A great sanity check.
Multi-armed bandits — adaptively send more traffic to the winning arm during the test, trading some statistical cleanliness for less regret. Good for many short-lived options.
CUPED / variance reduction — use pre-experiment data to shrink noise and reach significance faster.
Guardrail metrics — secondary metrics (latency, complaints, churn) that must not get worse even if the primary metric wins.

Connecting it to ML

This is how a model earns its place in production. Offline metrics (Chapter 10) tell you a model is probably better; the A/B test proves it moves the business metric — and the two often disagree (a higher- AUC ranker can lower engagement). The online experiment is the final, authoritative judge. Senior ML engineers live in this loop: offline eval → ship behind a flag → A/B test → roll out or roll back.

The takeaway

A/B testing is the causal proof that an ML change helps real users — randomization removes confounds, and a primary metric plus a p-value and confidence interval tells you if the lift is real. Size the test for adequate power before running it (small effects cost quadratically more users), and avoid the killers: peeking, underpowering, multiple comparisons, novelty, SRM, and Simpson's paradox. Offline metrics propose; the experiment disposes. Now let's zoom out to the tools, languages, and platforms you'll use to do all of this for real. 👉

AI Foundations in Depth