ML system design

The senior-level interview isn't "implement quicksort" — it's "design the recommendation system for a video app" or "build fraud detection for payments." These open-ended ML system design questions test whether you can turn a vague business goal into a working, scalable ML system. There's a repeatable framework, and this chapter gives it to you, then walks one example end to end.

Why this question exists

It reveals everything a coding puzzle can't: do you ask about requirements before coding? Do you know how to frame an ML problem? Can you reason about data, features, models, serving, scale, and failure? Structure beats brilliance here — a calm, organized walk through the framework impresses far more than jumping to "I'd use a transformer."

The framework (memorize the seven steps)

1. Clarify     →  2. Frame as ML  →  3. Data & labels  →  4. Features
      →  5. Model  →  6. Evaluation (offline + online)  →  7. Serve & scale

1. Clarify requirements (never skip this)

Ask before you design. Interviewers plant ambiguity to see if you probe:

Goal & scale — what business metric? How many users/items/requests per second?
Latency — real-time (<100ms) or batch overnight?
Constraints — cold start? Privacy? Interpretability required? Existing systems?

Don't be confused: jumping to the model is the #1 failure. Spending two minutes on requirements is the single highest-signal thing you can do. Candidates who start with "I'd train a neural net" before knowing the scale or latency budget signal inexperience. Clarify first, always.

2. Frame it as an ML problem

Translate the business goal into a precise ML task:

What is being predicted (the label/target)?
Is it classification, regression, ranking, retrieval, generation?
What's the simplest version that delivers value? (Start there; iterate.)

3. Data & labels

Where does training data come from? What's the label, and how do you get it?
Implicit vs. explicit feedback — a click is implicit and abundant but noisy; a rating is explicit but rare.
Watch for leakage (Chapter 16) and label delay (you learn if a loan defaulted only months later).

4. Features

What signals predict the target? (User features, item features, context, interactions, history.)
Feature freshness — real-time (last click) vs. batch (30-day average) — and the feature store to serve them consistently (Chapter 26).

5. Model

Start with a baseline (popularity, logistic regression) — you must beat it.
Then the appropriate workhorse: gradient boosting for tabular, two-tower/embeddings for retrieval, a neural ranker for ordering.
Justify the choice by the requirements (latency, interpretability, data size), not by what's trendy.

6. Evaluation — offline and online

Offline — the right metric for the task (Chapter 10): AUC/F1 for classification, NDCG/Recall@k for ranking, RMSE for regression.
Online — the A/B test (Chapter 24) on the real business metric. Name your guardrail metrics.
State plainly that offline and online can disagree, and online wins.

7. Serve, scale & maintain

Online vs. batch serving (Chapter 26); the latency budget decides.
Scale — caching, approximate nearest-neighbor retrieval (HNSW/IVF-PQ), precomputation, the candidate-generation → ranking two-stage pattern.
Monitoring & retraining — drift detection, the retraining trigger, rollback.

A worked example: "Design a news feed ranker"

Watch the framework turn a one-line prompt into a system.

1. Clarify. "Rank articles for each user's home feed. ~10M users, ~100K fresh articles/day, must serve in <150ms, optimize for long-term engagement (not just clicks — avoid clickbait)."

2. Frame. For each (user, article) pair, predict P(meaningful engagement). Then sort each user's candidates by that score. It's ranking, fed by retrieval.

3. Data & labels. Implicit feedback from logs: impressions + clicks + dwell time. Label = "engaged" if click and dwell > N seconds (discourages clickbait). Beware position bias (top items get more clicks regardless of quality) and the feedback loop (we only see labels for what we showed).

4. Features. User: history, topic affinities, demographics, time of day. Article: topic, freshness, source quality, popularity. Interaction: embedding similarity between user and article (Chapter 14). Context: device, session.

5. Model — the two-stage pattern (the standard answer for feeds/recommenders):

100K articles ─► CANDIDATE GENERATION ─► ~500 candidates ─► RANKING ─► top 20
                 (cheap: embedding ANN          (expensive: gradient-boosted
                  retrieval, HNSW/IVF-PQ)         or neural ranker on rich features)

Stage 1 is fast and approximate (narrow millions to hundreds with vector search). Stage 2 is slow and accurate (score the few hundred precisely). This split is how you hit <150ms at scale — and naming it signals real seniority.

6. Evaluation. Offline: NDCG@20, Recall@k on held-out engagement (split by time, not randomly). Online: A/B test on long-term engagement and retention; guardrails on diversity, latency, and complaint rate.

7. Serve & scale. Precompute article embeddings hourly; cache user embeddings; ANN index for stage 1; the ranker behind a low-latency endpoint with a feature store. Monitor for drift; retrain daily on fresh engagement. Cold start: fall back to popularity/ trending for new users and content-based features for new articles.

(The sister Recommendation Systems book builds exactly this, including the production capstone.)

The common system-design prompts

Prepare a sketch for each — they recur constantly:

Recommendation / feed ranking (the two-stage pattern above)
Search / retrieval (embeddings + ANN + re-ranking)
Fraud / anomaly detection (imbalanced data, precision/recall trade-off, real-time)
Ad click-through-rate prediction (massive scale, calibration matters)
Content moderation / classification (precision/recall costs, human-in-the-loop)
An LLM-powered feature (RAG, latency, cost, hallucination guardrails)

The takeaway

ML system design rewards structure: clarify requirements, frame the ML problem, reason through data/labels → features → model → offline+online evaluation → serving & scale. Always start with a baseline and the simplest thing that works; know the two-stage candidate-generation → ranking pattern cold; remember the A/B test is the final judge; and talk about drift, monitoring, and cold start unprompted. Now the final chapter: the concept questions, coding drills, and prep plan to bring it all home. 👉

AI Foundations in Depth