ML system design
The senior-level interview isn't "implement quicksort" — it's "design the recommendation system for a video app" or "build fraud detection for payments." These open-ended ML system design questions test whether you can turn a vague business goal into a working, scalable ML system. There's a repeatable framework, and this chapter gives it to you, then walks one example end to end.
Why this question exists
It reveals everything a coding puzzle can't: do you ask about requirements before coding? Do you know how to frame an ML problem? Can you reason about data, features, models, serving, scale, and failure? Structure beats brilliance here — a calm, organized walk through the framework impresses far more than jumping to "I'd use a transformer."
The framework (memorize the seven steps)
1. Clarify → 2. Frame as ML → 3. Data & labels → 4. Features
→ 5. Model → 6. Evaluation (offline + online) → 7. Serve & scale
1. Clarify requirements (never skip this)
Ask before you design. Interviewers plant ambiguity to see if you probe:
- Goal & scale — what business metric? How many users/items/requests per second?
- Latency — real-time (<100ms) or batch overnight?
- Constraints — cold start? Privacy? Interpretability required? Existing systems?
Don't be confused: jumping to the model is the #1 failure. Spending two minutes on requirements is the single highest-signal thing you can do. Candidates who start with "I'd train a neural net" before knowing the scale or latency budget signal inexperience. Clarify first, always.
2. Frame it as an ML problem
Translate the business goal into a precise ML task:
- What is being predicted (the label/target)?
- Is it classification, regression, ranking, retrieval, generation?
- What's the simplest version that delivers value? (Start there; iterate.)
3. Data & labels
- Where does training data come from? What's the label, and how do you get it?
- Implicit vs. explicit feedback — a click is implicit and abundant but noisy; a rating is explicit but rare.
- Watch for leakage (Chapter 16) and label delay (you learn if a loan defaulted only months later).
4. Features
- What signals predict the target? (User features, item features, context, interactions, history.)
- Feature freshness — real-time (last click) vs. batch (30-day average) — and the feature store to serve them consistently (Chapter 26).
5. Model
- Start with a baseline (popularity, logistic regression) — you must beat it.
- Then the appropriate workhorse: gradient boosting for tabular, two-tower/embeddings for retrieval, a neural ranker for ordering.
- Justify the choice by the requirements (latency, interpretability, data size), not by what's trendy.
6. Evaluation — offline and online
- Offline — the right metric for the task (Chapter 10): AUC/F1 for classification, NDCG/Recall@k for ranking, RMSE for regression.
- Online — the A/B test (Chapter 24) on the real business metric. Name your guardrail metrics.
- State plainly that offline and online can disagree, and online wins.
7. Serve, scale & maintain
- Online vs. batch serving (Chapter 26); the latency budget decides.
- Scale — caching, approximate nearest-neighbor retrieval (HNSW/IVF-PQ), precomputation, the candidate-generation → ranking two-stage pattern.
- Monitoring & retraining — drift detection, the retraining trigger, rollback.
A worked example: "Design a news feed ranker"
Watch the framework turn a one-line prompt into a system.
1. Clarify. "Rank articles for each user's home feed. ~10M users, ~100K fresh articles/day, must serve in <150ms, optimize for long-term engagement (not just clicks — avoid clickbait)."
2. Frame. For each (user, article) pair, predict P(meaningful engagement). Then sort each user's candidates by that score. It's ranking, fed by retrieval.
3. Data & labels. Implicit feedback from logs: impressions + clicks + dwell time. Label = "engaged" if click and dwell > N seconds (discourages clickbait). Beware position bias (top items get more clicks regardless of quality) and the feedback loop (we only see labels for what we showed).
4. Features. User: history, topic affinities, demographics, time of day. Article: topic, freshness, source quality, popularity. Interaction: embedding similarity between user and article (Chapter 14). Context: device, session.
5. Model — the two-stage pattern (the standard answer for feeds/recommenders):
100K articles ─► CANDIDATE GENERATION ─► ~500 candidates ─► RANKING ─► top 20
(cheap: embedding ANN (expensive: gradient-boosted
retrieval, HNSW/IVF-PQ) or neural ranker on rich features)
Stage 1 is fast and approximate (narrow millions to hundreds with vector search). Stage 2 is slow and accurate (score the few hundred precisely). This split is how you hit <150ms at scale — and naming it signals real seniority.
6. Evaluation. Offline: NDCG@20, Recall@k on held-out engagement (split by time, not randomly). Online: A/B test on long-term engagement and retention; guardrails on diversity, latency, and complaint rate.
7. Serve & scale. Precompute article embeddings hourly; cache user embeddings; ANN index for stage 1; the ranker behind a low-latency endpoint with a feature store. Monitor for drift; retrain daily on fresh engagement. Cold start: fall back to popularity/ trending for new users and content-based features for new articles.
(The sister Recommendation Systems book builds exactly this, including the production capstone.)
The common system-design prompts
Prepare a sketch for each — they recur constantly:
- Recommendation / feed ranking (the two-stage pattern above)
- Search / retrieval (embeddings + ANN + re-ranking)
- Fraud / anomaly detection (imbalanced data, precision/recall trade-off, real-time)
- Ad click-through-rate prediction (massive scale, calibration matters)
- Content moderation / classification (precision/recall costs, human-in-the-loop)
- An LLM-powered feature (RAG, latency, cost, hallucination guardrails)
The takeaway
ML system design rewards structure: clarify requirements, frame the ML problem, reason through data/labels → features → model → offline+online evaluation → serving & scale. Always start with a baseline and the simplest thing that works; know the two-stage candidate-generation → ranking pattern cold; remember the A/B test is the final judge; and talk about drift, monitoring, and cold start unprompted. Now the final chapter: the concept questions, coding drills, and prep plan to bring it all home. 👉