Deployment & production checklist
The pieces are built; this chapter packages them to run together and lays out what it takes to run this for real.
One command: Docker Compose
The whole stack — API, MLflow UI, and the React dev server — comes up together:
docker compose up --build
# backend -> http://localhost:8000 (FastAPI + /docs)
# mlflow -> http://localhost:5000 (experiment dashboard)
# frontend -> http://localhost:5173 (the UI)
# Full stack: API backend, MLflow tracking UI, and the React frontend.
# Usage: docker compose up --build
services:
backend:
build: .
ports: ["8000:8000"]
environment:
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}
- MLFLOW_TRACKING_URI=http://mlflow:5000
- NEWSRECO_EMBEDDER=${NEWSRECO_EMBEDDER:-tfidf}
depends_on: [mlflow]
mlflow:
image: ghcr.io/mlflow/mlflow:v2.16.2
command: mlflow server --host 0.0.0.0 --port 5000
--backend-store-uri /mlruns --default-artifact-root /mlruns
ports: ["5000:5000"]
volumes: ["./mlruns:/mlruns"]
frontend:
image: node:20-slim
working_dir: /app
command: sh -c "npm install && npm run dev -- --host 0.0.0.0"
environment:
- VITE_API_BASE=http://localhost:8000
ports: ["5173:5173"]
volumes: ["./frontend:/app"]
depends_on: [backend]
The backend image trains a model at build time so it's self-contained:
# Backend image: trains the model at build time, then serves the API.
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
# Generate the sample data and train a model so the image is self-contained.
RUN python scripts/make_sample_data.py && python -m newsreco.train
EXPOSE 8000
CMD ["uvicorn", "newsreco.api:app", "--host", "0.0.0.0", "--port", "8000"]
Set a Claude key (optional) before up to enable generative RAG:
export ANTHROPIC_API_KEY=sk-ant-...
The five phases (recap)
You can run any subset, no Docker required:
Phase 0 pip install -r requirements.txt && python scripts/make_sample_data.py
Phase 1 python -m newsreco.train # + mlflow ui --backend-store-uri ./mlruns
Phase 2 uvicorn newsreco.api:app --port 8000
Phase 3 cd frontend && npm install && npm run dev
Phase 4 export ANTHROPIC_API_KEY=... # real RAG (else offline)
Phase 5 docker compose up --build # everything at once
What was verified vs. what you run
Honest accounting, since this was built on a small ARM server:
| Component | Status |
|---|---|
| Data generator + loader | ✅ run & verified |
| Recommender + two-stage ranker (AUC 0.925) | ✅ run & verified |
| Training pipeline + MLflow logging | ✅ run & verified (./mlruns populated) |
| FastAPI endpoints | ✅ verified via TestClient |
Test suite (pytest) | ✅ 6/6 passing |
| RAG offline mode | ✅ run & verified |
| RAG with live Claude | ⚙️ code complete; needs your ANTHROPIC_API_KEY |
| React UI | ⚙️ complete code; npm install && npm run dev on your machine |
| Docker Compose | ⚙️ complete; docker compose up on a Docker host |
Scaling to production
The capstone is correct and complete; scaling it is about swapping components, not rewriting logic:
- Embeddings. TF-IDF → a neural model (
NEWSRECO_EMBEDDER=sbert, or a hosted embedding API). Same interface. - ANN index. The exact
VectorIndex→ hnswlib/FAISS for millions of articles, or a vector DB (OpenSearch/Milvus/Qdrant). The interface inann.pyalready matches. - Ranker. Logistic regression → gradient-boosted trees / a neural ranker, with
many more features from a feature store. Same
(features → click)pipeline. - Serving. Multiple uvicorn workers behind a load balancer; cache per-user candidates; load the Production model from the MLflow registry.
- Data pipeline. Batch jobs to (re)embed new articles and rebuild the index; stream clicks into the feature store; retrain on a schedule.
Operating it well (the recsys lessons, applied)
From Best practices, the things that keep it healthy:
- Evaluate by time + A/B test. Offline metrics (logged in MLflow) filter ideas; live A/B tests decide. Watch for train/serve skew via the feature store.
- Cold start from day one — trending fallback is wired in; add onboarding and a little exploration (bandits) to surface new articles.
- Watch the feedback loop. Inject diversity (we already dedupe by title) and exploration so the system doesn't collapse onto a few popular stories.
- Monitor. Latency, error rate, recommendation coverage/diversity, and click-through — not just accuracy.
- Ground & cite RAG, and keep the offline fallback so the assistant degrades gracefully.
Where to go next
- Drop in the real MIND dataset (Chapter 17) and re-run the pipeline.
- Swap TF-IDF for neural embeddings and the exact index for HNSW — then compare runs in MLflow.
- Add a sequence model (what the user read in order) for next-article prediction.
That's a complete, production-shaped recommendation system — data, models, tracking, serving, UI, and an LLM assistant — built from the ideas in this book. 🎓