Deployment & production checklist

The pieces are built; this chapter packages them to run together and lays out what it takes to run this for real.

One command: Docker Compose

The whole stack — API, MLflow UI, and the React dev server — comes up together:

docker compose up --build
#   backend  -> http://localhost:8000   (FastAPI + /docs)
#   mlflow   -> http://localhost:5000   (experiment dashboard)
#   frontend -> http://localhost:5173   (the UI)

# Full stack: API backend, MLflow tracking UI, and the React frontend.
# Usage:  docker compose up --build
services:
  backend:
    build: .
    ports: ["8000:8000"]
    environment:
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}
      - MLFLOW_TRACKING_URI=http://mlflow:5000
      - NEWSRECO_EMBEDDER=${NEWSRECO_EMBEDDER:-tfidf}
    depends_on: [mlflow]

  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.16.2
    command: mlflow server --host 0.0.0.0 --port 5000
             --backend-store-uri /mlruns --default-artifact-root /mlruns
    ports: ["5000:5000"]
    volumes: ["./mlruns:/mlruns"]

  frontend:
    image: node:20-slim
    working_dir: /app
    command: sh -c "npm install && npm run dev -- --host 0.0.0.0"
    environment:
      - VITE_API_BASE=http://localhost:8000
    ports: ["5173:5173"]
    volumes: ["./frontend:/app"]
    depends_on: [backend]

The backend image trains a model at build time so it's self-contained:

# Backend image: trains the model at build time, then serves the API.
FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

# Generate the sample data and train a model so the image is self-contained.
RUN python scripts/make_sample_data.py && python -m newsreco.train

EXPOSE 8000
CMD ["uvicorn", "newsreco.api:app", "--host", "0.0.0.0", "--port", "8000"]

Set a Claude key (optional) before up to enable generative RAG:

export ANTHROPIC_API_KEY=sk-ant-...

The five phases (recap)

You can run any subset, no Docker required:

Phase 0  pip install -r requirements.txt && python scripts/make_sample_data.py
Phase 1  python -m newsreco.train          # + mlflow ui --backend-store-uri ./mlruns
Phase 2  uvicorn newsreco.api:app --port 8000
Phase 3  cd frontend && npm install && npm run dev
Phase 4  export ANTHROPIC_API_KEY=...       # real RAG (else offline)
Phase 5  docker compose up --build          # everything at once

What was verified vs. what you run

Honest accounting, since this was built on a small ARM server:

Component	Status
Data generator + loader	✅ run & verified
Recommender + two-stage ranker (AUC 0.925)	✅ run & verified
Training pipeline + MLflow logging	✅ run & verified (`./mlruns` populated)
FastAPI endpoints	✅ verified via `TestClient`
Test suite (`pytest`)	✅ 6/6 passing
RAG offline mode	✅ run & verified
RAG with live Claude	⚙️ code complete; needs your `ANTHROPIC_API_KEY`
React UI	⚙️ complete code; `npm install && npm run dev` on your machine
Docker Compose	⚙️ complete; `docker compose up` on a Docker host

Scaling to production

The capstone is correct and complete; scaling it is about swapping components, not rewriting logic:

Embeddings. TF-IDF → a neural model (NEWSRECO_EMBEDDER=sbert, or a hosted embedding API). Same interface.
ANN index. The exact VectorIndex → hnswlib/FAISS for millions of articles, or a vector DB (OpenSearch/Milvus/Qdrant). The interface in ann.py already matches.
Ranker. Logistic regression → gradient-boosted trees / a neural ranker, with many more features from a feature store. Same (features → click) pipeline.
Serving. Multiple uvicorn workers behind a load balancer; cache per-user candidates; load the Production model from the MLflow registry.
Data pipeline. Batch jobs to (re)embed new articles and rebuild the index; stream clicks into the feature store; retrain on a schedule.

Operating it well (the recsys lessons, applied)

From Best practices, the things that keep it healthy:

Evaluate by time + A/B test. Offline metrics (logged in MLflow) filter ideas; live A/B tests decide. Watch for train/serve skew via the feature store.
Cold start from day one — trending fallback is wired in; add onboarding and a little exploration (bandits) to surface new articles.
Watch the feedback loop. Inject diversity (we already dedupe by title) and exploration so the system doesn't collapse onto a few popular stories.
Monitor. Latency, error rate, recommendation coverage/diversity, and click-through — not just accuracy.
Ground & cite RAG, and keep the offline fallback so the assistant degrades gracefully.

Where to go next

Drop in the real MIND dataset (Chapter 17) and re-run the pipeline.
Swap TF-IDF for neural embeddings and the exact index for HNSW — then compare runs in MLflow.
Add a sequence model (what the user read in order) for next-article prediction.

That's a complete, production-shaped recommendation system — data, models, tracking, serving, UI, and an LLM assistant — built from the ideas in this book. 🎓

Recommendation Systems from Scratch