Deployment & production checklist

The pieces are built; this chapter packages them to run together and lays out what it takes to run this for real.

One command: Docker Compose

The whole stack — API, MLflow UI, and the React dev server — comes up together:

docker compose up --build
#   backend  -> http://localhost:8000   (FastAPI + /docs)
#   mlflow   -> http://localhost:5000   (experiment dashboard)
#   frontend -> http://localhost:5173   (the UI)
# Full stack: API backend, MLflow tracking UI, and the React frontend.
# Usage:  docker compose up --build
services:
  backend:
    build: .
    ports: ["8000:8000"]
    environment:
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}
      - MLFLOW_TRACKING_URI=http://mlflow:5000
      - NEWSRECO_EMBEDDER=${NEWSRECO_EMBEDDER:-tfidf}
    depends_on: [mlflow]

  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.16.2
    command: mlflow server --host 0.0.0.0 --port 5000
             --backend-store-uri /mlruns --default-artifact-root /mlruns
    ports: ["5000:5000"]
    volumes: ["./mlruns:/mlruns"]

  frontend:
    image: node:20-slim
    working_dir: /app
    command: sh -c "npm install && npm run dev -- --host 0.0.0.0"
    environment:
      - VITE_API_BASE=http://localhost:8000
    ports: ["5173:5173"]
    volumes: ["./frontend:/app"]
    depends_on: [backend]

The backend image trains a model at build time so it's self-contained:

# Backend image: trains the model at build time, then serves the API.
FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

# Generate the sample data and train a model so the image is self-contained.
RUN python scripts/make_sample_data.py && python -m newsreco.train

EXPOSE 8000
CMD ["uvicorn", "newsreco.api:app", "--host", "0.0.0.0", "--port", "8000"]

Set a Claude key (optional) before up to enable generative RAG:

export ANTHROPIC_API_KEY=sk-ant-...

The five phases (recap)

You can run any subset, no Docker required:

Phase 0  pip install -r requirements.txt && python scripts/make_sample_data.py
Phase 1  python -m newsreco.train          # + mlflow ui --backend-store-uri ./mlruns
Phase 2  uvicorn newsreco.api:app --port 8000
Phase 3  cd frontend && npm install && npm run dev
Phase 4  export ANTHROPIC_API_KEY=...       # real RAG (else offline)
Phase 5  docker compose up --build          # everything at once

What was verified vs. what you run

Honest accounting, since this was built on a small ARM server:

ComponentStatus
Data generator + loader✅ run & verified
Recommender + two-stage ranker (AUC 0.925)✅ run & verified
Training pipeline + MLflow logging✅ run & verified (./mlruns populated)
FastAPI endpoints✅ verified via TestClient
Test suite (pytest)✅ 6/6 passing
RAG offline mode✅ run & verified
RAG with live Claude⚙️ code complete; needs your ANTHROPIC_API_KEY
React UI⚙️ complete code; npm install && npm run dev on your machine
Docker Compose⚙️ complete; docker compose up on a Docker host

Scaling to production

The capstone is correct and complete; scaling it is about swapping components, not rewriting logic:

  • Embeddings. TF-IDF → a neural model (NEWSRECO_EMBEDDER=sbert, or a hosted embedding API). Same interface.
  • ANN index. The exact VectorIndexhnswlib/FAISS for millions of articles, or a vector DB (OpenSearch/Milvus/Qdrant). The interface in ann.py already matches.
  • Ranker. Logistic regression → gradient-boosted trees / a neural ranker, with many more features from a feature store. Same (features → click) pipeline.
  • Serving. Multiple uvicorn workers behind a load balancer; cache per-user candidates; load the Production model from the MLflow registry.
  • Data pipeline. Batch jobs to (re)embed new articles and rebuild the index; stream clicks into the feature store; retrain on a schedule.

Operating it well (the recsys lessons, applied)

From Best practices, the things that keep it healthy:

  • Evaluate by time + A/B test. Offline metrics (logged in MLflow) filter ideas; live A/B tests decide. Watch for train/serve skew via the feature store.
  • Cold start from day one — trending fallback is wired in; add onboarding and a little exploration (bandits) to surface new articles.
  • Watch the feedback loop. Inject diversity (we already dedupe by title) and exploration so the system doesn't collapse onto a few popular stories.
  • Monitor. Latency, error rate, recommendation coverage/diversity, and click-through — not just accuracy.
  • Ground & cite RAG, and keep the offline fallback so the assistant degrades gracefully.

Where to go next

  • Drop in the real MIND dataset (Chapter 17) and re-run the pipeline.
  • Swap TF-IDF for neural embeddings and the exact index for HNSW — then compare runs in MLflow.
  • Add a sequence model (what the user read in order) for next-article prediction.

That's a complete, production-shaped recommendation system — data, models, tracking, serving, UI, and an LLM assistant — built from the ideas in this book. 🎓