Introduction

No prior tooling experience required. If you can train a model in a Python script, you're ready. This book teaches the production tools — the ones every job description lists and every ML team uses — by adding them, one at a time, around a single tiny model. Every chapter shows the exact commands to run and the output you'll see.

The gap this book fills

Most ML learning stops at "the model works in my notebook." But a model in a notebook helps no one. The leap from "it works on my machine" to "it serves a million requests, retrains nightly, and pages me when it drifts" is a different skill set — MLOps tooling — and it's rarely taught coherently. You end up googling "how do I deploy a model" and drowning in fifty disconnected tutorials.

This book is the coherent path. It picks the production-ready, industry-standard tools and teaches them in the order you'd actually adopt them, each solving a real problem the previous step created.

The one idea: one model, many wrappers

We train one tiny sentiment classifier in Chapter 1 — bag-of-words + logistic regression, pure NumPy, no downloads — and then never change it. Instead, every chapter wraps that same model in another production tool:

       track it (MLflow)  ─►  serve it (FastAPI)  ─►  package it (Docker)
            ─►  scale it (Celery+Redis)  ─►  version its data (DVC)
            ─►  orchestrate retraining (Prefect)  ─►  optimize it (ONNX)
            ─►  demo it (Streamlit)  ─►  watch it (monitoring)
            ─►  run the whole stack (docker-compose)

Keeping the model trivial means all your attention goes to the tool — which is the thing you're actually here to learn.

The tools, and why each was chosen

Tool	The problem it solves	Chapter
MLflow	"which experiment was best, and where's that model?"	2–3
FastAPI	"how do other systems call my model?"	4
Docker	"it works on my machine but nowhere else"	5
Celery + Redis	"this job is too slow to run in a request"	6–7
DVC	"which data trained this model?"	8
Prefect	"retrain every night, automatically"	9
ONNX	"make inference fast and framework-free"	10
Streamlit	"let a non-engineer try the model"	11
Monitoring	"is the model still working in production?"	12
Vector DBs + RAG	"answer from my own documents with an LLM"	14–15
LLM serving + observability	"call/run an LLM, and watch its cost & quality"	16–17
Testing + CI/CD	"prove it works on every change, automatically"	18
Config & security	"no leaked keys; lock the API"	19
docker-compose	"run the whole system with one command"	20

These aren't arbitrary — they're the default, battle-tested choices at real companies. Learn these and you can read almost any ML team's stack.

How to read it

Chapters build on each other, so read in order the first time. Some tools (MLflow, FastAPI, ONNX, monitoring) are installed and run live in this book — you'll see real output. Others (Docker, Celery, Redis, DVC, Prefect, Streamlit) need background services or a daemon, so those chapters are precise follow-along tutorials: the exact install commands, the code, the commands to run, and the output to expect. All the code lives in code/ and runs standalone.

What you'll be able to do by the end

Take any model and: track its experiments, register and version it, wrap it in a production API, containerize it, offload heavy work to a queue, cache results, version its data, schedule its retraining, export it for fast portable inference, build a demo UI, and monitor it for drift — then run all of it together with one command. That's the full production lifecycle, and it's exactly what "MLOps" means on a job description.

Ready? Let's meet the lifecycle and the toolbox map. 👉

The production ML lifecycle & the toolbox map

Before any tool, you need the map — the lifecycle every production model lives in, and which tool owns which stage. Once you see the whole loop, each chapter clicks into place as "the tool for this box."

The lifecycle loop

A production model isn't a one-shot script; it's a loop that runs forever:

   ┌─────────────────────────────────────────────────────────────┐
   │                                                              ▼
  DATA ──► TRAIN ──► TRACK ──► REGISTER ──► SERVE ──► MONITOR ──► (drift?)
   ▲         │         │          │           │          │
   │      (Prefect) (MLflow)  (MLflow Reg) (FastAPI)  (drift check)
   │         │                              (Docker)
   └──── retrain when it drifts ◄───────────────────────────────────┘

Read it as a sentence: pull data, train a model, track the experiment, register the winner, serve it behind an API, monitor it in production, and when it drifts, retrain — forever. Software ships once and is done; ML rots, because the world it learned changes. The loop is the whole point, and MLOps is the discipline of automating it.

Two words you'll see everywhere

MLOps — "DevOps for machine learning": the practices and tools that take a model from notebook to reliable, monitored production and keep it healthy.
The model artifact — the saved, trained model (a file). Everything downstream — registry, serving, Docker — moves this artifact around. In our project it's a model.json; in yours it might be a .pkl, .pt, or .onnx.

Why ML needs more than DevOps

Regular software has one moving part: code. ML has three: code + data + model, and all three drift independently. That's why ML gets its own tools — you must version the data (DVC), track which data made which model (MLflow), and watch the model's quality in production, not just its uptime (monitoring). Keep this "three moving parts" idea in mind; it explains why each tool in this book exists.

The toolbox, mapped to the loop

Lifecycle stage	Tool in this book	The category (alternatives)
Orchestrate the loop	Prefect	workflow orchestration (Airflow, Dagster)
Version the data	DVC	data versioning (lakeFS, Delta Lake)
Track experiments	MLflow	experiment tracking (Weights & Biases)
Register & version models	MLflow Registry	model registry (SageMaker, Vertex)
Serve predictions	FastAPI	model serving (BentoML, TorchServe)
Package to run anywhere	Docker	containerization (Podman)
Background / async work	Celery + Redis	task queues (RQ, Dramatiq, Arq)
Cache & fast lookups	Redis	in-memory store (Memcached)
Fast portable inference	ONNX	inference runtime (TensorRT, OpenVINO)
Demo UI	Streamlit	quick UIs (Gradio, Dash)
Monitor & detect drift	PSI / Evidently	observability (WhyLabs, Arize)
Store & search embeddings	vector store	vector DBs (Qdrant, Chroma, pgvector, Pinecone)
Ground an LLM in your docs	RAG	LangChain, LlamaIndex
Serve / run an LLM	Claude API / vLLM / Ollama	OpenAI, TGI, Triton
Watch LLM cost & quality	token/cost tracking + LLM-judge	Langfuse, LangSmith, Ragas
Test & ship safely	pytest + GitHub Actions	CI/CD (GitLab CI, Jenkins)
Config & secrets	Pydantic Settings + secrets mgr	Vault, AWS/GCP Secrets Manager
Run it all together	docker-compose	local orchestration (Kubernetes at scale)

You're learning one representative from each category. Swap in the alternative at your job and the concepts transfer directly — a task queue is a task queue.

How the chapters run

Live (real output): MLflow, FastAPI, ONNX, the drift check, the vector store, RAG retrieval, LLM cost math, the test suite, config loading, and API-key auth are executed in this book — you'll see genuine output, and your runs will match.
Follow-along: Docker, Celery, Redis, DVC, Prefect, Streamlit, and the LLM generation calls (which need a daemon, a background service, or an API key) give you the exact install command, the code, the commands to run, and the output to expect. Provide the service/key and they run identically.

Either way, all the code is in code/ and every file runs on its own.

A note on setup

You don't need to install everything up front. Each chapter names its one dependency (pip install mlflow, pip install fastapi uvicorn, …), so install as you go. The only thing the core model needs is NumPy. The full list lives in code/requirements.txt, and code/Makefile has a shortcut for every step.

With the map in hand, let's build the one model we'll spend the rest of the book productionizing. 👉

The project: one model we'll productionize

Here's the model the whole book wraps in tools. It's deliberately tiny — a sentiment classifier in pure NumPy, no downloads, no GPU — because the star of this book is the tooling, not the model. Whatever you learn here applies unchanged to a giant transformer; the model is just the thing inside the box.

All code is in code/sentiment/ and runs with only NumPy.

What it does

Given a sentence, predict whether it's positive or negative. Under the hood it's exactly the bag-of-words + logistic regression from the AI Foundations book: count the words, take a weighted sum, squash with a sigmoid to get P(positive).

The data

We keep a small labeled dataset inline (data.py) so the project runs anywhere with zero setup — 25 positive and 25 negative sentences that reuse strong polarity words so a simple model can generalize:

POSITIVE = ["i love this product it is great",
            "absolutely love it fantastic and great", ...]
NEGATIVE = ["i hate this product it is terrible",
            "absolutely hate it awful and terrible", ...]

In a real project this is the part that comes from a database, a warehouse, or a feature store — and the part you'll version with DVC in Chapter 8.

The model

model.py is a SentimentModel class with the four methods every production tool in this book needs: fit, predict_proba, save, load. That save/load pair produces the model artifact — the file that the registry, the API, and Docker all pass around.

class SentimentModel:
    def fit(self, texts, labels): ...          # train by gradient descent
    def predict_proba(self, texts): ...        # -> P(positive) per text
    def save(self, path): ...                  # write the artifact (model.json)
    @classmethod
    def load(cls, path): ...                   # read it back

Run it directly to see it learn and predict:

$ python model.py

Output:

train accuracy: 1.0
  P(positive)=0.595  <- 'this is wonderful'
  P(positive)=0.024  <- 'this is terrible'
  P(positive)=0.698  <- 'fast and reliable support'

It correctly scores the positive sentences high and the negative one near zero. (Modest confidence on novel phrasings is expected — it's a 50-example model. That's fine; it's our stand-in for a real model.)

Training & evaluating

train.py does the production-shaped thing: split the data, train, evaluate on a held-out test set (honest evaluation matters), and save the artifact. It also has a --mlflow flag we'll use in the next chapter.

$ python train.py

Output:

examples: 50  train=38  test=12
vocab size: 68
train_accuracy: 1.000
test_accuracy:  1.000
saved model -> model.json

100% on the held-out set (the data is clean and separable by design), and it wrote model.json — the artifact everything downstream consumes.

The artifact: the thing tools move around

That model.json is the heart of every chapter to come. Internally it's just the learned vocabulary and weights as JSON:

{ "vocab": {"love": 0, "great": 1, ...}, "weights": [1.83, 2.41, ..., -0.12], ... }

Everything from here on is about that file's journey: tracked (MLflow), versioned (registry), wrapped in an API (FastAPI), baked into an image (Docker), produced by a scheduled pipeline (Prefect), converted for speed (ONNX), and watched for rot (monitoring). The model never changes — its operational maturity does.

Why "keep the model trivial" is the right call

When a serving tutorial uses a 2 GB model, half of it is fighting downloads, CUDA, and memory, and the actual tool gets two paragraphs. By making the model a 5 KB JSON file, every chapter spends 100% of its energy on the tool you came to learn — and because the interface (fit/predict/save/load) is identical to a real model's, nothing you learn is toy-specific.

The takeaway

We have a small, honest, fully-working model and its artifact (model.json), exposing the universal fit/predict_proba/save/load interface. That's everything the rest of the book builds on. First stop on the artifact's journey: making experiments reproducible and comparable with MLflow. 👉

MLflow: experiment tracking

The moment you train a model more than once — a different learning rate, more epochs, new features — you face the question that haunts every ML project: which run was best, and what settings produced it? Without tooling the answer lives in scrollback, sticky notes, and filenames like model_final_v3_REALLY_final.json. MLflow replaces that chaos with a record of every run's parameters, metrics, and artifacts.

Install: pip install mlflow — that's all this chapter needs.

What MLflow Tracking gives you

Parameters — the inputs you chose (learning rate, epochs, vocab size).
Metrics — the numbers you got (train/test accuracy, loss).
Artifacts — the files produced (the model, plots, configs).
A UI — a web dashboard to sort, filter, and compare runs side by side.

Together these make experiments reproducible ("run 17 used lr=0.5, 500 epochs") and comparable ("lr=0.5 beat lr=0.1 by 4 points").

The five lines that matter

You wrap a training run in start_run() and log what you care about. From train.py:

import mlflow

mlflow.set_experiment("sentiment")               # a named bucket of runs
with mlflow.start_run() as run:                  # one run = one experiment
    mlflow.log_params({"lr": lr, "epochs": epochs, "vocab_size": len(model.vocab)})
    mlflow.log_metrics({"train_accuracy": train_acc, "test_accuracy": test_acc})
    mlflow.log_artifact("model.json")            # save the model file with the run

That's the whole API surface for tracking: log_params, log_metrics, log_artifact. Everything else is convenience on top.

Running it

Our train.py logs only when you pass --mlflow, so the core stays dependency-free:

cd code/sentiment
python train.py --mlflow

Output:

2026/06/28 04:09:24 INFO mlflow.tracking.fluent: Experiment with name 'sentiment' does not exist. Creating a new experiment.
examples: 50  train=38  test=12
vocab size: 68
train_accuracy: 1.000
test_accuracy:  1.000
saved model -> model.json
logged run 17dfd654a50e4a6d8a6f0271318d572e to MLflow

MLflow created a sentiment experiment and recorded a run with a unique ID. Every time you run it — with different settings — you get another logged run you can compare. Run it a few times with different flags:

python train.py --mlflow --lr 0.1 --epochs 200
python train.py --mlflow --lr 1.0 --epochs 1000

Viewing & comparing runs in the UI

MLflow wrote everything to a local mlruns/ folder. Launch the dashboard:

mlflow ui --backend-store-uri ./mlruns --port 5000      # open http://localhost:5000

You'll see a sortable table — one row per run, columns for every param and metric. Click two runs and Compare to see them side by side, sort by test_accuracy to find the winner, and download any run's logged model.json. This table is the "which run was best?" answer, forever.

Autolog: tracking for free

For popular frameworks, MLflow can capture everything automatically — one line instead of manual log_* calls:

mlflow.autolog()          # then just train as usual
# for sklearn/pytorch/xgboost it auto-logs params, metrics, and the model

autolog() is the fastest way to get tracking into an existing scikit-learn, PyTorch, or XGBoost script — it hooks the framework's fit() and records params, metrics, and the model with zero extra code.

Where the data lives (and how it scales)

Local (this chapter): mlruns/ is a folder on disk — perfect for solo work.
Team: run an MLflow tracking server backed by a database (Postgres) and object storage (S3), and point everyone at it with mlflow.set_tracking_uri("http://mlflow.mycompany.com"). Now the whole team's experiments are in one searchable place.

Don't be confused: tracking vs. the registry. Tracking (this chapter) records experiments — every run you ever did, good or bad. The registry (next chapter) manages the chosen models — versioned, named, promoted to production. Tracking is your lab notebook; the registry is your shipping shelf. They're complementary, and MLflow does both.

Why this matters in practice

Reproducibility — months later you can see exactly what produced a model.
Comparison — no more guessing whether the new idea actually helped.
Collaboration — teammates see your runs instead of asking "what did you try?"
Auditability — regulated industries require a record of how each model was built. MLflow is that record.

This is the difference between "I think the new model is better" and "run #17 improved test accuracy from 0.92 to 0.96; here's the proof and the exact config."

The takeaway

MLflow Tracking records every run's params, metrics, and artifacts with three calls (log_params, log_metrics, log_artifact), gives you a UI to compare them, and scales from a local folder to a shared team server. It turns "which experiment was best?" from archaeology into a database query. Now that we can find our best model, let's manage and ship it with the MLflow Model Registry. 👉

MLflow: the model registry

Tracking tells you which run was best. But which model is in production right now? Which version do you roll back to when the new one misbehaves? How does your serving code fetch "the current model" without hard-coding a file path? That's the job of the Model Registry — a versioned, named shelf for the models you actually ship.

Install: pip install mlflow pandas (the registry's pyfunc models use pandas). Needs a database backend — we use a local SQLite file.

The problem it solves

Without a registry, "deploy the model" means copying a file and hoping everyone uses the right one. The registry replaces that with named, versioned models and aliases like @production and @staging, so your serving code asks for "models:/sentiment@production" and always gets the right version — and you can promote or roll back by moving an alias, not redeploying code.

Step 1 — wrap the model as a `pyfunc`

The registry works best with a pyfunc — MLflow's universal model interface. Wrapping our custom model in it means any MLflow-aware system can load and run it the same way, regardless of framework. From registry.py:

import mlflow.pyfunc

class SentimentPyfunc(mlflow.pyfunc.PythonModel):
    def load_context(self, context):
        self.model = SentimentModel.load(context.artifacts["model_path"])

    def predict(self, context, model_input, params=None):
        return self.model.predict_proba(model_input["text"].tolist())

load_context restores the model from its artifact; predict defines the universal inference call. That's the adapter that makes our NumPy model speak fluent MLflow.

Step 2 — log and register in one call

mlflow.set_tracking_uri("sqlite:///mlflow.db")     # DB backend enables the registry
with mlflow.start_run() as run:
    mlflow.pyfunc.log_model(
        name="model",
        python_model=SentimentPyfunc(),
        artifacts={"model_path": "model.json"},
        registered_model_name="sentiment",          # <-- registers it
    )

The registered_model_name argument is the magic word: it logs the model and creates (or adds a new version to) a registered model called sentiment.

Step 3 — promote with an alias

A registered model accumulates versions (v1, v2, v3…). You point an alias at the one that's live:

client = mlflow.MlflowClient()
version = client.get_latest_versions("sentiment")[0].version
client.set_registered_model_alias("sentiment", "production", version)

Don't be confused: aliases vs. the old "stages." Older MLflow used fixed stages (Staging, Production, Archived). Modern MLflow (2.9+) replaced these with flexible aliases — you can have @production, @champion, @canary, whatever you need. If you see transition_model_version_stage in old tutorials, the modern equivalent is set_registered_model_alias.

Step 4 — load by alias (what serving does)

Your API never hard-codes a path or version. It asks for the alias:

model = mlflow.pyfunc.load_model("models:/sentiment@production")
out = model.predict(pd.DataFrame({"text": ["this is great", "this is terrible"]}))

Running the whole thing

cd code/sentiment
python registry.py

Output:

logged + registered 'sentiment' from run 6fc7412d
set alias 'production' -> sentiment v1
loaded model predictions: [0.977, 0.024]

It registered version 1, pointed @production at it, then loaded by that alias and predicted — 0.977 for the positive sentence, 0.024 for the negative. Run it again and you'll get v2, with @production moved to the new version — that's a deploy. Point the alias back to v1 and you've rolled back, instantly, without touching serving code.

The registry in the MLflow UI

mlflow ui --backend-store-uri sqlite:///mlflow.db --port 5000

The Models tab now shows sentiment with all its versions, their aliases, which run produced each, and its metrics. This is the team's source of truth for "what's in production and how did it get there."

How this looks at a real company

A CI pipeline trains a candidate, logs it, and registers a new version.
An automated gate (or a human) checks it beats the current @production on a holdout, then moves the alias — often via a canary: @production for 95% of traffic, @canary (the new version) for 5%, watched before full promotion.
Serving code loads models:/sentiment@production and is blissfully unaware of version numbers.
An incident? Move the alias back to the previous version. Rollback in seconds.

Managed platforms (SageMaker Model Registry, Vertex Model Registry) offer the same concept; the idea — named, versioned models with promotable aliases — transfers everywhere.

The takeaway

The Model Registry turns a pile of model files into named, versioned, promotable models. Wrap the model as a pyfunc, register a version, point a @production alias at it, and load by alias in serving — so deploys and rollbacks are just moving an alias, never editing code. Tracking found the best model; the registry ships it. Now let's actually expose it to the world with an API. 👉

FastAPI: a real prediction service

A registered model still can't be used by anything but Python code that imports it. To let a website, a mobile app, or another microservice get predictions, you wrap the model in a web API — an HTTP endpoint other systems call. FastAPI is the modern Python standard for this: fast, typed, and it generates interactive docs for free.

Install: pip install fastapi "uvicorn[standard]" — FastAPI is the framework, uvicorn is the server that runs it.

Why an API, and why FastAPI

An API turns your model into a service: any system, in any language, can POST some text and get a prediction back as JSON. FastAPI wins because it gives you, out of the box:

Automatic validation — declare the input shape with Pydantic; bad requests are rejected with clear errors before your code runs.
Interactive docs — a Swagger UI at /docs where anyone can try the API in a browser.
Speed & async — among the fastest Python frameworks; handles concurrency well.
Type safety — your editor and the framework both understand your data shapes.

The service

The full app is api/main.py. The core is three pieces: schemas (what goes in and out), the model loaded once at startup, and the endpoints.

Schemas — validation for free

from pydantic import BaseModel, Field

class PredictRequest(BaseModel):
    text: str = Field(..., min_length=1, examples=["this is wonderful"])

class Prediction(BaseModel):
    label: str
    score: float

By declaring text must be a non-empty string, FastAPI rejects bad input automatically — you never write validation code, and the rejection is a clean, typed error.

Load the model once, serve many times

app = FastAPI(title="Sentiment API")
model = _load_or_train()        # loaded ONCE at startup, reused for every request

Don't be confused: load the model at startup, not per request. A common beginner bug is loading the model file inside the endpoint — so every request re-reads it from disk, making the API painfully slow. Load it once at module level (or in a startup event) and reuse it. Models are read-only at inference time; one copy serves all requests.

The endpoints

@app.get("/health")
def health():
    return {"status": "ok", "vocab_size": len(model.vocab)}

@app.post("/predict", response_model=Prediction)
def predict(req: PredictRequest):
    score = float(model.predict_proba([req.text])[0])
    return Prediction(label="positive" if score >= 0.5 else "negative",
                      score=round(score, 4))

/health is a liveness probe — Docker and Kubernetes ping it to know the service is up (Chapter 5). /predict is the real work. There's also a /predict/batch for scoring many texts in one call.

Running it

cd code
uvicorn api.main:app --reload          # --reload = auto-restart on code changes

Then open http://localhost:8000/docs for the interactive Swagger UI, or call it from the command line:

curl -X POST http://localhost:8000/predict \
     -H "Content-Type: application/json" \
     -d '{"text": "this is wonderful and great"}'

What it returns

Here's the service answering, verified end to end (via FastAPI's test client):

GET  /health                          -> {"status": "ok", "vocab_size": 75}
POST /predict {"text":"... great"}    -> {"label": "positive", "score": 0.9841}
POST /predict {"text":"terrible ..."} -> {"label": "negative", "score": 0.0112}
POST /predict/batch {"texts":[...]}   -> [{"label":"positive","score":0.8847},
                                          {"label":"negative","score":0.1188}]
POST /predict {"text":""}             -> 422 Unprocessable Entity   (validation!)

Look at that last line: an empty string is rejected with HTTP 422 before your code ever runs, because the Pydantic schema said min_length=1. That's the validation you got for free — and it's exactly what keeps a production API from crashing on malformed input.

Production touches (so you sound senior)

The basics above are real, but a production service adds:

Multiple workers — uvicorn api.main:app --workers 4 (or Gunicorn) to use all CPU cores.
A timeout & request size limit — never let one request hang or flood you.
Structured logging & a request ID — so you can trace a prediction later.
Load the model from the registry — mlflow.pyfunc.load_model("models:/sentiment@production") (Chapter 3) instead of a local file, so deploys are alias moves.
Async for I/O-bound work — async def endpoints when you call databases or other services.

Don't be confused: uvicorn vs. FastAPI vs. Gunicorn. FastAPI is the framework you write code in. uvicorn is the ASGI server that actually runs it and speaks HTTP. Gunicorn is a process manager that runs several uvicorn workers for production. You write FastAPI; uvicorn serves it; Gunicorn scales it.

Connecting the dots

Notice the service can _load_or_train() from the model.json (Chapter 1) — but in production you'd load from the registry (Chapter 3), so promoting a new model needs no code change. This API is also what the Streamlit UI (Chapter 11) and any other client will call.

The takeaway

FastAPI wraps your model in a typed HTTP service: Pydantic schemas validate input for free (the empty string got a 422), the model loads once at startup, and /health + /predict expose it to any system in any language — with interactive docs at /docs. This is how a model stops being a script and becomes a service. But a service that runs only on your laptop isn't deployed — next, we make it run anywhere with Docker. 👉

Docker: package it to run anywhere

Your FastAPI service runs on your laptop, with your Python version and your installed packages. Ship it to a server and it breaks — wrong Python, missing library, different OS. Docker ends "works on my machine" by packaging your code and its entire environment into a portable image that runs identically everywhere: your laptop, a teammate's, a server, the cloud.

Setup: install Docker Desktop (Mac/Windows) or Docker Engine (Linux). This chapter is follow-along — Docker needs a daemon, so commands and expected output are shown for you to run locally.

The three words you must know

Dockerfile — a recipe: the steps to build your environment (base OS, install deps, copy code, run command).
Image — the built, frozen result of that recipe. Immutable, shareable, tagged (sentiment-api:1.0).
Container — a running instance of an image. You can run many containers from one image.

Analogy: the Dockerfile is a class, the image is a compiled program, the container is a running process.

The Dockerfile, line by line

Here's code/Dockerfile, which packages our API. Every line is a real best practice:

FROM python:3.11-slim                 # 1. small base image (fewer CVEs, less weight)

WORKDIR /app

COPY requirements.txt .               # 2. deps FIRST, so Docker caches this layer
RUN pip install --no-cache-dir fastapi "uvicorn[standard]" pydantic numpy

COPY sentiment/ ./sentiment/          # 3. then the code (changes often)
COPY api/ ./api/

RUN useradd --create-home appuser     # 4. don't run as root
USER appuser

EXPOSE 8000

HEALTHCHECK --interval=30s --timeout=3s --retries=3 \
    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"

CMD ["uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "8000"]

The four ideas worth internalizing:

Slim base image — python:3.11-slim is ~5× smaller than the full image: faster pulls, smaller attack surface.
Copy requirements.txt before the code. Docker builds in cached layers; if you copy code first, every code change re-installs all dependencies. Deps-first means pip install is cached and rebuilds are seconds, not minutes.
Run as a non-root user. If the container is compromised, the attacker isn't root. Basic, essential hygiene.
HEALTHCHECK — hits our /health endpoint (Chapter 4) so orchestrators know when the container is actually ready, not just started.

Don't be confused: EXPOSE vs. -p. EXPOSE 8000 only documents that the app uses port 8000 — it doesn't open anything. You actually publish the port at run time with -p 8000:8000 (host port : container port). Forgetting -p is the #1 "why can't I reach my container?" gotcha.

Build and run

cd code
docker build -t sentiment-api .          # build the image from the Dockerfile
docker run -p 8000:8000 sentiment-api    # run a container, publish the port

Expected output:

[+] Building 12.3s (12/12) FINISHED
 => naming to docker.io/library/sentiment-api                          0.0s
...
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     Application startup complete.

Now curl http://localhost:8000/health works exactly as in Chapter 4 — but it's running inside an isolated container, with its own Python and dependencies, that will behave identically on any machine with Docker.

The commands you'll use daily

docker images                    # list built images
docker ps                        # list running containers
docker logs <container>          # see a container's output
docker exec -it <container> bash # open a shell inside a running container
docker stop <container>          # stop it
docker build -t name:tag .       # build with a version tag
docker push registry/name:tag    # push to a registry (Docker Hub, ECR, GCR)

How images travel to production

You don't copy code to servers anymore — you build an image, push it to a container registry (Docker Hub, AWS ECR, Google Artifact Registry), and servers pull and run it:

build image  ─►  docker push  ─►  registry  ─►  docker pull  ─►  run on server / K8s

This is the foundation of modern deployment. Kubernetes (the full-stack chapter and the foundations book) orchestrates thousands of these containers; cloud "serverless container" services (AWS Fargate, Cloud Run) run them without you managing servers at all.

Don't be confused: image vs. container (again, because it matters). You build an image once and run many containers from it. Stopping a container doesn't delete the image. A container is ephemeral — anything written inside it (like a model.json created at runtime) vanishes when it stops, unless you mount a volume. Bake the model into the image, or load it from the registry/object storage at startup — never rely on files written inside a running container surviving.

A note on size & GPUs

Keep images small: slim bases, .dockerignore, multi-stage builds (build in a fat image, copy only the result into a slim one).
For GPU inference, use NVIDIA's CUDA base images and the NVIDIA container runtime — the same Dockerfile idea, heavier base.

The takeaway

Docker packages your code and its environment into a portable image — killing "works on my machine." Write a Dockerfile (slim base, deps before code for caching, non-root, healthcheck), build an image, run containers from it, and push to a registry so any server can pull and run it identically. Containers are ephemeral; don't rely on files written inside them. Our service is now portable — next, let's handle work that's too slow to do inside a request, with a task queue. 👉

Celery + Redis: an async task queue

Some work is too slow to do inside a web request. If a user asks to score 10,000 documents, or you need to retrain the model, you can't make them wait 30 seconds for the HTTP response — the request will time out and your server will be tied up. The fix is a task queue: hand the slow job to a background worker and return immediately. Celery is Python's standard task queue, and Redis is the fast in-memory store that connects the pieces.

Setup: pip install celery redis and run a Redis server (redis-server, or docker run -p 6379:6379 redis). Follow-along — needs Redis + a worker process.

The architecture

Three players pass a job along:

  your API  ──.delay(job)──►  Redis (the queue)  ──►  Celery worker runs it
     │                                                       │
     └────────  returns a task ID instantly  ◄──── result stored back in Redis

Producer — your code, which enqueues a task with .delay(...) and gets back a ticket (an AsyncResult) instantly, without waiting.
Broker (Redis) — the queue tasks wait in until a worker is free.
Worker — a separate process that pulls tasks off the queue and runs them.
Result backend (Redis) — where the worker stores the result for later pickup.

Don't be confused: Redis is playing two roles here. As the broker it's the message queue (tasks waiting to run). As the result backend it stores finished results. They're configured separately (and can be different systems — e.g. RabbitMQ broker + Redis backend), but using Redis for both is the simplest common setup.

The Celery app

tasks/celery_app.py configures Celery to use Redis for both roles, with production-sane defaults:

from celery import Celery

app = Celery("sentiment", broker="redis://localhost:6379/0",
                          backend="redis://localhost:6379/0")
app.conf.update(
    task_time_limit=300,            # hard-kill a task after 5 minutes
    worker_max_tasks_per_child=100, # recycle workers to avoid memory leaks
)

The tasks

A task is just a function with the @app.task decorator. From tasks/tasks.py:

@app.task
def batch_score(texts: list[str]) -> list[dict]:
    model = _model()
    probs = model.predict_proba(texts)
    return [{"text": t, "score": round(float(p), 4),
             "label": "positive" if p >= 0.5 else "negative"}
            for t, p in zip(texts, probs)]

@app.task
def retrain() -> dict:
    """Retrain from scratch and save. Schedule this nightly."""
    model = SentimentModel().fit(*load_dataset())
    model.save(MODEL_PATH)
    return {"status": "retrained", "vocab_size": len(model.vocab)}

Running it (three terminals)

# terminal 1 — the broker
redis-server

# terminal 2 — a worker (from code/)
celery -A tasks.celery_app worker --loglevel=info

# terminal 3 — enqueue a job and fetch the result
python -c "from tasks.tasks import batch_score; \
           print(batch_score.delay(['great product','awful service']).get(timeout=10))"

The worker logs the task running, and terminal 3 prints the result once it's done — the task logic produces exactly this (verified against the model):

[{'text': 'great product', 'score': 0.9768, 'label': 'positive'},
 {'text': 'awful service', 'score': 0.3242, 'label': 'negative'}]

The key move: .delay(...) returned instantly with a ticket; .get() waited for the worker to finish. In a real API you'd return {"task_id": result.id} immediately and let the client poll a /status/{id} endpoint — the user never waits on the connection.

The fire-and-forget pattern in an API

@app.post("/score-batch")
def score_batch(texts: list[str]):
    task = batch_score.delay(texts)       # enqueue, don't wait
    return {"task_id": task.id}           # respond in milliseconds

@app.get("/result/{task_id}")
def get_result(task_id: str):
    res = batch_score.AsyncResult(task_id)
    return {"ready": res.ready(),
            "result": res.result if res.ready() else None}

This is how every "we're processing your request, check back" flow works — bulk exports, video processing, sending email, long ML inference.

Scheduled tasks (Celery Beat)

Celery also runs tasks on a schedule via Celery Beat — e.g. retrain every night:

app.conf.beat_schedule = {
    "nightly-retrain": {"task": "tasks.tasks.retrain", "schedule": 86400.0},
}

(For richer pipelines with dependencies and retries, you'll reach for a real orchestrator — that's Chapter 9, Prefect. Celery is for tasks; Prefect is for workflows.)

Why a queue, not just threads

Don't be confused: background threads vs. a task queue. You could run slow work in a thread, but a task queue gives you what threads can't: work survives an API restart (it's in Redis, not memory), it scales across many machines (add more workers), it retries failures automatically, and it doesn't compete with your web server for resources. For anything important or heavy, use the queue.

When Celery — and when not

Celery is the right tool surprisingly often, but it's not the only one. Match the tool to the job:

You need…	Reach for
A quick "fire and forget" after responding (send an email, log an event)	FastAPI `BackgroundTasks` — built in, no broker, runs in the same process
Durable jobs that survive restarts, retry, and scale across machines	Celery + Redis (this chapter) — the general-purpose workhorse
A simpler Redis-only queue	RQ, Dramatiq, Arq (async)
Distributed ML training or heavy parallel inference across a cluster	Ray — purpose-built for scaling Python/ML compute
Multi-step pipelines with dependencies between steps	a workflow orchestrator (Prefect, Chapter 9)

Don't be confused: BackgroundTasks vs. Celery vs. Ray. FastAPI's BackgroundTasks runs work in your web process after the response is sent — great for light, fire-and-forget jobs, but the work dies if the process restarts and can't scale past one machine. Celery runs work in separate worker processes backed by a broker — durable, retryable, horizontally scalable; the right default for real background jobs. Ray is a different animal: a framework for distributing heavy compute (training, large batch inference, hyperparameter sweeps) across a cluster. Reach for Ray when the bottleneck is CPU/GPU work that needs many machines, not when you simply need to get a job off the request path.

Where it fits in production

Decouple slow work from the request path → fast, responsive APIs.
Scale independently → add workers when the queue backs up, without touching the API.
Reliability → tasks retry, and survive crashes because they live in Redis.

Alternatives in this category: RQ (simpler, Redis-only), Dramatiq, Arq (async). Cloud-native options: AWS SQS + Lambda. The pattern — producer, broker, worker — is identical everywhere.

The takeaway

A task queue moves slow work (batch scoring, retraining, emails) out of the request path: your code calls .delay() and returns instantly, Redis holds the queue, and Celery workers do the work on their own machines — surviving restarts and scaling independently. Use Beat for scheduled jobs, a real orchestrator for complex pipelines. We've been leaning on Redis as the broker; next, let's use it for the other thing it's great at — caching. 👉

Redis: caching & feature lookups

We met Redis as Celery's broker, but that's a fraction of what it does. Redis is an in-memory key-value store — absurdly fast (sub-millisecond), and the Swiss Army knife of production systems. For ML it solves three recurring problems: caching predictions, serving features fast, and rate limiting. This chapter shows each.

Setup: pip install redis and run a Redis server. Follow-along — needs a running Redis.

Why in-memory matters

Redis keeps data in RAM, so reads and writes take microseconds — orders of magnitude faster than a database or recomputing a model prediction. The trade-off: RAM is smaller and (by default) volatile. So you use Redis for hot, ephemeral data — things you can afford to lose and want now.

Use 1 — cache predictions

If the same text gets scored repeatedly (popular queries, repeated documents), don't re-run the model — cache the result. The pattern is cache-aside: check the cache; on a miss, compute and store it.

import json
import redis

r = redis.Redis(host="localhost", port=6379, db=0)

def cached_score(text, model, ttl=3600):
    key = f"pred:{text}"
    hit = r.get(key)
    if hit is not None:                       # cache HIT — skip the model entirely
        return json.loads(hit)
    score = float(model.predict_proba([text])[0])   # cache MISS — compute it
    r.set(key, json.dumps(score), ex=ttl)     # store with a 1-hour expiry
    return score

The ex=ttl gives the key a time-to-live — it auto-expires after an hour, so the cache never grows unbounded and stale predictions self-clean. Expected behavior:

first call ("great product")   -> MISS, runs model, stores -> 0.9768   (~5 ms)
second call ("great product")  -> HIT,  returns from Redis  -> 0.9768   (~0.2 ms)

The second call is ~25× faster because it never touched the model. For an expensive model (an LLM call costing money and seconds), caching is the single highest-impact optimization you can make.

Don't be confused: caching vs. the result backend. In Chapter 6 Redis stored task results so a producer could fetch them. Here it caches model outputs to avoid recomputation. Same Redis, different purpose — it's a general key-value store, and these are just two of its uses.

Use 2 — a feature store / fast feature lookups

Online models need features now: when a request arrives for user:123, you must fetch that user's features in single-digit milliseconds. A database might be too slow; Redis is built for it. Redis hashes store an object's fields under one key:

# precompute features offline, write them to Redis
r.hset("user:123", mapping={"avg_order": "84.50", "n_sessions": "12", "tier": "gold"})

# at request time, fetch instantly
features = r.hgetall("user:123")
# -> {b'avg_order': b'84.50', b'n_sessions': b'12', b'tier': b'gold'}

This is the core of an online feature store (the foundations book's MLOps chapter): batch jobs compute features and push them to Redis; the serving model reads them with hgetall in under a millisecond. Tools like Feast use exactly this pattern (often with Redis as the online store) — and they add the crucial guarantee that training and serving compute features identically, killing training/serving skew.

Use 3 — rate limiting

Protect your API from abuse by counting requests per user per time window. Redis's atomic INCR plus a TTL makes this a few lines:

def allow_request(user_id, limit=100, window=60):
    key = f"rate:{user_id}"
    count = r.incr(key)             # atomic increment (creates the key at 1)
    if count == 1:
        r.expire(key, window)       # first hit sets the 60-second window
    return count <= limit           # allow until the limit is hit

requests 1..100  -> allow_request -> True
request 101      -> allow_request -> False   (blocked until the window resets)

Because INCR is atomic, this is correct even with thousands of concurrent requests — no race conditions. The key auto-expires, resetting the window.

The data structures you'll actually use

Structure	Command	ML use
String	`SET`/`GET`	cache a prediction or config
Hash	`HSET`/`HGETALL`	a feature vector / user profile
Counter	`INCR`/`EXPIRE`	rate limiting, metrics
Sorted set	`ZADD`/`ZRANGE`	leaderboards, top-K, recent items
TTL	`EX`/`EXPIRE`	auto-expiring cache entries

Production notes

TTLs everywhere — cache entries should expire, or Redis fills up. Set an eviction policy (allkeys-lru) so it drops the least-used keys under pressure.
Persistence — Redis can persist to disk (RDB/AOF), but treat it as a cache you can rebuild, not your source of truth.
It's a network hop — fast, but not free; batch reads (MGET, pipelines) when you can.
Cache invalidation — "one of the two hard problems." When the model changes, old cached predictions are wrong — version your cache keys (pred:v2:...) so a new model can't serve stale answers.

The takeaway

Redis is the in-memory workhorse: cache predictions (cache-aside with a TTL) to skip expensive recomputation, serve features in sub-millisecond lookups (the online feature-store pattern), and rate-limit with atomic counters. Always set TTLs, version your cache keys, and treat it as rebuildable. We can now serve fast and scale work — but which data trained the model we're serving? Time to version it. 👉

DVC: version data & models

Git is brilliant for code and terrible for data. Commit a 2 GB dataset or a 500 MB model and your repo becomes unusable — Git stores every version of every byte. Yet you need to know which data and which model produced a given result (it's required for reproducibility, debugging, and audits). DVC (Data Version Control) solves this: it versions large files with Git, without putting them in Git.

Setup: pip install dvc (add dvc[s3]/dvc[gs] for cloud remotes). Follow-along — DVC commands shown with expected output.

The core trick

DVC keeps your big files in separate storage (a "remote": S3, GCS, even a shared folder) and commits a tiny pointer file to Git instead. The pointer holds a hash of the data; Git versions the pointer, DVC versions the data:

Git repo:   data.csv.dvc   (a 100-byte text file with a hash)   ✅ tiny, versioned
DVC remote: the actual 2 GB data.csv, keyed by that hash        ✅ big, out of Git

So git checkout an old commit, then dvc checkout, and you get exactly the data and model that commit used — full reproducibility, no bloated repo.

Versioning a dataset

cd code
dvc init                          # set up DVC in the repo (once)
dvc add data/reviews.csv          # tell DVC to track this file
git add data/reviews.csv.dvc data/.gitignore
git commit -m "track reviews dataset v1"

Expected output of dvc add:

100% Adding...|████████████████████████████████|1/1 [00:00, 12.4file/s]

To track the changes with git, run:
    git add data/reviews.csv.dvc data/.gitignore

DVC moved the real CSV into its cache, wrote reviews.csv.dvc (the pointer), and added the CSV to .gitignore so Git never tries to track the big file. You commit the pointer, not the data.

Don't be confused: dvc add vs. git add. git add stages a file into Git. dvc add does the opposite — it pulls a file out of Git's tracking and into DVC's, leaving behind a small .dvc pointer that Git does track. You then git add the pointer. Two tools, clear division: Git tracks small text (code, pointers), DVC tracks big binaries (data, models).

Pushing data to a remote

The data lives in your local DVC cache until you push it to shared storage:

dvc remote add -d storage s3://my-bucket/dvc-store    # configure once
dvc push                                              # upload tracked data

Now a teammate runs git pull (gets the pointers) then dvc pull (downloads the matching data). Everyone has the same data, versioned in lockstep with the code.

Reproducible pipelines (`dvc.yaml`)

DVC's second superpower: define your pipeline as stages with declared dependencies and outputs, and DVC only re-runs a stage when its inputs change:

# dvc.yaml
stages:
  train:
    cmd: python sentiment/train.py
    deps:
      - sentiment/train.py
      - data/reviews.csv          # if the data changes, retrain
    outs:
      - model.json                 # DVC versions the output model too
    metrics:
      - metrics.json               # track metrics across versions

dvc repro          # runs only the stages whose dependencies changed
dvc metrics show   # compare metrics across data/code versions

dvc repro is "make for ML": change the data and it knows to retrain; change nothing and it does nothing. This makes your whole train→evaluate flow reproducible and cached.

Time-travel to any experiment

Because data, model, and metrics are all versioned alongside the code, you can jump to any past state completely:

git checkout <old-commit>     # the code + the .dvc pointers from back then
dvc checkout                  # the exact data + model that commit used

You're now looking at precisely the data and model that produced last month's result — the thing that's impossible with Git alone, and essential when a regulator or a postmortem asks "what exactly did we train on?"

DVC vs. the alternatives

Git LFS — also stores big files outside Git, but it's generic and lacks DVC's ML pipelines, metrics, and experiment features.
lakeFS / Delta Lake — data versioning at the data-lake scale (whole tables, branching petabytes). Bigger hammer.
MLflow — versions models and runs; DVC versions data and pipelines. They're complementary, and teams often use both.

Don't be confused: DVC vs. MLflow. They overlap but lead with different things. MLflow centers on experiment tracking and the model registry (params, metrics, model versions). DVC centers on data versioning and reproducible pipelines (which data + which code = which result). Use MLflow to compare runs and ship models; use DVC to guarantee you can reproduce the data behind any of them.

The takeaway

DVC versions large data and models with Git but not in it: the big files live in a remote (S3/GCS), and Git tracks tiny .dvc pointer files, so git checkout + dvc checkout reproduces any past data-and-model state exactly. dvc.yaml turns your pipeline into cached, dependency-aware stages ("make for ML"). Now that data and models are versioned, let's automate the loop that produces them — orchestrated retraining. 👉

Prefect: orchestrate retraining

A production model must retrain as new data arrives — but "retraining" isn't one step. It's pull data → train → evaluate → promote if better → maybe alert, with retries when a step fails and a schedule so it runs every night unattended. Stringing that together with cron and shell scripts is fragile. Prefect turns the pipeline into observable, retrying, schedulable Python — a workflow orchestrator.

Setup: pip install prefect. Follow-along — runs locally as shown; a schedule needs a Prefect server/cloud.

Tasks and flows

Prefect has two decorators:

@task — one step (pull data, train, evaluate). Prefect tracks it: logs, retries, caching, timing.
@flow — the function that wires tasks into a pipeline (a DAG). It's the unit you run and schedule.

From pipeline/retrain_flow.py:

from prefect import flow, task

@task(retries=2, retry_delay_seconds=5)        # auto-retry flaky steps
def extract():
    return load_dataset()

@task
def train(data):
    return SentimentModel().fit(*data)

@task
def evaluate(model, data):
    return model.accuracy(*data)

@task
def promote(model, accuracy, threshold=0.9):
    if accuracy >= threshold:                  # an automated quality gate
        model.save("model.json")
        return f"promoted (acc={accuracy:.3f})"
    return f"rejected (acc={accuracy:.3f} < {threshold})"

@flow(name="retrain-sentiment", log_prints=True)
def retrain_pipeline():
    data = extract()
    model = train(data)
    acc = evaluate(model, data)
    print(promote(model, acc))

Notice the quality gate in promote: the new model only ships if it clears a bar. This is how you retrain automatically without risking a bad model reaching production — a critical safety valve in any automated pipeline.

Running it

cd code
python pipeline/retrain_flow.py

Expected output (Prefect narrates each task as it runs):

14:32:01.245 | INFO | prefect.engine - Created flow run 'splendid-otter' for flow 'retrain-sentiment'
14:32:01.310 | INFO | Task run 'extract-0' - Finished in state Completed()
14:32:01.402 | INFO | Task run 'train-0' - Finished in state Completed()
14:32:01.455 | INFO | Task run 'evaluate-0' - Finished in state Completed()
14:32:01.501 | INFO | Task run 'promote-0' - Finished in state Completed()
14:32:01.503 | INFO | Flow run 'splendid-otter' - promoted (acc=1.000)
14:32:01.540 | INFO | Flow run 'splendid-otter' - Finished in state Completed()

The model scored 1.000 ≥ 0.9, so the gate promoted it. Each task shows its own state — if train had thrown, Prefect would mark it Failed, retry extract per its policy, and you'd see exactly which step broke and why.

Why an orchestrator beats cron + scripts

Don't be confused: Celery vs. Prefect — tasks vs. workflows. Celery runs independent tasks off a queue (great for "score this batch"). Prefect runs workflows — multi-step pipelines with dependencies between steps, where step B needs step A's output, with retries, scheduling, and a UI showing the whole DAG. Use Celery for fire-and-forget jobs; use Prefect for "the nightly retraining pipeline."

A cron job that calls a shell script gives you none of: retries, visibility into which step failed, passing data between steps, backfills, or alerting. Prefect gives you all of it, in Python you already know.

Scheduling it

To run unattended every night at 2 AM, deploy it with a schedule:

prefect deploy pipeline/retrain_flow.py:retrain_pipeline --cron "0 2 * * *"
prefect worker start --pool default               # a worker executes scheduled runs

Now the pipeline runs nightly, the Prefect UI shows every run's status and logs, and failures can page you. That's the "retrain when it drifts" arrow of the lifecycle loop (Chapter 0) made real — often triggered by the drift check from Chapter 12.

The orchestrator landscape

Prefect — Pythonic, modern, gentle learning curve (this chapter).
Airflow — the incumbent; powerful, ubiquitous in data engineering, heavier.
Dagster — asset-centric, strong typing and data-awareness.
Kubeflow Pipelines — Kubernetes-native ML pipelines.

They all express the same idea — a DAG of steps with scheduling and observability — so the concept transfers. Airflow is the one you'll most often see on job descriptions; Prefect is the friendliest to learn it on.

The takeaway

Prefect turns retraining into an orchestrated workflow: @task steps wired by an @flow, with automatic retries, a quality gate that only promotes good models, per-step visibility, and cron scheduling for unattended nightly runs. It's cron with a brain — and the engine of the lifecycle's retraining loop. Our pipeline produces models; next, let's make those models fast and portable for inference with ONNX. 👉

ONNX: portable, faster inference

Your model is a Python object using NumPy (or PyTorch, or TensorFlow). But what if you need to serve it from a C++ game engine, a Java backend, a browser, or a tiny edge device with no Python? And what if framework overhead makes inference slower than it needs to be? ONNX (Open Neural Network Exchange) solves both: a single, framework-neutral model format that runs fast via a dedicated runtime, anywhere.

Install: pip install torch onnx onnxruntime. This chapter runs live — the output below is real.

What ONNX is

ONNX is a standard file format for models, plus onnxruntime, a highly optimized engine that executes them. The workflow:

train in Python (PyTorch/sklearn/...)  ─►  export to model.onnx  ─►
        run with onnxruntime in ANY language, often 2-5x faster

You decouple training (Python, flexible, slow is OK) from serving (any language, optimized, latency matters). The .onnx file is portable; the runtime handles the speed.

Exporting our model

Our model is bag-of-words + logistic regression, i.e. sigmoid(counts @ w + b). We rebuild that as a one-line PyTorch Linear layer (so we can use torch's ONNX exporter) and export it. From serving/onnx_export.py:

import torch, torch.nn as nn

model = SentimentModel().fit(*load_dataset())
linear = nn.Linear(len(model.weights) - 1, 1)
with torch.no_grad():                              # copy the learned weights across
    linear.weight.copy_(torch.tensor(model.weights[:-1]).reshape(1, -1))
    linear.bias.copy_(torch.tensor([model.weights[-1]]))

class Net(nn.Module):
    def __init__(self): super().__init__(); self.fc = linear
    def forward(self, x): return torch.sigmoid(self.fc(x))

torch.onnx.export(
    Net().eval(), torch.zeros(1, len(model.weights) - 1), "sentiment.onnx",
    input_names=["counts"], output_names=["score"],
    dynamic_axes={"counts": {0: "batch"}, "score": {0: "batch"}},   # variable batch
    opset_version=17,
)

The dynamic_axes is the one subtlety worth knowing: it marks the batch dimension as variable, so the exported model accepts 1 row or 1,000. We also save the vocabulary alongside, because the runtime still has to turn text into word counts the same way.

Running inference with onnxruntime — no PyTorch needed

The whole point: load the .onnx and predict with only onnxruntime, no training framework:

import onnxruntime as ort
sess = ort.InferenceSession("sentiment.onnx")
scores = sess.run(["score"], {"counts": counts})[0]    # counts = bag-of-words matrix

It runs — live output

cd code
python serving/onnx_export.py

Output:

exported -> sentiment.onnx  (602 bytes)
  0.977  positive  <- 'this is great'
  0.024  negative  <- 'this is terrible'

The model exported to a 602-byte .onnx file, and onnxruntime reproduced the exact predictions (0.977, 0.024) — without the original SentimentModel class. That file now runs in C#, Java, JavaScript (onnxruntime-web), C++, or on a Raspberry Pi, identically.

Don't be confused: ONNX (the format) vs. onnxruntime (the engine). ONNX is the .onnx file — a portable description of the model's computation graph. onnxruntime is the program that loads and executes it fast. You export to ONNX and run with onnxruntime. (Other engines — TensorRT, OpenVINO — also run ONNX files, tuned for specific hardware.)

Why teams reach for it

Speed — onnxruntime applies graph optimizations (operator fusion, constant folding) and uses optimized kernels; inference is often 2–5× faster than the source framework, especially on CPU.
Portability — one model file, every language and platform. No Python at serving time.
Smaller footprint — ship a lean runtime instead of a full PyTorch/TF install — big for containers (Chapter 5) and edge devices.
A bridge to faster hardware — ONNX is the on-ramp to TensorRT (NVIDIA GPUs), OpenVINO (Intel), and mobile/edge accelerators.

ONNX and quantization

ONNX pairs naturally with quantization — converting weights from float32 to int8 to shrink the model ~4× and speed it up further:

from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic("sentiment.onnx", "sentiment.int8.onnx", weight_type=QuantType.QInt8)

This is how large models get squeezed onto modest hardware — the same compression idea the IVF-PQ book applies to vectors, applied to model weights.

When not to bother

ONNX shines for deployment of a settled model. During research and rapid iteration, stay in your native framework — exporting adds a step and not every exotic operation converts cleanly. Reach for ONNX when you're optimizing a model for production serving, targeting a non-Python environment, or chasing latency.

The takeaway

ONNX is a portable model format; onnxruntime is the fast engine that runs it. Export once (mind the dynamic_axes for variable batch sizes), and your model runs in any language, on any platform, typically 2–5× faster — and quantizes down for edge hardware. It's the standard bridge from "trained in Python" to "served fast and anywhere." Now let's give non-engineers a way to try the model — a UI. 👉

Streamlit & Gradio: a demo UI

A JSON API (Chapter 4) is perfect for machines, but useless for the product manager, the domain expert, or the executive who wants to try your model. For them you need a UI — and you do not want to write HTML, CSS, and JavaScript for a demo. Streamlit and Gradio turn a Python script into an interactive web app in minutes. They're how ML people ship UIs without becoming frontend engineers.

Install: pip install streamlit (or pip install gradio). Follow-along — both launch a local web server.

Streamlit: a script is the app

Streamlit's model is delightfully simple: write a normal top-to-bottom Python script, and each st.* call renders a widget. It re-runs the whole script on every interaction. Here's streamlit_app.py, a full UI for our model:

import streamlit as st
from sentiment.data import load_dataset
from sentiment.model import SentimentModel

@st.cache_resource                       # load the model ONCE, not on every rerun
def get_model():
    return SentimentModel().fit(*load_dataset())

st.title("🎭 Sentiment Analyzer")
text = st.text_area("Your text", "this product is absolutely wonderful")
if st.button("Analyze") and text.strip():
    score = float(get_model().predict_proba([text])[0])
    st.metric("P(positive)", f"{score:.1%}")
    st.progress(score)
    st.success("Positive 😊") if score >= 0.5 else st.error("Negative 😞")

Run it:

cd code
streamlit run streamlit_app.py            # opens http://localhost:8501

You get a real web app: a title, a text box, a button, a live percentage metric, a progress bar, and a colored verdict — from ~12 lines of Python, no HTML.

Don't be confused: @st.cache_resource is not optional here. Streamlit re-runs your entire script on every click. Without the cache decorator, you'd retrain the model on every interaction — slow and wasteful. @st.cache_resource (for models/ connections) and @st.cache_data (for dataframes/computations) memoize expensive work across reruns. Forgetting them is the #1 Streamlit performance bug.

The production pattern: UI calls the API

In the demo above the UI loads the model directly — fine for a prototype. In production you keep one model behind the FastAPI service and have the UI call it, so there's a single source of truth and the model isn't duplicated in every app. Our script supports both — set API_URL and it calls the service instead:

def score(text):
    if API_URL:                           # production: call the FastAPI service
        return requests.post(f"{API_URL}/predict", json={"text": text}).json()["score"]
    return float(get_model().predict_proba([text])[0])   # prototype: in-process

API_URL=http://localhost:8000 streamlit run streamlit_app.py

This is the right architecture: the API owns inference; the UI is just a client.

Gradio: even faster for ML demos

Gradio is the other popular choice, purpose-built for ML demos and tightly integrated with Hugging Face. You wrap a function in an Interface and it builds the UI:

import gradio as gr
from sentiment.model import SentimentModel
from sentiment.data import load_dataset

model = SentimentModel().fit(*load_dataset())

def classify(text):
    p = float(model.predict_proba([text])[0])
    return {"positive": p, "negative": 1 - p}     # Gradio renders a label/bar chart

gr.Interface(fn=classify, inputs="text", outputs="label").launch()

That's a complete app with a labeled confidence chart. Gradio also gives you a public shareable link (launch(share=True)) — great for sending a demo to someone — and one-click hosting on Hugging Face Spaces.

Don't be confused: Streamlit vs. Gradio — which to pick? Gradio is fastest for a single-model demo ("input → model → output"), with instant sharing and HF Spaces hosting. Streamlit is better for richer apps — dashboards, multiple inputs, charts, multi-step tools — because you control the full page layout. Demo a model → Gradio. Build an internal tool → Streamlit.

Where these fit (and where they don't)

Great for: internal tools, model demos, stakeholder reviews, data dashboards, quick experiments, hackathons.
Not for: customer-facing production apps at scale. They're single-process and not built for heavy concurrent traffic. For a real product UI, a frontend framework (React/Vue) talks to your FastAPI backend — but that's a frontend job, not yours.

The value is speed-to-demo: you can put a working model in front of a human in minutes, which is often what unblocks a project ("can I just try it?").

The takeaway

Streamlit and Gradio turn a Python script into a web UI with no frontend code — Gradio for quick single-model demos with instant sharing, Streamlit for richer multi-widget apps. Cache the model (@st.cache_resource), and in production have the UI call your FastAPI service rather than loading the model itself. These are for demos and internal tools, not customer-facing scale. We can now train, track, serve, scale, version, orchestrate, optimize, and demo the model — the last question is whether it's still working in production. 👉

Monitoring & drift detection

Here's the failure mode that catches every team once: you ship a great model, it works, everyone moves on — and six months later it's quietly making bad predictions, because the world changed and nobody was watching. A deployed model rots. Monitoring is how you catch the rot before your users (or your boss) do. It's the most-skipped and most-important stage of the lifecycle.

Install: the PSI check below is pure NumPy and runs live. Richer monitoring uses pip install evidently.

Two kinds of monitoring (don't conflate them)

Don't be confused: operational monitoring vs. ML monitoring. Operational monitoring watches the service — latency, error rate, throughput, CPU. Standard DevOps (Prometheus/Grafana); if the API is slow or 500ing, you'll know. ML monitoring watches the model's quality — is it still accurate? A model can be perfectly healthy operationally (fast, 200 OK) while being completely wrong. Operational monitoring won't catch that. You need both.

This chapter is about the second kind — the one teams forget.

Why models rot: drift

The world your model learned no longer matches the world it's serving. Two flavors:

Data drift — the input distribution shifts. New users behave differently, an upstream feature changes units, a season turns. The model sees inputs unlike its training data.
Concept drift — the relationship between inputs and the answer shifts. Fraudsters invent new tactics; "expensive" means something different after inflation. Even with identical inputs, the right answer changed.

Both silently erode accuracy — and you often can't measure accuracy live, because the true labels arrive late or never (you learn if a loan defaults months later). So you monitor drift in the inputs as an early-warning proxy.

Detecting drift with PSI

The Population Stability Index is the classic, dependency-free drift metric. It compares a reference distribution (your training data) to the current one (live traffic), bucket by bucket, and sums how much they diverge. From serving/drift_check.py:

def psi(reference, current, bins=10):
    edges = np.quantile(reference, np.linspace(0, 1, bins + 1))
    edges[0], edges[-1] = -np.inf, np.inf
    ref_pct = np.histogram(reference, edges)[0] / len(reference)
    cur_pct = np.histogram(current, edges)[0] / len(current)
    ref_pct, cur_pct = np.clip(ref_pct, 1e-6, None), np.clip(cur_pct, 1e-6, None)
    return float(np.sum((cur_pct - ref_pct) * np.log(cur_pct / ref_pct)))

The industry-standard thresholds:

PSI	Meaning	Action
< 0.1	no significant drift	fine
0.1 – 0.25	moderate drift	investigate
> 0.25	major drift	retrain

It runs — live output

cd code
python serving/drift_check.py

Output:

stable    PSI=0.0012  -> no significant drift
drifted   PSI=0.3089  -> MAJOR drift — retrain

Two production scenarios compared against the training distribution: traffic that matches it scores PSI 0.0012 (no drift — all good), while traffic shifted in mean and spread scores PSI 0.3089 (well past 0.25 — retrain now). Run this nightly on a sample of live inputs and you get an automatic early warning, before accuracy visibly craters.

What to monitor

Input feature distributions — PSI/KS-test per feature vs. training (above).
Prediction distribution — if your model suddenly predicts "positive" 90% of the time when it used to be 50%, something changed upstream.
Model accuracy — when labels eventually arrive, compare predictions to truth and track accuracy/AUC over time.
Data quality — nulls, out-of-range values, schema changes. Often the real culprit: a broken upstream pipeline, not "concept drift."
Operational metrics — latency, errors, throughput (the other kind).

Evidently: drift reports out of the box

Computing PSI by hand teaches the idea; in practice Evidently generates full drift and quality reports with a few lines:

from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=train_df, current_data=live_df)
report.save_html("drift_report.html")        # per-feature drift, with charts

It checks every feature, picks appropriate statistical tests, and produces a shareable HTML report. Hosted tools (WhyLabs, Arize, Fiddler) add dashboards, alerts, and historical tracking.

Closing the loop

Monitoring isn't passive — it's the trigger for the rest of the lifecycle. When the drift check fires, it kicks off the Prefect retraining pipeline (Chapter 9), which trains a fresh model, gates it on quality, and promotes it via the registry (Chapter 3) — all automatically. That closed loop — serve → monitor → detect drift → retrain → promote → serve — is what "production ML" actually means, and it's the whole picture from Chapter 0 made real.

Don't be confused: retrain on drift, not on a fixed schedule (ideally). Many teams retrain "every week" by default — simple, but it retrains when nothing changed (waste) and misses drift that strikes mid-week. Drift-triggered retraining responds to the actual signal. A scheduled retrain is a fine starting point; a drift-triggered one is the goal.

The takeaway

Models rot as the world drifts from their training data, so you must monitor quality (drift, prediction shifts, accuracy when labels arrive), not just uptime. PSI is the classic drift metric — our live run flagged shifted traffic at 0.31 and clean traffic at 0.001 — and Evidently automates the full report. Crucially, monitoring triggers retraining, closing the production loop. That completes the classical MLOps stack — next we add the part every modern AI team now needs: the GenAI / LLM stack, starting with the vector databases that power semantic search and RAG. 👉

Vector databases

Modern AI runs on embeddings — vectors where geometric closeness means semantic similarity. Searching millions of them fast, with metadata and persistence, is the job of a vector database, and it's the storage layer under semantic search, recommendations, and RAG (next chapter). We build the core in NumPy so the magic is obvious, then map it to the real tools.

Setup: the in-memory store runs on NumPy only and the output below is real. Production uses Qdrant / Chroma / pgvector / Pinecone.

What a vector database actually does

Three operations, and the third is the whole point:

Store vectors, each with an id and metadata (the source text, a URL, a timestamp).
Filter by metadata (source = "docs", date > last_week).
Search: given a query vector, return the top-k most similar stored vectors — fast, even over millions.

The similarity is cosine (or dot product on normalized vectors) — exactly the recipe from the foundations book. Here's the entire core, from code/rag/vectorstore.py:

import numpy as np

class VectorStore:
    def __init__(self):
        self.ids, self.vecs, self.meta = [], [], []

    def add(self, id, vector, metadata=None):
        self.ids.append(id); self.vecs.append(np.asarray(vector, float))
        self.meta.append(metadata or {})

    def search(self, query_vec, k=3):
        M = np.vstack(self.vecs)
        sims = M @ np.asarray(query_vec, float)   # all-pairs cosine in one matmul
        order = np.argsort(-sims)[:k]             # top-k, best first
        return [{"id": self.ids[i], "score": round(float(sims[i]), 3),
                 "metadata": self.meta[i]} for i in order]

That M @ query_vec — one matrix-vector product scoring every stored vector at once — is vector search. Everything a real vector DB adds is performance and durability around this idea.

Embeddings: where the vectors come from

A vector store is only as good as its embeddings. You don't compute these by hand — you call an embedding model that maps text to a learned vector. For a runnable, deterministic demo we use a tiny interpretable "topic embedder" (3 dimensions: pets / finance / tech); in production you swap in a real model (Voyage AI, OpenAI, or a local sentence-transformer) that produces 768–1536 dimensions of learned meaning.

$ python rag/vectorstore.py

Output:

query 'loyal pets to own' -> [{'id': 'd1', 'score': 1.0, 'metadata': {'src': 'pets.md'}}, {'id': 'd2', 'score': 0.0, 'metadata': {'src': 'news.md'}}]
query 'python programming' -> [{'id': 'd3', 'score': 1.0, 'metadata': {'src': 'tech.md'}}, {'id': 'd1', 'score': 0.0, 'metadata': {'src': 'pets.md'}}]

The pets query retrieved the pets document (score 1.0) and correctly ranked the finance doc last (0.0); the programming query found the tech doc. Same matmul, real ranking — with learned embeddings instead of our toy ones, "loyal pets" would also match "faithful companion animals" with no shared words. That semantic matching is why vector search beats keyword search.

Why you need a real vector database

Our NumPy store works for thousands of vectors. It falls over at scale, and that's exactly the gap the tools fill:

Our store	A vector database adds
`M @ q` scans every vector (O(n))	ANN indexes (HNSW, IVF-PQ) → sub-linear search over millions
Lives in RAM, lost on restart	Persistence to disk
No filtering	Metadata filters combined with vector search
One process	Scaling, sharding, replication

Don't be confused: exact vs. approximate search. Our argsort does exact nearest-neighbor — perfect results, but O(n) per query. Vector databases use approximate nearest-neighbor (ANN) indexes that trade a tiny bit of recall for massive speed, finding the top-k in roughly O(log n). For millions of vectors, approximate is the only option — and the HNSW and IVF-PQ sister books build those exact indexes from scratch. A vector DB is essentially our VectorStore with an ANN index, persistence, and filtering bolted on.

The landscape

Tool	What it is
Chroma	dead-simple, embedded; great for prototypes and local RAG
Qdrant	fast Rust engine, rich filtering; popular self-hosted choice
pgvector	a Postgres extension — vectors in your existing database
Pinecone / Weaviate / Milvus	managed/scalable vector DBs for production
FAISS	Meta's library — the index, no server (the HNSW/IVF-PQ algorithms)

The same three calls — add, filter, search — exist in all of them, e.g. Chroma:

import chromadb
client = chromadb.Client()
col = client.create_collection("docs")
col.add(ids=["d1"], documents=["dogs are loyal pets"], metadatas=[{"src": "pets.md"}])
hits = col.query(query_texts=["faithful companion animals"], n_results=2)

Notice Chroma even calls the embedding model for you (documents= instead of raw vectors) — convenience over our explicit embed(), same mechanics underneath.

The takeaway

A vector database stores embeddings and answers "what's most similar to this?" in one matmul-plus-top-k — which you just built in NumPy. Production tools (Chroma, Qdrant, pgvector, Pinecone) add approximate-nearest-neighbor indexes (the HNSW/IVF-PQ engines), persistence, and metadata filtering so it scales to millions. Embeddings come from a model; the store finds neighbors. This is the retrieval half of RAG — now let's wire it to an LLM and answer questions. 👉

Building a RAG service

Retrieval-Augmented Generation (RAG) is the dominant pattern for building on top of LLMs — and it's built entirely from tools you now have: a vector store (last chapter) for retrieval, and an LLM for generation. It lets a model answer from your private or current documents — data it was never trained on — and it sharply reduces hallucination, because the model quotes retrieved text instead of inventing facts.

Setup: retrieval runs on NumPy (verified output below); generation needs pip install anthropic and an ANTHROPIC_API_KEY. Code in code/rag/rag.py.

The pattern in five steps

1. chunk + embed your documents  → vector store     (offline, once)
2. embed the user's question                        (per query)
3. retrieve the top-k similar chunks                (vector search)
4. stuff them into the prompt as context            (prompt assembly)
5. ask the LLM to answer USING ONLY that context    (grounded generation)

Steps 1–3 are vector search; steps 4–5 are one LLM call. That's the whole thing.

Step 1–3: retrieve (you already built this)

We embed a small knowledge base into the VectorStore, then retrieve the chunks most relevant to the question:

from rag.vectorstore import VectorStore, embed

def retrieve(store, question, k=2):
    hits = store.search(embed(question), k=k)
    return [h["metadata"]["text"] for h in hits if h["score"] > 0]

Step 4–5: assemble the prompt and generate

The grounding happens in the prompt: we hand the model the retrieved context and instruct it to answer only from that context. This is the single most important line in a RAG system:

import anthropic

def answer(question, context):
    context_block = "\n".join(f"- {c}" for c in context)
    prompt = (
        "Answer the question using ONLY the context below. "
        "If the context doesn't contain the answer, say you don't know.\n\n"
        f"Context:\n{context_block}\n\nQuestion: {question}"
    )
    client = anthropic.Anthropic()              # reads ANTHROPIC_API_KEY from env
    resp = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}],
    )
    return "".join(b.text for b in resp.content if b.type == "text")

"Answer using ONLY the context… if it's not there, say you don't know" is what keeps the model honest. Without it, the LLM falls back on its training data and may confidently make things up.

Running it

cd code
python rag/rag.py "how loyal are dogs?"

The retrieval step runs and prints real output (the generation step needs the SDK

key):

question : how loyal are dogs?
retrieved: ['dogs are extremely loyal pets and bond closely with their owners', 'cats are independent pets that groom themselves']
answer   : [would send the assembled prompt to claude-opus-4-8]

The vector store surfaced the relevant chunk ("dogs are extremely loyal…"), which gets injected into the prompt. With the SDK installed, Claude returns a grounded answer like "According to the context, dogs are extremely loyal and bond closely with their owners." — sourced from your document, not its training data.

Why RAG instead of fine-tuning?

Don't be confused: RAG vs. fine-tuning. RAG adds knowledge — facts the model reads at query time, kept fresh by updating the vector store. Fine-tuning adds behavior — a tone, format, or skill baked into the weights. To make a model know your latest docs, use RAG (update a document → it's instantly retrievable); fine-tuning would require expensive retraining and still wouldn't cite sources. Most production "chat with your data" systems are RAG, not fine-tuned.

What makes RAG hard in production

The toy version is five steps; the production version is where the engineering lives — and these are the RAG interview questions:

Chunking. How do you split documents? Too big → noisy, blows the context budget; too small → loses context. Chunk by paragraph/section with overlap.
Retrieval quality. Pure vector search misses exact keywords; hybrid search (vector + keyword/BM25) plus a re-ranker (a model that re-scores the top candidates) is the standard upgrade.
The context budget. You can only fit so many chunks in the prompt — retrieve more, then re-rank down to the best few.
Evaluation. Did the answer actually use the right sources? Did it hallucinate? This is hard and essential — see LLM observability.
Freshness. Re-embed and re-index documents as they change (a job for Celery or Prefect).

The production architecture

A real RAG service stacks the tools from this whole book:

[ingest pipeline]  docs → chunk → embed → vector DB        (Prefect, Ch 9 / Ch 14)
[query path]       FastAPI /ask (Ch 4) → embed → retrieve (Ch 14)
                   → re-rank → prompt → Claude → grounded answer
                   ↑ cache embeddings & answers (Redis, Ch 7)
                   ↑ track cost, latency, quality (Ch 17)

Every box is a chapter. RAG is the capstone that ties the GenAI stack together.

The takeaway

RAG grounds an LLM in your documents: embed and store them, retrieve the chunks relevant to a question, inject them into the prompt, and instruct the model to answer only from that context. It adds knowledge (use it over fine-tuning for facts) and curbs hallucination. The toy is five steps; production hardens chunking, hybrid retrieval, re-ranking, and evaluation. Next: how the LLM itself gets served — the API vs. running your own. 👉

Serving & running LLMs

You don't train large language models — you call them. But "call them" splits into two very different worlds: rent an API (Claude, GPT, Gemini) or run an open-weight model yourself (Llama, Mistral, Qwen via Ollama or vLLM). This chapter maps both, with the client patterns you'll actually write, so you can reason about the cost/control trade-off every AI team faces.

Setup: the API client needs pip install anthropic + ANTHROPIC_API_KEY; self-hosting needs Ollama or vLLM. Follow-along — code and commands shown.

The two worlds

	API model (Claude, GPT, Gemini)	Open-weight, self-hosted (Llama, Mistral)
You get	the strongest models, instantly	full control, privacy, fixed cost
You manage	nothing — rent by the token	GPUs, scaling, the serving stack
Cost	per token (can balloon at scale)	hardware (fixed, high upfront)
Privacy	data leaves your network	data stays in your VPC
Best for	most apps, fastest path	privacy-critical, high-volume, customization

Don't be confused: API vs. self-hosting is a real architectural decision. It's not "API = beginner, self-host = pro." Renting an API is the right call for the majority of products — you get frontier quality with zero ops. Self-hosting wins when data can't leave your infrastructure, when token volume makes per-call pricing painful, or when you need to fine-tune the weights. Reason about it on cost, privacy, and control — not prestige.

World 1: calling an API (the common case)

The whole interface is one HTTP call. With the Anthropic SDK (code/llm/claude_client.py):

import anthropic
client = anthropic.Anthropic()                 # reads ANTHROPIC_API_KEY from env

resp = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=512,
    messages=[{"role": "user", "content": "Summarize MLOps in one sentence."}],
)
print(resp.content[0].text)

Stream for responsiveness

For anything user-facing, stream the tokens as they're generated so the user sees output immediately instead of waiting for the whole response:

with client.messages.stream(
    model="claude-opus-4-8",
    max_tokens=512,
    messages=[{"role": "user", "content": "Write a haiku about deployment."}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Streaming is also how you avoid request timeouts on long outputs — the connection stays alive as tokens flow. This messages.stream / text_stream pattern is the one you'll write most often behind a chat UI.

The serving concerns are their problem

When you rent an API, the hard parts of LLM serving — GPU memory, batching, scaling, uptime — are handled for you. You focus on prompts, retrieval, cost, and latency.

World 2: running your own (open-weight models)

When you need control, you serve an open-weight model yourself. Two tools dominate:

Ollama — the easy on-ramp

Ollama runs open models locally with one command — perfect for development, prototyping, and privacy-sensitive local apps:

ollama pull llama3.2          # download an open-weight model
ollama run llama3.2           # chat in the terminal
# it also serves an HTTP API on localhost:11434

import requests
r = requests.post("http://localhost:11434/api/generate",
                  json={"model": "llama3.2", "prompt": "Hello", "stream": False})
print(r.json()["response"])

vLLM — production-grade serving

vLLM is the high-throughput inference server for self-hosting at scale. Its key trick, continuous batching (packing many requests through the GPU together), gives far higher throughput than naive serving. It exposes an OpenAI-compatible API:

pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct      # serves an OpenAI-style API

Other names you'll hear: TGI (Hugging Face Text Generation Inference), Triton (NVIDIA), TensorRT-LLM (fastest on NVIDIA GPUs), and llama.cpp (runs quantized models on CPUs/laptops).

Don't be confused: Ollama vs. vLLM. Ollama = easy, local, single-user, development and small apps. vLLM = production serving, high throughput, many concurrent users, GPU clusters. Prototype with Ollama; serve real traffic with vLLM (or a managed endpoint). Both run the same open-weight models — they're different engines, not different models.

Making self-hosted models affordable: quantization

A 70B model in full precision needs ~140 GB of GPU memory — multiple expensive cards. Quantization (Chapter 10's ONNX cousin) stores weights in 4–8 bits instead of 16, shrinking the model ~4× so it fits on one GPU (or a laptop), with modest quality loss. Formats you'll see: GGUF (llama.cpp/Ollama), AWQ, GPTQ. This is what makes self-hosting economically viable.

Choosing, in practice

Need frontier quality, fast, low ops?            → API (Claude/GPT/Gemini)
Data must not leave your network?                → self-host (vLLM)
Prototyping / local / single user?               → Ollama
Huge token volume, cost-sensitive?               → self-host + quantize, or batch API
Need to fine-tune the weights?                   → open-weight + LoRA, self-hosted

Many production systems are hybrid: an API model for the hard requests, a small self-hosted model for cheap high-volume ones, routed by difficulty.

The takeaway

Serving an LLM means one of two things: rent an API (one messages.create call; stream for UX; serving is the provider's problem) or run an open-weight model yourself with Ollama (easy/local) or vLLM (production throughput via continuous batching), made affordable by quantization. Choose on cost, privacy, and control — not prestige; hybrids are common. Whichever you pick, you must watch it — cost, latency, and quality. That's LLM observability. 👉

LLM observability & evaluation

A classical model has metrics like accuracy; an LLM feature has cost, latency, and "was the answer any good?" — and that last one has no ground-truth label. LLM observability is monitoring (Chapter 12) adapted to this reality: track every call's tokens, cost, and latency, trace multi-step chains, and evaluate output quality without labels. Skip it and you'll get a surprise five-figure bill and a silently-degrading product.

Setup: the cost/token math runs with no dependencies (verified output below). Tracing and eval use hosted tools (Langfuse, LangSmith) or another LLM.

1. Track cost, tokens & latency on every call

Every LLM response reports its token usage; turn that into dollars and log it. From code/llm/claude_client.py:

PRICING = {                       # USD per 1,000,000 tokens (input, output)
    "claude-opus-4-8":   (5.00, 25.00),
    "claude-sonnet-4-6": (3.00, 15.00),
    "claude-haiku-4-5":  (1.00,  5.00),
}
def cost_usd(model, input_tokens, output_tokens):
    pin, pout = PRICING[model]
    return input_tokens/1e6 * pin + output_tokens/1e6 * pout

The eye-opener is what model choice does to the bill. A typical RAG call (~1,500 input tokens of context + question, ~300 output) across the model tiers:

$ python llm/claude_client.py

Output:

claude-opus-4-8     1500 in / 300 out  ->  $0.01500
claude-sonnet-4-6   1500 in / 300 out  ->  $0.00900
claude-haiku-4-5    1500 in / 300 out  ->  $0.00300

100k Opus calls/day  ->  $1,500.00/day  ~$45,000/month

That projection is why you instrument cost from day one. The same workload is $45k/month on Opus, ~$27k on Sonnet, ~$9k on Haiku — a 5× swing from one config value. You can only make that trade-off if you're measuring per-call cost in production. (And prompt caching can cut the input cost of repeated context by ~90% — another lever you only find by watching the numbers.)

The production version logs a structured record per call — not a print:

@dataclass
class CallRecord:
    model: str; input_tokens: int; output_tokens: int; cost: float; latency_s: float

Ship these to your metrics system and you get dashboards for cost per day, tokens per request, p95 latency, and error rate — the LLM's operational vitals.

2. Trace multi-step chains

A RAG request isn't one call — it's embed → retrieve → (re-rank) → generate, sometimes across several LLM calls. When the answer is wrong, which step failed? Tracing records the full tree of a request — inputs, outputs, tokens, and timing at each step — so you can see exactly where it went sideways.

Tools: Langfuse (open-source), LangSmith (LangChain's), Helicone, plus OpenTelemetry for generic spans. The pattern is a decorator or context manager around each step:

from langfuse.decorators import observe

@observe()
def retrieve(question): ...        # each @observe step becomes a span in the trace

@observe()
def answer(question, context): ... # inputs, outputs, tokens, latency captured per span

Don't be confused: logging vs. tracing. A log is one event ("called Claude, 300 tokens"). A trace is the whole request tree — every step, nested, with timing and token counts at each node. For single calls, logs suffice; for RAG and agent chains, you need tracing to answer "the final answer was wrong — was it bad retrieval or bad generation?"

3. Evaluate quality — without labels

The hard part: there's no accuracy score for "summarize this" or "answer from these docs." Three approaches, in rising rigor:

LLM-as-judge

Use a strong LLM to grade outputs against a rubric. It scales where human review can't:

judge_prompt = (
    "Score the ANSWER from 1-5 for how well it's supported by the CONTEXT.\n"
    "Reply with only the number.\n\n"
    f"CONTEXT:\n{context}\n\nANSWER:\n{answer}"
)
# send to claude-opus-4-8, parse the score, track it over time

This is how you measure RAG groundedness (did the answer stick to the retrieved context, or hallucinate?) at scale. Tools like Ragas and DeepEval package these metrics (faithfulness, answer-relevancy, context-precision).

Golden datasets

Keep a fixed set of representative inputs with known-good answers; run them on every prompt or model change and check for regressions — the LLM equivalent of a test suite (Chapter 18).

Human feedback

Collect 👍/👎 and corrections from real users. The highest-signal data — and the seed for fine-tuning or improving retrieval.

Don't be confused: offline eval vs. online monitoring. Offline eval runs your golden set against a new prompt/model before shipping — a gate, like CI. Online monitoring watches real production traffic (cost, latency, thumbs-down rate, judge scores on a sample) after shipping. You need both: eval to catch regressions before release, monitoring to catch drift after.

What to watch in production

Signal	Why
Cost / day & tokens / request	budgets blow up silently; catch runaway prompts
p50 / p95 latency	LLMs are slow; the tail kills UX
Error & timeout rate	rate limits, overloads, malformed outputs
Groundedness / judge score (sampled)	is RAG still answering from sources?
Thumbs-down rate	the realest quality signal you have
Prompt-injection / safety flags	users will try to break it

The takeaway

LLM observability = track cost/tokens/latency on every call (model choice is a 5× cost lever you must measure), trace multi-step chains to localize failures, and evaluate quality without labels via LLM-as-judge, golden datasets, and human feedback. Keep offline eval (a pre-ship gate) separate from online monitoring (post-ship drift). That completes the GenAI stack — now the engineering discipline that holds all of it together: testing and CI/CD. 👉

Testing & CI/CD

Every chapter so far added a capability; this one adds the discipline that keeps them from breaking. Testing proves your code and model work; CI/CD runs those tests automatically on every change and blocks anything broken from reaching production. This is the difference between "it worked when I tried it" and "it provably works, on every commit, forever."

Setup: pip install pytest (already a dependency). The tests in code/tests/ run live — the output below is real.

ML needs more tests, not fewer

Regular software tests ask "does this function return the right value?" ML code needs those plus tests for the things that can silently rot:

Test type	Asks
Unit	does this function work? (tokenizer, cost math)
Behavioral	does the model do its job? (positive text → positive)
Range/invariant	are outputs valid? (probabilities in [0, 1])
Convergence	does training actually learn? (train accuracy > 0.9)
Artifact round-trip	save → load → identical predictions?
API contract	right status codes, schema, validation?

Testing the model

From code/tests/test_model.py — note these are ML tests, not just unit tests:

import pytest
from sentiment.data import load_dataset
from sentiment.model import SentimentModel

@pytest.fixture(scope="module")
def trained_model():
    return SentimentModel(epochs=300).fit(*load_dataset())

def test_positive_text_scores_high(trained_model):
    assert trained_model.predict_proba(["i love this it is great"])[0] > 0.5

def test_predictions_are_valid_probabilities(trained_model):
    probs = trained_model.predict_proba(["great", "terrible", "the cat"])
    assert ((probs >= 0.0) & (probs <= 1.0)).all()       # invariant: never out of range

def test_training_converges(trained_model):
    texts, labels = load_dataset()
    assert trained_model.accuracy(texts, labels) > 0.9   # it actually learns

def test_artifact_roundtrip(trained_model, tmp_path):
    path = tmp_path / "m.json"
    trained_model.save(str(path))
    reloaded = SentimentModel.load(str(path))
    assert reloaded.predict_proba(["this is great"])[0] == pytest.approx(
        trained_model.predict_proba(["this is great"])[0])

The @pytest.fixture trains the model once and shares it across tests (fast); tmp_path is a pytest-provided temp directory (no cleanup needed). The round-trip test is the one beginners skip and regret — a serialization bug silently ships a broken model.

Testing the API

FastAPI's TestClient exercises the app in-process — no running server — so endpoint tests are CI-friendly (code/tests/test_api.py):

from fastapi.testclient import TestClient
from api.main import app
client = TestClient(app)

def test_health_ok():
    assert client.get("/health").json()["status"] == "ok"

def test_empty_text_is_rejected():
    assert client.post("/predict", json={"text": ""}).status_code == 422  # validation

Running the suite

cd code
pytest -q

Output:

............                                                             [100%]
12 passed in 0.76s

Twelve tests — model behavior, invariants, convergence, round-trip, and the API contract — in under a second. This is the gate. A change that breaks any of them fails here, before it ever reaches a user.

CI: run the tests automatically on every change

You won't remember to run pytest before every push — so a machine does it for you. Continuous Integration runs your tests on every commit and pull request. .github/workflows/ci.yml (GitHub Actions):

name: CI
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11", cache: pip }
      - run: pip install numpy fastapi "uvicorn[standard]" pydantic pytest ruff
      - run: ruff check .          # lint
      - run: pytest -q             # test

Now every pull request shows a green check ✅ or red ✗. A red CI blocks the merge — broken code physically cannot reach main. That's the whole point: the gate is automatic and unskippable.

CD: deploy automatically once tests pass

Continuous Delivery/Deployment extends the pipeline: after tests pass on main, build the Docker image (Chapter 5) and ship it.

  build-image:
    needs: test                              # ONLY if tests passed
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: docker build -t sentiment-api:${{ github.sha }} .
      # then: docker push <registry>/...   and   deploy (kubectl / cloud run / ...)

needs: test is the safety interlock — the image only builds if the tests pass.

Don't be confused: CI vs. CD. CI (Continuous Integration) = automatically test every change. CD = automatically deliver/deploy changes that pass. CI is the gate; CD is the conveyor belt after it. You adopt CI first (always), CD when you're confident the gate is strong enough to ship without a human in the loop.

What makes ML CI/CD special

Beyond code tests, mature ML pipelines add ML-specific gates (Chapter 9's quality gate, Chapter 17's eval):

Model-quality gate — the newly trained model must beat the current production model on a holdout before it can deploy.
Data validation — schema and distribution checks on incoming data (drift, Chapter 12) before training.
Eval gate for LLM features — run the golden-set eval; block on regressions.

This is CI/CD/CT — Continuous Training — the loop from Chapter 0 made fully automatic.

The takeaway

Test the model (behavior, invariants, convergence, round-trip) and the API (contract, validation) — ML needs more tests than ordinary code, and our 12 run in under a second. CI runs them on every commit and blocks broken merges; CD ships what passes; ML adds model-quality, data-validation, and eval gates on top. Automated, unskippable gates are what let a team move fast without breaking production. One discipline left before we assemble everything: configuration, secrets, and security. 👉

Config, secrets & security

The fastest way to turn a working service into an incident is a hardcoded API key in a public repo or an unprotected /predict endpoint. This chapter covers the unglamorous-but-essential basics: load configuration the right way, keep secrets out of your code, and put a lock on your API. None of it is hard — and skipping it is how breaches and surprise bills happen.

Setup: pip install pydantic-settings; the auth demo uses FastAPI's TestClient. Output below is real. Code in code/config/.

Configuration: never hardcode

Anything that changes between environments — a database URL, the model path, a log level — is configuration, and it belongs in the environment, not the code. The 12-factor rule: config lives in environment variables, so the same image runs in dev, staging, and prod with different settings and no code change.

Pydantic Settings does this cleanly: declare your config as a typed class, and it reads from environment variables (and a .env file), validates types, and fails fast. From code/config/settings.py:

from pydantic import Field
from pydantic_settings import BaseSettings, SettingsConfigDict

class Settings(BaseSettings):
    model_config = SettingsConfigDict(env_file=".env", extra="ignore")
    model_path: str = "model.json"
    redis_url: str = "redis://localhost:6379/0"
    max_batch_size: int = Field(default=128, ge=1, le=10_000)   # validated range
    api_key: str = Field(default="", repr=False)                # secret: hidden from logs

$ MAX_BATCH_SIZE=256 API_KEY=secret123 python config/settings.py

Output:

model_path    : model.json
redis_url     : redis://localhost:6379/0
max_batch_size: 256
api_key set?  : True (value never printed)

MAX_BATCH_SIZE=256 was read and type-coerced to an int; an out-of-range or non-numeric value would fail at startup, not mid-request. And repr=False keeps the secret out of logs and tracebacks. Validated config that fails fast beats a typo discovered in production.

Secrets: keep them out of the code

Don't be confused: configuration vs. secrets. Both come from the environment, but a secret (API key, DB password, token) is sensitive — it must never be committed, logged, or printed. Config like a log level can live in plain docker-compose.yml; a secret cannot.

The rules, in order of importance:

Never commit secrets. No keys in source. Add .env to .gitignore (the project does). One leaked key in git history is a breach — and git remembers forever.
Load from the environment (or a .env file locally that's gitignored).
In production, use a secrets manager — AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault, or Kubernetes Secrets. These store, rotate, and audit access to secrets; your app fetches them at startup.
Rotate on exposure. If a key leaks, revoke and reissue it immediately — which is easy only if it was never hardcoded.

# .env  (gitignored — never committed)
API_KEY=sk-your-real-key
REDIS_URL=redis://prod-redis:6379/0

The classic mistake: committing a .env with real keys, or pasting a key into a notebook you later push. Scan your repo with tools like gitleaks or trufflehog; many CI pipelines (Chapter 18) run a secret-scan step to block leaks automatically.

Securing the API

Your model endpoint is on the internet — someone will find it. At minimum, it needs authentication (who are you?) and rate limiting (how often can you call?).

Authentication: require an API key

A FastAPI dependency checks a header on every protected route — one function, applied everywhere:

from fastapi import FastAPI, Depends, HTTPException, Header

def require_key(x_api_key: str = Header(default="")):
    if x_api_key != settings.api_key:
        raise HTTPException(status_code=401, detail="invalid or missing API key")

@app.get("/secure")
def secure(_=Depends(require_key)):
    return {"ok": True}

Verified end to end (via TestClient):

no key   -> 401
bad key  -> 401
good key -> 200  {'ok': True}

No key or a wrong key → 401 Unauthorized; the right key → 200. The endpoint is now locked. (For real user systems you'd graduate to OAuth2 / JWT tokens — FastAPI has first-class support — but a checked API key is the right baseline for service-to-service calls.)

Rate limiting: cap abuse

Without a limit, one client (or one bug, or one attacker) can flood your service or run up a huge LLM bill. The Redis atomic-counter pattern from Chapter 7 caps requests per client per window:

def allow(user_id, limit=100, window=60):
    n = r.incr(f"rate:{user_id}")
    if n == 1:
        r.expire(f"rate:{user_id}", window)   # first hit starts the 60s window
    return n <= limit

Reject with 429 Too Many Requests once the limit is hit. (Libraries like slowapi wire this into FastAPI for you.)

The security baseline checklist

For any service that goes live:

No secrets in code or git history (gitignore .env, scan in CI)
Config & secrets from the environment / a secrets manager, validated at startup
Authentication on every non-public endpoint (API key → OAuth2/JWT)
Rate limiting to cap abuse and runaway cost
Input validation (Pydantic — Chapter 4 — rejects junk before it runs)
HTTPS only (terminate TLS at the load balancer / gateway)
Least privilege — the service's credentials can do only what it needs
Dependency scanning (pip-audit, Dependabot) for known CVEs
Don't log secrets or full payloads (PII, keys)

You don't need all of it on day one, but you need this list in your head before exposing a model to the internet.

The takeaway

Load config and secrets from the environment with typed, fail-fast validation (Pydantic Settings); never commit secrets — gitignore .env, use a secrets manager in prod, rotate on leak. Lock your API with authentication (a checked key → 401/200) and rate limiting (429), on top of input validation and HTTPS. None of it is hard; all of it is the difference between a demo and a service you can trust in production. Now let's assemble every tool in this book into one running system. 👉

The full stack with docker-compose

You've wrapped one model in a dozen tools, one chapter at a time. In production they run together: the API, the worker, Redis, and the tracking server, all at once, talking to each other. Starting four services by hand in four terminals is painful and fragile. docker-compose defines the whole system in one file and launches it with one command.

Setup: Docker with Compose (bundled in Docker Desktop). Follow-along.

The whole system, declared

docker-compose.yml describes every service and how they connect:

services:
  redis:                                    # broker + cache
    image: redis:7-alpine
    ports: ["6379:6379"]

  api:                                      # the FastAPI model service
    build: .
    ports: ["8000:8000"]
    environment: [REDIS_URL=redis://redis:6379/0]
    depends_on: [redis]

  worker:                                   # Celery background worker
    build: .
    command: celery -A tasks.celery_app worker --loglevel=info
    environment: [REDIS_URL=redis://redis:6379/0]
    depends_on: [redis]

  mlflow:                                   # experiment tracking / registry UI
    image: ghcr.io/mlflow/mlflow:latest
    command: mlflow server --host 0.0.0.0 --port 5000 --backend-store-uri sqlite:////mlflow/mlflow.db
    ports: ["5000:5000"]
    volumes: [mlflow-data:/mlflow]

volumes:
  mlflow-data:

Four services, each from earlier chapters, now wired into one system. Read it top to bottom and you can see the whole architecture at a glance — which is itself a benefit.

One command to rule them all

cd code
docker compose up --build

Expected output:

[+] Running 5/5
 ✔ Network code_default      Created
 ✔ Container code-redis-1    Started
 ✔ Container code-mlflow-1   Started
 ✔ Container code-api-1      Started
 ✔ Container code-worker-1   Started
api-1     | INFO:     Uvicorn running on http://0.0.0.0:8000
worker-1  | celery@... ready.
mlflow-1  | Listening at: http://0.0.0.0:5000

The entire stack is live:

http://localhost:8000/docs — the prediction API
http://localhost:5000 — the MLflow UI
the worker consuming background jobs from Redis
Redis brokering and caching

Stop it all with one command:

docker compose down

How the services find each other

Notice REDIS_URL=redis://redis:6379/0 — the API reaches Redis by the service name redis, not an IP. Compose creates a private network where each service is reachable by its name. This is the key idea: services address each other by name, not address, so the same compose file works on any machine without editing IPs.

Don't be confused: depends_on waits for start, not ready. depends_on: [redis] makes Compose start Redis before the API container — but it doesn't wait for Redis to be accepting connections. A service that crashes because its dependency isn't ready yet needs a real healthcheck (like the one in our Dockerfile) plus retry-on-connect logic. "Started" ≠ "ready" is a classic compose gotcha.

docker-compose vs. Kubernetes

Don't be confused: compose vs. Kubernetes. docker-compose runs multiple containers on one machine — perfect for local development, CI, and small deployments. Kubernetes (K8s) runs containers across a cluster of machines with auto-scaling, self-healing, rolling updates, and load balancing — the standard for production at scale. The good news: a compose file maps conceptually onto K8s manifests, so what you learn here transfers. Start with compose; graduate to K8s when one machine isn't enough.

The complete lifecycle, assembled

Step back and look at what you've built across the book — the entire production loop from Chapter 0, now real:

  data ─► train ─► track (MLflow) ─► register ─► serve (FastAPI) ─► package (Docker)
       ─► scale (Celery+Redis) ─► version data (DVC) ─► orchestrate (Prefect)
       ─► optimize (ONNX) ─► demo (Streamlit) ─► monitor ─┐
       ▲                                                   │
       └──────────────── retrain on drift ◄────────────────┘

  GenAI stack:   vector DB ─► RAG service ─► LLM serving ─► LLM observability
  Engineering:   testing & CI/CD  ·  config, secrets & security
  … and run it all with one command (docker-compose)

Every box is a tool you can now use. Swap our tiny model for a real one and nothing about the tooling changes — that was the whole point of keeping the model trivial.

A production-readiness checklist

Before any model goes live, walk this list (each item maps to a chapter):

Experiments tracked and reproducible (MLflow, DVC)
Model versioned in a registry with a @production alias
Served behind a validated API with a /health check
Containerized; image in a registry; runs as non-root
Heavy work offloaded to a queue; hot paths cached
Retraining orchestrated and gated on quality
Inference optimized (ONNX/quantization) if latency matters
Monitoring for drift and operational metrics, with alerts
Tests passing in CI; deploys gated on green (Chapter 18)
Secrets out of code; API authenticated & rate-limited (Chapter 19)
For LLM features: cost/latency tracked, eval gate, grounded RAG (Chapter 17)
A rollback plan (move the alias back)

The takeaway

docker-compose declares your whole multi-service system — API, worker, Redis, MLflow — in one file and launches it with docker compose up; services find each other by name on a private network, and you graduate to Kubernetes when one machine isn't enough. You've now assembled the complete production loop: track, register, serve, package, scale, version, orchestrate, optimize, demo, monitor, and retrain. That's MLOps — and you can do it. Go ship something. 👉

References

The official docs are the best source for every tool here — they're well-written and current. This page points you to them, plus the books that go deeper.

Tool documentation

MLflow — mlflow.org/docs. Tracking, the Model Registry, pyfunc, autolog (Ch 2–Ch 3).
FastAPI — fastapi.tiangolo.com. The tutorial is outstanding (Ch 4).
Pydantic — docs.pydantic.dev. Validation and settings.
Uvicorn — uvicorn.org. The ASGI server.
Docker — docs.docker.com; Get Started + the Dockerfile best-practices guide (Ch 5).
Celery — docs.celeryq.dev (Ch 6).
Redis — redis.io/docs; the commands reference (Ch 7).
DVC — dvc.org/doc (Ch 8).
Prefect — docs.prefect.io (Ch 9).
ONNX / onnxruntime — onnx.ai and onnxruntime.ai (Ch 10).
Streamlit — docs.streamlit.io (Ch 11).
Gradio — gradio.app/docs (Ch 11).
Evidently — docs.evidentlyai.com; drift and quality reports (Ch 12).
Vector databases — Chroma (trychroma.com), Qdrant (qdrant.tech), pgvector, Pinecone (Ch 14).
Anthropic Claude API — docs.claude.com; the Messages API and Python SDK used in the RAG, serving, and observability chapters (Ch 15–Ch 17).
Ollama / vLLM — ollama.com, docs.vllm.ai; running open-weight models (Ch 16).
Langfuse / Ragas — langfuse.com, docs.ragas.io; LLM tracing & evaluation (Ch 17).
pytest / GitHub Actions — docs.pytest.org, docs.github.com/actions (Ch 18).
Pydantic Settings — docs.pydantic.dev/latest/concepts/pydantic_settings (Ch 19).

Going deeper

Chip Huyen. Designing Machine Learning Systems. O'Reilly, 2022. — The best single book on the production ML lifecycle; complements every chapter here.
Noah Gift et al. Practical MLOps. O'Reilly, 2021. — Hands-on cloud MLOps.
Google Cloud. MLOps: Continuous delivery and automation pipelines in machine learning. — The widely-cited MLOps maturity-levels white paper.
Martin Kleppmann. Designing Data-Intensive Applications. O'Reilly, 2017. — The systems foundations (queues, caching, storage) under all of this.

The alternatives, by category

So you recognize them on a job description:

Category	This book	Common alternatives
Experiment tracking	MLflow	Weights & Biases, Neptune, Comet
Model serving	FastAPI	BentoML, TorchServe, Triton, KServe
Task queue	Celery	RQ, Dramatiq, Arq, AWS SQS
Orchestration	Prefect	Airflow, Dagster, Kubeflow
Data versioning	DVC	lakeFS, Delta Lake, Git LFS
Inference runtime	ONNX	TensorRT, OpenVINO, TorchScript
Demo UI	Streamlit / Gradio	Dash, Panel
Monitoring	Evidently	WhyLabs, Arize, Fiddler
Vector database	Chroma / Qdrant	pgvector, Pinecone, Weaviate, Milvus
LLM API	Claude	GPT (OpenAI), Gemini (Google)
Self-hosted LLM serving	vLLM / Ollama	TGI, Triton, TensorRT-LLM, llama.cpp
LLM observability / eval	Langfuse / Ragas	LangSmith, Helicone, DeepEval
CI/CD	GitHub Actions	GitLab CI, Jenkins, CircleCI
Secrets management	env + Pydantic Settings	Vault, AWS/GCP Secrets Manager
Container orchestration	docker-compose	Kubernetes, Nomad, ECS

Sister books in this series

AI Foundations in Depth — the concepts behind the models you're deploying (and a cloud/MLOps overview chapter that maps the wider landscape).
HNSW and IVF & Product Quantization — the vector-search engines inside vector databases and feature stores.
Recommendation Systems from Scratch — a production capstone using MLflow, FastAPI, and a React frontend end to end.

This book's code

Everything lives in code/ and runs standalone: sentiment/ (the model + MLflow), api/ (FastAPI), tasks/ (Celery), pipeline/ (Prefect), serving/ (ONNX + drift), streamlit_app.py, plus the Dockerfile, docker-compose.yml, and Makefile. Only NumPy is required for the core model; each chapter installs its own tool.

Production ML & AI Tools: A Hands-On Field Guide