Introduction
No prior tooling experience required. If you can train a model in a Python script, you're ready. This book teaches the production tools — the ones every job description lists and every ML team uses — by adding them, one at a time, around a single tiny model. Every chapter shows the exact commands to run and the output you'll see.
The gap this book fills
Most ML learning stops at "the model works in my notebook." But a model in a notebook helps no one. The leap from "it works on my machine" to "it serves a million requests, retrains nightly, and pages me when it drifts" is a different skill set — MLOps tooling — and it's rarely taught coherently. You end up googling "how do I deploy a model" and drowning in fifty disconnected tutorials.
This book is the coherent path. It picks the production-ready, industry-standard tools and teaches them in the order you'd actually adopt them, each solving a real problem the previous step created.
The one idea: one model, many wrappers
We train one tiny sentiment classifier in Chapter 1 — bag-of-words + logistic regression, pure NumPy, no downloads — and then never change it. Instead, every chapter wraps that same model in another production tool:
track it (MLflow) ─► serve it (FastAPI) ─► package it (Docker)
─► scale it (Celery+Redis) ─► version its data (DVC)
─► orchestrate retraining (Prefect) ─► optimize it (ONNX)
─► demo it (Streamlit) ─► watch it (monitoring)
─► run the whole stack (docker-compose)
Keeping the model trivial means all your attention goes to the tool — which is the thing you're actually here to learn.
The tools, and why each was chosen
| Tool | The problem it solves | Chapter |
|---|---|---|
| MLflow | "which experiment was best, and where's that model?" | 2–3 |
| FastAPI | "how do other systems call my model?" | 4 |
| Docker | "it works on my machine but nowhere else" | 5 |
| Celery + Redis | "this job is too slow to run in a request" | 6–7 |
| DVC | "which data trained this model?" | 8 |
| Prefect | "retrain every night, automatically" | 9 |
| ONNX | "make inference fast and framework-free" | 10 |
| Streamlit | "let a non-engineer try the model" | 11 |
| Monitoring | "is the model still working in production?" | 12 |
| Vector DBs + RAG | "answer from my own documents with an LLM" | 14–15 |
| LLM serving + observability | "call/run an LLM, and watch its cost & quality" | 16–17 |
| Testing + CI/CD | "prove it works on every change, automatically" | 18 |
| Config & security | "no leaked keys; lock the API" | 19 |
| docker-compose | "run the whole system with one command" | 20 |
These aren't arbitrary — they're the default, battle-tested choices at real companies. Learn these and you can read almost any ML team's stack.
How to read it
Chapters build on each other, so read in order the first time. Some tools (MLflow,
FastAPI, ONNX, monitoring) are installed and run live in this book — you'll see real
output. Others (Docker, Celery, Redis, DVC, Prefect, Streamlit) need background
services or a daemon, so those chapters are precise follow-along tutorials: the
exact install commands, the code, the commands to run, and the output to expect. All
the code lives in code/ and runs standalone.
What you'll be able to do by the end
Take any model and: track its experiments, register and version it, wrap it in a production API, containerize it, offload heavy work to a queue, cache results, version its data, schedule its retraining, export it for fast portable inference, build a demo UI, and monitor it for drift — then run all of it together with one command. That's the full production lifecycle, and it's exactly what "MLOps" means on a job description.
Ready? Let's meet the lifecycle and the toolbox map. 👉
The production ML lifecycle & the toolbox map
Before any tool, you need the map — the lifecycle every production model lives in, and which tool owns which stage. Once you see the whole loop, each chapter clicks into place as "the tool for this box."
The lifecycle loop
A production model isn't a one-shot script; it's a loop that runs forever:
┌─────────────────────────────────────────────────────────────┐
│ ▼
DATA ──► TRAIN ──► TRACK ──► REGISTER ──► SERVE ──► MONITOR ──► (drift?)
▲ │ │ │ │ │
│ (Prefect) (MLflow) (MLflow Reg) (FastAPI) (drift check)
│ │ (Docker)
└──── retrain when it drifts ◄───────────────────────────────────┘
Read it as a sentence: pull data, train a model, track the experiment, register the winner, serve it behind an API, monitor it in production, and when it drifts, retrain — forever. Software ships once and is done; ML rots, because the world it learned changes. The loop is the whole point, and MLOps is the discipline of automating it.
Two words you'll see everywhere
- MLOps — "DevOps for machine learning": the practices and tools that take a model from notebook to reliable, monitored production and keep it healthy.
- The model artifact — the saved, trained model (a file). Everything downstream —
registry, serving, Docker — moves this artifact around. In our project it's a
model.json; in yours it might be a.pkl,.pt, or.onnx.
Why ML needs more than DevOps
Regular software has one moving part: code. ML has three: code + data + model, and all three drift independently. That's why ML gets its own tools — you must version the data (DVC), track which data made which model (MLflow), and watch the model's quality in production, not just its uptime (monitoring). Keep this "three moving parts" idea in mind; it explains why each tool in this book exists.
The toolbox, mapped to the loop
| Lifecycle stage | Tool in this book | The category (alternatives) |
|---|---|---|
| Orchestrate the loop | Prefect | workflow orchestration (Airflow, Dagster) |
| Version the data | DVC | data versioning (lakeFS, Delta Lake) |
| Track experiments | MLflow | experiment tracking (Weights & Biases) |
| Register & version models | MLflow Registry | model registry (SageMaker, Vertex) |
| Serve predictions | FastAPI | model serving (BentoML, TorchServe) |
| Package to run anywhere | Docker | containerization (Podman) |
| Background / async work | Celery + Redis | task queues (RQ, Dramatiq, Arq) |
| Cache & fast lookups | Redis | in-memory store (Memcached) |
| Fast portable inference | ONNX | inference runtime (TensorRT, OpenVINO) |
| Demo UI | Streamlit | quick UIs (Gradio, Dash) |
| Monitor & detect drift | PSI / Evidently | observability (WhyLabs, Arize) |
| Store & search embeddings | vector store | vector DBs (Qdrant, Chroma, pgvector, Pinecone) |
| Ground an LLM in your docs | RAG | LangChain, LlamaIndex |
| Serve / run an LLM | Claude API / vLLM / Ollama | OpenAI, TGI, Triton |
| Watch LLM cost & quality | token/cost tracking + LLM-judge | Langfuse, LangSmith, Ragas |
| Test & ship safely | pytest + GitHub Actions | CI/CD (GitLab CI, Jenkins) |
| Config & secrets | Pydantic Settings + secrets mgr | Vault, AWS/GCP Secrets Manager |
| Run it all together | docker-compose | local orchestration (Kubernetes at scale) |
You're learning one representative from each category. Swap in the alternative at your job and the concepts transfer directly — a task queue is a task queue.
How the chapters run
- Live (real output): MLflow, FastAPI, ONNX, the drift check, the vector store, RAG retrieval, LLM cost math, the test suite, config loading, and API-key auth are executed in this book — you'll see genuine output, and your runs will match.
- Follow-along: Docker, Celery, Redis, DVC, Prefect, Streamlit, and the LLM generation calls (which need a daemon, a background service, or an API key) give you the exact install command, the code, the commands to run, and the output to expect. Provide the service/key and they run identically.
Either way, all the code is in code/ and every file runs
on its own.
A note on setup
You don't need to install everything up front. Each chapter names its one
dependency (pip install mlflow, pip install fastapi uvicorn, …), so install as
you go. The only thing the core model needs is NumPy. The full list lives in
code/requirements.txt, and code/Makefile has a shortcut for every step.
With the map in hand, let's build the one model we'll spend the rest of the book productionizing. 👉
The project: one model we'll productionize
Here's the model the whole book wraps in tools. It's deliberately tiny — a sentiment classifier in pure NumPy, no downloads, no GPU — because the star of this book is the tooling, not the model. Whatever you learn here applies unchanged to a giant transformer; the model is just the thing inside the box.
All code is in code/sentiment/ and runs with only NumPy.
What it does
Given a sentence, predict whether it's positive or negative. Under the hood
it's exactly the bag-of-words + logistic regression from the AI Foundations book:
count the words, take a weighted sum, squash with a sigmoid to get P(positive).
The data
We keep a small labeled dataset inline (data.py) so the
project runs anywhere with zero setup — 25 positive and 25 negative sentences that
reuse strong polarity words so a simple model can generalize:
POSITIVE = ["i love this product it is great",
"absolutely love it fantastic and great", ...]
NEGATIVE = ["i hate this product it is terrible",
"absolutely hate it awful and terrible", ...]
In a real project this is the part that comes from a database, a warehouse, or a feature store — and the part you'll version with DVC in Chapter 8.
The model
model.py is a SentimentModel class with the four methods
every production tool in this book needs: fit, predict_proba,
save, load. That save/load pair produces the model artifact — the
file that the registry, the API, and Docker all pass around.
class SentimentModel:
def fit(self, texts, labels): ... # train by gradient descent
def predict_proba(self, texts): ... # -> P(positive) per text
def save(self, path): ... # write the artifact (model.json)
@classmethod
def load(cls, path): ... # read it back
Run it directly to see it learn and predict:
$ python model.py
Output:
train accuracy: 1.0
P(positive)=0.595 <- 'this is wonderful'
P(positive)=0.024 <- 'this is terrible'
P(positive)=0.698 <- 'fast and reliable support'
It correctly scores the positive sentences high and the negative one near zero. (Modest confidence on novel phrasings is expected — it's a 50-example model. That's fine; it's our stand-in for a real model.)
Training & evaluating
train.py does the production-shaped thing: split the data,
train, evaluate on a held-out test set (honest evaluation matters),
and save the artifact. It also has a --mlflow flag we'll use in the next chapter.
$ python train.py
Output:
examples: 50 train=38 test=12
vocab size: 68
train_accuracy: 1.000
test_accuracy: 1.000
saved model -> model.json
100% on the held-out set (the data is clean and separable by design), and it wrote
model.json — the artifact everything downstream consumes.
The artifact: the thing tools move around
That model.json is the heart of every chapter to come. Internally it's just the
learned vocabulary and weights as JSON:
{ "vocab": {"love": 0, "great": 1, ...}, "weights": [1.83, 2.41, ..., -0.12], ... }
Everything from here on is about that file's journey: tracked (MLflow), versioned (registry), wrapped in an API (FastAPI), baked into an image (Docker), produced by a scheduled pipeline (Prefect), converted for speed (ONNX), and watched for rot (monitoring). The model never changes — its operational maturity does.
Why "keep the model trivial" is the right call
When a serving tutorial uses a 2 GB model, half of it is fighting downloads, CUDA,
and memory, and the actual tool gets two paragraphs. By making the model a 5 KB
JSON file, every chapter spends 100% of its energy on the tool you came to learn —
and because the interface (fit/predict/save/load) is identical to a real
model's, nothing you learn is toy-specific.
The takeaway
We have a small, honest, fully-working model and its artifact (model.json),
exposing the universal fit/predict_proba/save/load interface. That's
everything the rest of the book builds on. First stop on the artifact's journey:
making experiments reproducible and comparable with MLflow. 👉
MLflow: experiment tracking
The moment you train a model more than once — a different learning rate, more
epochs, new features — you face the question that haunts every ML project: which run
was best, and what settings produced it? Without tooling the answer lives in
scrollback, sticky notes, and filenames like model_final_v3_REALLY_final.json.
MLflow replaces that chaos with a record of every run's parameters, metrics, and
artifacts.
Install:
pip install mlflow— that's all this chapter needs.
What MLflow Tracking gives you
- Parameters — the inputs you chose (learning rate, epochs, vocab size).
- Metrics — the numbers you got (train/test accuracy, loss).
- Artifacts — the files produced (the model, plots, configs).
- A UI — a web dashboard to sort, filter, and compare runs side by side.
Together these make experiments reproducible ("run 17 used lr=0.5, 500 epochs") and comparable ("lr=0.5 beat lr=0.1 by 4 points").
The five lines that matter
You wrap a training run in start_run() and log what you care about. From
train.py:
import mlflow
mlflow.set_experiment("sentiment") # a named bucket of runs
with mlflow.start_run() as run: # one run = one experiment
mlflow.log_params({"lr": lr, "epochs": epochs, "vocab_size": len(model.vocab)})
mlflow.log_metrics({"train_accuracy": train_acc, "test_accuracy": test_acc})
mlflow.log_artifact("model.json") # save the model file with the run
That's the whole API surface for tracking: log_params, log_metrics,
log_artifact. Everything else is convenience on top.
Running it
Our train.py logs only when you pass --mlflow, so the core stays dependency-free:
cd code/sentiment
python train.py --mlflow
Output:
2026/06/28 04:09:24 INFO mlflow.tracking.fluent: Experiment with name 'sentiment' does not exist. Creating a new experiment.
examples: 50 train=38 test=12
vocab size: 68
train_accuracy: 1.000
test_accuracy: 1.000
saved model -> model.json
logged run 17dfd654a50e4a6d8a6f0271318d572e to MLflow
MLflow created a sentiment experiment and recorded a run with a unique ID. Every
time you run it — with different settings — you get another logged run you can
compare. Run it a few times with different flags:
python train.py --mlflow --lr 0.1 --epochs 200
python train.py --mlflow --lr 1.0 --epochs 1000
Viewing & comparing runs in the UI
MLflow wrote everything to a local mlruns/ folder. Launch the dashboard:
mlflow ui --backend-store-uri ./mlruns --port 5000 # open http://localhost:5000
You'll see a sortable table — one row per run, columns for every param and metric.
Click two runs and Compare to see them side by side, sort by test_accuracy to
find the winner, and download any run's logged model.json. This table is the
"which run was best?" answer, forever.
Autolog: tracking for free
For popular frameworks, MLflow can capture everything automatically — one line
instead of manual log_* calls:
mlflow.autolog() # then just train as usual
# for sklearn/pytorch/xgboost it auto-logs params, metrics, and the model
autolog() is the fastest way to get tracking into an existing scikit-learn,
PyTorch, or XGBoost script — it hooks the framework's fit() and records params,
metrics, and the model with zero extra code.
Where the data lives (and how it scales)
- Local (this chapter):
mlruns/is a folder on disk — perfect for solo work. - Team: run an MLflow tracking server backed by a database (Postgres) and
object storage (S3), and point everyone at it with
mlflow.set_tracking_uri("http://mlflow.mycompany.com"). Now the whole team's experiments are in one searchable place.
Don't be confused: tracking vs. the registry. Tracking (this chapter) records experiments — every run you ever did, good or bad. The registry (next chapter) manages the chosen models — versioned, named, promoted to production. Tracking is your lab notebook; the registry is your shipping shelf. They're complementary, and MLflow does both.
Why this matters in practice
- Reproducibility — months later you can see exactly what produced a model.
- Comparison — no more guessing whether the new idea actually helped.
- Collaboration — teammates see your runs instead of asking "what did you try?"
- Auditability — regulated industries require a record of how each model was built. MLflow is that record.
This is the difference between "I think the new model is better" and "run #17 improved test accuracy from 0.92 to 0.96; here's the proof and the exact config."
The takeaway
MLflow Tracking records every run's params, metrics, and artifacts with three calls
(log_params, log_metrics, log_artifact), gives you a UI to compare them, and
scales from a local folder to a shared team server. It turns "which experiment was
best?" from archaeology into a database query. Now that we can find our best model,
let's manage and ship it with the MLflow Model Registry. 👉
MLflow: the model registry
Tracking tells you which run was best. But which model is in production right now? Which version do you roll back to when the new one misbehaves? How does your serving code fetch "the current model" without hard-coding a file path? That's the job of the Model Registry — a versioned, named shelf for the models you actually ship.
Install:
pip install mlflow pandas(the registry'spyfuncmodels use pandas). Needs a database backend — we use a local SQLite file.
The problem it solves
Without a registry, "deploy the model" means copying a file and hoping everyone uses
the right one. The registry replaces that with named, versioned models and
aliases like @production and @staging, so your serving code asks for
"models:/sentiment@production" and always gets the right version — and you can
promote or roll back by moving an alias, not redeploying code.
Step 1 — wrap the model as a pyfunc
The registry works best with a pyfunc — MLflow's universal model interface.
Wrapping our custom model in it means any MLflow-aware system can load and run it
the same way, regardless of framework. From registry.py:
import mlflow.pyfunc
class SentimentPyfunc(mlflow.pyfunc.PythonModel):
def load_context(self, context):
self.model = SentimentModel.load(context.artifacts["model_path"])
def predict(self, context, model_input, params=None):
return self.model.predict_proba(model_input["text"].tolist())
load_context restores the model from its artifact; predict defines the universal
inference call. That's the adapter that makes our NumPy model speak fluent MLflow.
Step 2 — log and register in one call
mlflow.set_tracking_uri("sqlite:///mlflow.db") # DB backend enables the registry
with mlflow.start_run() as run:
mlflow.pyfunc.log_model(
name="model",
python_model=SentimentPyfunc(),
artifacts={"model_path": "model.json"},
registered_model_name="sentiment", # <-- registers it
)
The registered_model_name argument is the magic word: it logs the model and
creates (or adds a new version to) a registered model called sentiment.
Step 3 — promote with an alias
A registered model accumulates versions (v1, v2, v3…). You point an alias at the one that's live:
client = mlflow.MlflowClient()
version = client.get_latest_versions("sentiment")[0].version
client.set_registered_model_alias("sentiment", "production", version)
Don't be confused: aliases vs. the old "stages." Older MLflow used fixed stages (
Staging,Production,Archived). Modern MLflow (2.9+) replaced these with flexible aliases — you can have@production,@champion,@canary, whatever you need. If you seetransition_model_version_stagein old tutorials, the modern equivalent isset_registered_model_alias.
Step 4 — load by alias (what serving does)
Your API never hard-codes a path or version. It asks for the alias:
model = mlflow.pyfunc.load_model("models:/sentiment@production")
out = model.predict(pd.DataFrame({"text": ["this is great", "this is terrible"]}))
Running the whole thing
cd code/sentiment
python registry.py
Output:
logged + registered 'sentiment' from run 6fc7412d
set alias 'production' -> sentiment v1
loaded model predictions: [0.977, 0.024]
It registered version 1, pointed @production at it, then loaded by that alias and
predicted — 0.977 for the positive sentence, 0.024 for the negative. Run it again
and you'll get v2, with @production moved to the new version — that's a
deploy. Point the alias back to v1 and you've rolled back, instantly, without
touching serving code.
The registry in the MLflow UI
mlflow ui --backend-store-uri sqlite:///mlflow.db --port 5000
The Models tab now shows sentiment with all its versions, their aliases, which
run produced each, and its metrics. This is the team's source of truth for "what's in
production and how did it get there."
How this looks at a real company
- A CI pipeline trains a candidate, logs it, and registers a new version.
- An automated gate (or a human) checks it beats the current
@productionon a holdout, then moves the alias — often via a canary:@productionfor 95% of traffic,@canary(the new version) for 5%, watched before full promotion. - Serving code loads
models:/sentiment@productionand is blissfully unaware of version numbers. - An incident? Move the alias back to the previous version. Rollback in seconds.
Managed platforms (SageMaker Model Registry, Vertex Model Registry) offer the same concept; the idea — named, versioned models with promotable aliases — transfers everywhere.
The takeaway
The Model Registry turns a pile of model files into named, versioned, promotable
models. Wrap the model as a pyfunc, register a version, point a @production alias
at it, and load by alias in serving — so deploys and rollbacks are just moving an
alias, never editing code. Tracking found the best model; the registry ships it.
Now let's actually expose it to the world with an API. 👉
FastAPI: a real prediction service
A registered model still can't be used by anything but Python code that imports it. To let a website, a mobile app, or another microservice get predictions, you wrap the model in a web API — an HTTP endpoint other systems call. FastAPI is the modern Python standard for this: fast, typed, and it generates interactive docs for free.
Install:
pip install fastapi "uvicorn[standard]"— FastAPI is the framework, uvicorn is the server that runs it.
Why an API, and why FastAPI
An API turns your model into a service: any system, in any language, can POST
some text and get a prediction back as JSON. FastAPI wins because it gives you, out of
the box:
- Automatic validation — declare the input shape with Pydantic; bad requests are rejected with clear errors before your code runs.
- Interactive docs — a Swagger UI at
/docswhere anyone can try the API in a browser. - Speed & async — among the fastest Python frameworks; handles concurrency well.
- Type safety — your editor and the framework both understand your data shapes.
The service
The full app is api/main.py. The core is three pieces:
schemas (what goes in and out), the model loaded once at startup, and the
endpoints.
Schemas — validation for free
from pydantic import BaseModel, Field
class PredictRequest(BaseModel):
text: str = Field(..., min_length=1, examples=["this is wonderful"])
class Prediction(BaseModel):
label: str
score: float
By declaring text must be a non-empty string, FastAPI rejects bad input
automatically — you never write validation code, and the rejection is a clean,
typed error.
Load the model once, serve many times
app = FastAPI(title="Sentiment API")
model = _load_or_train() # loaded ONCE at startup, reused for every request
Don't be confused: load the model at startup, not per request. A common beginner bug is loading the model file inside the endpoint — so every request re-reads it from disk, making the API painfully slow. Load it once at module level (or in a startup event) and reuse it. Models are read-only at inference time; one copy serves all requests.
The endpoints
@app.get("/health")
def health():
return {"status": "ok", "vocab_size": len(model.vocab)}
@app.post("/predict", response_model=Prediction)
def predict(req: PredictRequest):
score = float(model.predict_proba([req.text])[0])
return Prediction(label="positive" if score >= 0.5 else "negative",
score=round(score, 4))
/health is a liveness probe — Docker and Kubernetes ping it to know the service
is up (Chapter 5). /predict is the real work. There's also a
/predict/batch for scoring many texts in one call.
Running it
cd code
uvicorn api.main:app --reload # --reload = auto-restart on code changes
Then open http://localhost:8000/docs for the interactive Swagger UI, or call it from the command line:
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"text": "this is wonderful and great"}'
What it returns
Here's the service answering, verified end to end (via FastAPI's test client):
GET /health -> {"status": "ok", "vocab_size": 75}
POST /predict {"text":"... great"} -> {"label": "positive", "score": 0.9841}
POST /predict {"text":"terrible ..."} -> {"label": "negative", "score": 0.0112}
POST /predict/batch {"texts":[...]} -> [{"label":"positive","score":0.8847},
{"label":"negative","score":0.1188}]
POST /predict {"text":""} -> 422 Unprocessable Entity (validation!)
Look at that last line: an empty string is rejected with HTTP 422 before your code
ever runs, because the Pydantic schema said min_length=1. That's the validation you
got for free — and it's exactly what keeps a production API from crashing on
malformed input.
Production touches (so you sound senior)
The basics above are real, but a production service adds:
- Multiple workers —
uvicorn api.main:app --workers 4(or Gunicorn) to use all CPU cores. - A timeout & request size limit — never let one request hang or flood you.
- Structured logging & a request ID — so you can trace a prediction later.
- Load the model from the registry —
mlflow.pyfunc.load_model("models:/sentiment@production")(Chapter 3) instead of a local file, so deploys are alias moves. - Async for I/O-bound work —
async defendpoints when you call databases or other services.
Don't be confused: uvicorn vs. FastAPI vs. Gunicorn. FastAPI is the framework you write code in. uvicorn is the ASGI server that actually runs it and speaks HTTP. Gunicorn is a process manager that runs several uvicorn workers for production. You write FastAPI; uvicorn serves it; Gunicorn scales it.
Connecting the dots
Notice the service can _load_or_train() from the model.json
(Chapter 1) — but in production you'd load from the registry
(Chapter 3), so promoting a new model needs no code change.
This API is also what the Streamlit UI (Chapter 11)
and any other client will call.
The takeaway
FastAPI wraps your model in a typed HTTP service: Pydantic schemas validate input for
free (the empty string got a 422), the model loads once at startup, and /health +
/predict expose it to any system in any language — with interactive docs at /docs.
This is how a model stops being a script and becomes a service. But a service that
runs only on your laptop isn't deployed — next, we make it run anywhere with Docker. 👉
Docker: package it to run anywhere
Your FastAPI service runs on your laptop, with your Python version and your installed packages. Ship it to a server and it breaks — wrong Python, missing library, different OS. Docker ends "works on my machine" by packaging your code and its entire environment into a portable image that runs identically everywhere: your laptop, a teammate's, a server, the cloud.
Setup: install Docker Desktop (Mac/Windows) or Docker Engine (Linux). This chapter is follow-along — Docker needs a daemon, so commands and expected output are shown for you to run locally.
The three words you must know
- Dockerfile — a recipe: the steps to build your environment (base OS, install deps, copy code, run command).
- Image — the built, frozen result of that recipe. Immutable, shareable, tagged
(
sentiment-api:1.0). - Container — a running instance of an image. You can run many containers from one image.
Analogy: the Dockerfile is a class, the image is a compiled program, the container is a running process.
The Dockerfile, line by line
Here's code/Dockerfile, which packages our API. Every line is
a real best practice:
FROM python:3.11-slim # 1. small base image (fewer CVEs, less weight)
WORKDIR /app
COPY requirements.txt . # 2. deps FIRST, so Docker caches this layer
RUN pip install --no-cache-dir fastapi "uvicorn[standard]" pydantic numpy
COPY sentiment/ ./sentiment/ # 3. then the code (changes often)
COPY api/ ./api/
RUN useradd --create-home appuser # 4. don't run as root
USER appuser
EXPOSE 8000
HEALTHCHECK --interval=30s --timeout=3s --retries=3 \
CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"
CMD ["uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "8000"]
The four ideas worth internalizing:
- Slim base image —
python:3.11-slimis ~5× smaller than the full image: faster pulls, smaller attack surface. - Copy
requirements.txtbefore the code. Docker builds in cached layers; if you copy code first, every code change re-installs all dependencies. Deps-first meanspip installis cached and rebuilds are seconds, not minutes. - Run as a non-root user. If the container is compromised, the attacker isn't root. Basic, essential hygiene.
HEALTHCHECK— hits our/healthendpoint (Chapter 4) so orchestrators know when the container is actually ready, not just started.
Don't be confused:
EXPOSEvs.-p.EXPOSE 8000only documents that the app uses port 8000 — it doesn't open anything. You actually publish the port at run time with-p 8000:8000(host port : container port). Forgetting-pis the #1 "why can't I reach my container?" gotcha.
Build and run
cd code
docker build -t sentiment-api . # build the image from the Dockerfile
docker run -p 8000:8000 sentiment-api # run a container, publish the port
Expected output:
[+] Building 12.3s (12/12) FINISHED
=> naming to docker.io/library/sentiment-api 0.0s
...
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO: Application startup complete.
Now curl http://localhost:8000/health works exactly as in Chapter 4 — but it's
running inside an isolated container, with its own Python and dependencies, that
will behave identically on any machine with Docker.
The commands you'll use daily
docker images # list built images
docker ps # list running containers
docker logs <container> # see a container's output
docker exec -it <container> bash # open a shell inside a running container
docker stop <container> # stop it
docker build -t name:tag . # build with a version tag
docker push registry/name:tag # push to a registry (Docker Hub, ECR, GCR)
How images travel to production
You don't copy code to servers anymore — you build an image, push it to a container registry (Docker Hub, AWS ECR, Google Artifact Registry), and servers pull and run it:
build image ─► docker push ─► registry ─► docker pull ─► run on server / K8s
This is the foundation of modern deployment. Kubernetes (the full-stack chapter and the foundations book) orchestrates thousands of these containers; cloud "serverless container" services (AWS Fargate, Cloud Run) run them without you managing servers at all.
Don't be confused: image vs. container (again, because it matters). You build an image once and run many containers from it. Stopping a container doesn't delete the image. A container is ephemeral — anything written inside it (like a
model.jsoncreated at runtime) vanishes when it stops, unless you mount a volume. Bake the model into the image, or load it from the registry/object storage at startup — never rely on files written inside a running container surviving.
A note on size & GPUs
- Keep images small: slim bases,
.dockerignore, multi-stage builds (build in a fat image, copy only the result into a slim one). - For GPU inference, use NVIDIA's CUDA base images and the NVIDIA container runtime — the same Dockerfile idea, heavier base.
The takeaway
Docker packages your code and its environment into a portable image — killing "works
on my machine." Write a Dockerfile (slim base, deps before code for caching, non-root,
healthcheck), build an image, run containers from it, and push to a registry so
any server can pull and run it identically. Containers are ephemeral; don't rely on
files written inside them. Our service is now portable — next, let's handle work that's
too slow to do inside a request, with a task queue. 👉
Celery + Redis: an async task queue
Some work is too slow to do inside a web request. If a user asks to score 10,000 documents, or you need to retrain the model, you can't make them wait 30 seconds for the HTTP response — the request will time out and your server will be tied up. The fix is a task queue: hand the slow job to a background worker and return immediately. Celery is Python's standard task queue, and Redis is the fast in-memory store that connects the pieces.
Setup:
pip install celery redisand run a Redis server (redis-server, ordocker run -p 6379:6379 redis). Follow-along — needs Redis + a worker process.
The architecture
Three players pass a job along:
your API ──.delay(job)──► Redis (the queue) ──► Celery worker runs it
│ │
└──────── returns a task ID instantly ◄──── result stored back in Redis
- Producer — your code, which enqueues a task with
.delay(...)and gets back a ticket (anAsyncResult) instantly, without waiting. - Broker (Redis) — the queue tasks wait in until a worker is free.
- Worker — a separate process that pulls tasks off the queue and runs them.
- Result backend (Redis) — where the worker stores the result for later pickup.
Don't be confused: Redis is playing two roles here. As the broker it's the message queue (tasks waiting to run). As the result backend it stores finished results. They're configured separately (and can be different systems — e.g. RabbitMQ broker + Redis backend), but using Redis for both is the simplest common setup.
The Celery app
tasks/celery_app.py configures Celery to use Redis for both
roles, with production-sane defaults:
from celery import Celery
app = Celery("sentiment", broker="redis://localhost:6379/0",
backend="redis://localhost:6379/0")
app.conf.update(
task_time_limit=300, # hard-kill a task after 5 minutes
worker_max_tasks_per_child=100, # recycle workers to avoid memory leaks
)
The tasks
A task is just a function with the @app.task decorator. From
tasks/tasks.py:
@app.task
def batch_score(texts: list[str]) -> list[dict]:
model = _model()
probs = model.predict_proba(texts)
return [{"text": t, "score": round(float(p), 4),
"label": "positive" if p >= 0.5 else "negative"}
for t, p in zip(texts, probs)]
@app.task
def retrain() -> dict:
"""Retrain from scratch and save. Schedule this nightly."""
model = SentimentModel().fit(*load_dataset())
model.save(MODEL_PATH)
return {"status": "retrained", "vocab_size": len(model.vocab)}
Running it (three terminals)
# terminal 1 — the broker
redis-server
# terminal 2 — a worker (from code/)
celery -A tasks.celery_app worker --loglevel=info
# terminal 3 — enqueue a job and fetch the result
python -c "from tasks.tasks import batch_score; \
print(batch_score.delay(['great product','awful service']).get(timeout=10))"
The worker logs the task running, and terminal 3 prints the result once it's done — the task logic produces exactly this (verified against the model):
[{'text': 'great product', 'score': 0.9768, 'label': 'positive'},
{'text': 'awful service', 'score': 0.3242, 'label': 'negative'}]
The key move: .delay(...) returned instantly with a ticket; .get() waited for
the worker to finish. In a real API you'd return {"task_id": result.id} immediately
and let the client poll a /status/{id} endpoint — the user never waits on the
connection.
The fire-and-forget pattern in an API
@app.post("/score-batch")
def score_batch(texts: list[str]):
task = batch_score.delay(texts) # enqueue, don't wait
return {"task_id": task.id} # respond in milliseconds
@app.get("/result/{task_id}")
def get_result(task_id: str):
res = batch_score.AsyncResult(task_id)
return {"ready": res.ready(),
"result": res.result if res.ready() else None}
This is how every "we're processing your request, check back" flow works — bulk exports, video processing, sending email, long ML inference.
Scheduled tasks (Celery Beat)
Celery also runs tasks on a schedule via Celery Beat — e.g. retrain every night:
app.conf.beat_schedule = {
"nightly-retrain": {"task": "tasks.tasks.retrain", "schedule": 86400.0},
}
(For richer pipelines with dependencies and retries, you'll reach for a real orchestrator — that's Chapter 9, Prefect. Celery is for tasks; Prefect is for workflows.)
Why a queue, not just threads
Don't be confused: background threads vs. a task queue. You could run slow work in a thread, but a task queue gives you what threads can't: work survives an API restart (it's in Redis, not memory), it scales across many machines (add more workers), it retries failures automatically, and it doesn't compete with your web server for resources. For anything important or heavy, use the queue.
When Celery — and when not
Celery is the right tool surprisingly often, but it's not the only one. Match the tool to the job:
| You need… | Reach for |
|---|---|
| A quick "fire and forget" after responding (send an email, log an event) | FastAPI BackgroundTasks — built in, no broker, runs in the same process |
| Durable jobs that survive restarts, retry, and scale across machines | Celery + Redis (this chapter) — the general-purpose workhorse |
| A simpler Redis-only queue | RQ, Dramatiq, Arq (async) |
| Distributed ML training or heavy parallel inference across a cluster | Ray — purpose-built for scaling Python/ML compute |
| Multi-step pipelines with dependencies between steps | a workflow orchestrator (Prefect, Chapter 9) |
Don't be confused:
BackgroundTasksvs. Celery vs. Ray. FastAPI'sBackgroundTasksruns work in your web process after the response is sent — great for light, fire-and-forget jobs, but the work dies if the process restarts and can't scale past one machine. Celery runs work in separate worker processes backed by a broker — durable, retryable, horizontally scalable; the right default for real background jobs. Ray is a different animal: a framework for distributing heavy compute (training, large batch inference, hyperparameter sweeps) across a cluster. Reach for Ray when the bottleneck is CPU/GPU work that needs many machines, not when you simply need to get a job off the request path.
Where it fits in production
- Decouple slow work from the request path → fast, responsive APIs.
- Scale independently → add workers when the queue backs up, without touching the API.
- Reliability → tasks retry, and survive crashes because they live in Redis.
Alternatives in this category: RQ (simpler, Redis-only), Dramatiq, Arq (async). Cloud-native options: AWS SQS + Lambda. The pattern — producer, broker, worker — is identical everywhere.
The takeaway
A task queue moves slow work (batch scoring, retraining, emails) out of the request
path: your code calls .delay() and returns instantly, Redis holds the queue, and
Celery workers do the work on their own machines — surviving restarts and scaling
independently. Use Beat for scheduled jobs, a real orchestrator for complex pipelines.
We've been leaning on Redis as the broker; next, let's use it for the other thing
it's great at — caching. 👉
Redis: caching & feature lookups
We met Redis as Celery's broker, but that's a fraction of what it does. Redis is an in-memory key-value store — absurdly fast (sub-millisecond), and the Swiss Army knife of production systems. For ML it solves three recurring problems: caching predictions, serving features fast, and rate limiting. This chapter shows each.
Setup:
pip install redisand run a Redis server. Follow-along — needs a running Redis.
Why in-memory matters
Redis keeps data in RAM, so reads and writes take microseconds — orders of magnitude faster than a database or recomputing a model prediction. The trade-off: RAM is smaller and (by default) volatile. So you use Redis for hot, ephemeral data — things you can afford to lose and want now.
Use 1 — cache predictions
If the same text gets scored repeatedly (popular queries, repeated documents), don't re-run the model — cache the result. The pattern is cache-aside: check the cache; on a miss, compute and store it.
import json
import redis
r = redis.Redis(host="localhost", port=6379, db=0)
def cached_score(text, model, ttl=3600):
key = f"pred:{text}"
hit = r.get(key)
if hit is not None: # cache HIT — skip the model entirely
return json.loads(hit)
score = float(model.predict_proba([text])[0]) # cache MISS — compute it
r.set(key, json.dumps(score), ex=ttl) # store with a 1-hour expiry
return score
The ex=ttl gives the key a time-to-live — it auto-expires after an hour, so the
cache never grows unbounded and stale predictions self-clean. Expected behavior:
first call ("great product") -> MISS, runs model, stores -> 0.9768 (~5 ms)
second call ("great product") -> HIT, returns from Redis -> 0.9768 (~0.2 ms)
The second call is ~25× faster because it never touched the model. For an expensive model (an LLM call costing money and seconds), caching is the single highest-impact optimization you can make.
Don't be confused: caching vs. the result backend. In Chapter 6 Redis stored task results so a producer could fetch them. Here it caches model outputs to avoid recomputation. Same Redis, different purpose — it's a general key-value store, and these are just two of its uses.
Use 2 — a feature store / fast feature lookups
Online models need features now: when a request arrives for user:123, you must
fetch that user's features in single-digit milliseconds. A database might be too slow;
Redis is built for it. Redis hashes store an object's fields under one key:
# precompute features offline, write them to Redis
r.hset("user:123", mapping={"avg_order": "84.50", "n_sessions": "12", "tier": "gold"})
# at request time, fetch instantly
features = r.hgetall("user:123")
# -> {b'avg_order': b'84.50', b'n_sessions': b'12', b'tier': b'gold'}
This is the core of an online feature store (the foundations book's MLOps
chapter): batch jobs compute features and push them to Redis;
the serving model reads them with hgetall in under a millisecond. Tools like Feast
use exactly this pattern (often with Redis as the online store) — and they add the
crucial guarantee that training and serving compute features identically, killing
training/serving skew.
Use 3 — rate limiting
Protect your API from abuse by counting requests per user per time window. Redis's
atomic INCR plus a TTL makes this a few lines:
def allow_request(user_id, limit=100, window=60):
key = f"rate:{user_id}"
count = r.incr(key) # atomic increment (creates the key at 1)
if count == 1:
r.expire(key, window) # first hit sets the 60-second window
return count <= limit # allow until the limit is hit
requests 1..100 -> allow_request -> True
request 101 -> allow_request -> False (blocked until the window resets)
Because INCR is atomic, this is correct even with thousands of concurrent
requests — no race conditions. The key auto-expires, resetting the window.
The data structures you'll actually use
| Structure | Command | ML use |
|---|---|---|
| String | SET/GET | cache a prediction or config |
| Hash | HSET/HGETALL | a feature vector / user profile |
| Counter | INCR/EXPIRE | rate limiting, metrics |
| Sorted set | ZADD/ZRANGE | leaderboards, top-K, recent items |
| TTL | EX/EXPIRE | auto-expiring cache entries |
Production notes
- TTLs everywhere — cache entries should expire, or Redis fills up. Set an
eviction policy (
allkeys-lru) so it drops the least-used keys under pressure. - Persistence — Redis can persist to disk (RDB/AOF), but treat it as a cache you can rebuild, not your source of truth.
- It's a network hop — fast, but not free; batch reads (
MGET, pipelines) when you can. - Cache invalidation — "one of the two hard problems." When the model changes,
old cached predictions are wrong — version your cache keys (
pred:v2:...) so a new model can't serve stale answers.
The takeaway
Redis is the in-memory workhorse: cache predictions (cache-aside with a TTL) to skip expensive recomputation, serve features in sub-millisecond lookups (the online feature-store pattern), and rate-limit with atomic counters. Always set TTLs, version your cache keys, and treat it as rebuildable. We can now serve fast and scale work — but which data trained the model we're serving? Time to version it. 👉
DVC: version data & models
Git is brilliant for code and terrible for data. Commit a 2 GB dataset or a 500 MB model and your repo becomes unusable — Git stores every version of every byte. Yet you need to know which data and which model produced a given result (it's required for reproducibility, debugging, and audits). DVC (Data Version Control) solves this: it versions large files with Git, without putting them in Git.
Setup:
pip install dvc(adddvc[s3]/dvc[gs]for cloud remotes). Follow-along — DVC commands shown with expected output.
The core trick
DVC keeps your big files in separate storage (a "remote": S3, GCS, even a shared folder) and commits a tiny pointer file to Git instead. The pointer holds a hash of the data; Git versions the pointer, DVC versions the data:
Git repo: data.csv.dvc (a 100-byte text file with a hash) ✅ tiny, versioned
DVC remote: the actual 2 GB data.csv, keyed by that hash ✅ big, out of Git
So git checkout an old commit, then dvc checkout, and you get exactly the data
and model that commit used — full reproducibility, no bloated repo.
Versioning a dataset
cd code
dvc init # set up DVC in the repo (once)
dvc add data/reviews.csv # tell DVC to track this file
git add data/reviews.csv.dvc data/.gitignore
git commit -m "track reviews dataset v1"
Expected output of dvc add:
100% Adding...|████████████████████████████████|1/1 [00:00, 12.4file/s]
To track the changes with git, run:
git add data/reviews.csv.dvc data/.gitignore
DVC moved the real CSV into its cache, wrote reviews.csv.dvc (the pointer), and added
the CSV to .gitignore so Git never tries to track the big file. You commit the
pointer, not the data.
Don't be confused:
dvc addvs.git add.git addstages a file into Git.dvc adddoes the opposite — it pulls a file out of Git's tracking and into DVC's, leaving behind a small.dvcpointer that Git does track. You thengit addthe pointer. Two tools, clear division: Git tracks small text (code, pointers), DVC tracks big binaries (data, models).
Pushing data to a remote
The data lives in your local DVC cache until you push it to shared storage:
dvc remote add -d storage s3://my-bucket/dvc-store # configure once
dvc push # upload tracked data
Now a teammate runs git pull (gets the pointers) then dvc pull (downloads the
matching data). Everyone has the same data, versioned in lockstep with the code.
Reproducible pipelines (dvc.yaml)
DVC's second superpower: define your pipeline as stages with declared dependencies and outputs, and DVC only re-runs a stage when its inputs change:
# dvc.yaml
stages:
train:
cmd: python sentiment/train.py
deps:
- sentiment/train.py
- data/reviews.csv # if the data changes, retrain
outs:
- model.json # DVC versions the output model too
metrics:
- metrics.json # track metrics across versions
dvc repro # runs only the stages whose dependencies changed
dvc metrics show # compare metrics across data/code versions
dvc repro is "make for ML": change the data and it knows to retrain; change nothing
and it does nothing. This makes your whole train→evaluate flow reproducible and
cached.
Time-travel to any experiment
Because data, model, and metrics are all versioned alongside the code, you can jump to any past state completely:
git checkout <old-commit> # the code + the .dvc pointers from back then
dvc checkout # the exact data + model that commit used
You're now looking at precisely the data and model that produced last month's result — the thing that's impossible with Git alone, and essential when a regulator or a postmortem asks "what exactly did we train on?"
DVC vs. the alternatives
- Git LFS — also stores big files outside Git, but it's generic and lacks DVC's ML pipelines, metrics, and experiment features.
- lakeFS / Delta Lake — data versioning at the data-lake scale (whole tables, branching petabytes). Bigger hammer.
- MLflow — versions models and runs; DVC versions data and pipelines. They're complementary, and teams often use both.
Don't be confused: DVC vs. MLflow. They overlap but lead with different things. MLflow centers on experiment tracking and the model registry (params, metrics, model versions). DVC centers on data versioning and reproducible pipelines (which data + which code = which result). Use MLflow to compare runs and ship models; use DVC to guarantee you can reproduce the data behind any of them.
The takeaway
DVC versions large data and models with Git but not in it: the big files live in a
remote (S3/GCS), and Git tracks tiny .dvc pointer files, so git checkout + dvc checkout reproduces any past data-and-model state exactly. dvc.yaml turns your
pipeline into cached, dependency-aware stages ("make for ML"). Now that data and models
are versioned, let's automate the loop that produces them — orchestrated retraining. 👉
Prefect: orchestrate retraining
A production model must retrain as new data arrives — but "retraining" isn't one step. It's pull data → train → evaluate → promote if better → maybe alert, with retries when a step fails and a schedule so it runs every night unattended. Stringing that together with cron and shell scripts is fragile. Prefect turns the pipeline into observable, retrying, schedulable Python — a workflow orchestrator.
Setup:
pip install prefect. Follow-along — runs locally as shown; a schedule needs a Prefect server/cloud.
Tasks and flows
Prefect has two decorators:
@task— one step (pull data, train, evaluate). Prefect tracks it: logs, retries, caching, timing.@flow— the function that wires tasks into a pipeline (a DAG). It's the unit you run and schedule.
From pipeline/retrain_flow.py:
from prefect import flow, task
@task(retries=2, retry_delay_seconds=5) # auto-retry flaky steps
def extract():
return load_dataset()
@task
def train(data):
return SentimentModel().fit(*data)
@task
def evaluate(model, data):
return model.accuracy(*data)
@task
def promote(model, accuracy, threshold=0.9):
if accuracy >= threshold: # an automated quality gate
model.save("model.json")
return f"promoted (acc={accuracy:.3f})"
return f"rejected (acc={accuracy:.3f} < {threshold})"
@flow(name="retrain-sentiment", log_prints=True)
def retrain_pipeline():
data = extract()
model = train(data)
acc = evaluate(model, data)
print(promote(model, acc))
Notice the quality gate in promote: the new model only ships if it clears a bar.
This is how you retrain automatically without risking a bad model reaching production —
a critical safety valve in any automated pipeline.
Running it
cd code
python pipeline/retrain_flow.py
Expected output (Prefect narrates each task as it runs):
14:32:01.245 | INFO | prefect.engine - Created flow run 'splendid-otter' for flow 'retrain-sentiment'
14:32:01.310 | INFO | Task run 'extract-0' - Finished in state Completed()
14:32:01.402 | INFO | Task run 'train-0' - Finished in state Completed()
14:32:01.455 | INFO | Task run 'evaluate-0' - Finished in state Completed()
14:32:01.501 | INFO | Task run 'promote-0' - Finished in state Completed()
14:32:01.503 | INFO | Flow run 'splendid-otter' - promoted (acc=1.000)
14:32:01.540 | INFO | Flow run 'splendid-otter' - Finished in state Completed()
The model scored 1.000 ≥ 0.9, so the gate promoted it. Each task shows its own
state — if train had thrown, Prefect would mark it Failed, retry extract per its
policy, and you'd see exactly which step broke and why.
Why an orchestrator beats cron + scripts
Don't be confused: Celery vs. Prefect — tasks vs. workflows. Celery runs independent tasks off a queue (great for "score this batch"). Prefect runs workflows — multi-step pipelines with dependencies between steps, where step B needs step A's output, with retries, scheduling, and a UI showing the whole DAG. Use Celery for fire-and-forget jobs; use Prefect for "the nightly retraining pipeline."
A cron job that calls a shell script gives you none of: retries, visibility into which step failed, passing data between steps, backfills, or alerting. Prefect gives you all of it, in Python you already know.
Scheduling it
To run unattended every night at 2 AM, deploy it with a schedule:
prefect deploy pipeline/retrain_flow.py:retrain_pipeline --cron "0 2 * * *"
prefect worker start --pool default # a worker executes scheduled runs
Now the pipeline runs nightly, the Prefect UI shows every run's status and logs, and failures can page you. That's the "retrain when it drifts" arrow of the lifecycle loop (Chapter 0) made real — often triggered by the drift check from Chapter 12.
The orchestrator landscape
- Prefect — Pythonic, modern, gentle learning curve (this chapter).
- Airflow — the incumbent; powerful, ubiquitous in data engineering, heavier.
- Dagster — asset-centric, strong typing and data-awareness.
- Kubeflow Pipelines — Kubernetes-native ML pipelines.
They all express the same idea — a DAG of steps with scheduling and observability — so the concept transfers. Airflow is the one you'll most often see on job descriptions; Prefect is the friendliest to learn it on.
The takeaway
Prefect turns retraining into an orchestrated workflow: @task steps wired by an
@flow, with automatic retries, a quality gate that only promotes good models,
per-step visibility, and cron scheduling for unattended nightly runs. It's cron with a
brain — and the engine of the lifecycle's retraining loop. Our pipeline produces models;
next, let's make those models fast and portable for inference with ONNX. 👉
ONNX: portable, faster inference
Your model is a Python object using NumPy (or PyTorch, or TensorFlow). But what if you need to serve it from a C++ game engine, a Java backend, a browser, or a tiny edge device with no Python? And what if framework overhead makes inference slower than it needs to be? ONNX (Open Neural Network Exchange) solves both: a single, framework-neutral model format that runs fast via a dedicated runtime, anywhere.
Install:
pip install torch onnx onnxruntime. This chapter runs live — the output below is real.
What ONNX is
ONNX is a standard file format for models, plus onnxruntime, a highly optimized engine that executes them. The workflow:
train in Python (PyTorch/sklearn/...) ─► export to model.onnx ─►
run with onnxruntime in ANY language, often 2-5x faster
You decouple training (Python, flexible, slow is OK) from serving (any language,
optimized, latency matters). The .onnx file is portable; the runtime handles the
speed.
Exporting our model
Our model is bag-of-words + logistic regression, i.e. sigmoid(counts @ w + b). We
rebuild that as a one-line PyTorch Linear layer (so we can use torch's ONNX exporter)
and export it. From serving/onnx_export.py:
import torch, torch.nn as nn
model = SentimentModel().fit(*load_dataset())
linear = nn.Linear(len(model.weights) - 1, 1)
with torch.no_grad(): # copy the learned weights across
linear.weight.copy_(torch.tensor(model.weights[:-1]).reshape(1, -1))
linear.bias.copy_(torch.tensor([model.weights[-1]]))
class Net(nn.Module):
def __init__(self): super().__init__(); self.fc = linear
def forward(self, x): return torch.sigmoid(self.fc(x))
torch.onnx.export(
Net().eval(), torch.zeros(1, len(model.weights) - 1), "sentiment.onnx",
input_names=["counts"], output_names=["score"],
dynamic_axes={"counts": {0: "batch"}, "score": {0: "batch"}}, # variable batch
opset_version=17,
)
The dynamic_axes is the one subtlety worth knowing: it marks the batch dimension
as variable, so the exported model accepts 1 row or 1,000. We also save the vocabulary
alongside, because the runtime still has to turn text into word counts the same way.
Running inference with onnxruntime — no PyTorch needed
The whole point: load the .onnx and predict with only onnxruntime, no training
framework:
import onnxruntime as ort
sess = ort.InferenceSession("sentiment.onnx")
scores = sess.run(["score"], {"counts": counts})[0] # counts = bag-of-words matrix
It runs — live output
cd code
python serving/onnx_export.py
Output:
exported -> sentiment.onnx (602 bytes)
0.977 positive <- 'this is great'
0.024 negative <- 'this is terrible'
The model exported to a 602-byte .onnx file, and onnxruntime reproduced the exact
predictions (0.977, 0.024) — without the original SentimentModel class. That
file now runs in C#, Java, JavaScript (onnxruntime-web), C++, or on a Raspberry Pi,
identically.
Don't be confused: ONNX (the format) vs. onnxruntime (the engine). ONNX is the
.onnxfile — a portable description of the model's computation graph. onnxruntime is the program that loads and executes it fast. You export to ONNX and run with onnxruntime. (Other engines — TensorRT, OpenVINO — also run ONNX files, tuned for specific hardware.)
Why teams reach for it
- Speed — onnxruntime applies graph optimizations (operator fusion, constant folding) and uses optimized kernels; inference is often 2–5× faster than the source framework, especially on CPU.
- Portability — one model file, every language and platform. No Python at serving time.
- Smaller footprint — ship a lean runtime instead of a full PyTorch/TF install — big for containers (Chapter 5) and edge devices.
- A bridge to faster hardware — ONNX is the on-ramp to TensorRT (NVIDIA GPUs), OpenVINO (Intel), and mobile/edge accelerators.
ONNX and quantization
ONNX pairs naturally with quantization — converting weights from float32 to int8 to shrink the model ~4× and speed it up further:
from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic("sentiment.onnx", "sentiment.int8.onnx", weight_type=QuantType.QInt8)
This is how large models get squeezed onto modest hardware — the same compression idea the IVF-PQ book applies to vectors, applied to model weights.
When not to bother
ONNX shines for deployment of a settled model. During research and rapid iteration, stay in your native framework — exporting adds a step and not every exotic operation converts cleanly. Reach for ONNX when you're optimizing a model for production serving, targeting a non-Python environment, or chasing latency.
The takeaway
ONNX is a portable model format; onnxruntime is the fast engine that runs it. Export
once (mind the dynamic_axes for variable batch sizes), and your model runs in any
language, on any platform, typically 2–5× faster — and quantizes down for edge
hardware. It's the standard bridge from "trained in Python" to "served fast and
anywhere." Now let's give non-engineers a way to try the model — a UI. 👉
Streamlit & Gradio: a demo UI
A JSON API (Chapter 4) is perfect for machines, but useless for the product manager, the domain expert, or the executive who wants to try your model. For them you need a UI — and you do not want to write HTML, CSS, and JavaScript for a demo. Streamlit and Gradio turn a Python script into an interactive web app in minutes. They're how ML people ship UIs without becoming frontend engineers.
Install:
pip install streamlit(orpip install gradio). Follow-along — both launch a local web server.
Streamlit: a script is the app
Streamlit's model is delightfully simple: write a normal top-to-bottom Python script,
and each st.* call renders a widget. It re-runs the whole script on every interaction.
Here's streamlit_app.py, a full UI for our model:
import streamlit as st
from sentiment.data import load_dataset
from sentiment.model import SentimentModel
@st.cache_resource # load the model ONCE, not on every rerun
def get_model():
return SentimentModel().fit(*load_dataset())
st.title("🎭 Sentiment Analyzer")
text = st.text_area("Your text", "this product is absolutely wonderful")
if st.button("Analyze") and text.strip():
score = float(get_model().predict_proba([text])[0])
st.metric("P(positive)", f"{score:.1%}")
st.progress(score)
st.success("Positive 😊") if score >= 0.5 else st.error("Negative 😞")
Run it:
cd code
streamlit run streamlit_app.py # opens http://localhost:8501
You get a real web app: a title, a text box, a button, a live percentage metric, a progress bar, and a colored verdict — from ~12 lines of Python, no HTML.
Don't be confused:
@st.cache_resourceis not optional here. Streamlit re-runs your entire script on every click. Without the cache decorator, you'd retrain the model on every interaction — slow and wasteful.@st.cache_resource(for models/ connections) and@st.cache_data(for dataframes/computations) memoize expensive work across reruns. Forgetting them is the #1 Streamlit performance bug.
The production pattern: UI calls the API
In the demo above the UI loads the model directly — fine for a prototype. In production
you keep one model behind the FastAPI service and have the UI call it, so there's
a single source of truth and the model isn't duplicated in every app. Our script
supports both — set API_URL and it calls the service instead:
def score(text):
if API_URL: # production: call the FastAPI service
return requests.post(f"{API_URL}/predict", json={"text": text}).json()["score"]
return float(get_model().predict_proba([text])[0]) # prototype: in-process
API_URL=http://localhost:8000 streamlit run streamlit_app.py
This is the right architecture: the API owns inference; the UI is just a client.
Gradio: even faster for ML demos
Gradio is the other popular choice, purpose-built for ML demos and tightly integrated
with Hugging Face. You wrap a function in an Interface and it builds the UI:
import gradio as gr
from sentiment.model import SentimentModel
from sentiment.data import load_dataset
model = SentimentModel().fit(*load_dataset())
def classify(text):
p = float(model.predict_proba([text])[0])
return {"positive": p, "negative": 1 - p} # Gradio renders a label/bar chart
gr.Interface(fn=classify, inputs="text", outputs="label").launch()
That's a complete app with a labeled confidence chart. Gradio also gives you a public
shareable link (launch(share=True)) — great for sending a demo to someone — and
one-click hosting on Hugging Face Spaces.
Don't be confused: Streamlit vs. Gradio — which to pick? Gradio is fastest for a single-model demo ("input → model → output"), with instant sharing and HF Spaces hosting. Streamlit is better for richer apps — dashboards, multiple inputs, charts, multi-step tools — because you control the full page layout. Demo a model → Gradio. Build an internal tool → Streamlit.
Where these fit (and where they don't)
- Great for: internal tools, model demos, stakeholder reviews, data dashboards, quick experiments, hackathons.
- Not for: customer-facing production apps at scale. They're single-process and not built for heavy concurrent traffic. For a real product UI, a frontend framework (React/Vue) talks to your FastAPI backend — but that's a frontend job, not yours.
The value is speed-to-demo: you can put a working model in front of a human in minutes, which is often what unblocks a project ("can I just try it?").
The takeaway
Streamlit and Gradio turn a Python script into a web UI with no frontend code — Gradio
for quick single-model demos with instant sharing, Streamlit for richer multi-widget
apps. Cache the model (@st.cache_resource), and in production have the UI call your
FastAPI service rather than loading the model itself. These are for demos and internal
tools, not customer-facing scale. We can now train, track, serve, scale, version,
orchestrate, optimize, and demo the model — the last question is whether it's still
working in production. 👉
Monitoring & drift detection
Here's the failure mode that catches every team once: you ship a great model, it works, everyone moves on — and six months later it's quietly making bad predictions, because the world changed and nobody was watching. A deployed model rots. Monitoring is how you catch the rot before your users (or your boss) do. It's the most-skipped and most-important stage of the lifecycle.
Install: the PSI check below is pure NumPy and runs live. Richer monitoring uses
pip install evidently.
Two kinds of monitoring (don't conflate them)
Don't be confused: operational monitoring vs. ML monitoring. Operational monitoring watches the service — latency, error rate, throughput, CPU. Standard DevOps (Prometheus/Grafana); if the API is slow or 500ing, you'll know. ML monitoring watches the model's quality — is it still accurate? A model can be perfectly healthy operationally (fast, 200 OK) while being completely wrong. Operational monitoring won't catch that. You need both.
This chapter is about the second kind — the one teams forget.
Why models rot: drift
The world your model learned no longer matches the world it's serving. Two flavors:
- Data drift — the input distribution shifts. New users behave differently, an upstream feature changes units, a season turns. The model sees inputs unlike its training data.
- Concept drift — the relationship between inputs and the answer shifts. Fraudsters invent new tactics; "expensive" means something different after inflation. Even with identical inputs, the right answer changed.
Both silently erode accuracy — and you often can't measure accuracy live, because the true labels arrive late or never (you learn if a loan defaults months later). So you monitor drift in the inputs as an early-warning proxy.
Detecting drift with PSI
The Population Stability Index is the classic, dependency-free drift metric. It
compares a reference distribution (your training data) to the current one (live
traffic), bucket by bucket, and sums how much they diverge. From
serving/drift_check.py:
def psi(reference, current, bins=10):
edges = np.quantile(reference, np.linspace(0, 1, bins + 1))
edges[0], edges[-1] = -np.inf, np.inf
ref_pct = np.histogram(reference, edges)[0] / len(reference)
cur_pct = np.histogram(current, edges)[0] / len(current)
ref_pct, cur_pct = np.clip(ref_pct, 1e-6, None), np.clip(cur_pct, 1e-6, None)
return float(np.sum((cur_pct - ref_pct) * np.log(cur_pct / ref_pct)))
The industry-standard thresholds:
| PSI | Meaning | Action |
|---|---|---|
| < 0.1 | no significant drift | fine |
| 0.1 – 0.25 | moderate drift | investigate |
| > 0.25 | major drift | retrain |
It runs — live output
cd code
python serving/drift_check.py
Output:
stable PSI=0.0012 -> no significant drift
drifted PSI=0.3089 -> MAJOR drift — retrain
Two production scenarios compared against the training distribution: traffic that matches it scores PSI 0.0012 (no drift — all good), while traffic shifted in mean and spread scores PSI 0.3089 (well past 0.25 — retrain now). Run this nightly on a sample of live inputs and you get an automatic early warning, before accuracy visibly craters.
What to monitor
- Input feature distributions — PSI/KS-test per feature vs. training (above).
- Prediction distribution — if your model suddenly predicts "positive" 90% of the time when it used to be 50%, something changed upstream.
- Model accuracy — when labels eventually arrive, compare predictions to truth and track accuracy/AUC over time.
- Data quality — nulls, out-of-range values, schema changes. Often the real culprit: a broken upstream pipeline, not "concept drift."
- Operational metrics — latency, errors, throughput (the other kind).
Evidently: drift reports out of the box
Computing PSI by hand teaches the idea; in practice Evidently generates full drift and quality reports with a few lines:
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=train_df, current_data=live_df)
report.save_html("drift_report.html") # per-feature drift, with charts
It checks every feature, picks appropriate statistical tests, and produces a shareable HTML report. Hosted tools (WhyLabs, Arize, Fiddler) add dashboards, alerts, and historical tracking.
Closing the loop
Monitoring isn't passive — it's the trigger for the rest of the lifecycle. When the drift check fires, it kicks off the Prefect retraining pipeline (Chapter 9), which trains a fresh model, gates it on quality, and promotes it via the registry (Chapter 3) — all automatically. That closed loop — serve → monitor → detect drift → retrain → promote → serve — is what "production ML" actually means, and it's the whole picture from Chapter 0 made real.
Don't be confused: retrain on drift, not on a fixed schedule (ideally). Many teams retrain "every week" by default — simple, but it retrains when nothing changed (waste) and misses drift that strikes mid-week. Drift-triggered retraining responds to the actual signal. A scheduled retrain is a fine starting point; a drift-triggered one is the goal.
The takeaway
Models rot as the world drifts from their training data, so you must monitor quality (drift, prediction shifts, accuracy when labels arrive), not just uptime. PSI is the classic drift metric — our live run flagged shifted traffic at 0.31 and clean traffic at 0.001 — and Evidently automates the full report. Crucially, monitoring triggers retraining, closing the production loop. That completes the classical MLOps stack — next we add the part every modern AI team now needs: the GenAI / LLM stack, starting with the vector databases that power semantic search and RAG. 👉
Vector databases
Modern AI runs on embeddings — vectors where geometric closeness means semantic similarity. Searching millions of them fast, with metadata and persistence, is the job of a vector database, and it's the storage layer under semantic search, recommendations, and RAG (next chapter). We build the core in NumPy so the magic is obvious, then map it to the real tools.
Setup: the in-memory store runs on NumPy only and the output below is real. Production uses Qdrant / Chroma / pgvector / Pinecone.
What a vector database actually does
Three operations, and the third is the whole point:
- Store vectors, each with an
idand metadata (the source text, a URL, a timestamp). - Filter by metadata (
source = "docs",date > last_week). - Search: given a query vector, return the top-k most similar stored vectors — fast, even over millions.
The similarity is cosine (or dot product on normalized vectors) — exactly the recipe
from the foundations book. Here's the entire core, from
code/rag/vectorstore.py:
import numpy as np
class VectorStore:
def __init__(self):
self.ids, self.vecs, self.meta = [], [], []
def add(self, id, vector, metadata=None):
self.ids.append(id); self.vecs.append(np.asarray(vector, float))
self.meta.append(metadata or {})
def search(self, query_vec, k=3):
M = np.vstack(self.vecs)
sims = M @ np.asarray(query_vec, float) # all-pairs cosine in one matmul
order = np.argsort(-sims)[:k] # top-k, best first
return [{"id": self.ids[i], "score": round(float(sims[i]), 3),
"metadata": self.meta[i]} for i in order]
That M @ query_vec — one matrix-vector product scoring every stored vector at once
— is vector search. Everything a real vector DB adds is performance and durability
around this idea.
Embeddings: where the vectors come from
A vector store is only as good as its embeddings. You don't compute these by hand — you call an embedding model that maps text to a learned vector. For a runnable, deterministic demo we use a tiny interpretable "topic embedder" (3 dimensions: pets / finance / tech); in production you swap in a real model (Voyage AI, OpenAI, or a local sentence-transformer) that produces 768–1536 dimensions of learned meaning.
$ python rag/vectorstore.py
Output:
query 'loyal pets to own' -> [{'id': 'd1', 'score': 1.0, 'metadata': {'src': 'pets.md'}}, {'id': 'd2', 'score': 0.0, 'metadata': {'src': 'news.md'}}]
query 'python programming' -> [{'id': 'd3', 'score': 1.0, 'metadata': {'src': 'tech.md'}}, {'id': 'd1', 'score': 0.0, 'metadata': {'src': 'pets.md'}}]
The pets query retrieved the pets document (score 1.0) and correctly ranked the finance doc last (0.0); the programming query found the tech doc. Same matmul, real ranking — with learned embeddings instead of our toy ones, "loyal pets" would also match "faithful companion animals" with no shared words. That semantic matching is why vector search beats keyword search.
Why you need a real vector database
Our NumPy store works for thousands of vectors. It falls over at scale, and that's exactly the gap the tools fill:
| Our store | A vector database adds |
|---|---|
M @ q scans every vector (O(n)) | ANN indexes (HNSW, IVF-PQ) → sub-linear search over millions |
| Lives in RAM, lost on restart | Persistence to disk |
| No filtering | Metadata filters combined with vector search |
| One process | Scaling, sharding, replication |
Don't be confused: exact vs. approximate search. Our
argsortdoes exact nearest-neighbor — perfect results, but O(n) per query. Vector databases use approximate nearest-neighbor (ANN) indexes that trade a tiny bit of recall for massive speed, finding the top-k in roughly O(log n). For millions of vectors, approximate is the only option — and the HNSW and IVF-PQ sister books build those exact indexes from scratch. A vector DB is essentially ourVectorStorewith an ANN index, persistence, and filtering bolted on.
The landscape
| Tool | What it is |
|---|---|
| Chroma | dead-simple, embedded; great for prototypes and local RAG |
| Qdrant | fast Rust engine, rich filtering; popular self-hosted choice |
| pgvector | a Postgres extension — vectors in your existing database |
| Pinecone / Weaviate / Milvus | managed/scalable vector DBs for production |
| FAISS | Meta's library — the index, no server (the HNSW/IVF-PQ algorithms) |
The same three calls — add, filter, search — exist in all of them, e.g. Chroma:
import chromadb
client = chromadb.Client()
col = client.create_collection("docs")
col.add(ids=["d1"], documents=["dogs are loyal pets"], metadatas=[{"src": "pets.md"}])
hits = col.query(query_texts=["faithful companion animals"], n_results=2)
Notice Chroma even calls the embedding model for you (documents= instead of raw
vectors) — convenience over our explicit embed(), same mechanics underneath.
The takeaway
A vector database stores embeddings and answers "what's most similar to this?" in one matmul-plus-top-k — which you just built in NumPy. Production tools (Chroma, Qdrant, pgvector, Pinecone) add approximate-nearest-neighbor indexes (the HNSW/IVF-PQ engines), persistence, and metadata filtering so it scales to millions. Embeddings come from a model; the store finds neighbors. This is the retrieval half of RAG — now let's wire it to an LLM and answer questions. 👉
Building a RAG service
Retrieval-Augmented Generation (RAG) is the dominant pattern for building on top of LLMs — and it's built entirely from tools you now have: a vector store (last chapter) for retrieval, and an LLM for generation. It lets a model answer from your private or current documents — data it was never trained on — and it sharply reduces hallucination, because the model quotes retrieved text instead of inventing facts.
Setup: retrieval runs on NumPy (verified output below); generation needs
pip install anthropicand anANTHROPIC_API_KEY. Code incode/rag/rag.py.
The pattern in five steps
1. chunk + embed your documents → vector store (offline, once)
2. embed the user's question (per query)
3. retrieve the top-k similar chunks (vector search)
4. stuff them into the prompt as context (prompt assembly)
5. ask the LLM to answer USING ONLY that context (grounded generation)
Steps 1–3 are vector search; steps 4–5 are one LLM call. That's the whole thing.
Step 1–3: retrieve (you already built this)
We embed a small knowledge base into the VectorStore, then retrieve the chunks most
relevant to the question:
from rag.vectorstore import VectorStore, embed
def retrieve(store, question, k=2):
hits = store.search(embed(question), k=k)
return [h["metadata"]["text"] for h in hits if h["score"] > 0]
Step 4–5: assemble the prompt and generate
The grounding happens in the prompt: we hand the model the retrieved context and instruct it to answer only from that context. This is the single most important line in a RAG system:
import anthropic
def answer(question, context):
context_block = "\n".join(f"- {c}" for c in context)
prompt = (
"Answer the question using ONLY the context below. "
"If the context doesn't contain the answer, say you don't know.\n\n"
f"Context:\n{context_block}\n\nQuestion: {question}"
)
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from env
resp = client.messages.create(
model="claude-opus-4-8",
max_tokens=512,
messages=[{"role": "user", "content": prompt}],
)
return "".join(b.text for b in resp.content if b.type == "text")
"Answer using ONLY the context… if it's not there, say you don't know" is what keeps the model honest. Without it, the LLM falls back on its training data and may confidently make things up.
Running it
cd code
python rag/rag.py "how loyal are dogs?"
The retrieval step runs and prints real output (the generation step needs the SDK
- key):
question : how loyal are dogs?
retrieved: ['dogs are extremely loyal pets and bond closely with their owners', 'cats are independent pets that groom themselves']
answer : [would send the assembled prompt to claude-opus-4-8]
The vector store surfaced the relevant chunk ("dogs are extremely loyal…"), which gets injected into the prompt. With the SDK installed, Claude returns a grounded answer like "According to the context, dogs are extremely loyal and bond closely with their owners." — sourced from your document, not its training data.
Why RAG instead of fine-tuning?
Don't be confused: RAG vs. fine-tuning. RAG adds knowledge — facts the model reads at query time, kept fresh by updating the vector store. Fine-tuning adds behavior — a tone, format, or skill baked into the weights. To make a model know your latest docs, use RAG (update a document → it's instantly retrievable); fine-tuning would require expensive retraining and still wouldn't cite sources. Most production "chat with your data" systems are RAG, not fine-tuned.
What makes RAG hard in production
The toy version is five steps; the production version is where the engineering lives — and these are the RAG interview questions:
- Chunking. How do you split documents? Too big → noisy, blows the context budget; too small → loses context. Chunk by paragraph/section with overlap.
- Retrieval quality. Pure vector search misses exact keywords; hybrid search (vector + keyword/BM25) plus a re-ranker (a model that re-scores the top candidates) is the standard upgrade.
- The context budget. You can only fit so many chunks in the prompt — retrieve more, then re-rank down to the best few.
- Evaluation. Did the answer actually use the right sources? Did it hallucinate? This is hard and essential — see LLM observability.
- Freshness. Re-embed and re-index documents as they change (a job for Celery or Prefect).
The production architecture
A real RAG service stacks the tools from this whole book:
[ingest pipeline] docs → chunk → embed → vector DB (Prefect, Ch 9 / Ch 14)
[query path] FastAPI /ask (Ch 4) → embed → retrieve (Ch 14)
→ re-rank → prompt → Claude → grounded answer
↑ cache embeddings & answers (Redis, Ch 7)
↑ track cost, latency, quality (Ch 17)
Every box is a chapter. RAG is the capstone that ties the GenAI stack together.
The takeaway
RAG grounds an LLM in your documents: embed and store them, retrieve the chunks relevant to a question, inject them into the prompt, and instruct the model to answer only from that context. It adds knowledge (use it over fine-tuning for facts) and curbs hallucination. The toy is five steps; production hardens chunking, hybrid retrieval, re-ranking, and evaluation. Next: how the LLM itself gets served — the API vs. running your own. 👉
Serving & running LLMs
You don't train large language models — you call them. But "call them" splits into two very different worlds: rent an API (Claude, GPT, Gemini) or run an open-weight model yourself (Llama, Mistral, Qwen via Ollama or vLLM). This chapter maps both, with the client patterns you'll actually write, so you can reason about the cost/control trade-off every AI team faces.
Setup: the API client needs
pip install anthropic+ANTHROPIC_API_KEY; self-hosting needs Ollama or vLLM. Follow-along — code and commands shown.
The two worlds
| API model (Claude, GPT, Gemini) | Open-weight, self-hosted (Llama, Mistral) | |
|---|---|---|
| You get | the strongest models, instantly | full control, privacy, fixed cost |
| You manage | nothing — rent by the token | GPUs, scaling, the serving stack |
| Cost | per token (can balloon at scale) | hardware (fixed, high upfront) |
| Privacy | data leaves your network | data stays in your VPC |
| Best for | most apps, fastest path | privacy-critical, high-volume, customization |
Don't be confused: API vs. self-hosting is a real architectural decision. It's not "API = beginner, self-host = pro." Renting an API is the right call for the majority of products — you get frontier quality with zero ops. Self-hosting wins when data can't leave your infrastructure, when token volume makes per-call pricing painful, or when you need to fine-tune the weights. Reason about it on cost, privacy, and control — not prestige.
World 1: calling an API (the common case)
The whole interface is one HTTP call. With the Anthropic SDK
(code/llm/claude_client.py):
import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from env
resp = client.messages.create(
model="claude-opus-4-8",
max_tokens=512,
messages=[{"role": "user", "content": "Summarize MLOps in one sentence."}],
)
print(resp.content[0].text)
Stream for responsiveness
For anything user-facing, stream the tokens as they're generated so the user sees output immediately instead of waiting for the whole response:
with client.messages.stream(
model="claude-opus-4-8",
max_tokens=512,
messages=[{"role": "user", "content": "Write a haiku about deployment."}],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
Streaming is also how you avoid request timeouts on long outputs — the connection
stays alive as tokens flow. This messages.stream / text_stream pattern is the one
you'll write most often behind a chat UI.
The serving concerns are their problem
When you rent an API, the hard parts of LLM serving — GPU memory, batching, scaling, uptime — are handled for you. You focus on prompts, retrieval, cost, and latency.
World 2: running your own (open-weight models)
When you need control, you serve an open-weight model yourself. Two tools dominate:
Ollama — the easy on-ramp
Ollama runs open models locally with one command — perfect for development, prototyping, and privacy-sensitive local apps:
ollama pull llama3.2 # download an open-weight model
ollama run llama3.2 # chat in the terminal
# it also serves an HTTP API on localhost:11434
import requests
r = requests.post("http://localhost:11434/api/generate",
json={"model": "llama3.2", "prompt": "Hello", "stream": False})
print(r.json()["response"])
vLLM — production-grade serving
vLLM is the high-throughput inference server for self-hosting at scale. Its key trick, continuous batching (packing many requests through the GPU together), gives far higher throughput than naive serving. It exposes an OpenAI-compatible API:
pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct # serves an OpenAI-style API
Other names you'll hear: TGI (Hugging Face Text Generation Inference), Triton (NVIDIA), TensorRT-LLM (fastest on NVIDIA GPUs), and llama.cpp (runs quantized models on CPUs/laptops).
Don't be confused: Ollama vs. vLLM. Ollama = easy, local, single-user, development and small apps. vLLM = production serving, high throughput, many concurrent users, GPU clusters. Prototype with Ollama; serve real traffic with vLLM (or a managed endpoint). Both run the same open-weight models — they're different engines, not different models.
Making self-hosted models affordable: quantization
A 70B model in full precision needs ~140 GB of GPU memory — multiple expensive cards. Quantization (Chapter 10's ONNX cousin) stores weights in 4–8 bits instead of 16, shrinking the model ~4× so it fits on one GPU (or a laptop), with modest quality loss. Formats you'll see: GGUF (llama.cpp/Ollama), AWQ, GPTQ. This is what makes self-hosting economically viable.
Choosing, in practice
Need frontier quality, fast, low ops? → API (Claude/GPT/Gemini)
Data must not leave your network? → self-host (vLLM)
Prototyping / local / single user? → Ollama
Huge token volume, cost-sensitive? → self-host + quantize, or batch API
Need to fine-tune the weights? → open-weight + LoRA, self-hosted
Many production systems are hybrid: an API model for the hard requests, a small self-hosted model for cheap high-volume ones, routed by difficulty.
The takeaway
Serving an LLM means one of two things: rent an API (one messages.create call;
stream for UX; serving is the provider's problem) or run an open-weight model yourself
with Ollama (easy/local) or vLLM (production throughput via continuous
batching), made affordable by quantization. Choose on cost, privacy, and control —
not prestige; hybrids are common. Whichever you pick, you must watch it — cost,
latency, and quality. That's LLM observability. 👉
LLM observability & evaluation
A classical model has metrics like accuracy; an LLM feature has cost, latency, and "was the answer any good?" — and that last one has no ground-truth label. LLM observability is monitoring (Chapter 12) adapted to this reality: track every call's tokens, cost, and latency, trace multi-step chains, and evaluate output quality without labels. Skip it and you'll get a surprise five-figure bill and a silently-degrading product.
Setup: the cost/token math runs with no dependencies (verified output below). Tracing and eval use hosted tools (Langfuse, LangSmith) or another LLM.
1. Track cost, tokens & latency on every call
Every LLM response reports its token usage; turn that into dollars and log it. From
code/llm/claude_client.py:
PRICING = { # USD per 1,000,000 tokens (input, output)
"claude-opus-4-8": (5.00, 25.00),
"claude-sonnet-4-6": (3.00, 15.00),
"claude-haiku-4-5": (1.00, 5.00),
}
def cost_usd(model, input_tokens, output_tokens):
pin, pout = PRICING[model]
return input_tokens/1e6 * pin + output_tokens/1e6 * pout
The eye-opener is what model choice does to the bill. A typical RAG call (~1,500 input tokens of context + question, ~300 output) across the model tiers:
$ python llm/claude_client.py
Output:
claude-opus-4-8 1500 in / 300 out -> $0.01500
claude-sonnet-4-6 1500 in / 300 out -> $0.00900
claude-haiku-4-5 1500 in / 300 out -> $0.00300
100k Opus calls/day -> $1,500.00/day ~$45,000/month
That projection is why you instrument cost from day one. The same workload is $45k/month on Opus, ~$27k on Sonnet, ~$9k on Haiku — a 5× swing from one config value. You can only make that trade-off if you're measuring per-call cost in production. (And prompt caching can cut the input cost of repeated context by ~90% — another lever you only find by watching the numbers.)
The production version logs a structured record per call — not a print:
@dataclass
class CallRecord:
model: str; input_tokens: int; output_tokens: int; cost: float; latency_s: float
Ship these to your metrics system and you get dashboards for cost per day, tokens per request, p95 latency, and error rate — the LLM's operational vitals.
2. Trace multi-step chains
A RAG request isn't one call — it's embed → retrieve → (re-rank) → generate, sometimes across several LLM calls. When the answer is wrong, which step failed? Tracing records the full tree of a request — inputs, outputs, tokens, and timing at each step — so you can see exactly where it went sideways.
Tools: Langfuse (open-source), LangSmith (LangChain's), Helicone, plus OpenTelemetry for generic spans. The pattern is a decorator or context manager around each step:
from langfuse.decorators import observe
@observe()
def retrieve(question): ... # each @observe step becomes a span in the trace
@observe()
def answer(question, context): ... # inputs, outputs, tokens, latency captured per span
Don't be confused: logging vs. tracing. A log is one event ("called Claude, 300 tokens"). A trace is the whole request tree — every step, nested, with timing and token counts at each node. For single calls, logs suffice; for RAG and agent chains, you need tracing to answer "the final answer was wrong — was it bad retrieval or bad generation?"
3. Evaluate quality — without labels
The hard part: there's no accuracy score for "summarize this" or "answer from these docs." Three approaches, in rising rigor:
LLM-as-judge
Use a strong LLM to grade outputs against a rubric. It scales where human review can't:
judge_prompt = (
"Score the ANSWER from 1-5 for how well it's supported by the CONTEXT.\n"
"Reply with only the number.\n\n"
f"CONTEXT:\n{context}\n\nANSWER:\n{answer}"
)
# send to claude-opus-4-8, parse the score, track it over time
This is how you measure RAG groundedness (did the answer stick to the retrieved context, or hallucinate?) at scale. Tools like Ragas and DeepEval package these metrics (faithfulness, answer-relevancy, context-precision).
Golden datasets
Keep a fixed set of representative inputs with known-good answers; run them on every prompt or model change and check for regressions — the LLM equivalent of a test suite (Chapter 18).
Human feedback
Collect 👍/👎 and corrections from real users. The highest-signal data — and the seed for fine-tuning or improving retrieval.
Don't be confused: offline eval vs. online monitoring. Offline eval runs your golden set against a new prompt/model before shipping — a gate, like CI. Online monitoring watches real production traffic (cost, latency, thumbs-down rate, judge scores on a sample) after shipping. You need both: eval to catch regressions before release, monitoring to catch drift after.
What to watch in production
| Signal | Why |
|---|---|
| Cost / day & tokens / request | budgets blow up silently; catch runaway prompts |
| p50 / p95 latency | LLMs are slow; the tail kills UX |
| Error & timeout rate | rate limits, overloads, malformed outputs |
| Groundedness / judge score (sampled) | is RAG still answering from sources? |
| Thumbs-down rate | the realest quality signal you have |
| Prompt-injection / safety flags | users will try to break it |
The takeaway
LLM observability = track cost/tokens/latency on every call (model choice is a 5× cost lever you must measure), trace multi-step chains to localize failures, and evaluate quality without labels via LLM-as-judge, golden datasets, and human feedback. Keep offline eval (a pre-ship gate) separate from online monitoring (post-ship drift). That completes the GenAI stack — now the engineering discipline that holds all of it together: testing and CI/CD. 👉
Testing & CI/CD
Every chapter so far added a capability; this one adds the discipline that keeps them from breaking. Testing proves your code and model work; CI/CD runs those tests automatically on every change and blocks anything broken from reaching production. This is the difference between "it worked when I tried it" and "it provably works, on every commit, forever."
Setup:
pip install pytest(already a dependency). The tests incode/tests/run live — the output below is real.
ML needs more tests, not fewer
Regular software tests ask "does this function return the right value?" ML code needs those plus tests for the things that can silently rot:
| Test type | Asks |
|---|---|
| Unit | does this function work? (tokenizer, cost math) |
| Behavioral | does the model do its job? (positive text → positive) |
| Range/invariant | are outputs valid? (probabilities in [0, 1]) |
| Convergence | does training actually learn? (train accuracy > 0.9) |
| Artifact round-trip | save → load → identical predictions? |
| API contract | right status codes, schema, validation? |
Testing the model
From code/tests/test_model.py — note these are ML tests, not
just unit tests:
import pytest
from sentiment.data import load_dataset
from sentiment.model import SentimentModel
@pytest.fixture(scope="module")
def trained_model():
return SentimentModel(epochs=300).fit(*load_dataset())
def test_positive_text_scores_high(trained_model):
assert trained_model.predict_proba(["i love this it is great"])[0] > 0.5
def test_predictions_are_valid_probabilities(trained_model):
probs = trained_model.predict_proba(["great", "terrible", "the cat"])
assert ((probs >= 0.0) & (probs <= 1.0)).all() # invariant: never out of range
def test_training_converges(trained_model):
texts, labels = load_dataset()
assert trained_model.accuracy(texts, labels) > 0.9 # it actually learns
def test_artifact_roundtrip(trained_model, tmp_path):
path = tmp_path / "m.json"
trained_model.save(str(path))
reloaded = SentimentModel.load(str(path))
assert reloaded.predict_proba(["this is great"])[0] == pytest.approx(
trained_model.predict_proba(["this is great"])[0])
The @pytest.fixture trains the model once and shares it across tests (fast);
tmp_path is a pytest-provided temp directory (no cleanup needed). The round-trip test
is the one beginners skip and regret — a serialization bug silently ships a broken
model.
Testing the API
FastAPI's TestClient exercises the app in-process — no running server — so
endpoint tests are CI-friendly (code/tests/test_api.py):
from fastapi.testclient import TestClient
from api.main import app
client = TestClient(app)
def test_health_ok():
assert client.get("/health").json()["status"] == "ok"
def test_empty_text_is_rejected():
assert client.post("/predict", json={"text": ""}).status_code == 422 # validation
Running the suite
cd code
pytest -q
Output:
............ [100%]
12 passed in 0.76s
Twelve tests — model behavior, invariants, convergence, round-trip, and the API contract — in under a second. This is the gate. A change that breaks any of them fails here, before it ever reaches a user.
CI: run the tests automatically on every change
You won't remember to run pytest before every push — so a machine does it for you.
Continuous Integration runs your tests on every commit and pull request.
.github/workflows/ci.yml (GitHub Actions):
name: CI
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.11", cache: pip }
- run: pip install numpy fastapi "uvicorn[standard]" pydantic pytest ruff
- run: ruff check . # lint
- run: pytest -q # test
Now every pull request shows a green check ✅ or red ✗. A red CI blocks the merge —
broken code physically cannot reach main. That's the whole point: the gate is
automatic and unskippable.
CD: deploy automatically once tests pass
Continuous Delivery/Deployment extends the pipeline: after tests pass on main,
build the Docker image (Chapter 5) and ship it.
build-image:
needs: test # ONLY if tests passed
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: docker build -t sentiment-api:${{ github.sha }} .
# then: docker push <registry>/... and deploy (kubectl / cloud run / ...)
needs: test is the safety interlock — the image only builds if the tests pass.
Don't be confused: CI vs. CD. CI (Continuous Integration) = automatically test every change. CD = automatically deliver/deploy changes that pass. CI is the gate; CD is the conveyor belt after it. You adopt CI first (always), CD when you're confident the gate is strong enough to ship without a human in the loop.
What makes ML CI/CD special
Beyond code tests, mature ML pipelines add ML-specific gates (Chapter 9's quality gate, Chapter 17's eval):
- Model-quality gate — the newly trained model must beat the current production model on a holdout before it can deploy.
- Data validation — schema and distribution checks on incoming data (drift, Chapter 12) before training.
- Eval gate for LLM features — run the golden-set eval; block on regressions.
This is CI/CD/CT — Continuous Training — the loop from Chapter 0 made fully automatic.
The takeaway
Test the model (behavior, invariants, convergence, round-trip) and the API (contract, validation) — ML needs more tests than ordinary code, and our 12 run in under a second. CI runs them on every commit and blocks broken merges; CD ships what passes; ML adds model-quality, data-validation, and eval gates on top. Automated, unskippable gates are what let a team move fast without breaking production. One discipline left before we assemble everything: configuration, secrets, and security. 👉
Config, secrets & security
The fastest way to turn a working service into an incident is a hardcoded API key in a
public repo or an unprotected /predict endpoint. This chapter covers the
unglamorous-but-essential basics: load configuration the right way, keep secrets out
of your code, and put a lock on your API. None of it is hard — and skipping it is how
breaches and surprise bills happen.
Setup:
pip install pydantic-settings; the auth demo uses FastAPI'sTestClient. Output below is real. Code incode/config/.
Configuration: never hardcode
Anything that changes between environments — a database URL, the model path, a log level — is configuration, and it belongs in the environment, not the code. The 12-factor rule: config lives in environment variables, so the same image runs in dev, staging, and prod with different settings and no code change.
Pydantic Settings does this cleanly: declare your config as a typed class, and it
reads from environment variables (and a .env file), validates types, and fails fast.
From code/config/settings.py:
from pydantic import Field
from pydantic_settings import BaseSettings, SettingsConfigDict
class Settings(BaseSettings):
model_config = SettingsConfigDict(env_file=".env", extra="ignore")
model_path: str = "model.json"
redis_url: str = "redis://localhost:6379/0"
max_batch_size: int = Field(default=128, ge=1, le=10_000) # validated range
api_key: str = Field(default="", repr=False) # secret: hidden from logs
$ MAX_BATCH_SIZE=256 API_KEY=secret123 python config/settings.py
Output:
model_path : model.json
redis_url : redis://localhost:6379/0
max_batch_size: 256
api_key set? : True (value never printed)
MAX_BATCH_SIZE=256 was read and type-coerced to an int; an out-of-range or
non-numeric value would fail at startup, not mid-request. And repr=False keeps the
secret out of logs and tracebacks. Validated config that fails fast beats a typo
discovered in production.
Secrets: keep them out of the code
Don't be confused: configuration vs. secrets. Both come from the environment, but a secret (API key, DB password, token) is sensitive — it must never be committed, logged, or printed. Config like a log level can live in plain
docker-compose.yml; a secret cannot.
The rules, in order of importance:
- Never commit secrets. No keys in source. Add
.envto.gitignore(the project does). One leaked key in git history is a breach — andgitremembers forever. - Load from the environment (or a
.envfile locally that's gitignored). - In production, use a secrets manager — AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault, or Kubernetes Secrets. These store, rotate, and audit access to secrets; your app fetches them at startup.
- Rotate on exposure. If a key leaks, revoke and reissue it immediately — which is easy only if it was never hardcoded.
# .env (gitignored — never committed)
API_KEY=sk-your-real-key
REDIS_URL=redis://prod-redis:6379/0
The classic mistake: committing a
.envwith real keys, or pasting a key into a notebook you later push. Scan your repo with tools like gitleaks or trufflehog; many CI pipelines (Chapter 18) run a secret-scan step to block leaks automatically.
Securing the API
Your model endpoint is on the internet — someone will find it. At minimum, it needs authentication (who are you?) and rate limiting (how often can you call?).
Authentication: require an API key
A FastAPI dependency checks a header on every protected route — one function, applied everywhere:
from fastapi import FastAPI, Depends, HTTPException, Header
def require_key(x_api_key: str = Header(default="")):
if x_api_key != settings.api_key:
raise HTTPException(status_code=401, detail="invalid or missing API key")
@app.get("/secure")
def secure(_=Depends(require_key)):
return {"ok": True}
Verified end to end (via TestClient):
no key -> 401
bad key -> 401
good key -> 200 {'ok': True}
No key or a wrong key → 401 Unauthorized; the right key → 200. The endpoint is now locked. (For real user systems you'd graduate to OAuth2 / JWT tokens — FastAPI has first-class support — but a checked API key is the right baseline for service-to-service calls.)
Rate limiting: cap abuse
Without a limit, one client (or one bug, or one attacker) can flood your service or run up a huge LLM bill. The Redis atomic-counter pattern from Chapter 7 caps requests per client per window:
def allow(user_id, limit=100, window=60):
n = r.incr(f"rate:{user_id}")
if n == 1:
r.expire(f"rate:{user_id}", window) # first hit starts the 60s window
return n <= limit
Reject with 429 Too Many Requests once the limit is hit. (Libraries like slowapi wire this into FastAPI for you.)
The security baseline checklist
For any service that goes live:
-
No secrets in code or git history (gitignore
.env, scan in CI) - Config & secrets from the environment / a secrets manager, validated at startup
- Authentication on every non-public endpoint (API key → OAuth2/JWT)
- Rate limiting to cap abuse and runaway cost
- Input validation (Pydantic — Chapter 4 — rejects junk before it runs)
- HTTPS only (terminate TLS at the load balancer / gateway)
- Least privilege — the service's credentials can do only what it needs
-
Dependency scanning (
pip-audit, Dependabot) for known CVEs - Don't log secrets or full payloads (PII, keys)
You don't need all of it on day one, but you need this list in your head before exposing a model to the internet.
The takeaway
Load config and secrets from the environment with typed, fail-fast validation
(Pydantic Settings); never commit secrets — gitignore .env, use a secrets manager
in prod, rotate on leak. Lock your API with authentication (a checked key → 401/200)
and rate limiting (429), on top of input validation and HTTPS. None of it is hard;
all of it is the difference between a demo and a service you can trust in production.
Now let's assemble every tool in this book into one running system. 👉
The full stack with docker-compose
You've wrapped one model in a dozen tools, one chapter at a time. In production they run together: the API, the worker, Redis, and the tracking server, all at once, talking to each other. Starting four services by hand in four terminals is painful and fragile. docker-compose defines the whole system in one file and launches it with one command.
Setup: Docker with Compose (bundled in Docker Desktop). Follow-along.
The whole system, declared
docker-compose.yml describes every service and how they
connect:
services:
redis: # broker + cache
image: redis:7-alpine
ports: ["6379:6379"]
api: # the FastAPI model service
build: .
ports: ["8000:8000"]
environment: [REDIS_URL=redis://redis:6379/0]
depends_on: [redis]
worker: # Celery background worker
build: .
command: celery -A tasks.celery_app worker --loglevel=info
environment: [REDIS_URL=redis://redis:6379/0]
depends_on: [redis]
mlflow: # experiment tracking / registry UI
image: ghcr.io/mlflow/mlflow:latest
command: mlflow server --host 0.0.0.0 --port 5000 --backend-store-uri sqlite:////mlflow/mlflow.db
ports: ["5000:5000"]
volumes: [mlflow-data:/mlflow]
volumes:
mlflow-data:
Four services, each from earlier chapters, now wired into one system. Read it top to bottom and you can see the whole architecture at a glance — which is itself a benefit.
One command to rule them all
cd code
docker compose up --build
Expected output:
[+] Running 5/5
✔ Network code_default Created
✔ Container code-redis-1 Started
✔ Container code-mlflow-1 Started
✔ Container code-api-1 Started
✔ Container code-worker-1 Started
api-1 | INFO: Uvicorn running on http://0.0.0.0:8000
worker-1 | celery@... ready.
mlflow-1 | Listening at: http://0.0.0.0:5000
The entire stack is live:
- http://localhost:8000/docs — the prediction API
- http://localhost:5000 — the MLflow UI
- the worker consuming background jobs from Redis
- Redis brokering and caching
Stop it all with one command:
docker compose down
How the services find each other
Notice REDIS_URL=redis://redis:6379/0 — the API reaches Redis by the service name
redis, not an IP. Compose creates a private network where each service is reachable by
its name. This is the key idea: services address each other by name, not address,
so the same compose file works on any machine without editing IPs.
Don't be confused:
depends_onwaits for start, not ready.depends_on: [redis]makes Compose start Redis before the API container — but it doesn't wait for Redis to be accepting connections. A service that crashes because its dependency isn't ready yet needs a real healthcheck (like the one in our Dockerfile) plus retry-on-connect logic. "Started" ≠ "ready" is a classic compose gotcha.
docker-compose vs. Kubernetes
Don't be confused: compose vs. Kubernetes. docker-compose runs multiple containers on one machine — perfect for local development, CI, and small deployments. Kubernetes (K8s) runs containers across a cluster of machines with auto-scaling, self-healing, rolling updates, and load balancing — the standard for production at scale. The good news: a compose file maps conceptually onto K8s manifests, so what you learn here transfers. Start with compose; graduate to K8s when one machine isn't enough.
The complete lifecycle, assembled
Step back and look at what you've built across the book — the entire production loop from Chapter 0, now real:
data ─► train ─► track (MLflow) ─► register ─► serve (FastAPI) ─► package (Docker)
─► scale (Celery+Redis) ─► version data (DVC) ─► orchestrate (Prefect)
─► optimize (ONNX) ─► demo (Streamlit) ─► monitor ─┐
▲ │
└──────────────── retrain on drift ◄────────────────┘
GenAI stack: vector DB ─► RAG service ─► LLM serving ─► LLM observability
Engineering: testing & CI/CD · config, secrets & security
… and run it all with one command (docker-compose)
Every box is a tool you can now use. Swap our tiny model for a real one and nothing about the tooling changes — that was the whole point of keeping the model trivial.
A production-readiness checklist
Before any model goes live, walk this list (each item maps to a chapter):
- Experiments tracked and reproducible (MLflow, DVC)
-
Model versioned in a registry with a
@productionalias -
Served behind a validated API with a
/healthcheck - Containerized; image in a registry; runs as non-root
- Heavy work offloaded to a queue; hot paths cached
- Retraining orchestrated and gated on quality
- Inference optimized (ONNX/quantization) if latency matters
- Monitoring for drift and operational metrics, with alerts
- Tests passing in CI; deploys gated on green (Chapter 18)
- Secrets out of code; API authenticated & rate-limited (Chapter 19)
- For LLM features: cost/latency tracked, eval gate, grounded RAG (Chapter 17)
- A rollback plan (move the alias back)
The takeaway
docker-compose declares your whole multi-service system — API, worker, Redis, MLflow —
in one file and launches it with docker compose up; services find each other by name
on a private network, and you graduate to Kubernetes when one machine isn't enough.
You've now assembled the complete production loop: track, register, serve, package,
scale, version, orchestrate, optimize, demo, monitor, and retrain. That's MLOps — and
you can do it. Go ship something. 👉
References
The official docs are the best source for every tool here — they're well-written and current. This page points you to them, plus the books that go deeper.
Tool documentation
- MLflow — mlflow.org/docs. Tracking, the Model Registry,
pyfunc, autolog (Ch 2–Ch 3). - FastAPI — fastapi.tiangolo.com. The tutorial is outstanding (Ch 4).
- Pydantic — docs.pydantic.dev. Validation and settings.
- Uvicorn — uvicorn.org. The ASGI server.
- Docker — docs.docker.com; Get Started + the Dockerfile best-practices guide (Ch 5).
- Celery — docs.celeryq.dev (Ch 6).
- Redis — redis.io/docs; the commands reference (Ch 7).
- DVC — dvc.org/doc (Ch 8).
- Prefect — docs.prefect.io (Ch 9).
- ONNX / onnxruntime — onnx.ai and onnxruntime.ai (Ch 10).
- Streamlit — docs.streamlit.io (Ch 11).
- Gradio — gradio.app/docs (Ch 11).
- Evidently — docs.evidentlyai.com; drift and quality reports (Ch 12).
- Vector databases — Chroma (trychroma.com), Qdrant (qdrant.tech), pgvector, Pinecone (Ch 14).
- Anthropic Claude API — docs.claude.com; the Messages API and Python SDK used in the RAG, serving, and observability chapters (Ch 15–Ch 17).
- Ollama / vLLM — ollama.com, docs.vllm.ai; running open-weight models (Ch 16).
- Langfuse / Ragas — langfuse.com, docs.ragas.io; LLM tracing & evaluation (Ch 17).
- pytest / GitHub Actions — docs.pytest.org, docs.github.com/actions (Ch 18).
- Pydantic Settings — docs.pydantic.dev/latest/concepts/pydantic_settings (Ch 19).
Going deeper
- Chip Huyen. Designing Machine Learning Systems. O'Reilly, 2022. — The best single book on the production ML lifecycle; complements every chapter here.
- Noah Gift et al. Practical MLOps. O'Reilly, 2021. — Hands-on cloud MLOps.
- Google Cloud. MLOps: Continuous delivery and automation pipelines in machine learning. — The widely-cited MLOps maturity-levels white paper.
- Martin Kleppmann. Designing Data-Intensive Applications. O'Reilly, 2017. — The systems foundations (queues, caching, storage) under all of this.
The alternatives, by category
So you recognize them on a job description:
| Category | This book | Common alternatives |
|---|---|---|
| Experiment tracking | MLflow | Weights & Biases, Neptune, Comet |
| Model serving | FastAPI | BentoML, TorchServe, Triton, KServe |
| Task queue | Celery | RQ, Dramatiq, Arq, AWS SQS |
| Orchestration | Prefect | Airflow, Dagster, Kubeflow |
| Data versioning | DVC | lakeFS, Delta Lake, Git LFS |
| Inference runtime | ONNX | TensorRT, OpenVINO, TorchScript |
| Demo UI | Streamlit / Gradio | Dash, Panel |
| Monitoring | Evidently | WhyLabs, Arize, Fiddler |
| Vector database | Chroma / Qdrant | pgvector, Pinecone, Weaviate, Milvus |
| LLM API | Claude | GPT (OpenAI), Gemini (Google) |
| Self-hosted LLM serving | vLLM / Ollama | TGI, Triton, TensorRT-LLM, llama.cpp |
| LLM observability / eval | Langfuse / Ragas | LangSmith, Helicone, DeepEval |
| CI/CD | GitHub Actions | GitLab CI, Jenkins, CircleCI |
| Secrets management | env + Pydantic Settings | Vault, AWS/GCP Secrets Manager |
| Container orchestration | docker-compose | Kubernetes, Nomad, ECS |
Sister books in this series
- AI Foundations in Depth — the concepts behind the models you're deploying (and a cloud/MLOps overview chapter that maps the wider landscape).
- HNSW and IVF & Product Quantization — the vector-search engines inside vector databases and feature stores.
- Recommendation Systems from Scratch — a production capstone using MLflow, FastAPI, and a React frontend end to end.
This book's code
Everything lives in code/ and runs standalone:
sentiment/ (the model + MLflow), api/ (FastAPI), tasks/ (Celery), pipeline/
(Prefect), serving/ (ONNX + drift), streamlit_app.py, plus the Dockerfile,
docker-compose.yml, and Makefile. Only NumPy is required for the core model; each
chapter installs its own tool.