Cloud & MLOps

A model in a notebook helps no one. MLOps — "DevOps for machine learning" — is the discipline of getting models into production and keeping them working: reproducible training, automated deployment, monitoring, and retraining. This chapter maps the cloud platforms and the MLOps lifecycle so that when an interviewer asks "how would you deploy and maintain this?", you have a real answer.

The companion book Production ML & AI Tools builds many of these tools hands-on (MLflow, FastAPI, Docker, Celery, monitoring). This chapter is the conceptual map.

Why "just deploy it" is hard

Software has code; ML has code + data + model, and all three drift. A deployed model silently rots as the world changes (yesterday's spam isn't today's). MLOps exists to manage three moving parts at once — which is why it's harder than ordinary DevOps and why it's a whole job title now.

The big three cloud providers

You'll work on at least one. They mirror each other:

Capability	AWS	GCP	Azure
Managed ML platform	SageMaker	Vertex AI	Azure ML
Object storage (data/models)	S3	GCS	Blob Storage
Compute (VMs, GPUs)	EC2	Compute Engine	VMs
Serverless functions	Lambda	Cloud Functions	Functions
Managed containers (K8s)	EKS	GKE	AKS
Data warehouse (SQL at scale)	Redshift	BigQuery	Synapse

Don't memorize all of it. Know the categories — storage, compute, a managed ML platform, a data warehouse, container orchestration — and that every cloud has one of each. The managed platforms (SageMaker / Vertex / Azure ML) bundle training, tuning, registry, and serving so you don't assemble it yourself.

Compute: CPU, GPU, TPU

CPU — data prep, classical ML, light inference.
GPU — the workhorse for deep-learning training and heavy inference (parallel matrix multiplies — Chapter 16). NVIDIA dominates; "A100/H100" are the names to know.
TPU — Google's custom ML chips, big for large-scale training on GCP.

Don't be confused: training vs. inference hardware. Training is bursty and expensive — rent big GPUs, run, release. Inference is continuous — optimize for cost and latency (quantize the model, batch requests, often run on cheaper GPUs or CPUs). Treating them the same way wastes a fortune. This cost framing impresses interviewers.

Containers: Docker & Kubernetes

Docker — package your code + dependencies + model into an image that runs identically everywhere. The unit of deployment; kills "works on my machine."
Kubernetes (K8s) — orchestrates many containers across many machines: scaling, restarts, rolling updates, load balancing. The production standard for serving at scale.

You don't need to be a K8s expert, but you must know what Docker is and why (covered hands-on in the tools book), and that K8s is how containers run at scale.

The MLOps lifecycle

The loop a production model lives in:

   data  ─►  train  ─►  evaluate  ─►  register  ─►  deploy  ─►  monitor
     ▲                                                            │
     └──────────────  retrain when it drifts  ◄──────────────────┘

Each stage has a tooling category:

Data versioning — track which data trained which model (DVC, lakeFS, Delta Lake). Reproducibility requires it.
Experiment tracking — log params/metrics/artifacts per run (MLflow, Weights & Biases — Chapter 16).
Model registry — versioned models with stages (staging → production), so you can roll back (MLflow Registry, SageMaker Model Registry).
Pipeline orchestration — schedule and chain steps (Airflow, Prefect, Dagster, Kubeflow). "Retrain every Sunday on fresh data" lives here.
CI/CD for ML — automated testing + deployment, plus model-specific gates (does the new model beat the old on a holdout before it ships?).
Serving — expose predictions via an API (FastAPI, BentoML, TorchServe, Triton, or a managed endpoint).
Monitoring — track latency, errors, and model quality/drift (see the drift section below, Chapter 16, and the tools book).

Serving patterns

Online / real-time — a REST/gRPC endpoint answers one request at a time (a recommendation as the page loads). Optimize latency.
Batch — score millions of rows on a schedule (nightly churn scores written to a table). Optimize throughput.
Streaming — score events as they flow (fraud on each transaction) via Kafka/ Flink.

Don't be confused: online vs. batch serving. Online = low-latency, one-at-a-time, always-on infrastructure (and a feature store to fetch features fast). Batch = high- throughput, scheduled, cheaper, simpler. Many "we need real-time ML!" requirements are actually fine as batch — and saying so in a design interview shows maturity.

Feature stores

A feature store (Feast, Tecton, SageMaker Feature Store) is a central place to define, store, and serve features so that training and serving use the exact same computation. It solves training/serving skew — the silent killer where a feature is computed one way in your training notebook and a subtly different way in production, quietly wrecking accuracy. Know the term and the problem it solves.

Monitoring & drift — the part everyone forgets

A deployed model degrades silently. You must watch:

Operational metrics — latency, throughput, error rate, cost (standard ops).
Data drift — the input distribution shifts from training (new user behavior, a changed upstream feature).
Concept drift — the relationship between inputs and the target changes (fraud patterns evolve).
Prediction/label monitoring — are outputs and (eventually) real outcomes still in the expected range?

Tools: Evidently, WhyLabs, Arize, plus Prometheus/Grafana for ops. The trigger to retrain is drift, not the calendar — though scheduled retraining is a common simple proxy.

MLOps maturity (a useful framing)

Level 0 — manual: train in a notebook, hand the file to engineering. Fragile.
Level 1 — automated pipeline: retraining and deployment are scripted and reproducible.
Level 2 — full CI/CD: code, data, and models flow to production automatically with tests, monitoring, and automated retraining.

Most companies are at level 0–1 and trying to climb. Knowing this ladder lets you answer "how mature is your ML process?" credibly.

The takeaway

MLOps is DevOps for the code-plus-data-plus-model trinity: every cloud gives you storage, compute (CPU/GPU/TPU), containers (Docker/K8s), a data warehouse, and a managed ML platform (SageMaker/Vertex/Azure ML). The lifecycle — version data, track experiments, register models, orchestrate pipelines, CI/CD, serve (online vs. batch), and monitor for drift — is what keeps a model alive in production. Feature stores fight training/serving skew; drift, not the calendar, triggers retraining. Next, the fast- moving frontier: the modern AI landscape and the trends you'll be asked about. 👉

AI Foundations in Depth