Testing & CI/CD
Every chapter so far added a capability; this one adds the discipline that keeps them from breaking. Testing proves your code and model work; CI/CD runs those tests automatically on every change and blocks anything broken from reaching production. This is the difference between "it worked when I tried it" and "it provably works, on every commit, forever."
Setup:
pip install pytest(already a dependency). The tests incode/tests/run live — the output below is real.
ML needs more tests, not fewer
Regular software tests ask "does this function return the right value?" ML code needs those plus tests for the things that can silently rot:
| Test type | Asks |
|---|---|
| Unit | does this function work? (tokenizer, cost math) |
| Behavioral | does the model do its job? (positive text → positive) |
| Range/invariant | are outputs valid? (probabilities in [0, 1]) |
| Convergence | does training actually learn? (train accuracy > 0.9) |
| Artifact round-trip | save → load → identical predictions? |
| API contract | right status codes, schema, validation? |
Testing the model
From code/tests/test_model.py — note these are ML tests, not
just unit tests:
import pytest
from sentiment.data import load_dataset
from sentiment.model import SentimentModel
@pytest.fixture(scope="module")
def trained_model():
return SentimentModel(epochs=300).fit(*load_dataset())
def test_positive_text_scores_high(trained_model):
assert trained_model.predict_proba(["i love this it is great"])[0] > 0.5
def test_predictions_are_valid_probabilities(trained_model):
probs = trained_model.predict_proba(["great", "terrible", "the cat"])
assert ((probs >= 0.0) & (probs <= 1.0)).all() # invariant: never out of range
def test_training_converges(trained_model):
texts, labels = load_dataset()
assert trained_model.accuracy(texts, labels) > 0.9 # it actually learns
def test_artifact_roundtrip(trained_model, tmp_path):
path = tmp_path / "m.json"
trained_model.save(str(path))
reloaded = SentimentModel.load(str(path))
assert reloaded.predict_proba(["this is great"])[0] == pytest.approx(
trained_model.predict_proba(["this is great"])[0])
The @pytest.fixture trains the model once and shares it across tests (fast);
tmp_path is a pytest-provided temp directory (no cleanup needed). The round-trip test
is the one beginners skip and regret — a serialization bug silently ships a broken
model.
Testing the API
FastAPI's TestClient exercises the app in-process — no running server — so
endpoint tests are CI-friendly (code/tests/test_api.py):
from fastapi.testclient import TestClient
from api.main import app
client = TestClient(app)
def test_health_ok():
assert client.get("/health").json()["status"] == "ok"
def test_empty_text_is_rejected():
assert client.post("/predict", json={"text": ""}).status_code == 422 # validation
Running the suite
cd code
pytest -q
Output:
............ [100%]
12 passed in 0.76s
Twelve tests — model behavior, invariants, convergence, round-trip, and the API contract — in under a second. This is the gate. A change that breaks any of them fails here, before it ever reaches a user.
CI: run the tests automatically on every change
You won't remember to run pytest before every push — so a machine does it for you.
Continuous Integration runs your tests on every commit and pull request.
.github/workflows/ci.yml (GitHub Actions):
name: CI
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.11", cache: pip }
- run: pip install numpy fastapi "uvicorn[standard]" pydantic pytest ruff
- run: ruff check . # lint
- run: pytest -q # test
Now every pull request shows a green check ✅ or red ✗. A red CI blocks the merge —
broken code physically cannot reach main. That's the whole point: the gate is
automatic and unskippable.
CD: deploy automatically once tests pass
Continuous Delivery/Deployment extends the pipeline: after tests pass on main,
build the Docker image (Chapter 5) and ship it.
build-image:
needs: test # ONLY if tests passed
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: docker build -t sentiment-api:${{ github.sha }} .
# then: docker push <registry>/... and deploy (kubectl / cloud run / ...)
needs: test is the safety interlock — the image only builds if the tests pass.
Don't be confused: CI vs. CD. CI (Continuous Integration) = automatically test every change. CD = automatically deliver/deploy changes that pass. CI is the gate; CD is the conveyor belt after it. You adopt CI first (always), CD when you're confident the gate is strong enough to ship without a human in the loop.
What makes ML CI/CD special
Beyond code tests, mature ML pipelines add ML-specific gates (Chapter 9's quality gate, Chapter 17's eval):
- Model-quality gate — the newly trained model must beat the current production model on a holdout before it can deploy.
- Data validation — schema and distribution checks on incoming data (drift, Chapter 12) before training.
- Eval gate for LLM features — run the golden-set eval; block on regressions.
This is CI/CD/CT — Continuous Training — the loop from Chapter 0 made fully automatic.
The takeaway
Test the model (behavior, invariants, convergence, round-trip) and the API (contract, validation) — ML needs more tests than ordinary code, and our 12 run in under a second. CI runs them on every commit and blocks broken merges; CD ships what passes; ML adds model-quality, data-validation, and eval gates on top. Automated, unskippable gates are what let a team move fast without breaking production. One discipline left before we assemble everything: configuration, secrets, and security. 👉