Testing & CI/CD

Every chapter so far added a capability; this one adds the discipline that keeps them from breaking. Testing proves your code and model work; CI/CD runs those tests automatically on every change and blocks anything broken from reaching production. This is the difference between "it worked when I tried it" and "it provably works, on every commit, forever."

Setup: pip install pytest (already a dependency). The tests in code/tests/ run live — the output below is real.

ML needs more tests, not fewer

Regular software tests ask "does this function return the right value?" ML code needs those plus tests for the things that can silently rot:

Test type	Asks
Unit	does this function work? (tokenizer, cost math)
Behavioral	does the model do its job? (positive text → positive)
Range/invariant	are outputs valid? (probabilities in [0, 1])
Convergence	does training actually learn? (train accuracy > 0.9)
Artifact round-trip	save → load → identical predictions?
API contract	right status codes, schema, validation?

Testing the model

From code/tests/test_model.py — note these are ML tests, not just unit tests:

import pytest
from sentiment.data import load_dataset
from sentiment.model import SentimentModel

@pytest.fixture(scope="module")
def trained_model():
    return SentimentModel(epochs=300).fit(*load_dataset())

def test_positive_text_scores_high(trained_model):
    assert trained_model.predict_proba(["i love this it is great"])[0] > 0.5

def test_predictions_are_valid_probabilities(trained_model):
    probs = trained_model.predict_proba(["great", "terrible", "the cat"])
    assert ((probs >= 0.0) & (probs <= 1.0)).all()       # invariant: never out of range

def test_training_converges(trained_model):
    texts, labels = load_dataset()
    assert trained_model.accuracy(texts, labels) > 0.9   # it actually learns

def test_artifact_roundtrip(trained_model, tmp_path):
    path = tmp_path / "m.json"
    trained_model.save(str(path))
    reloaded = SentimentModel.load(str(path))
    assert reloaded.predict_proba(["this is great"])[0] == pytest.approx(
        trained_model.predict_proba(["this is great"])[0])

The @pytest.fixture trains the model once and shares it across tests (fast); tmp_path is a pytest-provided temp directory (no cleanup needed). The round-trip test is the one beginners skip and regret — a serialization bug silently ships a broken model.

Testing the API

FastAPI's TestClient exercises the app in-process — no running server — so endpoint tests are CI-friendly (code/tests/test_api.py):

from fastapi.testclient import TestClient
from api.main import app
client = TestClient(app)

def test_health_ok():
    assert client.get("/health").json()["status"] == "ok"

def test_empty_text_is_rejected():
    assert client.post("/predict", json={"text": ""}).status_code == 422  # validation

Running the suite

cd code
pytest -q

Output:

............                                                             [100%]
12 passed in 0.76s

Twelve tests — model behavior, invariants, convergence, round-trip, and the API contract — in under a second. This is the gate. A change that breaks any of them fails here, before it ever reaches a user.

CI: run the tests automatically on every change

You won't remember to run pytest before every push — so a machine does it for you. Continuous Integration runs your tests on every commit and pull request. .github/workflows/ci.yml (GitHub Actions):

name: CI
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11", cache: pip }
      - run: pip install numpy fastapi "uvicorn[standard]" pydantic pytest ruff
      - run: ruff check .          # lint
      - run: pytest -q             # test

Now every pull request shows a green check ✅ or red ✗. A red CI blocks the merge — broken code physically cannot reach main. That's the whole point: the gate is automatic and unskippable.

CD: deploy automatically once tests pass

Continuous Delivery/Deployment extends the pipeline: after tests pass on main, build the Docker image (Chapter 5) and ship it.

  build-image:
    needs: test                              # ONLY if tests passed
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: docker build -t sentiment-api:${{ github.sha }} .
      # then: docker push <registry>/...   and   deploy (kubectl / cloud run / ...)

needs: test is the safety interlock — the image only builds if the tests pass.

Don't be confused: CI vs. CD. CI (Continuous Integration) = automatically test every change. CD = automatically deliver/deploy changes that pass. CI is the gate; CD is the conveyor belt after it. You adopt CI first (always), CD when you're confident the gate is strong enough to ship without a human in the loop.

What makes ML CI/CD special

Beyond code tests, mature ML pipelines add ML-specific gates (Chapter 9's quality gate, Chapter 17's eval):

Model-quality gate — the newly trained model must beat the current production model on a holdout before it can deploy.
Data validation — schema and distribution checks on incoming data (drift, Chapter 12) before training.
Eval gate for LLM features — run the golden-set eval; block on regressions.

This is CI/CD/CT — Continuous Training — the loop from Chapter 0 made fully automatic.

The takeaway

Test the model (behavior, invariants, convergence, round-trip) and the API (contract, validation) — ML needs more tests than ordinary code, and our 12 run in under a second. CI runs them on every commit and blocks broken merges; CD ships what passes; ML adds model-quality, data-validation, and eval gates on top. Automated, unskippable gates are what let a team move fast without breaking production. One discipline left before we assemble everything: configuration, secrets, and security. 👉

Production ML & AI Tools: A Hands-On Field Guide