FastAPI: a real prediction service

A registered model still can't be used by anything but Python code that imports it. To let a website, a mobile app, or another microservice get predictions, you wrap the model in a web API — an HTTP endpoint other systems call. FastAPI is the modern Python standard for this: fast, typed, and it generates interactive docs for free.

Install: pip install fastapi "uvicorn[standard]" — FastAPI is the framework, uvicorn is the server that runs it.

Why an API, and why FastAPI

An API turns your model into a service: any system, in any language, can POST some text and get a prediction back as JSON. FastAPI wins because it gives you, out of the box:

Automatic validation — declare the input shape with Pydantic; bad requests are rejected with clear errors before your code runs.
Interactive docs — a Swagger UI at /docs where anyone can try the API in a browser.
Speed & async — among the fastest Python frameworks; handles concurrency well.
Type safety — your editor and the framework both understand your data shapes.

The service

The full app is api/main.py. The core is three pieces: schemas (what goes in and out), the model loaded once at startup, and the endpoints.

Schemas — validation for free

from pydantic import BaseModel, Field

class PredictRequest(BaseModel):
    text: str = Field(..., min_length=1, examples=["this is wonderful"])

class Prediction(BaseModel):
    label: str
    score: float

By declaring text must be a non-empty string, FastAPI rejects bad input automatically — you never write validation code, and the rejection is a clean, typed error.

Load the model once, serve many times

app = FastAPI(title="Sentiment API")
model = _load_or_train()        # loaded ONCE at startup, reused for every request

Don't be confused: load the model at startup, not per request. A common beginner bug is loading the model file inside the endpoint — so every request re-reads it from disk, making the API painfully slow. Load it once at module level (or in a startup event) and reuse it. Models are read-only at inference time; one copy serves all requests.

The endpoints

@app.get("/health")
def health():
    return {"status": "ok", "vocab_size": len(model.vocab)}

@app.post("/predict", response_model=Prediction)
def predict(req: PredictRequest):
    score = float(model.predict_proba([req.text])[0])
    return Prediction(label="positive" if score >= 0.5 else "negative",
                      score=round(score, 4))

/health is a liveness probe — Docker and Kubernetes ping it to know the service is up (Chapter 5). /predict is the real work. There's also a /predict/batch for scoring many texts in one call.

Running it

cd code
uvicorn api.main:app --reload          # --reload = auto-restart on code changes

Then open http://localhost:8000/docs for the interactive Swagger UI, or call it from the command line:

curl -X POST http://localhost:8000/predict \
     -H "Content-Type: application/json" \
     -d '{"text": "this is wonderful and great"}'

What it returns

Here's the service answering, verified end to end (via FastAPI's test client):

GET  /health                          -> {"status": "ok", "vocab_size": 75}
POST /predict {"text":"... great"}    -> {"label": "positive", "score": 0.9841}
POST /predict {"text":"terrible ..."} -> {"label": "negative", "score": 0.0112}
POST /predict/batch {"texts":[...]}   -> [{"label":"positive","score":0.8847},
                                          {"label":"negative","score":0.1188}]
POST /predict {"text":""}             -> 422 Unprocessable Entity   (validation!)

Look at that last line: an empty string is rejected with HTTP 422 before your code ever runs, because the Pydantic schema said min_length=1. That's the validation you got for free — and it's exactly what keeps a production API from crashing on malformed input.

Production touches (so you sound senior)

The basics above are real, but a production service adds:

Multiple workers — uvicorn api.main:app --workers 4 (or Gunicorn) to use all CPU cores.
A timeout & request size limit — never let one request hang or flood you.
Structured logging & a request ID — so you can trace a prediction later.
Load the model from the registry — mlflow.pyfunc.load_model("models:/sentiment@production") (Chapter 3) instead of a local file, so deploys are alias moves.
Async for I/O-bound work — async def endpoints when you call databases or other services.

Don't be confused: uvicorn vs. FastAPI vs. Gunicorn. FastAPI is the framework you write code in. uvicorn is the ASGI server that actually runs it and speaks HTTP. Gunicorn is a process manager that runs several uvicorn workers for production. You write FastAPI; uvicorn serves it; Gunicorn scales it.

Connecting the dots

Notice the service can _load_or_train() from the model.json (Chapter 1) — but in production you'd load from the registry (Chapter 3), so promoting a new model needs no code change. This API is also what the Streamlit UI (Chapter 11) and any other client will call.

The takeaway

FastAPI wraps your model in a typed HTTP service: Pydantic schemas validate input for free (the empty string got a 422), the model loads once at startup, and /health + /predict expose it to any system in any language — with interactive docs at /docs. This is how a model stops being a script and becomes a service. But a service that runs only on your laptop isn't deployed — next, we make it run anywhere with Docker. 👉

Production ML & AI Tools: A Hands-On Field Guide