FastAPI: a real prediction service
A registered model still can't be used by anything but Python code that imports it. To let a website, a mobile app, or another microservice get predictions, you wrap the model in a web API — an HTTP endpoint other systems call. FastAPI is the modern Python standard for this: fast, typed, and it generates interactive docs for free.
Install:
pip install fastapi "uvicorn[standard]"— FastAPI is the framework, uvicorn is the server that runs it.
Why an API, and why FastAPI
An API turns your model into a service: any system, in any language, can POST
some text and get a prediction back as JSON. FastAPI wins because it gives you, out of
the box:
- Automatic validation — declare the input shape with Pydantic; bad requests are rejected with clear errors before your code runs.
- Interactive docs — a Swagger UI at
/docswhere anyone can try the API in a browser. - Speed & async — among the fastest Python frameworks; handles concurrency well.
- Type safety — your editor and the framework both understand your data shapes.
The service
The full app is api/main.py. The core is three pieces:
schemas (what goes in and out), the model loaded once at startup, and the
endpoints.
Schemas — validation for free
from pydantic import BaseModel, Field
class PredictRequest(BaseModel):
text: str = Field(..., min_length=1, examples=["this is wonderful"])
class Prediction(BaseModel):
label: str
score: float
By declaring text must be a non-empty string, FastAPI rejects bad input
automatically — you never write validation code, and the rejection is a clean,
typed error.
Load the model once, serve many times
app = FastAPI(title="Sentiment API")
model = _load_or_train() # loaded ONCE at startup, reused for every request
Don't be confused: load the model at startup, not per request. A common beginner bug is loading the model file inside the endpoint — so every request re-reads it from disk, making the API painfully slow. Load it once at module level (or in a startup event) and reuse it. Models are read-only at inference time; one copy serves all requests.
The endpoints
@app.get("/health")
def health():
return {"status": "ok", "vocab_size": len(model.vocab)}
@app.post("/predict", response_model=Prediction)
def predict(req: PredictRequest):
score = float(model.predict_proba([req.text])[0])
return Prediction(label="positive" if score >= 0.5 else "negative",
score=round(score, 4))
/health is a liveness probe — Docker and Kubernetes ping it to know the service
is up (Chapter 5). /predict is the real work. There's also a
/predict/batch for scoring many texts in one call.
Running it
cd code
uvicorn api.main:app --reload # --reload = auto-restart on code changes
Then open http://localhost:8000/docs for the interactive Swagger UI, or call it from the command line:
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"text": "this is wonderful and great"}'
What it returns
Here's the service answering, verified end to end (via FastAPI's test client):
GET /health -> {"status": "ok", "vocab_size": 75}
POST /predict {"text":"... great"} -> {"label": "positive", "score": 0.9841}
POST /predict {"text":"terrible ..."} -> {"label": "negative", "score": 0.0112}
POST /predict/batch {"texts":[...]} -> [{"label":"positive","score":0.8847},
{"label":"negative","score":0.1188}]
POST /predict {"text":""} -> 422 Unprocessable Entity (validation!)
Look at that last line: an empty string is rejected with HTTP 422 before your code
ever runs, because the Pydantic schema said min_length=1. That's the validation you
got for free — and it's exactly what keeps a production API from crashing on
malformed input.
Production touches (so you sound senior)
The basics above are real, but a production service adds:
- Multiple workers —
uvicorn api.main:app --workers 4(or Gunicorn) to use all CPU cores. - A timeout & request size limit — never let one request hang or flood you.
- Structured logging & a request ID — so you can trace a prediction later.
- Load the model from the registry —
mlflow.pyfunc.load_model("models:/sentiment@production")(Chapter 3) instead of a local file, so deploys are alias moves. - Async for I/O-bound work —
async defendpoints when you call databases or other services.
Don't be confused: uvicorn vs. FastAPI vs. Gunicorn. FastAPI is the framework you write code in. uvicorn is the ASGI server that actually runs it and speaks HTTP. Gunicorn is a process manager that runs several uvicorn workers for production. You write FastAPI; uvicorn serves it; Gunicorn scales it.
Connecting the dots
Notice the service can _load_or_train() from the model.json
(Chapter 1) — but in production you'd load from the registry
(Chapter 3), so promoting a new model needs no code change.
This API is also what the Streamlit UI (Chapter 11)
and any other client will call.
The takeaway
FastAPI wraps your model in a typed HTTP service: Pydantic schemas validate input for
free (the empty string got a 422), the model loads once at startup, and /health +
/predict expose it to any system in any language — with interactive docs at /docs.
This is how a model stops being a script and becomes a service. But a service that
runs only on your laptop isn't deployed — next, we make it run anywhere with Docker. 👉