ONNX: portable, faster inference
Your model is a Python object using NumPy (or PyTorch, or TensorFlow). But what if you need to serve it from a C++ game engine, a Java backend, a browser, or a tiny edge device with no Python? And what if framework overhead makes inference slower than it needs to be? ONNX (Open Neural Network Exchange) solves both: a single, framework-neutral model format that runs fast via a dedicated runtime, anywhere.
Install:
pip install torch onnx onnxruntime. This chapter runs live — the output below is real.
What ONNX is
ONNX is a standard file format for models, plus onnxruntime, a highly optimized engine that executes them. The workflow:
train in Python (PyTorch/sklearn/...) ─► export to model.onnx ─►
run with onnxruntime in ANY language, often 2-5x faster
You decouple training (Python, flexible, slow is OK) from serving (any language,
optimized, latency matters). The .onnx file is portable; the runtime handles the
speed.
Exporting our model
Our model is bag-of-words + logistic regression, i.e. sigmoid(counts @ w + b). We
rebuild that as a one-line PyTorch Linear layer (so we can use torch's ONNX exporter)
and export it. From serving/onnx_export.py:
import torch, torch.nn as nn
model = SentimentModel().fit(*load_dataset())
linear = nn.Linear(len(model.weights) - 1, 1)
with torch.no_grad(): # copy the learned weights across
linear.weight.copy_(torch.tensor(model.weights[:-1]).reshape(1, -1))
linear.bias.copy_(torch.tensor([model.weights[-1]]))
class Net(nn.Module):
def __init__(self): super().__init__(); self.fc = linear
def forward(self, x): return torch.sigmoid(self.fc(x))
torch.onnx.export(
Net().eval(), torch.zeros(1, len(model.weights) - 1), "sentiment.onnx",
input_names=["counts"], output_names=["score"],
dynamic_axes={"counts": {0: "batch"}, "score": {0: "batch"}}, # variable batch
opset_version=17,
)
The dynamic_axes is the one subtlety worth knowing: it marks the batch dimension
as variable, so the exported model accepts 1 row or 1,000. We also save the vocabulary
alongside, because the runtime still has to turn text into word counts the same way.
Running inference with onnxruntime — no PyTorch needed
The whole point: load the .onnx and predict with only onnxruntime, no training
framework:
import onnxruntime as ort
sess = ort.InferenceSession("sentiment.onnx")
scores = sess.run(["score"], {"counts": counts})[0] # counts = bag-of-words matrix
It runs — live output
cd code
python serving/onnx_export.py
Output:
exported -> sentiment.onnx (602 bytes)
0.977 positive <- 'this is great'
0.024 negative <- 'this is terrible'
The model exported to a 602-byte .onnx file, and onnxruntime reproduced the exact
predictions (0.977, 0.024) — without the original SentimentModel class. That
file now runs in C#, Java, JavaScript (onnxruntime-web), C++, or on a Raspberry Pi,
identically.
Don't be confused: ONNX (the format) vs. onnxruntime (the engine). ONNX is the
.onnxfile — a portable description of the model's computation graph. onnxruntime is the program that loads and executes it fast. You export to ONNX and run with onnxruntime. (Other engines — TensorRT, OpenVINO — also run ONNX files, tuned for specific hardware.)
Why teams reach for it
- Speed — onnxruntime applies graph optimizations (operator fusion, constant folding) and uses optimized kernels; inference is often 2–5× faster than the source framework, especially on CPU.
- Portability — one model file, every language and platform. No Python at serving time.
- Smaller footprint — ship a lean runtime instead of a full PyTorch/TF install — big for containers (Chapter 5) and edge devices.
- A bridge to faster hardware — ONNX is the on-ramp to TensorRT (NVIDIA GPUs), OpenVINO (Intel), and mobile/edge accelerators.
ONNX and quantization
ONNX pairs naturally with quantization — converting weights from float32 to int8 to shrink the model ~4× and speed it up further:
from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic("sentiment.onnx", "sentiment.int8.onnx", weight_type=QuantType.QInt8)
This is how large models get squeezed onto modest hardware — the same compression idea the IVF-PQ book applies to vectors, applied to model weights.
When not to bother
ONNX shines for deployment of a settled model. During research and rapid iteration, stay in your native framework — exporting adds a step and not every exotic operation converts cleanly. Reach for ONNX when you're optimizing a model for production serving, targeting a non-Python environment, or chasing latency.
The takeaway
ONNX is a portable model format; onnxruntime is the fast engine that runs it. Export
once (mind the dynamic_axes for variable batch sizes), and your model runs in any
language, on any platform, typically 2–5× faster — and quantizes down for edge
hardware. It's the standard bridge from "trained in Python" to "served fast and
anywhere." Now let's give non-engineers a way to try the model — a UI. 👉