ONNX: portable, faster inference

Your model is a Python object using NumPy (or PyTorch, or TensorFlow). But what if you need to serve it from a C++ game engine, a Java backend, a browser, or a tiny edge device with no Python? And what if framework overhead makes inference slower than it needs to be? ONNX (Open Neural Network Exchange) solves both: a single, framework-neutral model format that runs fast via a dedicated runtime, anywhere.

Install: pip install torch onnx onnxruntime. This chapter runs live — the output below is real.

What ONNX is

ONNX is a standard file format for models, plus onnxruntime, a highly optimized engine that executes them. The workflow:

train in Python (PyTorch/sklearn/...)  ─►  export to model.onnx  ─►
        run with onnxruntime in ANY language, often 2-5x faster

You decouple training (Python, flexible, slow is OK) from serving (any language, optimized, latency matters). The .onnx file is portable; the runtime handles the speed.

Exporting our model

Our model is bag-of-words + logistic regression, i.e. sigmoid(counts @ w + b). We rebuild that as a one-line PyTorch Linear layer (so we can use torch's ONNX exporter) and export it. From serving/onnx_export.py:

import torch, torch.nn as nn

model = SentimentModel().fit(*load_dataset())
linear = nn.Linear(len(model.weights) - 1, 1)
with torch.no_grad():                              # copy the learned weights across
    linear.weight.copy_(torch.tensor(model.weights[:-1]).reshape(1, -1))
    linear.bias.copy_(torch.tensor([model.weights[-1]]))

class Net(nn.Module):
    def __init__(self): super().__init__(); self.fc = linear
    def forward(self, x): return torch.sigmoid(self.fc(x))

torch.onnx.export(
    Net().eval(), torch.zeros(1, len(model.weights) - 1), "sentiment.onnx",
    input_names=["counts"], output_names=["score"],
    dynamic_axes={"counts": {0: "batch"}, "score": {0: "batch"}},   # variable batch
    opset_version=17,
)

The dynamic_axes is the one subtlety worth knowing: it marks the batch dimension as variable, so the exported model accepts 1 row or 1,000. We also save the vocabulary alongside, because the runtime still has to turn text into word counts the same way.

Running inference with onnxruntime — no PyTorch needed

The whole point: load the .onnx and predict with only onnxruntime, no training framework:

import onnxruntime as ort
sess = ort.InferenceSession("sentiment.onnx")
scores = sess.run(["score"], {"counts": counts})[0]    # counts = bag-of-words matrix

It runs — live output

cd code
python serving/onnx_export.py

Output:

exported -> sentiment.onnx  (602 bytes)
  0.977  positive  <- 'this is great'
  0.024  negative  <- 'this is terrible'

The model exported to a 602-byte .onnx file, and onnxruntime reproduced the exact predictions (0.977, 0.024) — without the original SentimentModel class. That file now runs in C#, Java, JavaScript (onnxruntime-web), C++, or on a Raspberry Pi, identically.

Don't be confused: ONNX (the format) vs. onnxruntime (the engine). ONNX is the .onnx file — a portable description of the model's computation graph. onnxruntime is the program that loads and executes it fast. You export to ONNX and run with onnxruntime. (Other engines — TensorRT, OpenVINO — also run ONNX files, tuned for specific hardware.)

Why teams reach for it

  • Speed — onnxruntime applies graph optimizations (operator fusion, constant folding) and uses optimized kernels; inference is often 2–5× faster than the source framework, especially on CPU.
  • Portability — one model file, every language and platform. No Python at serving time.
  • Smaller footprint — ship a lean runtime instead of a full PyTorch/TF install — big for containers (Chapter 5) and edge devices.
  • A bridge to faster hardware — ONNX is the on-ramp to TensorRT (NVIDIA GPUs), OpenVINO (Intel), and mobile/edge accelerators.

ONNX and quantization

ONNX pairs naturally with quantization — converting weights from float32 to int8 to shrink the model ~4× and speed it up further:

from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic("sentiment.onnx", "sentiment.int8.onnx", weight_type=QuantType.QInt8)

This is how large models get squeezed onto modest hardware — the same compression idea the IVF-PQ book applies to vectors, applied to model weights.

When not to bother

ONNX shines for deployment of a settled model. During research and rapid iteration, stay in your native framework — exporting adds a step and not every exotic operation converts cleanly. Reach for ONNX when you're optimizing a model for production serving, targeting a non-Python environment, or chasing latency.

The takeaway

ONNX is a portable model format; onnxruntime is the fast engine that runs it. Export once (mind the dynamic_axes for variable batch sizes), and your model runs in any language, on any platform, typically 2–5× faster — and quantizes down for edge hardware. It's the standard bridge from "trained in Python" to "served fast and anywhere." Now let's give non-engineers a way to try the model — a UI. 👉