Serving & running LLMs
You don't train large language models — you call them. But "call them" splits into two very different worlds: rent an API (Claude, GPT, Gemini) or run an open-weight model yourself (Llama, Mistral, Qwen via Ollama or vLLM). This chapter maps both, with the client patterns you'll actually write, so you can reason about the cost/control trade-off every AI team faces.
Setup: the API client needs
pip install anthropic+ANTHROPIC_API_KEY; self-hosting needs Ollama or vLLM. Follow-along — code and commands shown.
The two worlds
| API model (Claude, GPT, Gemini) | Open-weight, self-hosted (Llama, Mistral) | |
|---|---|---|
| You get | the strongest models, instantly | full control, privacy, fixed cost |
| You manage | nothing — rent by the token | GPUs, scaling, the serving stack |
| Cost | per token (can balloon at scale) | hardware (fixed, high upfront) |
| Privacy | data leaves your network | data stays in your VPC |
| Best for | most apps, fastest path | privacy-critical, high-volume, customization |
Don't be confused: API vs. self-hosting is a real architectural decision. It's not "API = beginner, self-host = pro." Renting an API is the right call for the majority of products — you get frontier quality with zero ops. Self-hosting wins when data can't leave your infrastructure, when token volume makes per-call pricing painful, or when you need to fine-tune the weights. Reason about it on cost, privacy, and control — not prestige.
World 1: calling an API (the common case)
The whole interface is one HTTP call. With the Anthropic SDK
(code/llm/claude_client.py):
import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from env
resp = client.messages.create(
model="claude-opus-4-8",
max_tokens=512,
messages=[{"role": "user", "content": "Summarize MLOps in one sentence."}],
)
print(resp.content[0].text)
Stream for responsiveness
For anything user-facing, stream the tokens as they're generated so the user sees output immediately instead of waiting for the whole response:
with client.messages.stream(
model="claude-opus-4-8",
max_tokens=512,
messages=[{"role": "user", "content": "Write a haiku about deployment."}],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
Streaming is also how you avoid request timeouts on long outputs — the connection
stays alive as tokens flow. This messages.stream / text_stream pattern is the one
you'll write most often behind a chat UI.
The serving concerns are their problem
When you rent an API, the hard parts of LLM serving — GPU memory, batching, scaling, uptime — are handled for you. You focus on prompts, retrieval, cost, and latency.
World 2: running your own (open-weight models)
When you need control, you serve an open-weight model yourself. Two tools dominate:
Ollama — the easy on-ramp
Ollama runs open models locally with one command — perfect for development, prototyping, and privacy-sensitive local apps:
ollama pull llama3.2 # download an open-weight model
ollama run llama3.2 # chat in the terminal
# it also serves an HTTP API on localhost:11434
import requests
r = requests.post("http://localhost:11434/api/generate",
json={"model": "llama3.2", "prompt": "Hello", "stream": False})
print(r.json()["response"])
vLLM — production-grade serving
vLLM is the high-throughput inference server for self-hosting at scale. Its key trick, continuous batching (packing many requests through the GPU together), gives far higher throughput than naive serving. It exposes an OpenAI-compatible API:
pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct # serves an OpenAI-style API
Other names you'll hear: TGI (Hugging Face Text Generation Inference), Triton (NVIDIA), TensorRT-LLM (fastest on NVIDIA GPUs), and llama.cpp (runs quantized models on CPUs/laptops).
Don't be confused: Ollama vs. vLLM. Ollama = easy, local, single-user, development and small apps. vLLM = production serving, high throughput, many concurrent users, GPU clusters. Prototype with Ollama; serve real traffic with vLLM (or a managed endpoint). Both run the same open-weight models — they're different engines, not different models.
Making self-hosted models affordable: quantization
A 70B model in full precision needs ~140 GB of GPU memory — multiple expensive cards. Quantization (Chapter 10's ONNX cousin) stores weights in 4–8 bits instead of 16, shrinking the model ~4× so it fits on one GPU (or a laptop), with modest quality loss. Formats you'll see: GGUF (llama.cpp/Ollama), AWQ, GPTQ. This is what makes self-hosting economically viable.
Choosing, in practice
Need frontier quality, fast, low ops? → API (Claude/GPT/Gemini)
Data must not leave your network? → self-host (vLLM)
Prototyping / local / single user? → Ollama
Huge token volume, cost-sensitive? → self-host + quantize, or batch API
Need to fine-tune the weights? → open-weight + LoRA, self-hosted
Many production systems are hybrid: an API model for the hard requests, a small self-hosted model for cheap high-volume ones, routed by difficulty.
The takeaway
Serving an LLM means one of two things: rent an API (one messages.create call;
stream for UX; serving is the provider's problem) or run an open-weight model yourself
with Ollama (easy/local) or vLLM (production throughput via continuous
batching), made affordable by quantization. Choose on cost, privacy, and control —
not prestige; hybrids are common. Whichever you pick, you must watch it — cost,
latency, and quality. That's LLM observability. 👉