Serving & running LLMs

You don't train large language models — you call them. But "call them" splits into two very different worlds: rent an API (Claude, GPT, Gemini) or run an open-weight model yourself (Llama, Mistral, Qwen via Ollama or vLLM). This chapter maps both, with the client patterns you'll actually write, so you can reason about the cost/control trade-off every AI team faces.

Setup: the API client needs pip install anthropic + ANTHROPIC_API_KEY; self-hosting needs Ollama or vLLM. Follow-along — code and commands shown.

The two worlds

API model (Claude, GPT, Gemini)Open-weight, self-hosted (Llama, Mistral)
You getthe strongest models, instantlyfull control, privacy, fixed cost
You managenothing — rent by the tokenGPUs, scaling, the serving stack
Costper token (can balloon at scale)hardware (fixed, high upfront)
Privacydata leaves your networkdata stays in your VPC
Best formost apps, fastest pathprivacy-critical, high-volume, customization

Don't be confused: API vs. self-hosting is a real architectural decision. It's not "API = beginner, self-host = pro." Renting an API is the right call for the majority of products — you get frontier quality with zero ops. Self-hosting wins when data can't leave your infrastructure, when token volume makes per-call pricing painful, or when you need to fine-tune the weights. Reason about it on cost, privacy, and control — not prestige.

World 1: calling an API (the common case)

The whole interface is one HTTP call. With the Anthropic SDK (code/llm/claude_client.py):

import anthropic
client = anthropic.Anthropic()                 # reads ANTHROPIC_API_KEY from env

resp = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=512,
    messages=[{"role": "user", "content": "Summarize MLOps in one sentence."}],
)
print(resp.content[0].text)

Stream for responsiveness

For anything user-facing, stream the tokens as they're generated so the user sees output immediately instead of waiting for the whole response:

with client.messages.stream(
    model="claude-opus-4-8",
    max_tokens=512,
    messages=[{"role": "user", "content": "Write a haiku about deployment."}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Streaming is also how you avoid request timeouts on long outputs — the connection stays alive as tokens flow. This messages.stream / text_stream pattern is the one you'll write most often behind a chat UI.

The serving concerns are their problem

When you rent an API, the hard parts of LLM serving — GPU memory, batching, scaling, uptime — are handled for you. You focus on prompts, retrieval, cost, and latency.

World 2: running your own (open-weight models)

When you need control, you serve an open-weight model yourself. Two tools dominate:

Ollama — the easy on-ramp

Ollama runs open models locally with one command — perfect for development, prototyping, and privacy-sensitive local apps:

ollama pull llama3.2          # download an open-weight model
ollama run llama3.2           # chat in the terminal
# it also serves an HTTP API on localhost:11434
import requests
r = requests.post("http://localhost:11434/api/generate",
                  json={"model": "llama3.2", "prompt": "Hello", "stream": False})
print(r.json()["response"])

vLLM — production-grade serving

vLLM is the high-throughput inference server for self-hosting at scale. Its key trick, continuous batching (packing many requests through the GPU together), gives far higher throughput than naive serving. It exposes an OpenAI-compatible API:

pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct      # serves an OpenAI-style API

Other names you'll hear: TGI (Hugging Face Text Generation Inference), Triton (NVIDIA), TensorRT-LLM (fastest on NVIDIA GPUs), and llama.cpp (runs quantized models on CPUs/laptops).

Don't be confused: Ollama vs. vLLM. Ollama = easy, local, single-user, development and small apps. vLLM = production serving, high throughput, many concurrent users, GPU clusters. Prototype with Ollama; serve real traffic with vLLM (or a managed endpoint). Both run the same open-weight models — they're different engines, not different models.

Making self-hosted models affordable: quantization

A 70B model in full precision needs ~140 GB of GPU memory — multiple expensive cards. Quantization (Chapter 10's ONNX cousin) stores weights in 4–8 bits instead of 16, shrinking the model ~4× so it fits on one GPU (or a laptop), with modest quality loss. Formats you'll see: GGUF (llama.cpp/Ollama), AWQ, GPTQ. This is what makes self-hosting economically viable.

Choosing, in practice

Need frontier quality, fast, low ops?            → API (Claude/GPT/Gemini)
Data must not leave your network?                → self-host (vLLM)
Prototyping / local / single user?               → Ollama
Huge token volume, cost-sensitive?               → self-host + quantize, or batch API
Need to fine-tune the weights?                   → open-weight + LoRA, self-hosted

Many production systems are hybrid: an API model for the hard requests, a small self-hosted model for cheap high-volume ones, routed by difficulty.

The takeaway

Serving an LLM means one of two things: rent an API (one messages.create call; stream for UX; serving is the provider's problem) or run an open-weight model yourself with Ollama (easy/local) or vLLM (production throughput via continuous batching), made affordable by quantization. Choose on cost, privacy, and control — not prestige; hybrids are common. Whichever you pick, you must watch it — cost, latency, and quality. That's LLM observability. 👉