Serving & running LLMs

You don't train large language models — you call them. But "call them" splits into two very different worlds: rent an API (Claude, GPT, Gemini) or run an open-weight model yourself (Llama, Mistral, Qwen via Ollama or vLLM). This chapter maps both, with the client patterns you'll actually write, so you can reason about the cost/control trade-off every AI team faces.

Setup: the API client needs pip install anthropic + ANTHROPIC_API_KEY; self-hosting needs Ollama or vLLM. Follow-along — code and commands shown.

The two worlds

	API model (Claude, GPT, Gemini)	Open-weight, self-hosted (Llama, Mistral)
You get	the strongest models, instantly	full control, privacy, fixed cost
You manage	nothing — rent by the token	GPUs, scaling, the serving stack
Cost	per token (can balloon at scale)	hardware (fixed, high upfront)
Privacy	data leaves your network	data stays in your VPC
Best for	most apps, fastest path	privacy-critical, high-volume, customization

Don't be confused: API vs. self-hosting is a real architectural decision. It's not "API = beginner, self-host = pro." Renting an API is the right call for the majority of products — you get frontier quality with zero ops. Self-hosting wins when data can't leave your infrastructure, when token volume makes per-call pricing painful, or when you need to fine-tune the weights. Reason about it on cost, privacy, and control — not prestige.

World 1: calling an API (the common case)

The whole interface is one HTTP call. With the Anthropic SDK (code/llm/claude_client.py):

import anthropic
client = anthropic.Anthropic()                 # reads ANTHROPIC_API_KEY from env

resp = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=512,
    messages=[{"role": "user", "content": "Summarize MLOps in one sentence."}],
)
print(resp.content[0].text)

Stream for responsiveness

For anything user-facing, stream the tokens as they're generated so the user sees output immediately instead of waiting for the whole response:

with client.messages.stream(
    model="claude-opus-4-8",
    max_tokens=512,
    messages=[{"role": "user", "content": "Write a haiku about deployment."}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Streaming is also how you avoid request timeouts on long outputs — the connection stays alive as tokens flow. This messages.stream / text_stream pattern is the one you'll write most often behind a chat UI.

The serving concerns are their problem

When you rent an API, the hard parts of LLM serving — GPU memory, batching, scaling, uptime — are handled for you. You focus on prompts, retrieval, cost, and latency.

World 2: running your own (open-weight models)

When you need control, you serve an open-weight model yourself. Two tools dominate:

Ollama — the easy on-ramp

Ollama runs open models locally with one command — perfect for development, prototyping, and privacy-sensitive local apps:

ollama pull llama3.2          # download an open-weight model
ollama run llama3.2           # chat in the terminal
# it also serves an HTTP API on localhost:11434

import requests
r = requests.post("http://localhost:11434/api/generate",
                  json={"model": "llama3.2", "prompt": "Hello", "stream": False})
print(r.json()["response"])

vLLM — production-grade serving

vLLM is the high-throughput inference server for self-hosting at scale. Its key trick, continuous batching (packing many requests through the GPU together), gives far higher throughput than naive serving. It exposes an OpenAI-compatible API:

pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct      # serves an OpenAI-style API

Other names you'll hear: TGI (Hugging Face Text Generation Inference), Triton (NVIDIA), TensorRT-LLM (fastest on NVIDIA GPUs), and llama.cpp (runs quantized models on CPUs/laptops).

Don't be confused: Ollama vs. vLLM. Ollama = easy, local, single-user, development and small apps. vLLM = production serving, high throughput, many concurrent users, GPU clusters. Prototype with Ollama; serve real traffic with vLLM (or a managed endpoint). Both run the same open-weight models — they're different engines, not different models.

Making self-hosted models affordable: quantization

A 70B model in full precision needs ~140 GB of GPU memory — multiple expensive cards. Quantization (Chapter 10's ONNX cousin) stores weights in 4–8 bits instead of 16, shrinking the model ~4× so it fits on one GPU (or a laptop), with modest quality loss. Formats you'll see: GGUF (llama.cpp/Ollama), AWQ, GPTQ. This is what makes self-hosting economically viable.

Choosing, in practice

Need frontier quality, fast, low ops?            → API (Claude/GPT/Gemini)
Data must not leave your network?                → self-host (vLLM)
Prototyping / local / single user?               → Ollama
Huge token volume, cost-sensitive?               → self-host + quantize, or batch API
Need to fine-tune the weights?                   → open-weight + LoRA, self-hosted

Many production systems are hybrid: an API model for the hard requests, a small self-hosted model for cheap high-volume ones, routed by difficulty.

The takeaway

Serving an LLM means one of two things: rent an API (one messages.create call; stream for UX; serving is the provider's problem) or run an open-weight model yourself with Ollama (easy/local) or vLLM (production throughput via continuous batching), made affordable by quantization. Choose on cost, privacy, and control — not prestige; hybrids are common. Whichever you pick, you must watch it — cost, latency, and quality. That's LLM observability. 👉

Production ML & AI Tools: A Hands-On Field Guide