LLM observability & evaluation

A classical model has metrics like accuracy; an LLM feature has cost, latency, and "was the answer any good?" — and that last one has no ground-truth label. LLM observability is monitoring (Chapter 12) adapted to this reality: track every call's tokens, cost, and latency, trace multi-step chains, and evaluate output quality without labels. Skip it and you'll get a surprise five-figure bill and a silently-degrading product.

Setup: the cost/token math runs with no dependencies (verified output below). Tracing and eval use hosted tools (Langfuse, LangSmith) or another LLM.

1. Track cost, tokens & latency on every call

Every LLM response reports its token usage; turn that into dollars and log it. From code/llm/claude_client.py:

PRICING = {                       # USD per 1,000,000 tokens (input, output)
    "claude-opus-4-8":   (5.00, 25.00),
    "claude-sonnet-4-6": (3.00, 15.00),
    "claude-haiku-4-5":  (1.00,  5.00),
}
def cost_usd(model, input_tokens, output_tokens):
    pin, pout = PRICING[model]
    return input_tokens/1e6 * pin + output_tokens/1e6 * pout

The eye-opener is what model choice does to the bill. A typical RAG call (~1,500 input tokens of context + question, ~300 output) across the model tiers:

$ python llm/claude_client.py

Output:

claude-opus-4-8     1500 in / 300 out  ->  $0.01500
claude-sonnet-4-6   1500 in / 300 out  ->  $0.00900
claude-haiku-4-5    1500 in / 300 out  ->  $0.00300

100k Opus calls/day  ->  $1,500.00/day  ~$45,000/month

That projection is why you instrument cost from day one. The same workload is $45k/month on Opus, ~$27k on Sonnet, ~$9k on Haiku — a 5× swing from one config value. You can only make that trade-off if you're measuring per-call cost in production. (And prompt caching can cut the input cost of repeated context by ~90% — another lever you only find by watching the numbers.)

The production version logs a structured record per call — not a print:

@dataclass
class CallRecord:
    model: str; input_tokens: int; output_tokens: int; cost: float; latency_s: float

Ship these to your metrics system and you get dashboards for cost per day, tokens per request, p95 latency, and error rate — the LLM's operational vitals.

2. Trace multi-step chains

A RAG request isn't one call — it's embed → retrieve → (re-rank) → generate, sometimes across several LLM calls. When the answer is wrong, which step failed? Tracing records the full tree of a request — inputs, outputs, tokens, and timing at each step — so you can see exactly where it went sideways.

Tools: Langfuse (open-source), LangSmith (LangChain's), Helicone, plus OpenTelemetry for generic spans. The pattern is a decorator or context manager around each step:

from langfuse.decorators import observe

@observe()
def retrieve(question): ...        # each @observe step becomes a span in the trace

@observe()
def answer(question, context): ... # inputs, outputs, tokens, latency captured per span

Don't be confused: logging vs. tracing. A log is one event ("called Claude, 300 tokens"). A trace is the whole request tree — every step, nested, with timing and token counts at each node. For single calls, logs suffice; for RAG and agent chains, you need tracing to answer "the final answer was wrong — was it bad retrieval or bad generation?"

3. Evaluate quality — without labels

The hard part: there's no accuracy score for "summarize this" or "answer from these docs." Three approaches, in rising rigor:

LLM-as-judge

Use a strong LLM to grade outputs against a rubric. It scales where human review can't:

judge_prompt = (
    "Score the ANSWER from 1-5 for how well it's supported by the CONTEXT.\n"
    "Reply with only the number.\n\n"
    f"CONTEXT:\n{context}\n\nANSWER:\n{answer}"
)
# send to claude-opus-4-8, parse the score, track it over time

This is how you measure RAG groundedness (did the answer stick to the retrieved context, or hallucinate?) at scale. Tools like Ragas and DeepEval package these metrics (faithfulness, answer-relevancy, context-precision).

Golden datasets

Keep a fixed set of representative inputs with known-good answers; run them on every prompt or model change and check for regressions — the LLM equivalent of a test suite (Chapter 18).

Human feedback

Collect 👍/👎 and corrections from real users. The highest-signal data — and the seed for fine-tuning or improving retrieval.

Don't be confused: offline eval vs. online monitoring. Offline eval runs your golden set against a new prompt/model before shipping — a gate, like CI. Online monitoring watches real production traffic (cost, latency, thumbs-down rate, judge scores on a sample) after shipping. You need both: eval to catch regressions before release, monitoring to catch drift after.

What to watch in production

Signal	Why
Cost / day & tokens / request	budgets blow up silently; catch runaway prompts
p50 / p95 latency	LLMs are slow; the tail kills UX
Error & timeout rate	rate limits, overloads, malformed outputs
Groundedness / judge score (sampled)	is RAG still answering from sources?
Thumbs-down rate	the realest quality signal you have
Prompt-injection / safety flags	users will try to break it

The takeaway

LLM observability = track cost/tokens/latency on every call (model choice is a 5× cost lever you must measure), trace multi-step chains to localize failures, and evaluate quality without labels via LLM-as-judge, golden datasets, and human feedback. Keep offline eval (a pre-ship gate) separate from online monitoring (post-ship drift). That completes the GenAI stack — now the engineering discipline that holds all of it together: testing and CI/CD. 👉

Production ML & AI Tools: A Hands-On Field Guide