Building a RAG service

Retrieval-Augmented Generation (RAG) is the dominant pattern for building on top of LLMs — and it's built entirely from tools you now have: a vector store (last chapter) for retrieval, and an LLM for generation. It lets a model answer from your private or current documents — data it was never trained on — and it sharply reduces hallucination, because the model quotes retrieved text instead of inventing facts.

Setup: retrieval runs on NumPy (verified output below); generation needs pip install anthropic and an ANTHROPIC_API_KEY. Code in code/rag/rag.py.

The pattern in five steps

1. chunk + embed your documents  → vector store     (offline, once)
2. embed the user's question                        (per query)
3. retrieve the top-k similar chunks                (vector search)
4. stuff them into the prompt as context            (prompt assembly)
5. ask the LLM to answer USING ONLY that context    (grounded generation)

Steps 1–3 are vector search; steps 4–5 are one LLM call. That's the whole thing.

Step 1–3: retrieve (you already built this)

We embed a small knowledge base into the VectorStore, then retrieve the chunks most relevant to the question:

from rag.vectorstore import VectorStore, embed

def retrieve(store, question, k=2):
    hits = store.search(embed(question), k=k)
    return [h["metadata"]["text"] for h in hits if h["score"] > 0]

Step 4–5: assemble the prompt and generate

The grounding happens in the prompt: we hand the model the retrieved context and instruct it to answer only from that context. This is the single most important line in a RAG system:

import anthropic

def answer(question, context):
    context_block = "\n".join(f"- {c}" for c in context)
    prompt = (
        "Answer the question using ONLY the context below. "
        "If the context doesn't contain the answer, say you don't know.\n\n"
        f"Context:\n{context_block}\n\nQuestion: {question}"
    )
    client = anthropic.Anthropic()              # reads ANTHROPIC_API_KEY from env
    resp = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}],
    )
    return "".join(b.text for b in resp.content if b.type == "text")

"Answer using ONLY the context… if it's not there, say you don't know" is what keeps the model honest. Without it, the LLM falls back on its training data and may confidently make things up.

Running it

cd code
python rag/rag.py "how loyal are dogs?"

The retrieval step runs and prints real output (the generation step needs the SDK

key):

question : how loyal are dogs?
retrieved: ['dogs are extremely loyal pets and bond closely with their owners', 'cats are independent pets that groom themselves']
answer   : [would send the assembled prompt to claude-opus-4-8]

The vector store surfaced the relevant chunk ("dogs are extremely loyal…"), which gets injected into the prompt. With the SDK installed, Claude returns a grounded answer like "According to the context, dogs are extremely loyal and bond closely with their owners." — sourced from your document, not its training data.

Why RAG instead of fine-tuning?

Don't be confused: RAG vs. fine-tuning. RAG adds knowledge — facts the model reads at query time, kept fresh by updating the vector store. Fine-tuning adds behavior — a tone, format, or skill baked into the weights. To make a model know your latest docs, use RAG (update a document → it's instantly retrievable); fine-tuning would require expensive retraining and still wouldn't cite sources. Most production "chat with your data" systems are RAG, not fine-tuned.

What makes RAG hard in production

The toy version is five steps; the production version is where the engineering lives — and these are the RAG interview questions:

Chunking. How do you split documents? Too big → noisy, blows the context budget; too small → loses context. Chunk by paragraph/section with overlap.
Retrieval quality. Pure vector search misses exact keywords; hybrid search (vector + keyword/BM25) plus a re-ranker (a model that re-scores the top candidates) is the standard upgrade.
The context budget. You can only fit so many chunks in the prompt — retrieve more, then re-rank down to the best few.
Evaluation. Did the answer actually use the right sources? Did it hallucinate? This is hard and essential — see LLM observability.
Freshness. Re-embed and re-index documents as they change (a job for Celery or Prefect).

The production architecture

A real RAG service stacks the tools from this whole book:

[ingest pipeline]  docs → chunk → embed → vector DB        (Prefect, Ch 9 / Ch 14)
[query path]       FastAPI /ask (Ch 4) → embed → retrieve (Ch 14)
                   → re-rank → prompt → Claude → grounded answer
                   ↑ cache embeddings & answers (Redis, Ch 7)
                   ↑ track cost, latency, quality (Ch 17)

Every box is a chapter. RAG is the capstone that ties the GenAI stack together.

The takeaway

RAG grounds an LLM in your documents: embed and store them, retrieve the chunks relevant to a question, inject them into the prompt, and instruct the model to answer only from that context. It adds knowledge (use it over fine-tuning for facts) and curbs hallucination. The toy is five steps; production hardens chunking, hybrid retrieval, re-ranking, and evaluation. Next: how the LLM itself gets served — the API vs. running your own. 👉

Production ML & AI Tools: A Hands-On Field Guide