The RAG news assistant
A recommender shows articles; a RAG (Retrieval-Augmented Generation) assistant answers questions about them. It's the natural companion feature for a news app — "What happened in the Champions League?" — and it reuses the exact same embeddings and vector index the recommender already built.
What RAG is
An LLM doesn't know your private/fresh news corpus, and asking it to recall facts invites hallucination. RAG fixes both:
question ──► embed ──► vector search over articles ──► top-k articles (context)
│
"Answer using ONLY these articles" + question
│
LLM (Claude)
│
grounded, cited answer
The LLM only reasons over retrieved facts, so answers stay grounded and you can cite sources. The retrieval step is precisely the nearest-neighbor search from the HNSW/IVF-PQ books — RAG is a recommender for context.
Pluggable generation: Claude + offline fallback
rag.py retrieves with the recommender's index, then generates:
- if
ANTHROPIC_API_KEYis set (and theanthropicpackage is installed), it calls Claude with the retrieved articles as context; - otherwise it returns a transparent extractive answer (the lead sentences of the top articles), so the assistant always works — no key required to demo.
"""RAG news assistant: answer a question grounded in retrieved articles.
Retrieval reuses the recommender's embedder + vector index (the same embeddings
that power recommendations). Generation is pluggable:
* if ANTHROPIC_API_KEY is set and the `anthropic` package is installed, it
calls Claude with the retrieved articles as context;
* otherwise it falls back to a transparent extractive answer (the lead
sentences of the top articles) so the system ALWAYS returns something.
This mirrors production RAG: retrieve over your vector store, then generate.
"""
from __future__ import annotations
import os
import numpy as np
SYSTEM_PROMPT = (
"You are a news assistant. Answer the user's question using ONLY the provided "
"articles. Cite article numbers like [1], [2]. If the articles don't contain "
"the answer, say so."
)
class NewsAssistant:
def __init__(self, recommender, api_key=None, model="claude-opus-4-8"):
self.rec = recommender
self.api_key = api_key if api_key is not None else os.environ.get("ANTHROPIC_API_KEY", "")
self.model = model
# ---------------------------------------------------------------- retrieve
def retrieve(self, query, k=5):
"""Top-k articles most relevant to the query (vector search)."""
q = self.rec.embedder.transform([query])[0]
hits = self.rec.index.search(q, k)
return [(nid, score, self.rec.data.articles[nid]) for nid, score in hits]
# ----------------------------------------------------------------- answer
def ask(self, query, k=5):
"""Return {'answer': str, 'sources': [...], 'mode': 'claude'|'extractive'}."""
retrieved = self.retrieve(query, k)
context = "\n".join(
f"[{i+1}] {art.title}. {art.abstract}" for i, (_, _, art) in enumerate(retrieved))
sources = [{"id": nid, "title": art.title, "score": round(score, 3)}
for nid, score, art in retrieved]
if self.api_key:
try:
answer = self._claude(query, context)
return {"answer": answer, "sources": sources, "mode": "claude"}
except Exception as e: # never fail the request
answer = self._extractive(query, retrieved)
return {"answer": answer, "sources": sources,
"mode": f"extractive (claude error: {e})"}
return {"answer": self._extractive(query, retrieved),
"sources": sources, "mode": "extractive"}
# ------------------------------------------------------------ generators
def _claude(self, query, context):
import anthropic # lazy import; only needed when a key is set
client = anthropic.Anthropic(api_key=self.api_key)
msg = client.messages.create(
model=self.model,
max_tokens=400,
system=SYSTEM_PROMPT,
messages=[{"role": "user",
"content": f"Articles:\n{context}\n\nQuestion: {query}"}],
)
return "".join(block.text for block in msg.content if block.type == "text")
def _extractive(self, query, retrieved):
"""Offline fallback: stitch the lead sentence of each top article + cite."""
if not retrieved:
return "No relevant articles found."
lines = [f"Based on {len(retrieved)} related articles:"]
for i, (_, _, art) in enumerate(retrieved, 1):
lead = art.abstract.split(". ")[0].strip()
lines.append(f" [{i}] {lead}. ({art.title})")
return "\n".join(lines)
See it work (offline mode)
>>> NewsAssistant(rec, api_key="").ask("Who won the World Cup match?", k=3)
mode = extractive
Based on 3 related articles:
[1] Netherlands advanced to the final of the World Cup, with Vinicius Junior inspiring a memorable victory ... (Croatia reach the World Cup final after dramatic win)
[2] Portugal advanced to the final of the World Cup, with Kylian Mbappe inspiring a memorable victory ... (Morocco reach the World Cup quarter-final after dramatic win)
[3] Netherlands advanced to the final of the World Cup ... (Portugal reach the World Cup quarter-final after dramatic win)
Retrieval found the right (World Cup) articles by meaning, and the assistant answered with citations — with zero external dependencies.
Turning on Claude
export ANTHROPIC_API_KEY=sk-ant-...
Now ask() sends the retrieved articles + the question to Claude with a strict
system prompt ("answer using ONLY the provided articles, cite by number"), and
returns a fluent, grounded answer with the same sources list. The code is the
canonical Anthropic call:
client = anthropic.Anthropic(api_key=...)
msg = client.messages.create(
model="claude-opus-4-8", max_tokens=400, system=SYSTEM_PROMPT,
messages=[{"role": "user", "content": f"Articles:\n{context}\n\nQuestion: {query}"}],
)
Robustness: if the Claude call fails (network, quota), the assistant catches the error and falls back to the extractive answer rather than erroring out — a small but important production habit. The response's
modefield tells you which path ran.
Production notes
- Chunking. Long articles should be split into passages and each embedded, so retrieval returns the relevant passage, not a whole document.
- Grounding & citations. Always instruct the model to use only retrieved context and cite it; surface the sources in the UI (our React app does).
- Same index, two products. Recommendations and RAG share one embedding + vector-search backend — build it once, serve both.
Next: wrap all of this in an API. 👉