The RAG news assistant

A recommender shows articles; a RAG (Retrieval-Augmented Generation) assistant answers questions about them. It's the natural companion feature for a news app — "What happened in the Champions League?" — and it reuses the exact same embeddings and vector index the recommender already built.

What RAG is

An LLM doesn't know your private/fresh news corpus, and asking it to recall facts invites hallucination. RAG fixes both:

question ──► embed ──► vector search over articles ──► top-k articles (context)
                                                              │
                            "Answer using ONLY these articles" + question
                                                              │
                                                          LLM (Claude)
                                                              │
                                                    grounded, cited answer

The LLM only reasons over retrieved facts, so answers stay grounded and you can cite sources. The retrieval step is precisely the nearest-neighbor search from the HNSW/IVF-PQ books — RAG is a recommender for context.

Pluggable generation: Claude + offline fallback

rag.py retrieves with the recommender's index, then generates:

if ANTHROPIC_API_KEY is set (and the anthropic package is installed), it calls Claude with the retrieved articles as context;
otherwise it returns a transparent extractive answer (the lead sentences of the top articles), so the assistant always works — no key required to demo.

"""RAG news assistant: answer a question grounded in retrieved articles.

Retrieval reuses the recommender's embedder + vector index (the same embeddings
that power recommendations). Generation is pluggable:

  * if ANTHROPIC_API_KEY is set and the `anthropic` package is installed, it
    calls Claude with the retrieved articles as context;
  * otherwise it falls back to a transparent extractive answer (the lead
    sentences of the top articles) so the system ALWAYS returns something.

This mirrors production RAG: retrieve over your vector store, then generate.
"""

from __future__ import annotations

import os

import numpy as np


SYSTEM_PROMPT = (
    "You are a news assistant. Answer the user's question using ONLY the provided "
    "articles. Cite article numbers like [1], [2]. If the articles don't contain "
    "the answer, say so."
)


class NewsAssistant:
    def __init__(self, recommender, api_key=None, model="claude-opus-4-8"):
        self.rec = recommender
        self.api_key = api_key if api_key is not None else os.environ.get("ANTHROPIC_API_KEY", "")
        self.model = model

    # ---------------------------------------------------------------- retrieve
    def retrieve(self, query, k=5):
        """Top-k articles most relevant to the query (vector search)."""
        q = self.rec.embedder.transform([query])[0]
        hits = self.rec.index.search(q, k)
        return [(nid, score, self.rec.data.articles[nid]) for nid, score in hits]

    # ----------------------------------------------------------------- answer
    def ask(self, query, k=5):
        """Return {'answer': str, 'sources': [...], 'mode': 'claude'|'extractive'}."""
        retrieved = self.retrieve(query, k)
        context = "\n".join(
            f"[{i+1}] {art.title}. {art.abstract}" for i, (_, _, art) in enumerate(retrieved))
        sources = [{"id": nid, "title": art.title, "score": round(score, 3)}
                   for nid, score, art in retrieved]

        if self.api_key:
            try:
                answer = self._claude(query, context)
                return {"answer": answer, "sources": sources, "mode": "claude"}
            except Exception as e:                       # never fail the request
                answer = self._extractive(query, retrieved)
                return {"answer": answer, "sources": sources,
                        "mode": f"extractive (claude error: {e})"}
        return {"answer": self._extractive(query, retrieved),
                "sources": sources, "mode": "extractive"}

    # ------------------------------------------------------------ generators
    def _claude(self, query, context):
        import anthropic  # lazy import; only needed when a key is set
        client = anthropic.Anthropic(api_key=self.api_key)
        msg = client.messages.create(
            model=self.model,
            max_tokens=400,
            system=SYSTEM_PROMPT,
            messages=[{"role": "user",
                       "content": f"Articles:\n{context}\n\nQuestion: {query}"}],
        )
        return "".join(block.text for block in msg.content if block.type == "text")

    def _extractive(self, query, retrieved):
        """Offline fallback: stitch the lead sentence of each top article + cite."""
        if not retrieved:
            return "No relevant articles found."
        lines = [f"Based on {len(retrieved)} related articles:"]
        for i, (_, _, art) in enumerate(retrieved, 1):
            lead = art.abstract.split(". ")[0].strip()
            lines.append(f"  [{i}] {lead}. ({art.title})")
        return "\n".join(lines)

See it work (offline mode)

>>> NewsAssistant(rec, api_key="").ask("Who won the World Cup match?", k=3)
mode = extractive
Based on 3 related articles:
  [1] Netherlands advanced to the final of the World Cup, with Vinicius Junior inspiring a memorable victory ... (Croatia reach the World Cup final after dramatic win)
  [2] Portugal advanced to the final of the World Cup, with Kylian Mbappe inspiring a memorable victory ... (Morocco reach the World Cup quarter-final after dramatic win)
  [3] Netherlands advanced to the final of the World Cup ... (Portugal reach the World Cup quarter-final after dramatic win)

Retrieval found the right (World Cup) articles by meaning, and the assistant answered with citations — with zero external dependencies.

Turning on Claude

export ANTHROPIC_API_KEY=sk-ant-...

Now ask() sends the retrieved articles + the question to Claude with a strict system prompt ("answer using ONLY the provided articles, cite by number"), and returns a fluent, grounded answer with the same sources list. The code is the canonical Anthropic call:

client = anthropic.Anthropic(api_key=...)
msg = client.messages.create(
    model="claude-opus-4-8", max_tokens=400, system=SYSTEM_PROMPT,
    messages=[{"role": "user", "content": f"Articles:\n{context}\n\nQuestion: {query}"}],
)

Robustness: if the Claude call fails (network, quota), the assistant catches the error and falls back to the extractive answer rather than erroring out — a small but important production habit. The response's mode field tells you which path ran.

Production notes

Chunking. Long articles should be split into passages and each embedded, so retrieval returns the relevant passage, not a whole document.
Grounding & citations. Always instruct the model to use only retrieved context and cite it; surface the sources in the UI (our React app does).
Same index, two products. Recommendations and RAG share one embedding + vector-search backend — build it once, serve both.

Next: wrap all of this in an API. 👉