The RAG news assistant

A recommender shows articles; a RAG (Retrieval-Augmented Generation) assistant answers questions about them. It's the natural companion feature for a news app — "What happened in the Champions League?" — and it reuses the exact same embeddings and vector index the recommender already built.

What RAG is

An LLM doesn't know your private/fresh news corpus, and asking it to recall facts invites hallucination. RAG fixes both:

question ──► embed ──► vector search over articles ──► top-k articles (context)
                                                              │
                            "Answer using ONLY these articles" + question
                                                              │
                                                          LLM (Claude)
                                                              │
                                                    grounded, cited answer

The LLM only reasons over retrieved facts, so answers stay grounded and you can cite sources. The retrieval step is precisely the nearest-neighbor search from the HNSW/IVF-PQ books — RAG is a recommender for context.

Pluggable generation: Claude + offline fallback

rag.py retrieves with the recommender's index, then generates:

  • if ANTHROPIC_API_KEY is set (and the anthropic package is installed), it calls Claude with the retrieved articles as context;
  • otherwise it returns a transparent extractive answer (the lead sentences of the top articles), so the assistant always works — no key required to demo.
"""RAG news assistant: answer a question grounded in retrieved articles.

Retrieval reuses the recommender's embedder + vector index (the same embeddings
that power recommendations). Generation is pluggable:

  * if ANTHROPIC_API_KEY is set and the `anthropic` package is installed, it
    calls Claude with the retrieved articles as context;
  * otherwise it falls back to a transparent extractive answer (the lead
    sentences of the top articles) so the system ALWAYS returns something.

This mirrors production RAG: retrieve over your vector store, then generate.
"""

from __future__ import annotations

import os

import numpy as np


SYSTEM_PROMPT = (
    "You are a news assistant. Answer the user's question using ONLY the provided "
    "articles. Cite article numbers like [1], [2]. If the articles don't contain "
    "the answer, say so."
)


class NewsAssistant:
    def __init__(self, recommender, api_key=None, model="claude-opus-4-8"):
        self.rec = recommender
        self.api_key = api_key if api_key is not None else os.environ.get("ANTHROPIC_API_KEY", "")
        self.model = model

    # ---------------------------------------------------------------- retrieve
    def retrieve(self, query, k=5):
        """Top-k articles most relevant to the query (vector search)."""
        q = self.rec.embedder.transform([query])[0]
        hits = self.rec.index.search(q, k)
        return [(nid, score, self.rec.data.articles[nid]) for nid, score in hits]

    # ----------------------------------------------------------------- answer
    def ask(self, query, k=5):
        """Return {'answer': str, 'sources': [...], 'mode': 'claude'|'extractive'}."""
        retrieved = self.retrieve(query, k)
        context = "\n".join(
            f"[{i+1}] {art.title}. {art.abstract}" for i, (_, _, art) in enumerate(retrieved))
        sources = [{"id": nid, "title": art.title, "score": round(score, 3)}
                   for nid, score, art in retrieved]

        if self.api_key:
            try:
                answer = self._claude(query, context)
                return {"answer": answer, "sources": sources, "mode": "claude"}
            except Exception as e:                       # never fail the request
                answer = self._extractive(query, retrieved)
                return {"answer": answer, "sources": sources,
                        "mode": f"extractive (claude error: {e})"}
        return {"answer": self._extractive(query, retrieved),
                "sources": sources, "mode": "extractive"}

    # ------------------------------------------------------------ generators
    def _claude(self, query, context):
        import anthropic  # lazy import; only needed when a key is set
        client = anthropic.Anthropic(api_key=self.api_key)
        msg = client.messages.create(
            model=self.model,
            max_tokens=400,
            system=SYSTEM_PROMPT,
            messages=[{"role": "user",
                       "content": f"Articles:\n{context}\n\nQuestion: {query}"}],
        )
        return "".join(block.text for block in msg.content if block.type == "text")

    def _extractive(self, query, retrieved):
        """Offline fallback: stitch the lead sentence of each top article + cite."""
        if not retrieved:
            return "No relevant articles found."
        lines = [f"Based on {len(retrieved)} related articles:"]
        for i, (_, _, art) in enumerate(retrieved, 1):
            lead = art.abstract.split(". ")[0].strip()
            lines.append(f"  [{i}] {lead}. ({art.title})")
        return "\n".join(lines)

See it work (offline mode)

>>> NewsAssistant(rec, api_key="").ask("Who won the World Cup match?", k=3)
mode = extractive
Based on 3 related articles:
  [1] Netherlands advanced to the final of the World Cup, with Vinicius Junior inspiring a memorable victory ... (Croatia reach the World Cup final after dramatic win)
  [2] Portugal advanced to the final of the World Cup, with Kylian Mbappe inspiring a memorable victory ... (Morocco reach the World Cup quarter-final after dramatic win)
  [3] Netherlands advanced to the final of the World Cup ... (Portugal reach the World Cup quarter-final after dramatic win)

Retrieval found the right (World Cup) articles by meaning, and the assistant answered with citations — with zero external dependencies.

Turning on Claude

export ANTHROPIC_API_KEY=sk-ant-...

Now ask() sends the retrieved articles + the question to Claude with a strict system prompt ("answer using ONLY the provided articles, cite by number"), and returns a fluent, grounded answer with the same sources list. The code is the canonical Anthropic call:

client = anthropic.Anthropic(api_key=...)
msg = client.messages.create(
    model="claude-opus-4-8", max_tokens=400, system=SYSTEM_PROMPT,
    messages=[{"role": "user", "content": f"Articles:\n{context}\n\nQuestion: {query}"}],
)

Robustness: if the Claude call fails (network, quota), the assistant catches the error and falls back to the extractive answer rather than erroring out — a small but important production habit. The response's mode field tells you which path ran.

Production notes

  • Chunking. Long articles should be split into passages and each embedded, so retrieval returns the relevant passage, not a whole document.
  • Grounding & citations. Always instruct the model to use only retrieved context and cite it; surface the sources in the UI (our React app does).
  • Same index, two products. Recommendations and RAG share one embedding + vector-search backend — build it once, serve both.

Next: wrap all of this in an API. 👉