One-file CLI: an article recommender

Everything from the content-based chapter, distilled into one runnable tool — the exact pipeline from the introduction: embed articles, build a time-decayed user profile, nearest-neighbor over the catalog, and fall back to representative/trending items on cold start. It needs no machine-learning libraries (embeddings via from-scratch TF-IDF).

The steps

  1. Load a catalog (one article per line).
  2. Embed each article as a TF-IDF vector (built from scratch).
  3. Profile the user: a time-decayed average of the articles they've read (most recent weighted most).
  4. Recommend: cosine nearest-neighbor between the profile and every unread article.
  5. Cold start: with no history, return the most representative articles (a content-only stand-in for trending).

Install & run

pip install numpy        # that's all it needs

# personalized: the user read articles 0 and 1 (1 most recent)
python recommend_cli.py articles.txt --history 0,1 -k 3

# cold start: a brand-new user
python recommend_cli.py articles.txt -k 3

A sample articles.txt (20 short articles across tech / health / sports / space) ships alongside the script.

It works — real output

Personalized — a user who read two tech articles gets tech back:

$ python recommend_cli.py articles.txt --history 0,1 -k 3
catalog: 20 articles
history (oldest->newest): [0, 1]
  most recent read: 'Google releases smartphone software update improving camera battery and performance'

recommended for you:
  1. (score=0.453)  [3] Samsung reveals smartphone with upgraded camera processor and faster software
  2. (score=0.337)  [2] New laptop launches with a powerful processor faster software and better battery
  3. (score=0.311)  [4] Chipmaker announces faster processor boosting laptop and smartphone performance

Switch the history to sports articles and you get sports back instead — the taste vector moves to a different region of embedding space and kNN follows.

Cold start — no history, so fall back:

$ python recommend_cli.py articles.txt -k 3
catalog: 20 articles
no history -> COLD START: showing representative/popular-by-content

cold-start picks:
  1. (score=0.133)  [5] Daily exercise and a healthy diet reduce heart disease risk a new study finds
  2. (score=0.109)  [7] Study shows healthy diet and exercise improve heart health and reduce disease
  3. (score=0.107)  [12] Football team celebrates a championship victory after a dramatic final match

The complete script

#!/usr/bin/env python3
"""
recommend_cli.py — a content-based article recommender, from scratch.

This is the canonical "embed articles -> build a time-decayed user profile ->
nearest-neighbor over the catalog" pipeline, runnable on a plain text file.

  - Catalog: a text file, one article (title/summary) per line.
  - History: the indices of articles the user has read, oldest-first
    (most recent counts most, via time decay).
  - Recommend: cosine k-NN between the user's profile and every unread article.
  - Cold start: with no history, fall back to the most representative articles.

Usage:
    # personalized: user read articles 0, 5, 9 (9 most recent)
    python recommend_cli.py articles.txt --history 0,5,9 -k 5

    # cold start: a brand-new user
    python recommend_cli.py articles.txt -k 5

Requirements: numpy. (TF-IDF is built from scratch; no ML libraries.)
"""

from __future__ import annotations

import argparse
import re

import numpy as np


def tokenize(text):
    return re.findall(r"[a-z0-9]+", text.lower())


def tfidf(docs):
    """Return an (n_docs, vocab) TF-IDF matrix (rows L2-normalized)."""
    df = {}
    for d in docs:
        for w in set(tokenize(d)):
            df[w] = df.get(w, 0) + 1
    vocab = {w: j for j, w in enumerate(sorted(df))}
    n = len(docs)
    idf = np.zeros(len(vocab))
    for w, j in vocab.items():
        idf[j] = np.log((1 + n) / (1 + df[w])) + 1.0
    X = np.zeros((n, len(vocab)))
    for r, d in enumerate(docs):
        for w in tokenize(d):
            X[r, vocab[w]] += 1.0
        X[r] *= idf
    X /= np.maximum(np.linalg.norm(X, axis=1, keepdims=True), 1e-9)   # unit rows
    return X


def decayed_profile(X, history, half_life):
    """Time-decayed average of read-article vectors. history: oldest-first ids."""
    lam = np.log(2) / half_life
    n = len(history)
    # age 0 = most recent (last in list) -> highest weight
    weights = np.array([np.exp(-lam * (n - 1 - pos)) for pos in range(n)])
    profile = (weights[:, None] * X[history]).sum(0) / weights.sum()
    return profile / max(np.linalg.norm(profile), 1e-9)


def main(argv=None):
    p = argparse.ArgumentParser(description="Content-based article recommender.")
    p.add_argument("articles", help="text file, one article per line")
    p.add_argument("--history", default="",
                   help="comma-separated article indices the user read, oldest first")
    p.add_argument("-k", type=int, default=5, help="recommendations to return")
    p.add_argument("--half-life", type=float, default=3.0,
                   help="recency half-life in #articles (smaller = more recency-biased)")
    args = p.parse_args(argv)

    with open(args.articles, encoding="utf-8") as f:
        docs = [ln.rstrip("\n") for ln in f if ln.strip()]
    X = tfidf(docs)
    print(f"catalog: {len(docs)} articles")

    history = [int(x) for x in args.history.split(",") if x.strip() != ""]

    if history:
        prof = decayed_profile(X, history, args.half_life)
        scores = X @ prof
        scores[history] = -np.inf                       # don't re-recommend read items
        print(f"history (oldest->newest): {history}")
        print(f"  most recent read: {docs[history[-1]]!r}\n")
        label = "recommended for you"
    else:
        # cold start: most "representative" articles (highest average similarity).
        sim = X @ X.T
        np.fill_diagonal(sim, 0.0)
        scores = sim.mean(1)
        print("no history -> COLD START: showing representative/popular-by-content\n")
        label = "cold-start picks"

    top = np.argsort(-scores)[:args.k]
    print(f"{label}:")
    for rank, i in enumerate(top, 1):
        print(f"  {rank}. (score={scores[i]:.3f})  [{i}] {docs[i]}")


if __name__ == "__main__":
    main()

From toy to production

This is a real, if small, content-based recommender. To productionize it you'd swap two pieces and keep the structure:

  • Better embeddings. Replace TF-IDF with a neural text-embedding model — the profile-and-kNN logic is unchanged, the relevance jumps.
  • ANN serving. Replace the exhaustive X @ profile scan with an HNSW or IVF-PQ index so it scales to millions of articles in milliseconds.
  • Real trending for cold start. Swap the content-only "representative" fallback for true trending from your interaction logs.

That's the whole book in one tool: embeddings → time decay → nearest-neighbor → cold-start fallback — the backbone of real recommendation and retrieval systems. 🎉