One-file CLI: an article recommender
Everything from the content-based chapter, distilled into one runnable tool — the exact pipeline from the introduction: embed articles, build a time-decayed user profile, nearest-neighbor over the catalog, and fall back to representative/trending items on cold start. It needs no machine-learning libraries (embeddings via from-scratch TF-IDF).
The steps
- Load a catalog (one article per line).
- Embed each article as a TF-IDF vector (built from scratch).
- Profile the user: a time-decayed average of the articles they've read (most recent weighted most).
- Recommend: cosine nearest-neighbor between the profile and every unread article.
- Cold start: with no history, return the most representative articles (a content-only stand-in for trending).
Install & run
pip install numpy # that's all it needs
# personalized: the user read articles 0 and 1 (1 most recent)
python recommend_cli.py articles.txt --history 0,1 -k 3
# cold start: a brand-new user
python recommend_cli.py articles.txt -k 3
A sample articles.txt (20 short articles across tech / health / sports / space)
ships alongside the script.
It works — real output
Personalized — a user who read two tech articles gets tech back:
$ python recommend_cli.py articles.txt --history 0,1 -k 3
catalog: 20 articles
history (oldest->newest): [0, 1]
most recent read: 'Google releases smartphone software update improving camera battery and performance'
recommended for you:
1. (score=0.453) [3] Samsung reveals smartphone with upgraded camera processor and faster software
2. (score=0.337) [2] New laptop launches with a powerful processor faster software and better battery
3. (score=0.311) [4] Chipmaker announces faster processor boosting laptop and smartphone performance
Switch the history to sports articles and you get sports back instead — the taste vector moves to a different region of embedding space and kNN follows.
Cold start — no history, so fall back:
$ python recommend_cli.py articles.txt -k 3
catalog: 20 articles
no history -> COLD START: showing representative/popular-by-content
cold-start picks:
1. (score=0.133) [5] Daily exercise and a healthy diet reduce heart disease risk a new study finds
2. (score=0.109) [7] Study shows healthy diet and exercise improve heart health and reduce disease
3. (score=0.107) [12] Football team celebrates a championship victory after a dramatic final match
The complete script
#!/usr/bin/env python3
"""
recommend_cli.py — a content-based article recommender, from scratch.
This is the canonical "embed articles -> build a time-decayed user profile ->
nearest-neighbor over the catalog" pipeline, runnable on a plain text file.
- Catalog: a text file, one article (title/summary) per line.
- History: the indices of articles the user has read, oldest-first
(most recent counts most, via time decay).
- Recommend: cosine k-NN between the user's profile and every unread article.
- Cold start: with no history, fall back to the most representative articles.
Usage:
# personalized: user read articles 0, 5, 9 (9 most recent)
python recommend_cli.py articles.txt --history 0,5,9 -k 5
# cold start: a brand-new user
python recommend_cli.py articles.txt -k 5
Requirements: numpy. (TF-IDF is built from scratch; no ML libraries.)
"""
from __future__ import annotations
import argparse
import re
import numpy as np
def tokenize(text):
return re.findall(r"[a-z0-9]+", text.lower())
def tfidf(docs):
"""Return an (n_docs, vocab) TF-IDF matrix (rows L2-normalized)."""
df = {}
for d in docs:
for w in set(tokenize(d)):
df[w] = df.get(w, 0) + 1
vocab = {w: j for j, w in enumerate(sorted(df))}
n = len(docs)
idf = np.zeros(len(vocab))
for w, j in vocab.items():
idf[j] = np.log((1 + n) / (1 + df[w])) + 1.0
X = np.zeros((n, len(vocab)))
for r, d in enumerate(docs):
for w in tokenize(d):
X[r, vocab[w]] += 1.0
X[r] *= idf
X /= np.maximum(np.linalg.norm(X, axis=1, keepdims=True), 1e-9) # unit rows
return X
def decayed_profile(X, history, half_life):
"""Time-decayed average of read-article vectors. history: oldest-first ids."""
lam = np.log(2) / half_life
n = len(history)
# age 0 = most recent (last in list) -> highest weight
weights = np.array([np.exp(-lam * (n - 1 - pos)) for pos in range(n)])
profile = (weights[:, None] * X[history]).sum(0) / weights.sum()
return profile / max(np.linalg.norm(profile), 1e-9)
def main(argv=None):
p = argparse.ArgumentParser(description="Content-based article recommender.")
p.add_argument("articles", help="text file, one article per line")
p.add_argument("--history", default="",
help="comma-separated article indices the user read, oldest first")
p.add_argument("-k", type=int, default=5, help="recommendations to return")
p.add_argument("--half-life", type=float, default=3.0,
help="recency half-life in #articles (smaller = more recency-biased)")
args = p.parse_args(argv)
with open(args.articles, encoding="utf-8") as f:
docs = [ln.rstrip("\n") for ln in f if ln.strip()]
X = tfidf(docs)
print(f"catalog: {len(docs)} articles")
history = [int(x) for x in args.history.split(",") if x.strip() != ""]
if history:
prof = decayed_profile(X, history, args.half_life)
scores = X @ prof
scores[history] = -np.inf # don't re-recommend read items
print(f"history (oldest->newest): {history}")
print(f" most recent read: {docs[history[-1]]!r}\n")
label = "recommended for you"
else:
# cold start: most "representative" articles (highest average similarity).
sim = X @ X.T
np.fill_diagonal(sim, 0.0)
scores = sim.mean(1)
print("no history -> COLD START: showing representative/popular-by-content\n")
label = "cold-start picks"
top = np.argsort(-scores)[:args.k]
print(f"{label}:")
for rank, i in enumerate(top, 1):
print(f" {rank}. (score={scores[i]:.3f}) [{i}] {docs[i]}")
if __name__ == "__main__":
main()
From toy to production
This is a real, if small, content-based recommender. To productionize it you'd swap two pieces and keep the structure:
- Better embeddings. Replace TF-IDF with a neural text-embedding model — the profile-and-kNN logic is unchanged, the relevance jumps.
- ANN serving. Replace the exhaustive
X @ profilescan with an HNSW or IVF-PQ index so it scales to millions of articles in milliseconds. - Real trending for cold start. Swap the content-only "representative" fallback for true trending from your interaction logs.
That's the whole book in one tool: embeddings → time decay → nearest-neighbor → cold-start fallback — the backbone of real recommendation and retrieval systems. 🎉