Learning to rank: BPR

Matrix factorization minimized prediction error — how close $x_u\cdot y_i$ is to 0/1. But we never show users a number; we show a ranked list. What we actually want is for items they'll like to be ranked above items they won't. BPR (Bayesian Personalized Ranking, Rendle et al. 2009) optimizes that directly, and it introduces the pivotal idea of negative sampling.

The shift: pairs, not points

"Pointwise" methods (like ALS) score each item in isolation. Pairwise methods look at pairs: for a user, a positive item (one they interacted with) should score higher than a negative item (one they didn't). We don't care about the absolute scores, only the order.

BPR's assumption: for user $u$, an observed item $i$ is preferred over an unobserved item $j$. Train the model so $\hat{x}{ui} > \hat{x}{uj}$ for as many such pairs as possible.

This sidesteps the "missing = unknown" trap elegantly: we never claim $j$ is disliked, only that $i$ is preferred to $j$.

Negative sampling

There are astronomically many $(u, i, j)$ triples. BPR samples them: for each observed $(u, i)$, draw a random item $j$ the user hasn't interacted with as the negative. This negative sampling is what makes training over implicit data tractable, and it reappears in virtually every modern neural recommender and embedding model.

The objective

For a triple $(u, i, j)$, let the score gap be $\hat{x}{uij} = x_u\cdot y_i - x_u\cdot y_j$. BPR pushes this gap to be large and positive by maximizing $\ln \sigma(\hat{x}{uij})$, where $\sigma(z) = 1/(1+e^{-z})$ is the logistic (sigmoid) function. Maximizing it makes "positive ranked above negative" more probable.

The gradient gives a clean update. With $s = \sigma(-\hat{x}{uij}) = 1 - \sigma(\hat{x}{uij})$ (how wrong the current order is — big when the pair is mis-ranked):

x_u += lr * ( s * (y_i - y_j) - reg * x_u )
y_i += lr * ( s *  x_u        - reg * y_i )
y_j += lr * ( s * (-x_u)      - reg * y_j )

Read it: if the model already ranks $i$ above $j$ ($s \approx 0$), barely move; if it's wrong ($s \approx 1$), take a big step pulling $y_i$ toward the user and pushing $y_j$ away.

The code

import numpy as np

class BPR:
    def __init__(self, factors=32, lr=0.05, reg=0.01, epochs=30, seed=0):
        self.f, self.lr, self.reg, self.epochs, self.seed = factors, lr, reg, epochs, seed

    def fit(self, train, n_users, n_items):
        rng = np.random.default_rng(self.seed)
        X = rng.normal(scale=0.1, size=(n_users, self.f))
        Y = rng.normal(scale=0.1, size=(n_items, self.f))
        ui = {u: set() for u in range(n_users)}
        for u, i, t in train:
            ui[u].add(i)
        pos = [(u, i) for u in ui for i in ui[u]]      # all observed (user, item)

        for _ in range(self.epochs):
            rng.shuffle(pos)
            for u, i in pos:
                j = int(rng.integers(n_items))         # sample a negative
                while j in ui[u]:
                    j = int(rng.integers(n_items))
                diff = X[u] @ (Y[i] - Y[j])
                s = 1.0 / (1.0 + np.exp(diff))         # = sigma(-diff): "how wrong"
                xu = X[u].copy()
                X[u] += self.lr * (s * (Y[i] - Y[j]) - self.reg * X[u])
                Y[i] += self.lr * (s * xu - self.reg * Y[i])
                Y[j] += self.lr * (-s * xu - self.reg * Y[j])
        self.X, self.Y = X, Y
        self.seen = user_seen(train, n_users)
        return self

    def recommend(self, u, k=10):
        scores = self.Y @ self.X[u]
        return _topk_excluding(scores, self.seen.get(u, set()), k)

Same X/Y vectors and dot-product scoring as MF — only the training objective changed, from "predict the value" to "get the order right".

How it does

From the leaderboard:

  ItemItemCF         0.093
  ImplicitALS        0.087
  BPR                0.110

BPR tops the recall@10 board here — unsurprising, since it optimizes ranking, which is exactly what recall@k measures. (On other splits/metrics ALS or item-item can edge ahead; always measure on your data.)

The bigger picture

  • WARP (Weighted Approximate-Rank Pairwise) is a popular cousin that samples harder negatives (keep drawing until you find a mis-ranked one) and weights updates by the estimated rank — often better top-k accuracy.
  • Negative sampling + a ranking loss is the template for modern retrieval/embedding training (including two-tower models and many text-embedding models). BPR is the clearest place to understand it.

So far, every model produces embeddings scored by a dot product. The next chapter generalizes that to neural networks and shows the architecture that ties the whole book together. 👉