Matrix factorization (SGD & implicit ALS)

Matrix factorization (MF) is the idea that won the Netflix Prize and still underpins huge production systems. Instead of a giant similarity table, it learns a short latent vector for every user and every item, such that their dot product predicts interaction. This chapter explains it from the ground up, with the real math.

The core idea: hidden factors

Imagine every movie can be described by a few hidden dials — how much action, how much romance, how "indie" it is — and every user by how much they like each dial. You'd predict a user's interest in a movie by matching their preferences to the movie's traits.

MF learns these dials automatically. Pick a small number f of latent factors (say 32). Then:

each user $u$ gets a vector $x_u \in \mathbb{R}^f$,
each item $i$ gets a vector $y_i \in \mathbb{R}^f$,
the predicted affinity is their dot product $\hat{r}_{ui} = x_u \cdot y_i$.

Stacking these into matrices $X$ (users × f) and $Y$ (items × f), we're approximating the whole interaction matrix as a product:

$$ R \approx X\,Y^\top. $$

We compress an enormous, sparse R into two thin, dense matrices. The factors are "latent" because we never name them — the model discovers whatever dimensions best explain the data.

Version 1: explicit ratings via SGD

If we have explicit ratings, we want $x_u \cdot y_i$ to match the known rating $r_{ui}$. Minimize the squared error over observed ratings, plus a regularization term ($\lambda$) that keeps vectors small to avoid overfitting:

$$ \min_{X,Y} \sum_{(u,i)\,\text{observed}} \big(r_{ui} - x_u \cdot y_i\big)^2 + \lambda\big(\lVert x_u\rVert^2 + \lVert y_i\rVert^2\big). $$

Stochastic Gradient Descent (SGD) optimizes this by nudging the vectors toward lower error, one observed rating at a time. For the error $e_{ui} = r_{ui} - x_u\cdot y_i$:

for (u, i, r) in observed_ratings:
    e = r - X[u] @ Y[i]
    X[u] += lr * (e * Y[i] - reg * X[u])     # step toward lower error
    Y[i] += lr * (e * X[u] - reg * Y[i])

Each update moves $x_u$ and $y_i$ so their dot product gets closer to the true rating. Repeat over all ratings for several epochs. Simple and effective — but it assumes we have explicit ratings, and treats missing entries as "ignore", which isn't right for implicit data.

Version 2: implicit feedback via ALS (the real-world case)

Most data is implicit (clicks, views), and here the "missing means unknown, not disliked" problem (from Chapter 2) bites. The classic solution is Implicit ALS (Hu, Koren & Volinsky, 2008), and it's worth understanding deeply because it's everywhere.

Two new concepts: preference and confidence

For each user-item cell, define:

Preference $p_{ui} = 1$ if the user interacted with $i$, else $0$.
Confidence $c_{ui} = 1 + \alpha\, r_{ui}$, where $r_{ui}$ is the interaction count (or strength). We're more sure about a "1" the more the user engaged, and only weakly sure a "0" means disinterest.

The objective sums over every cell (observed and not), weighted by confidence:

$$ \min_{X,Y}\ \sum_{u,i} c_{ui}\big(p_{ui} - x_u\cdot y_i\big)^2 + \lambda\big(\lVert X\rVert^2 + \lVert Y\rVert^2\big). $$

The genius: every blank is included (as a low-confidence 0), so the model learns "probably not interested" softly, while strong interactions pull hard toward 1.

Why ALS instead of SGD here?

That sum is over all user×item cells — far too many to loop over with SGD. But notice: if we fix $Y$, the problem becomes an ordinary (weighted) least-squares problem in $X$ — solvable exactly, one user at a time. Then fix $X$ and solve for $Y$. Alternate until convergence. That's Alternating Least Squares.

For a single user, the optimal vector has a closed form:

$$ x_u = \big(Y^\top C^u Y + \lambda I\big)^{-1}\, Y^\top C^u p_u, $$

where $C^u$ is the diagonal matrix of that user's confidences and $p_u$ their preference vector. The matching formula for items swaps roles. (The Hu et al. speedup: $Y^\top C^u Y = Y^\top Y + Y^\top (C^u - I) Y$, and $C^u - I$ is zero except on the user's few interactions — so each solve is cheap.)

The code

import numpy as np

class ImplicitALS:
    def __init__(self, factors=32, reg=0.1, alpha=40.0, iters=15, seed=0):
        self.f, self.reg, self.alpha, self.iters, self.seed = factors, reg, alpha, iters, seed

    def fit(self, train, n_users, n_items):
        R = build_matrix(train, n_users, n_items)
        rng = np.random.default_rng(self.seed)
        X = rng.normal(scale=0.01, size=(n_users, self.f))
        Y = rng.normal(scale=0.01, size=(n_items, self.f))
        C = 1.0 + self.alpha * R          # confidence
        P = (R > 0).astype(float)         # preference (0/1)
        I = self.reg * np.eye(self.f)

        for _ in range(self.iters):
            YtY = Y.T @ Y                                  # shared across users
            for u in range(n_users):
                cu = C[u]
                A = YtY + Y.T @ (Y * (cu - 1.0)[:, None]) + I   # Y^T C^u Y + reg I
                b = (Y * (cu * P[u])[:, None]).sum(0)           # Y^T C^u p_u
                X[u] = np.linalg.solve(A, b)               # closed-form solve
            XtX = X.T @ X
            for i in range(n_items):
                ci = C[:, i]
                A = XtX + X.T @ (X * (ci - 1.0)[:, None]) + I
                b = (X * (ci * P[:, i])[:, None]).sum(0)
                Y[i] = np.linalg.solve(A, b)
        self.X, self.Y = X, Y
        self.seen = user_seen(train, n_users)
        return self

    def recommend(self, u, k=10):
        scores = self.Y @ self.X[u]        # dot product with every item vector
        return _topk_excluding(scores, self.seen.get(u, set()), k)

np.linalg.solve(A, b) is the exact least-squares solution for one user's vector — that's the "LS" in ALS. The outer loop alternates user and item updates.

How it does

From the leaderboard:

  ItemItemCF         0.093
  ImplicitALS        0.087
  BPR                0.110

ALS is competitive with item-item here and far more scalable and compact: it stores only f numbers per user/item (not an items×items table), and recommending is a dot product — which at scale becomes an ANN search over the item vectors (HNSW / IVF-PQ), the bridge to the serving chapter.

Why MF is so important

Generalization. It discovers latent structure, finding links that neighborhood methods miss (it can connect items that share no direct co-interactions but occupy the same latent region).
Compactness & speed. User/item vectors are tiny; scoring is a dot product → perfect for ANN serving.
The blueprint for deep models. "Learn a user vector and item vectors, score by dot product" is exactly what modern two-tower neural networks do — MF is their linear ancestor (next chapters).

But squared error optimizes rating prediction, not ranking. The next chapter optimizes the ranking directly. 👉

Recommendation Systems from Scratch