Neighborhood collaborative filtering

Collaborative filtering (CF) ignores item content and learns purely from the interaction patterns of the crowd: "people who liked what you liked also liked this." The simplest, most intuitive form is the neighborhood method, and it remains a strong, explainable baseline.

Two flavors

  • User-user: find users similar to you, recommend what they liked.
  • Item-item: find items similar to the ones you liked, recommend those.

Item-item is the workhorse (it's what Amazon famously deployed) because item similarities are more stable than user similarities (an item's audience changes slowly; a user's tastes and the user base shift fast) and can be precomputed. We'll build item-item.

The key idea: items are similar if the same people interact with them

Forget content. Two items are "similar" if they tend to be touched by the same users. Represent each item by its column in the user-item matrix R (which users interacted with it), and measure similarity between those columns with cosine similarity.

              item A column   item B column   item C column
   user0          1               1               0
   user1          1               1               0
   user2          0               0               1
   user3          1               0               1

   A and B share users 0,1  -> HIGH similarity
   A and C share only user 3 -> LOW similarity

The code

import numpy as np

class ItemItemCF:
    def fit(self, train, n_users, n_items):
        R = build_matrix(train, n_users, n_items)        # users x items
        self.R = R
        norm = np.linalg.norm(R, axis=0, keepdims=True)  # length of each item column
        Rn = R / np.maximum(norm, 1e-9)                  # normalize columns
        self.S = Rn.T @ Rn                               # item-item cosine similarity
        np.fill_diagonal(self.S, 0.0)                    # an item isn't its own neighbor
        self.seen = user_seen(train, n_users)
        return self

    def recommend(self, u, k=10):
        scores = self.R[u] @ self.S                      # spread from items the user has
        return _topk_excluding(scores, self.seen.get(u, set()), k)

How the math works

  • Rn.T @ Rn computes, in one matrix multiply, the cosine similarity between every pair of items (rows/cols of the resulting S are items). Normalizing the columns first turns the dot product into a cosine.
  • self.R[u] @ self.S is the recommendation step: R[u] is a 0/1 vector of the items user u has. Multiplying by S spreads a vote from each of those items to its similar items and sums the votes. Items similar to many of the user's items score highest.

That's the whole "users who interacted with X also interacted with Y" intuition, expressed as two matrix multiplications.

How it does

From the leaderboard:

  Popularity         0.073
  ContentBased       0.080
  ItemItemCF         0.093

Item-item CF beats both popularity and content-based here, and it does so without any item features — purely from co-interaction patterns. It also tends to be more explainable ("recommended because you interacted with X").

Strengths and limits

Strengths

  • Simple, intuitive, explainable.
  • No item features required.
  • Item-item similarities can be precomputed offline and served fast.

Limits

  • Cold start: a new item with no interactions has an all-zero column → no similarities → it can never be recommended. Same for a new user with no history.
  • Sparsity: when the matrix is extremely sparse, co-interactions are rare and similarities get noisy.
  • Scale & memory: the item-item matrix is items × items. For millions of items that's far too big to store densely — you keep only each item's top-N neighbors, or move to the compact latent-vector models in the next chapter.

Those limits motivate matrix factorization: instead of an items × items similarity table, learn a small vector per user and per item that explains all the interactions. 👉