Unsupervised learning: clustering & dimensionality reduction

Everything so far has been supervised — every example came with a label. But most data in the world is unlabeled, and a huge part of practical ML is finding structure without answers: grouping similar things (clustering) and squeezing many features into a few (dimensionality reduction). These power customer segmentation, anomaly detection, recommendation, visualization, and the preprocessing behind embeddings.

Clustering: grouping without labels

k-means — the workhorse

k-means partitions data into k groups by repeating two steps until stable: assign each point to its nearest centroid, then move each centroid to the mean of its points. (This alternation is an instance of the expectation-maximization pattern.)

import numpy as np
def kmeans(X, k, iters=10, seed=0):
    rng = np.random.default_rng(seed)
    C = X[rng.choice(len(X), k, replace=False)]          # init: random points
    for _ in range(iters):
        d = ((X[:, None, :] - C[None, :, :]) ** 2).sum(2) # dist to each centroid
        labels = d.argmin(1)                              # assign
        C = np.array([X[labels == j].mean(0) for j in range(k)])  # move
    return C, labels

# two well-separated blobs around (0,0) and (3,3)
C, labels = kmeans(X, k=2)
print("centroids:\n", C.round(3))
print("cluster sizes:", np.bincount(labels))

Output:

centroids:
 [[ 3.121  3.06 ]
 [-0.081  0.045]]
cluster sizes: [20 20]

k-means recovered the two blobs exactly — centroids at ≈(3,3) and ≈(0,0), 20 points each. (This is the building block the IVF-PQ sister book uses to partition vector space for fast search.)

  • You must choose k. Use the elbow method (plot within-cluster variance vs. k, look for the bend) or the silhouette score.
  • It assumes round, equal-sized clusters and is sensitive to initialization — k-means++ seeding (spread initial centroids apart) is the standard fix.
  • Scale your features first (Chapter 3) — k-means uses Euclidean distance, so an unscaled big feature dominates.

Don't be confused: k-means vs. k-NN (again). k-means = unsupervised clustering (k = number of groups, no labels). k-NN = supervised classification (k = neighbors that vote, needs labels). Same letter, opposite problems.

DBSCAN & hierarchical clustering

  • DBSCAN — groups points that are densely packed, marking sparse points as noise/outliers. Finds arbitrary shapes and doesn't need k (you set a radius and a min-points instead). Great for anomaly detection and non-spherical clusters.
  • Hierarchical (agglomerative) — repeatedly merge the two closest clusters, building a tree (dendrogram) you can cut at any level. No k up front; great for understanding nested structure.

Don't be confused: which clustering algorithm? k-means → fast, round clusters, you know k. DBSCAN → arbitrary shapes, outlier detection, density varies, you don't know k. Hierarchical → small data, you want the full merge tree.

Dimensionality reduction: many features → few

High-dimensional data is hard to visualize, slow to process, and prone to overfitting (the curse of dimensionality). Dimensionality reduction compresses features while keeping the signal.

PCA — the linear workhorse

Principal Component Analysis finds the orthogonal directions of maximum variance and projects onto the top few. The most numerically stable way to compute it is the SVD (Chapter 23) of the centered data:

Xc = X - X.mean(0)                          # 1. center the data
U, S, Vt = np.linalg.svd(Xc, full_matrices=False)   # 2. SVD
print("singular values     :", S.round(3))
print("PC1 direction        :", Vt[0].round(3))
print("variance explained   :", ((S**2) / (S**2).sum()).round(3))
print("project onto PC1 ->", (Xc @ Vt[0]).round(2)[:5], "...")

Output:

singular values     : [14.039  1.609]
PC1 direction        : [0.727 0.686]
variance explained   : [0.987 0.013]
project onto PC1 -> [-2.683 -2.45  -2.005 -2.668 -2.123] ...

The first component captures 98.7% of the variance — because our two blobs lie mostly along the (1,1) diagonal, exactly the direction [0.727, 0.686] PCA found. We could drop the second dimension and lose almost nothing. The variance explained ratio is how you decide how many components to keep (e.g. "enough to reach 95%").

  • Uses: compression, denoising, speeding up downstream models, and visualizing high-dim data in 2-D.
  • Limits: PCA is linear — it can't unfold curved structure. It's also not interpretable (components are mixtures of all features).

t-SNE and UMAP — visualization of non-linear structure

When you want to see clusters in high-dim data (e.g. embeddings from Chapter 14), t-SNE and UMAP are the go-to non-linear methods. They place similar points near each other in 2-D, revealing structure PCA misses.

Don't be confused: t-SNE is for looking, not for features. Use t-SNE/UMAP to visualize (2-D plots for humans), never as input to another model: the distances and cluster sizes in a t-SNE plot are distorted and not meaningful, and the result changes run to run. For actual dimensionality reduction feeding a model, use PCA (or an autoencoder). UMAP is faster than t-SNE and preserves more global structure.

Where unsupervised learning shows up

  • Customer/user segmentation — cluster behavior to target groups.
  • Anomaly/fraud detection — points far from any cluster are suspicious (DBSCAN, isolation forests).
  • Recommendation — cluster users/items; compress embeddings.
  • Preprocessing — PCA before a model to cut noise and dimensions.
  • Visualization — t-SNE/UMAP to inspect embedding spaces and debug models.
  • Vector search — k-means partitions space (the IVF index).

The takeaway

Unsupervised learning finds structure without labels. k-means partitions into k round clusters (choose k with the elbow/silhouette); DBSCAN finds dense arbitrary-shaped clusters and outliers without k; hierarchical builds a merge tree. PCA (via SVD) projects onto the directions of maximum variance for compression; t-SNE/ UMAP are for visualization only. Scale features first, and never feed a t-SNE plot to a model. Next, the probability and statistics that underpin all of it — and half of every interview. 👉