Linear algebra you actually need

Every model in this book is linear algebra underneath: features are vectors, layers are matrix multiplies, embeddings are vectors in a space, and PCA, recommenders, and transformers all reduce to a few matrix operations. You don't need a semester of proofs — you need a working grip on six ideas. This chapter centralizes them, building on the tensor mechanics of Chapter 2.

Vectors & matrices as transformations

A vector is a point/arrow in space (Chapter 0). The key mental shift: a matrix is a function that transforms vectors — rotating, scaling, projecting them. Matrix-times-vector is applying that transformation:

import numpy as np
R = np.array([[0., -1.],
              [1.,  0.]])               # a 90-degree rotation
print("rotate [1,0] ->", (R @ np.array([1., 0.])).round(3))

Output:

rotate [1,0] -> [0. 1.]

The matrix R turned the rightward arrow [1,0] into the upward arrow [0,1] — a 90° rotation. A neural-network layer xW is the same thing: a learned transformation of the input vector into a new space (Chapter 11).

The dot product, geometrically

a · b = Σ aᵢbᵢ measures alignment: it's $\lVert a\rVert,\lVert b\rVert\cos\theta$. Positive when vectors point similarly, zero when perpendicular — the entire basis of cosine similarity and attention (Chapter 4).

Norms — measuring size

A norm is a vector's length. The two you'll meet:

v = np.array([3., 4.])
print("L2 (Euclidean):", np.linalg.norm(v))     # sqrt(9+16)
print("L1 (Manhattan):", np.linalg.norm(v, 1))  # |3|+|4|

Output:

L2 (Euclidean): 5.0
L1 (Manhattan): 7.0

These are the same L1/L2 from regularization (Chapter 9) and distance (Chapter 4) — now you see they're just norms of the difference vector.

Rank — how much independent information

The rank of a matrix is the number of linearly independent rows (or columns) — how many genuinely distinct directions it spans. A matrix whose rows are combinations of each other is low-rank (redundant):

M = np.array([[1., 2, 3],
              [2., 4, 6],     # exactly 2x the first row -> redundant
              [1., 1, 1]])
print("rank:", np.linalg.matrix_rank(M))         # 3 rows, but only 2 independent

Output:

rank: 2

Low rank = compressible. This idea underlies dimensionality reduction, matrix factorization for recommenders (a huge user×item matrix approximated by low-rank user/item embeddings — see the recsys book), and LoRA, the low-rank fine-tuning trick for LLMs (Chapter 27).

Eigenvalues & eigenvectors — a matrix's natural axes

An eigenvector of a matrix is a special direction the matrix only stretches, never rotates; its eigenvalue is the stretch factor. Formally $Av = \lambda v$:

A = np.array([[2., 1.],
              [1., 2.]])
vals, vecs = np.linalg.eig(A)
print("eigenvalues:", np.sort(vals))
v = vecs[:, 0]
print("A v      =", (A @ v).round(3))
print("lambda v =", (vals[0] * v).round(3))      # equal -> v is an eigenvector

Output:

eigenvalues: [1. 3.]
A v      = [2.121 2.121]
lambda v = [2.121 2.121]

Av equals λv exactly — v is a direction the matrix purely scales. Eigenvectors are a matrix's "natural axes." They power PCA (the eigenvectors of the covariance matrix are the principal components), PageRank, and the analysis of dynamical systems.

SVD — the Swiss Army knife

Singular Value Decomposition factors any matrix into $M = U\Sigma V^\top$: a rotation, a scaling by the singular values $\Sigma$, and another rotation. It's the most useful decomposition in all of applied ML because the singular values rank directions by importance, so keeping the top few gives the best possible low-rank approximation:

U, S, Vt = np.linalg.svd(M)
print("singular values:", S.round(3))
rank1 = S[0] * np.outer(U[:, 0], Vt[0])           # keep only the top component
print("rank-1 approximation error:", round(np.linalg.norm(M - rank1), 3))

Output:

singular values: [8.52  0.643 0.   ]
rank-1 approximation error: 0.643

The third singular value is 0 (confirming rank 2), and a single component already reconstructs the matrix with tiny error. SVD is the engine behind PCA (Chapter 21), recommender matrix factorization, latent semantic analysis, image compression, and noise reduction. If you remember one decomposition, remember SVD.

Don't be confused: eigendecomposition vs. SVD. Eigendecomposition applies only to square matrices and can produce complex values; SVD works on any matrix (any shape) and always gives real, non-negative singular values. For the symmetric positive matrices common in ML (like covariance), they coincide. When in doubt, reach for SVD — it's more general and more numerically stable.

The mental cheat sheet

ObjectGeometric meaningShows up in
matrix × vectortransform a pointevery NN layer
dot productalignment / similaritycosine, attention
normvector lengthdistance, regularization
rankindependent directionscompression, matrix factorization
eigenvectoraxis a matrix only scalesPCA, PageRank
SVDbest low-rank factorizationPCA, recommenders, LoRA

The takeaway

A matrix is a transformation; the dot product measures alignment; norms measure size (L1/L2 = your regularizers and distances); rank measures redundancy (the basis of compression and low-rank tricks); eigenvectors are a matrix's natural scaling axes; and SVD — the master tool — gives the best low-rank approximation of any matrix, powering PCA, recommenders, and LoRA. With probability and linear algebra in hand, we can tackle the applied-stats skill every product team needs: running a trustworthy experiment. 👉