Background: kernels & change points
Two ideas power KTS: change-point detection and kernels. We'll build both from scratch, with numbers you can check by hand.
Change-point detection
We have an ordered sequence of frame fingerprints $x_1,\dots,x_n$. A change point is a position where the sequence suddenly starts behaving differently — the boundary between two shots.
If we mark change points at positions $t_1 < t_2 < \dots < t_m$, they cut the sequence into $m+1$ segments (the stretches between consecutive cuts). For example, with 10 frames and cuts after frame 3 and frame 7, we get three segments: frames 1–3, 4–7, 8–10.
How do we know a segmentation is good? We use within-segment variance — a measure of how spread out the frames inside a segment are. A tight, consistent segment (all frames similar) has low variance; a segment that mixes a beach and a party has high variance. The total cost of a segmentation is the sum of its segments' variances, and lower is better.
⚠️ One catch: if you're allowed unlimited cuts, the cheapest answer is to put every frame in its own segment (variance = 0). That's useless. So KTS either fixes the number of segments (Chapter 5) or adds a penalty for having too many (Chapter 6).
Why a "kernel"?
To measure variance we need to measure similarity between frames. The simplest similarity is the dot product (from the primer): multiply two fingerprints position-by-position and add up. Big result = similar.
But a plain dot product assumes the "natural" way to compare frames is a straight line in number-space. Often that's too rigid. A kernel is just a more flexible similarity function $k(x, y)$ — give it two fingerprints, it returns a number saying how alike they are. Different kernels encode different notions of "alike".
The beauty of kernels: we collect every pairwise similarity into one grid called the kernel matrix (or Gram matrix) $K$, where the entry in row $i$, column $j$ is
$$ K_{ij} = k(x_i, x_j) = \text{similarity between frame } i \text{ and frame } j. $$
Everything KTS does is read numbers out of this one grid.
Three common kernels
| Kernel | Formula | In plain words |
|---|---|---|
| Linear | $k(x,y) = x^\top y$ | Plain dot product. Recovers ordinary variance. |
| Cosine | $k(x,y) = \dfrac{x^\top y}{\lVert x\rVert\,\lVert y\rVert}$ | Dot product after rescaling each vector to length 1 — measures direction only, ignoring magnitude. A robust default. |
| RBF / Gaussian | $k(x,y) = \exp\!\big(-\lVert x-y\rVert^2 / 2\sigma^2\big)$ | "1" when identical, fading toward "0" as they get farther apart. Captures non-linear similarity. |
(Here $x^\top y$ is the dot product, and $\lVert x\rVert$ means the length of vector $x$.)
A concrete kernel matrix
Take three 2-number frames. Frames 0 and 1 point "right"; frame 2 points "up":
import numpy as np
X = np.array([[1.0, 0.0], # frame 0: points right
[0.9, 0.1], # frame 1: almost the same as frame 0
[0.1, 1.0]]) # frame 2: points up (different)
Using the cosine kernel (direction similarity), the kernel matrix is:
[[1. 0.994 0.1 ]
[0.994 1. 0.209]
[0.1 0.209 1. ]]
Read it like a similarity table:
- The diagonal is all
1.0: every frame is perfectly similar to itself. - Row 0, column 1 =
0.994: frames 0 and 1 are almost identical (both point right). High number → same shot. - Row 0, column 2 =
0.1: frames 0 and 2 are very different. Low number → a boundary lives between them.
That single grid already "knows" frame 2 belongs to a different shot. The next chapter turns this intuition into the exact cost formula KTS minimizes. 👉