Background: kernels & change points

Two ideas power KTS: change-point detection and kernels. We'll build both from scratch, with numbers you can check by hand.

Change-point detection

We have an ordered sequence of frame fingerprints $x_1,\dots,x_n$. A change point is a position where the sequence suddenly starts behaving differently — the boundary between two shots.

If we mark change points at positions $t_1 < t_2 < \dots < t_m$, they cut the sequence into $m+1$ segments (the stretches between consecutive cuts). For example, with 10 frames and cuts after frame 3 and frame 7, we get three segments: frames 1–3, 4–7, 8–10.

How do we know a segmentation is good? We use within-segment variance — a measure of how spread out the frames inside a segment are. A tight, consistent segment (all frames similar) has low variance; a segment that mixes a beach and a party has high variance. The total cost of a segmentation is the sum of its segments' variances, and lower is better.

⚠️ One catch: if you're allowed unlimited cuts, the cheapest answer is to put every frame in its own segment (variance = 0). That's useless. So KTS either fixes the number of segments (Chapter 5) or adds a penalty for having too many (Chapter 6).

Why a "kernel"?

To measure variance we need to measure similarity between frames. The simplest similarity is the dot product (from the primer): multiply two fingerprints position-by-position and add up. Big result = similar.

But a plain dot product assumes the "natural" way to compare frames is a straight line in number-space. Often that's too rigid. A kernel is just a more flexible similarity function $k(x, y)$ — give it two fingerprints, it returns a number saying how alike they are. Different kernels encode different notions of "alike".

The beauty of kernels: we collect every pairwise similarity into one grid called the kernel matrix (or Gram matrix) $K$, where the entry in row $i$, column $j$ is

$$ K_{ij} = k(x_i, x_j) = \text{similarity between frame } i \text{ and frame } j. $$

Everything KTS does is read numbers out of this one grid.

Three common kernels

KernelFormulaIn plain words
Linear$k(x,y) = x^\top y$Plain dot product. Recovers ordinary variance.
Cosine$k(x,y) = \dfrac{x^\top y}{\lVert x\rVert\,\lVert y\rVert}$Dot product after rescaling each vector to length 1 — measures direction only, ignoring magnitude. A robust default.
RBF / Gaussian$k(x,y) = \exp\!\big(-\lVert x-y\rVert^2 / 2\sigma^2\big)$"1" when identical, fading toward "0" as they get farther apart. Captures non-linear similarity.

(Here $x^\top y$ is the dot product, and $\lVert x\rVert$ means the length of vector $x$.)

A concrete kernel matrix

Take three 2-number frames. Frames 0 and 1 point "right"; frame 2 points "up":

import numpy as np

X = np.array([[1.0, 0.0],     # frame 0: points right
              [0.9, 0.1],     # frame 1: almost the same as frame 0
              [0.1, 1.0]])    # frame 2: points up (different)

Using the cosine kernel (direction similarity), the kernel matrix is:

[[1.    0.994 0.1  ]
 [0.994 1.    0.209]
 [0.1   0.209 1.   ]]

Read it like a similarity table:

  • The diagonal is all 1.0: every frame is perfectly similar to itself.
  • Row 0, column 1 = 0.994: frames 0 and 1 are almost identical (both point right). High number → same shot.
  • Row 0, column 2 = 0.1: frames 0 and 2 are very different. Low number → a boundary lives between them.

That single grid already "knows" frame 2 belongs to a different shot. The next chapter turns this intuition into the exact cost formula KTS minimizes. 👉