The math behind KTS

This is the only heavily mathematical chapter. We'll go slowly and explain every symbol. The payoff is one simple formula that scores how "tight" a segment is, using only the kernel matrix from the previous chapter.

Reading the notation

A few symbols appear a lot:

$\sum_{t=i}^{j}$ means "add up, for every $t$ from $i$ to $j$." For example $\sum_{t=1}^{3} t = 1 + 2 + 3 = 6$. The big $\Sigma$ is just shorthand for "sum".
$\phi_t$ (the Greek letter phi) is frame $t$'s fingerprint after the kernel's similarity transformation. You never compute it directly — it's a thinking tool.
$\mu$ (mu) is the average (mean) of a group of fingerprints.
$\lVert a \rVert^2$ is the squared length of vector $a$ — a measure of "how big" it is. $\lVert a - b \rVert^2$ is then the squared distance between $a$ and $b$: small when they're close, large when far apart.
$\langle a, b \rangle$ is the dot product of $a$ and $b$ (the similarity measure from before). Crucially, $\langle \phi_s, \phi_t \rangle = K_{st}$ — the dot product of two fingerprints is exactly the kernel-matrix entry.

Step 1: what "spread out" means

Take one segment covering frames $i, i+1, \dots, j$. Call its length $L = j - i + 1$ (the number of frames in it). Its average fingerprint is

$$ \mu = \frac{1}{L} \sum_{t=i}^{j} \phi_t. $$

(Add up the fingerprints in the segment, divide by how many there are — an ordinary average.)

The scatter of the segment is the total squared distance from each frame to that average:

$$ v(i, j) = \sum_{t=i}^{j} \lVert \phi_t - \mu \rVert^2. $$

In words: how far is each frame from the segment's "typical" frame, added up. A consistent segment → every frame near the average → small $v$. A jumbled segment → frames far from the average → large $v$. This single number is what KTS uses to grade a segment.

Step 2: rewrite it using only the kernel matrix

The problem: $\mu$ and $\phi_t$ live in an abstract space we can't compute in directly. The fix: expand the formula until only dot products $\langle \phi_s, \phi_t \rangle = K_{st}$ remain.

Expanding the squared distance (using $\lVert a-b\rVert^2 = \langle a,a\rangle - 2\langle a,b\rangle + \langle b,b\rangle$):

$$ v(i,j) = \sum_{t=i}^{j} \Big( \langle \phi_t, \phi_t \rangle - 2\langle \phi_t, \mu \rangle + \langle \mu, \mu \rangle \Big). $$

Now substitute $\mu = \frac{1}{L}\sum_s \phi_s$ and replace every $\langle \phi_s, \phi_t \rangle$ with $K_{st}$. The three pieces become:

$\displaystyle \sum_{t=i}^{j} \langle \phi_t, \phi_t \rangle = \sum_{t=i}^{j} K_{tt}$ — the diagonal entries summed.
$\displaystyle \sum_{t=i}^{j} 2\langle \phi_t, \mu \rangle = \frac{2}{L}\sum_{t=i}^{j}\sum_{s=i}^{j} K_{st}$ — twice the sum of the whole block, divided by $L$.
$\displaystyle \sum_{t=i}^{j} \langle \mu, \mu \rangle = \frac{1}{L}\sum_{s,t} K_{st}$ — the block sum divided by $L$.

The last two combine, leaving the key formula:

$$ \boxed{\,v(i,j) = \sum_{t=i}^{j} K_{tt} \;-\; \frac{1}{L} \sum_{s=i}^{j}\sum_{t=i}^{j} K_{st}\,} $$

In plain words:

scatter = (sum of the diagonal entries for the segment) − (sum of the whole square block of $K$ for the segment, divided by its length).

No averages, no abstract fingerprints — just sums of numbers we already have in the grid $K$. That's the whole trick.

A worked example you can check by hand

Take two frames that are each a single number: $x_1 = 0$ and $x_2 = 0.2$, with the linear kernel (so this is just ordinary distance). The segment containing both has length $L = 2$ and average $\mu = (0 + 0.2)/2 = 0.1$.

Directly from the definition:

$$ v = (0 - 0.1)^2 + (0.2 - 0.1)^2 = 0.01 + 0.01 = 0.02. $$

In Chapter 4 we compute this segment's scatter with code and get the matching value 0.02. Two completely different routes — by hand and by computer — agree.

The total objective

A full segmentation with change points $t_1 < \dots < t_m$ (giving $m+1$ segments) is scored by adding up every segment's scatter:

$$ J = \sum_{r=0}^{m} v\big(t_r,\ t_{r+1}-1\big). $$

KTS chooses the cut positions that make $J$ as small as possible. Two cases:

You know how many segments you want → minimize $J$ directly with dynamic programming (Chapter 5).
You don't → add a penalty for extra cuts and let the computer pick (Chapter 6).

Why we'll precompute sums

Each $v(i,j)$ needs a sum over a block of $K$. Done naively for every possible segment, that's far too slow. The next chapter shows a trick (cumulative sums) that makes each scatter an instant lookup. First, let's write the code to build $K$. 👉

Kernel Temporal Segmentation from Scratch