Introduction

No background required. This book assumes you know nothing about programming, AI, or math beyond high-school basics. Every concept, line of code, and symbol is explained, and every code example is followed by the exact output it produces. If you've never written Python, read the 5-minute primer first.

The one-sentence idea

Kernel Temporal Segmentation (KTS) automatically chops a video into a small number of consistent pieces ("shots" or "scenes") by finding the moments where the picture changes a lot.

An everyday analogy

Imagine flipping through a photo album. Pages 1–10 are a beach trip, pages 11–18 are a birthday party, pages 19–25 are a hike. You can instantly spot the boundaries — the points where the photos suddenly start looking different. KTS does exactly this, automatically, for the frames of a video. The boundaries it finds are called change points.

Where it's used

KTS was introduced by Potapov, Douze, Harchaoui & Schmid in "Category-Specific Video Summarization" (ECCV 2014). It became a standard first step in video summarization: before a computer decides which parts of a video are worth keeping, it first splits the video into shots — and KTS is the tool that does the splitting. It's used to prepare well-known datasets like TVSum and SumMe.

How a video becomes numbers

A computer can't "see" a picture the way we do — it works with numbers. So each video frame is turned into a list of numbers called a feature vector: a numeric fingerprint of that frame. Similar-looking frames get similar fingerprints.

So our input is a sequence of feature vectors, one per frame:

$$ x_1, x_2, \dots, x_n \in \mathbb{R}^d $$

Read that line as: "there are $n$ frames; each frame $x_t$ is a list of $d$ numbers." (The symbol $\mathbb{R}^d$ just means "a list of $d$ real numbers.") Frames within the same shot have similar fingerprints; at a shot boundary the fingerprint changes sharply. KTS finds those sharp changes.

What KTS actually does

Given those fingerprints, KTS finds the change points so that:

  • frames inside a segment are as similar to each other as possible, and
  • the number of segments stays small (we don't want one segment per frame).

It does this with two ideas, which are the two halves of this book:

  1. Measure how "spread out" a segment is using a kernel — a flexible similarity measure (Chapters 1–2).
  2. Find the cut points that minimize the total spread using dynamic programming, a technique that is guaranteed to find the best possible answer, not just a good guess (Chapters 3–6).

What you'll build

By the end you'll have one short, fully-understood Python file, kts.py, and you'll watch it recover hidden boundaries in test data exactly.

What you need

  • Python 3.8+ (the primer shows how code/output is presented)
  • NumPy (a number-crunching library — also covered in the primer)
  • (optional) Matplotlib, only for one picture in the final demo

Let's start with the two background ideas: kernels and change points. 👉

A 5-minute Python & NumPy primer

This book assumes no prior experience with Python, AI, or any library. This page gives you just enough to read every code example. If you already know Python and NumPy, skip ahead.

What is Python?

Python is a programming language — a way to write step-by-step instructions for a computer. You write text in a file ending in .py and run it. Throughout this book, grey boxes contain Python code, and the box right after shows what it prints when you run it:

print("hello")
print(2 + 3)

Output:

hello
5

print(...) displays a value on the screen. That's how we'll inspect results.

Variables, functions, lists

x = 10                 # a variable: the name x now refers to the number 10
name = "video"         # text (a "string") goes in quotes
nums = [4, 1, 7]       # a list: an ordered collection of values

A function is a reusable recipe. You "call" it with inputs (called arguments) in parentheses, and it gives back a result:

def double(value):     # define a function named "double"
    return value * 2   # "return" hands a result back to the caller

print(double(21))      # call it

Output:

42

What is NumPy?

NumPy is a Python library (a bundle of pre-written code you can reuse) for working with arrays of numbers efficiently. We load it once at the top of a file and, by convention, nickname it np:

import numpy as np

Arrays: 1-D (a list of numbers) and 2-D (a grid/matrix)

import numpy as np

v = np.array([1.0, 2.0, 3.0])          # a 1-D array (a "vector")
M = np.array([[1.0, 2.0],              # a 2-D array (a "matrix" / grid)
              [3.0, 4.0]])
print(v)
print(M)
print("shape of M:", M.shape)          # (rows, columns)

Output:

[1. 2. 3.]
[[1. 2.]
 [3. 4.]]
shape of M: (2, 2)
  • A vector is just a row of numbers. In this book, each video frame is described by a vector of numbers (its "features" — think of it as a numeric fingerprint of the image).
  • A matrix is a grid of numbers (rows × columns). M.shape tells you its size as (rows, columns).

The handful of NumPy operations we use

You'll seeWhat it means
A @ BMatrix multiplication (a specific way to combine two grids of numbers).
A.TTranspose — flip a matrix over its diagonal (rows become columns).
np.sum(A)Add up all the numbers in A.
np.cumsum(a)Running totals: [1,2,3][1,3,6].
np.diag(M)The diagonal entries of a square matrix (top-left to bottom-right).
A.shapeThe size of an array, as (rows, columns).
np.zeros((r, c))A grid of the given size filled with 0.
np.arange(k)The whole numbers 0, 1, ..., k-1.

You don't need to memorize these — each is explained again the first time it appears.

One more idea: the dot product

The dot product of two vectors multiplies them position-by-position and adds up the results. It's the basic "how aligned are these two things?" measurement, and it's the seed of everything in this book:

import numpy as np
a = np.array([1.0, 0.0])
b = np.array([0.9, 0.1])
print(np.dot(a, b))     # 1*0.9 + 0*0.1

Output:

0.9

A big dot product means the two vectors point in a similar direction (the frames look alike); a small one means they're different. Hold onto that intuition — the next chapter builds the whole method on it. 👉

Background: kernels & change points

Two ideas power KTS: change-point detection and kernels. We'll build both from scratch, with numbers you can check by hand.

Change-point detection

We have an ordered sequence of frame fingerprints $x_1,\dots,x_n$. A change point is a position where the sequence suddenly starts behaving differently — the boundary between two shots.

If we mark change points at positions $t_1 < t_2 < \dots < t_m$, they cut the sequence into $m+1$ segments (the stretches between consecutive cuts). For example, with 10 frames and cuts after frame 3 and frame 7, we get three segments: frames 1–3, 4–7, 8–10.

How do we know a segmentation is good? We use within-segment variance — a measure of how spread out the frames inside a segment are. A tight, consistent segment (all frames similar) has low variance; a segment that mixes a beach and a party has high variance. The total cost of a segmentation is the sum of its segments' variances, and lower is better.

⚠️ One catch: if you're allowed unlimited cuts, the cheapest answer is to put every frame in its own segment (variance = 0). That's useless. So KTS either fixes the number of segments (Chapter 5) or adds a penalty for having too many (Chapter 6).

Why a "kernel"?

To measure variance we need to measure similarity between frames. The simplest similarity is the dot product (from the primer): multiply two fingerprints position-by-position and add up. Big result = similar.

But a plain dot product assumes the "natural" way to compare frames is a straight line in number-space. Often that's too rigid. A kernel is just a more flexible similarity function $k(x, y)$ — give it two fingerprints, it returns a number saying how alike they are. Different kernels encode different notions of "alike".

The beauty of kernels: we collect every pairwise similarity into one grid called the kernel matrix (or Gram matrix) $K$, where the entry in row $i$, column $j$ is

$$ K_{ij} = k(x_i, x_j) = \text{similarity between frame } i \text{ and frame } j. $$

Everything KTS does is read numbers out of this one grid.

Three common kernels

KernelFormulaIn plain words
Linear$k(x,y) = x^\top y$Plain dot product. Recovers ordinary variance.
Cosine$k(x,y) = \dfrac{x^\top y}{\lVert x\rVert\,\lVert y\rVert}$Dot product after rescaling each vector to length 1 — measures direction only, ignoring magnitude. A robust default.
RBF / Gaussian$k(x,y) = \exp\!\big(-\lVert x-y\rVert^2 / 2\sigma^2\big)$"1" when identical, fading toward "0" as they get farther apart. Captures non-linear similarity.

(Here $x^\top y$ is the dot product, and $\lVert x\rVert$ means the length of vector $x$.)

A concrete kernel matrix

Take three 2-number frames. Frames 0 and 1 point "right"; frame 2 points "up":

import numpy as np

X = np.array([[1.0, 0.0],     # frame 0: points right
              [0.9, 0.1],     # frame 1: almost the same as frame 0
              [0.1, 1.0]])    # frame 2: points up (different)

Using the cosine kernel (direction similarity), the kernel matrix is:

[[1.    0.994 0.1  ]
 [0.994 1.    0.209]
 [0.1   0.209 1.   ]]

Read it like a similarity table:

  • The diagonal is all 1.0: every frame is perfectly similar to itself.
  • Row 0, column 1 = 0.994: frames 0 and 1 are almost identical (both point right). High number → same shot.
  • Row 0, column 2 = 0.1: frames 0 and 2 are very different. Low number → a boundary lives between them.

That single grid already "knows" frame 2 belongs to a different shot. The next chapter turns this intuition into the exact cost formula KTS minimizes. 👉

The math behind KTS

This is the only heavily mathematical chapter. We'll go slowly and explain every symbol. The payoff is one simple formula that scores how "tight" a segment is, using only the kernel matrix from the previous chapter.

Reading the notation

A few symbols appear a lot:

  • $\sum_{t=i}^{j}$ means "add up, for every $t$ from $i$ to $j$." For example $\sum_{t=1}^{3} t = 1 + 2 + 3 = 6$. The big $\Sigma$ is just shorthand for "sum".
  • $\phi_t$ (the Greek letter phi) is frame $t$'s fingerprint after the kernel's similarity transformation. You never compute it directly — it's a thinking tool.
  • $\mu$ (mu) is the average (mean) of a group of fingerprints.
  • $\lVert a \rVert^2$ is the squared length of vector $a$ — a measure of "how big" it is. $\lVert a - b \rVert^2$ is then the squared distance between $a$ and $b$: small when they're close, large when far apart.
  • $\langle a, b \rangle$ is the dot product of $a$ and $b$ (the similarity measure from before). Crucially, $\langle \phi_s, \phi_t \rangle = K_{st}$ — the dot product of two fingerprints is exactly the kernel-matrix entry.

Step 1: what "spread out" means

Take one segment covering frames $i, i+1, \dots, j$. Call its length $L = j - i + 1$ (the number of frames in it). Its average fingerprint is

$$ \mu = \frac{1}{L} \sum_{t=i}^{j} \phi_t. $$

(Add up the fingerprints in the segment, divide by how many there are — an ordinary average.)

The scatter of the segment is the total squared distance from each frame to that average:

$$ v(i, j) = \sum_{t=i}^{j} \lVert \phi_t - \mu \rVert^2. $$

In words: how far is each frame from the segment's "typical" frame, added up. A consistent segment → every frame near the average → small $v$. A jumbled segment → frames far from the average → large $v$. This single number is what KTS uses to grade a segment.

Step 2: rewrite it using only the kernel matrix

The problem: $\mu$ and $\phi_t$ live in an abstract space we can't compute in directly. The fix: expand the formula until only dot products $\langle \phi_s, \phi_t \rangle = K_{st}$ remain.

Expanding the squared distance (using $\lVert a-b\rVert^2 = \langle a,a\rangle - 2\langle a,b\rangle + \langle b,b\rangle$):

$$ v(i,j) = \sum_{t=i}^{j} \Big( \langle \phi_t, \phi_t \rangle - 2\langle \phi_t, \mu \rangle + \langle \mu, \mu \rangle \Big). $$

Now substitute $\mu = \frac{1}{L}\sum_s \phi_s$ and replace every $\langle \phi_s, \phi_t \rangle$ with $K_{st}$. The three pieces become:

  • $\displaystyle \sum_{t=i}^{j} \langle \phi_t, \phi_t \rangle = \sum_{t=i}^{j} K_{tt}$ — the diagonal entries summed.
  • $\displaystyle \sum_{t=i}^{j} 2\langle \phi_t, \mu \rangle = \frac{2}{L}\sum_{t=i}^{j}\sum_{s=i}^{j} K_{st}$ — twice the sum of the whole block, divided by $L$.
  • $\displaystyle \sum_{t=i}^{j} \langle \mu, \mu \rangle = \frac{1}{L}\sum_{s,t} K_{st}$ — the block sum divided by $L$.

The last two combine, leaving the key formula:

$$ \boxed{\,v(i,j) = \sum_{t=i}^{j} K_{tt} \;-\; \frac{1}{L} \sum_{s=i}^{j}\sum_{t=i}^{j} K_{st}\,} $$

In plain words:

scatter = (sum of the diagonal entries for the segment) − (sum of the whole square block of $K$ for the segment, divided by its length).

No averages, no abstract fingerprints — just sums of numbers we already have in the grid $K$. That's the whole trick.

A worked example you can check by hand

Take two frames that are each a single number: $x_1 = 0$ and $x_2 = 0.2$, with the linear kernel (so this is just ordinary distance). The segment containing both has length $L = 2$ and average $\mu = (0 + 0.2)/2 = 0.1$.

Directly from the definition:

$$ v = (0 - 0.1)^2 + (0.2 - 0.1)^2 = 0.01 + 0.01 = 0.02. $$

In Chapter 4 we compute this segment's scatter with code and get the matching value 0.02. Two completely different routes — by hand and by computer — agree.

The total objective

A full segmentation with change points $t_1 < \dots < t_m$ (giving $m+1$ segments) is scored by adding up every segment's scatter:

$$ J = \sum_{r=0}^{m} v\big(t_r,\ t_{r+1}-1\big). $$

KTS chooses the cut positions that make $J$ as small as possible. Two cases:

  • You know how many segments you want → minimize $J$ directly with dynamic programming (Chapter 5).
  • You don't → add a penalty for extra cuts and let the computer pick (Chapter 6).

Why we'll precompute sums

Each $v(i,j)$ needs a sum over a block of $K$. Done naively for every possible segment, that's far too slow. The next chapter shows a trick (cumulative sums) that makes each scatter an instant lookup. First, let's write the code to build $K$. 👉

Step 1 — The kernel matrix

Everything in KTS reads from the kernel matrix $K$: a grid where the entry in row $i$, column $j$ is the similarity between frame $i$ and frame $j$. Step one is to turn our table of frame fingerprints into this similarity grid.

Our input is an array X with one row per frame and one column per feature number. So X has shape (n, d): n frames, each described by d numbers. The output K has shape (n, n): a similarity for every pair of frames.

The code

import numpy as np


def linear_kernel(X):
    """K[i, j] = dot product of frame i and frame j. Plain variance."""
    return X @ X.T


def cosine_kernel(X, eps=1e-8):
    """Similarity by direction only (length-normalized). A good default."""
    norms = np.linalg.norm(X, axis=1, keepdims=True)   # length of each frame vector
    Xn = X / np.maximum(norms, eps)                    # rescale each to length 1
    return Xn @ Xn.T


def rbf_kernel(X, sigma=None):
    """Gaussian similarity: 1 when identical, fading to 0 as frames differ."""
    sq = np.sum(X**2, axis=1)                          # squared length of each frame
    # squared distance between every pair: ||a-b||^2 = ||a||^2 + ||b||^2 - 2 a.b
    d2 = sq[:, None] + sq[None, :] - 2.0 * (X @ X.T)
    d2 = np.maximum(d2, 0.0)                            # clip tiny negatives (rounding)
    if sigma is None:
        med = np.median(d2[d2 > 0]) if np.any(d2 > 0) else 1.0
        sigma = np.sqrt(0.5 * med) if med > 0 else 1.0  # auto bandwidth
    return np.exp(-d2 / (2.0 * sigma**2))


def build_kernel(X, kind="cosine", **kwargs):
    """Pick a kernel by name and build the matrix."""
    X = np.asarray(X, dtype=np.float64)
    kernels = {"linear": linear_kernel, "cosine": cosine_kernel, "rbf": rbf_kernel}
    if kind not in kernels:
        raise ValueError(f"unknown kernel {kind!r}; pick one of {list(kernels)}")
    return kernels[kind](X, **kwargs) if kind == "rbf" else kernels[kind](X)

Reading the code, line by line

  • X @ X.T@ is matrix multiplication and .T is the transpose (rows ↔ columns). Multiplying X by its own transpose computes the dot product of every pair of rows at once. That grid of dot products is the linear kernel.
  • np.linalg.norm(X, axis=1, keepdims=True) — the length of each frame vector. axis=1 means "do it per row" (one length per frame).
  • X / norms — divide each frame by its own length, so every frame becomes length 1. After this, the dot product measures direction only — that's the cosine kernel. np.maximum(norms, eps) just avoids dividing by zero.
  • In rbf_kernel, sq[:, None] + sq[None, :] - 2 * (X @ X.T) is the algebra identity for squared distance between every pair, computed in one shot.
  • np.exp(...) raises $e$ to a power elementwise; here it turns "distance" into "similarity" (distance 0 → similarity 1; large distance → similarity ~0).

Try it: input and output

import numpy as np

X = np.array([[1.0, 0.0],     # frame 0: points right
              [0.9, 0.1],     # frame 1: nearly the same as frame 0
              [0.1, 1.0]])    # frame 2: points up (different)

K = build_kernel(X, kind="cosine")
print(np.round(K, 3))         # round to 3 decimals for readability

Output:

[[1.    0.994 0.1  ]
 [0.994 1.    0.209]
 [0.1   0.209 1.   ]]

How to read this 3×3 grid:

  • Diagonal is all 1.0 — each frame is identical to itself.
  • K[0,1] = 0.994 — frames 0 and 1 are nearly identical (same shot).
  • K[0,2] = 0.1 — frames 0 and 2 are very different (a boundary sits between them).

The same three frames with the other two kernels:

print("linear:\n", np.round(build_kernel(X, "linear"), 3))
print("rbf:\n",    np.round(build_kernel(X, "rbf"), 3))

Output:

linear:
 [[1.   0.9  0.1 ]
 [0.9  0.82 0.19]
 [0.1  0.19 1.01]]
rbf:
 [[1.    0.986 0.287]
 [0.986 1.    0.368]
 [0.287 0.368 1.   ]]

All three agree on the story — frames 0 and 1 are similar, frame 2 is the odd one out — they just put the numbers on different scales.

Which kernel should I use?

  • cosine — the safe default for AI "deep" features; it ignores brightness/ scale and compares content.
  • linear — reproduces ordinary variance; simplest.
  • rbf — adds non-linear sensitivity, at the cost of a sigma knob (how fast similarity fades with distance). The code picks a reasonable sigma automatically.

Two practical notes

  • Memory. K has n × n entries. A 1-hour video has too many frames, so in practice you keep one frame per second, run KTS, then map the change points back to real time.
  • Sanity check. A valid kernel matrix is symmetric (K[i,j] == K[j,i]), and for the cosine kernel the diagonal is exactly 1:
Xr = np.random.randn(6, 4)            # 6 random frames, 4 features each
Kc = build_kernel(Xr, "cosine")
print("symmetric:", np.allclose(Kc, Kc.T))
print("diagonal all 1:", np.allclose(np.diag(Kc), 1.0))

Output:

symmetric: True
diagonal all 1: True

With $K$ in hand we can score any segment. Let's make that fast. 👉

Step 2 — The scatter (cost) matrix

Recall the per-segment cost from the math chapter: the scatter $v(i,j)$ of frames $i$ through $j$ is

$$ v(i,j) = \sum_{t=i}^{j} K_{tt} \;-\; \frac{1}{L} \sum_{s=i}^{j}\sum_{t=i}^{j} K_{st}, \qquad L = j - i + 1. $$

We want this number for every possible segment, and we want each one to be instant to look up. The trick that makes it instant is cumulative sums.

What is a cumulative sum?

A cumulative (running) sum replaces each number with the total of everything up to it: [5, 2, 4] becomes [5, 7, 11]. Why bother? Because once you have running totals, the sum of any stretch is a single subtraction. The sum of items 1–2 above is 11 − 5 = 6 — no re-adding. We use the same idea in two dimensions.

The code

import numpy as np


def calc_scatters(K):
    """
    Return J, an n x n table where J[i, j] is the scatter v(i, j) of frames
    i..j. Building it costs O(n^2); each entry is then an O(1) lookup.
    """
    n = K.shape[0]

    # running total of the diagonal: dcum[t] = sum of K[0,0] .. K[t-1,t-1]
    dcum = np.concatenate(([0.0], np.cumsum(np.diag(K))))

    # 2-D running total: Kcum[a, b] = sum of the top-left a x b block of K
    Kcum = np.zeros((n + 1, n + 1))
    Kcum[1:, 1:] = np.cumsum(np.cumsum(K, axis=0), axis=1)

    J = np.zeros((n, n))
    for i in range(n):
        for j in range(i, n):
            L = j - i + 1
            diag_sum = dcum[j + 1] - dcum[i]                 # diagonal part
            block = (Kcum[j + 1, j + 1] - Kcum[i, j + 1]     # block sum via
                     - Kcum[j + 1, i] + Kcum[i, i])          # 4 corner lookups
            J[i, j] = diag_sum - block / L
    return J

Reading the code

  • np.diag(K) pulls out the diagonal (each frame's self-similarity). np.cumsum(...) makes its running total, so dcum[j+1] - dcum[i] instantly gives the diagonal sum for frames $i..j$.
  • np.cumsum(np.cumsum(K, axis=0), axis=1) does the running total down then across, giving a 2-D running total Kcum. After that, the sum of any rectangular block of K is four corner lookups added/subtracted (see below).
  • The two for loops fill in J[i, j] for every segment using those instant lookups.

Why the four corners work

For a 2-D running total, the sum inside a rectangle equals (big rectangle) − (strip above) − (strip left) + (overlap added back):

sum = Kcum[j+1, j+1]   # everything from the origin down to (j, j)
    - Kcum[i,   j+1]   # subtract the strip above the block
    - Kcum[j+1, i  ]   # subtract the strip to the left
    + Kcum[i,   i  ]   # the top-left corner got subtracted twice; add it back

This is the standard "inclusion–exclusion" pattern for rectangle sums.

Try it: input and output

Four frames that are single numbers — two near 0, two near 5 (so clearly two groups). We use the linear kernel:

import numpy as np

X = np.array([[0.0], [0.2], [5.0], [5.1]])
K = X @ X.T               # linear kernel
J = calc_scatters(K)
print(np.round(J, 3))

Output:

[[ 0.     0.02  16.027 24.527]
 [ 0.     0.    11.52  15.687]
 [ 0.     0.     0.     0.005]
 [ 0.     0.     0.     0.   ]]

How to read J[i, j] (only the upper triangle is filled; j ≥ i):

  • J[0,1] = 0.02 — frames 0–1 (both near 0) form a tight segment. Tiny scatter = good segment. (This is the exact number we computed by hand in Chapter 2! ✔)
  • J[2,3] = 0.005 — frames 2–3 (both near 5) are also tight.
  • J[0,3] = 24.527 — lumping all four frames into one segment mixes the two groups, so the scatter is huge. Bad segment.

The table already tells us the right split: keep 0–1 together, keep 2–3 together, and don't merge across the gap. Dynamic programming (next chapter) makes that decision automatically.

Speed

Both the running totals and filling J cost on the order of $n^2$ operations — the same as just storing $K$. So we get every segment's cost cheaply.

Sanity check against the definition

We can confirm calc_scatters matches the original "distance to the average" definition by brute force:

def brute_scatter(X, i, j):
    seg = X[i:j + 1]
    mu = seg.mean(axis=0)              # the segment average
    return float(((seg - mu) ** 2).sum())

X = np.random.randn(8, 3)             # 8 random frames, 3 features
J = calc_scatters(X @ X.T)
ok = all(abs(J[i, j] - brute_scatter(X, i, j)) < 1e-8
         for i in range(8) for j in range(i, 8))
print("scatter matrix matches the definition:", ok)

Output:

scatter matrix matches the definition: True

Now every segment has a cost. Time to find the best combination of segments. 👉

Step 3 — Dynamic programming

We can now score any single segment instantly. The remaining problem: out of the enormous number of ways to place the cuts, find the one combination with the smallest total scatter

$$ J = \sum_{r=0}^{m} v\big(t_r,\ t_{r+1}-1\big). $$

Trying every possibility is hopeless — for $n$ frames and $m$ cuts there are $\binom{n-1}{m}$ combinations, which explodes. Dynamic programming (DP) finds the guaranteed-best answer without trying them all.

The idea, in plain words

DP is "solve small pieces, reuse the answers." Here's the intuition:

Suppose you already knew the best way to split the first part of the video into a few segments. Then to split a slightly longer part, you only need to decide where the last segment starts — everything before it is a smaller problem you've already solved.

So we build up the answer for longer and longer prefixes of the video, reusing the best answers to the shorter prefixes. Because every sub-answer is optimal, the final answer is optimal too.

The recurrence

Let $C[k, \ell]$ be the smallest possible total scatter when splitting the first $\ell$ frames into exactly $k+1$ segments (that's $k$ cuts). The last segment must start somewhere — call it $t$ — and run to frame $\ell-1$. Everything before $t$ is the best $k$-segment split of the first $t$ frames. So we just try every possible start $t$ and keep the cheapest:

$$ C[k, \ell] \;=\; \min_{t} \Big( C[k-1,\ t] \;+\; v(t,\ \ell-1) \Big). $$

  • Base case: $C[0, \ell] = v(0, \ell-1)$ — with zero cuts there's one segment covering everything.
  • Answer: $C[m, n]$ is the best total scatter for the whole video with $m$ cuts.

To recover where the cuts are (not just the cost), we remember, for each cell, which start $t$ won. Walking those choices backward is called backtracking.

The code

import numpy as np


def cpd_nonlin(K, ncp, lmin=1, lmax=None):
    """
    Globally optimal segmentation given the kernel matrix K.

    K    : (n, n) kernel matrix.
    ncp  : number of change points m  (=> m + 1 segments).
    lmin : smallest allowed segment length.
    lmax : largest allowed segment length (defaults to n).

    Returns (cps, cost):
      cps  : the m change-point positions (start frame of each new segment).
      cost : the total scatter of that optimal segmentation.
    """
    n = K.shape[0]
    m = int(ncp)
    if lmax is None:
        lmax = n
    if (m + 1) * lmin > n:
        raise ValueError("sequence too short for this many change points")

    J = calc_scatters(K)                 # from Step 2: every segment's cost

    INF = 1e18
    C = np.full((m + 1, n + 1), INF)     # C[k, l]: best cost, first l frames, k+1 segs
    back = np.zeros((m + 1, n + 1), dtype=int)   # remembers the winning start t

    # base row: a single segment [0 .. l-1]
    for l in range(lmin, min(lmax, n) + 1):
        C[0, l] = J[0, l - 1]

    # fill rows k = 1 .. m
    for k in range(1, m + 1):
        for l in range((k + 1) * lmin, n + 1):
            t_lo = max(k * lmin, l - lmax)
            t_hi = l - lmin              # last segment must be at least lmin long
            if t_lo > t_hi:
                continue
            for t in range(t_lo, t_hi + 1):
                cost = C[k - 1, t] + J[t, l - 1]
                if cost < C[k, l]:
                    C[k, l] = cost
                    back[k, l] = t

    # backtrack from the full problem to read off the cuts
    cps = np.zeros(m, dtype=int)
    l = n
    for k in range(m, 0, -1):
        t = back[k, l]
        cps[k - 1] = t
        l = t
    return cps, C[m, n]

Reading the code

  • C is the table of best costs; C[k, l] is "first l frames, k+1 segments."
  • np.full((m+1, n+1), INF) fills the table with a huge number (INF) meaning "not computed / impossible yet"; real costs will be smaller and overwrite it.
  • The three nested loops implement the recurrence: for each number of cuts k and each prefix length l, try every start t for the last segment and keep the cheapest, recording the winner in back.
  • lmin/lmax (min/max segment length) shrink the search and prevent silly one-frame segments.
  • The final loop reads back from the whole-video cell backward to list the cuts.

Try it: input and output

Six frames: the first three point "right", the last three point "up" — two obvious shots, so the boundary should be at frame 3.

import numpy as np
from kts import build_kernel, cpd_nonlin

X = np.array([[1.0, 0.0], [0.95, 0.10], [0.90, 0.05],   # shot A (frames 0,1,2)
              [0.10, 0.90], [0.05, 0.95], [0.00, 1.00]]) # shot B (frames 3,4,5)
K = build_kernel(X, "cosine")

cps, cost = cpd_nonlin(K, ncp=1, lmin=1)    # ask for exactly 1 cut
print("1 cut :", cps.tolist(), " cost:", round(cost, 4))

cps2, cost2 = cpd_nonlin(K, ncp=2, lmin=1)  # ask for exactly 2 cuts
print("2 cuts:", cps2.tolist(), " cost:", round(cost2, 4))

Output:

1 cut : [3]  cost: 0.0116
2 cuts: [3, 4]  cost: 0.0069
  • With 1 cut, KTS puts it at frame 3 — exactly the true boundary between the two shots. 🎯
  • With 2 cuts, it keeps the real boundary at 3 and adds a second cut at 4. Notice the cost barely drops (0.0116 → 0.0069): the first cut explains almost all the structure; the second is just splitting hairs. That tiny extra gain is the clue we'll use in the next chapter to decide the right number of cuts automatically.

How backtracking recovers the cuts

back[k, l] stored the best start of the last segment for each sub-problem. Starting from the whole video (m, n), we read that start, jump to the smaller sub-problem, and repeat — collecting one cut each step. Those starts are the change points.

Speed

The loops run on the order of $m \times n \times n$ steps in the worst case — polynomial and fast, versus the impossible "try everything" count. (The full implementation in Chapter 7 replaces the inner for t loop with a single NumPy operation for extra speed; the logic is identical.)

This solves the problem when you already know how many cuts you want. Usually you don't — so next we let KTS choose. 👉

Step 4 — Choosing the number of segments

cpd_nonlin makes you specify $m$, the number of cuts. But in real life you don't know how many shots a video has. This chapter makes KTS decide for you.

The trade-off

More cuts always lower the total scatter. (In the extreme, one frame per segment gives a scatter of zero — perfectly "tight" but completely useless.) So we can't just minimize scatter; that would always choose the maximum number of cuts.

We need to charge a fee for every cut, then pick the number of cuts that minimizes scatter + fees:

$$ m^\star = \arg\min_m \Big( J_m + \text{penalty}(m) \Big). $$

($\arg\min$ means "the $m$ that gives the smallest value".) This is the same spirit as not over-fitting in statistics: a new cut is only worth it if it lowers the scatter by more than its fee.

The penalty (and a bug to avoid)

A natural-but-wrong idea is to make the fee proportional to the scatter. That fails: as scatter shrinks toward zero, the fee shrinks too, so cuts become free and KTS never stops splitting. (We actually hit this and fixed it.)

The correct penalty grows with the number of cuts, independent of the scatter. For $m$ cuts over $n$ frames:

$$ \text{penalty}(m) = \frac{m}{2n},\Big( \log\!\big(\tfrac{n}{m}\big) + 1 \Big), \qquad \text{score}(m) = \frac{J_m}{n} + v_{\max}\cdot\text{penalty}(m). $$

The fee rises steadily with each extra cut, so the score goes down while cuts are explaining real structure, then up once they're just splitting hairs. The lowest point is the answer. The weight $v_{\max}$ is your one knob: larger → fewer segments.

The code

import numpy as np


def cpd_auto(K, ncp_max, vmax=1.0, lmin=1, lmax=None):
    """
    Automatically choose the number of change points.

    K       : (n, n) kernel matrix.
    ncp_max : largest number of cuts to consider.
    vmax    : penalty weight. Larger -> fewer segments. (try 0.5 .. 2.0)
    lmin    : minimum segment length.

    Returns (cps, scores):
      cps    : change points of the chosen segmentation.
      scores : dict {m: score} so you can see/plot the trade-off.
    """
    n = K.shape[0]
    ncp_max = min(ncp_max, n - 1)
    J = calc_scatters(K)                     # compute once, reuse for every m

    best_score = np.inf
    best_cps = np.array([], dtype=int)
    scores = {}

    for m in range(0, ncp_max + 1):
        if (m + 1) * lmin > n:
            break
        cps, J_m = cpd_nonlin(K, m, lmin=lmin, lmax=lmax, J=J)
        penalty = 0.0 if m == 0 else (m / (2.0 * n)) * (np.log(n / m) + 1.0)
        score = J_m / n + vmax * penalty
        scores[m] = score
        if score < best_score:
            best_score = score
            best_cps = cps
    return best_cps, scores

Reading the code

  • It runs the Chapter-5 solver once for each candidate number of cuts m, from 0 up to ncp_max.
  • For each, it adds the penalty fee and keeps the m with the lowest total score.
  • best_cps ends up holding the cuts of the winning segmentation.

Try it: input and output

A small but realistic test: 30 frames, the first 15 around one center and the next 15 around another (with random noise). The true boundary is at frame 15.

import numpy as np
from kts import build_kernel, cpd_auto

rng = np.random.default_rng(0)
A = np.array([3, 3, 0, 0, 0, 0, 0, 0.0])
B = np.array([0, 0, 0, 0, 3, 3, 0, 0.0])
X = np.vstack([A + rng.normal(scale=0.5, size=(15, 8)),    # frames 0..14
               B + rng.normal(scale=0.5, size=(15, 8))])   # frames 15..29
K = build_kernel(X, "cosine")

cps, scores = cpd_auto(K, ncp_max=8, vmax=1.0, lmin=2)
print("chosen change points:", cps.tolist(), "=>", len(cps) + 1, "segments")
for m, s in scores.items():
    print(f"   m={m}: score={s:.3f}")

Output:

chosen change points: [15] => 2 segments
   m=0: score=0.538
   m=1: score=0.159
   m=2: score=0.202
   m=3: score=0.237
   m=4: score=0.269
   m=5: score=0.298
   m=6: score=0.323
   m=7: score=0.345
   m=8: score=0.366

KTS picks one cut at frame 15 — exactly the real boundary — without being told there were two groups. 🎯 Look at the scores: they plunge from m=0 (0.538) to m=1 (0.159), then climb steadily. That "valley" at m=1 is the penalty doing its job — the first cut pays for itself; further cuts don't. Plotting score vs m shows the same valley as a classic elbow.

Tuning, in practice

  • vmax is the main dial. Too many tiny segments → raise it. Too few → lower it. (1.0 is a sensible start.)
  • ncp_max caps how many cuts to even consider — set it comfortably above what you expect.
  • lmin (minimum segment length) is a cheap way to forbid spurious one- or two-frame segments — e.g. "at least 1 second of frames".

We now have all four pieces. Next: the polished, all-in-one implementation. 👉

The full implementation

Here is the consolidated module that ties Steps 1–4 together. It's the same logic from the previous chapters, with the DP inner loops vectorized with NumPy for speed (the for t loop becomes a slice + argmin), and calc_scatters computed once and reused across every candidate $m$ in cpd_auto.

The runnable file lives at code/kts.py in this book's folder. It is reproduced below in full.

"""
Kernel Temporal Segmentation (KTS) — from scratch in NumPy.

Reference:
    Potapov, Douze, Harchaoui, Schmid.
    "Category-Specific Video Summarization." ECCV 2014.

This module is the consolidated, vectorized version of the step-by-step
derivation in the accompanying mdBook. Public functions:

    build_kernel(X, kind="cosine")  -> kernel/Gram matrix K
    calc_scatters(K)                -> n x n scatter (cost) table J
    cpd_nonlin(K, ncp, ...)         -> optimal segmentation for fixed m
    cpd_auto(K, ncp_max, ...)       -> auto-select the number of segments

All functions use only NumPy.
"""

from __future__ import annotations

import numpy as np


# --------------------------------------------------------------------------- #
# Step 1 — kernels
# --------------------------------------------------------------------------- #
def linear_kernel(X):
    """K[i, j] = <x_i, x_j>; reproduces ordinary Euclidean variance."""
    return X @ X.T


def cosine_kernel(X, eps=1e-8):
    """Scale-invariant kernel; a robust default for deep features."""
    norms = np.linalg.norm(X, axis=1, keepdims=True)
    Xn = X / np.maximum(norms, eps)
    return Xn @ Xn.T


def rbf_kernel(X, sigma=None):
    """Gaussian kernel exp(-||x_i - x_j||^2 / (2 sigma^2))."""
    sq = np.sum(X ** 2, axis=1)
    d2 = sq[:, None] + sq[None, :] - 2.0 * (X @ X.T)
    d2 = np.maximum(d2, 0.0)
    if sigma is None:
        pos = d2[d2 > 0]
        med = np.median(pos) if pos.size else 1.0
        sigma = np.sqrt(0.5 * med) if med > 0 else 1.0
    return np.exp(-d2 / (2.0 * sigma ** 2))


def build_kernel(X, kind="cosine", **kwargs):
    """Build a kernel matrix from an (n, d) feature array."""
    X = np.asarray(X, dtype=np.float64)
    if kind == "linear":
        return linear_kernel(X)
    if kind == "cosine":
        return cosine_kernel(X)
    if kind == "rbf":
        return rbf_kernel(X, **kwargs)
    raise ValueError(f"unknown kernel {kind!r}; use 'linear', 'cosine' or 'rbf'")


# --------------------------------------------------------------------------- #
# Step 2 — scatter (cost) matrix via cumulative sums
# --------------------------------------------------------------------------- #
def calc_scatters(K):
    """
    J[i, j] = within-segment scatter of frames i..j (inclusive):

        v(i, j) = sum_t K[t, t]  -  (1 / L) * sum_{s,t in [i, j]} K[s, t]

    where L = j - i + 1. Built in O(n^2) with prefix sums; entries with
    j < i are left at 0.
    """
    K = np.asarray(K, dtype=np.float64)
    n = K.shape[0]

    dcum = np.concatenate(([0.0], np.cumsum(np.diag(K))))        # diagonal prefix
    Kcum = np.zeros((n + 1, n + 1))
    Kcum[1:, 1:] = np.cumsum(np.cumsum(K, axis=0), axis=1)       # 2-D prefix

    J = np.zeros((n, n))
    for i in range(n):
        # vectorized over j = i .. n-1
        js = np.arange(i, n)
        L = js - i + 1
        diag_sum = dcum[js + 1] - dcum[i]
        block = (Kcum[js + 1, js + 1] - Kcum[i, js + 1]
                 - Kcum[js + 1, i] + Kcum[i, i])
        J[i, i:] = diag_sum - block / L
    return J


# --------------------------------------------------------------------------- #
# Step 3 — dynamic programming for a fixed number of change points
# --------------------------------------------------------------------------- #
def cpd_nonlin(K, ncp, lmin=1, lmax=None, J=None):
    """
    Globally optimal segmentation with exactly `ncp` change points.

    Returns
    -------
    cps  : (ncp,) int array of change points (start frame of each new segment).
    cost : total scatter of the optimal segmentation.
    """
    n = K.shape[0]
    m = int(ncp)
    if lmax is None:
        lmax = n
    if (m + 1) * lmin > n:
        raise ValueError("sequence too short for this many change points")

    if J is None:
        J = calc_scatters(K)

    INF = 1e18
    C = np.full((m + 1, n + 1), INF)         # C[k, l]: best cost, first l frames, k+1 segs
    back = np.zeros((m + 1, n + 1), dtype=int)

    # base row: single segment [0 .. l-1]
    for l in range(lmin, min(lmax, n) + 1):
        C[0, l] = J[0, l - 1]

    for k in range(1, m + 1):
        for l in range((k + 1) * lmin, n + 1):
            t_lo = max(k * lmin, l - lmax)
            t_hi = l - lmin                  # last segment length >= lmin
            if t_lo > t_hi:
                continue
            ts = np.arange(t_lo, t_hi + 1)
            cand = C[k - 1, ts] + J[ts, l - 1]
            idx = int(np.argmin(cand))
            C[k, l] = cand[idx]
            back[k, l] = ts[idx]

    # backtrack
    cps = np.zeros(m, dtype=int)
    l = n
    for k in range(m, 0, -1):
        t = back[k, l]
        cps[k - 1] = t
        l = t
    return cps, float(C[m, n])


# --------------------------------------------------------------------------- #
# Step 4 — automatic selection of the number of segments
# --------------------------------------------------------------------------- #
def cpd_auto(K, ncp_max, vmax=1.0, lmin=1, lmax=None):
    """
    Pick the number of change points by penalized model selection.

    Returns
    -------
    cps    : change points of the selected segmentation.
    scores : dict {m: penalized score} for inspection.
    """
    n = K.shape[0]
    ncp_max = min(ncp_max, n - 1)
    J = calc_scatters(K)                     # compute once, reuse for every m

    best_score = np.inf
    best_cps = np.array([], dtype=int)
    scores = {}

    for m in range(0, ncp_max + 1):
        if (m + 1) * lmin > n:
            break
        cps, J_m = cpd_nonlin(K, m, lmin=lmin, lmax=lmax, J=J)
        # Penalty grows with the *number of change points* m, independent of the
        # scatter. (A scatter-proportional penalty would vanish as J_m -> 0 and
        # never stop adding cuts.) This is the KTS model-selection criterion.
        penalty = 0.0 if m == 0 else (m / (2.0 * n)) * (np.log(n / m) + 1.0)
        score = J_m / n + vmax * penalty
        scores[m] = score
        if score < best_score:
            best_score = score
            best_cps = cps
    return best_cps, scores


__all__ = [
    "build_kernel", "linear_kernel", "cosine_kernel", "rbf_kernel",
    "calc_scatters", "cpd_nonlin", "cpd_auto",
]

The {{#include}} directive above is an mdBook feature: when the book is built, it splices in the real code/kts.py, so the documentation can never drift from the actual implementation.

API at a glance

FunctionPurpose
build_kernel(X, kind)Turn an (n, d) feature array into the kernel matrix K. kind ∈ {linear, cosine, rbf}.
calc_scatters(K)The n × n cost table J[i, j] = v(i, j).
cpd_nonlin(K, ncp, lmin, lmax)Optimal segmentation for a fixed number of change points.
cpd_auto(K, ncp_max, vmax, lmin)Auto-select the number of segments via a penalized score.

Typical usage

import numpy as np
from kts import build_kernel, cpd_auto

features = np.load("frame_features.npy")   # shape (n_frames, d)
K = build_kernel(features, kind="cosine")

# Let KTS decide how many shots there are:
change_points, scores = cpd_auto(K, ncp_max=30, vmax=1.0, lmin=5)

# change_points are start-frame indices of each new segment.
segments = np.split(np.arange(len(features)), change_points)
print(f"{len(segments)} segments")

A complete run, start to finish

This single example uses every function in order — build features, build the kernel, auto-segment, and list the resulting segments:

import numpy as np
from kts import build_kernel, cpd_auto

# 1. Fake "video": 12 frames in 3 groups of 4 (groups differ in direction).
rng = np.random.default_rng(1)
g1 = np.array([1, 0, 0, 0.0]) + rng.normal(scale=0.05, size=(4, 4))
g2 = np.array([0, 1, 0, 0.0]) + rng.normal(scale=0.05, size=(4, 4))
g3 = np.array([0, 0, 1, 0.0]) + rng.normal(scale=0.05, size=(4, 4))
X = np.vstack([g1, g2, g3])                 # shape (12, 4)

# 2. Similarity grid.
K = build_kernel(X, kind="cosine")

# 3. Let KTS choose the cuts.
cps, _ = cpd_auto(K, ncp_max=6, vmax=1.0, lmin=2)

# 4. Turn cuts into actual frame ranges.
segments = np.split(np.arange(len(X)), cps)
print("change points :", cps.tolist())
print("segments       :", [s.tolist() for s in segments])

Output:

change points : [4, 8]
segments       : [[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]]

KTS recovers the three groups exactly — cuts at frames 4 and 8, splitting the 12 frames into 0–3, 4–7, 8–11. 🎯

In the final chapter we run the larger bundled demo and measure accuracy. 👉

A worked demo

Let's confirm KTS actually works. We build a synthetic "video" whose true structure we know, run KTS, and check it recovers the planted boundaries.

The full script is code/demo.py; here's how it's built.

A fake video with known change points

To test KTS we need data where we already know the right answer. So we manufacture a fake video: we pick a few random "centers" (each stands for a shot), and for each shot we generate a block of frames that are that center plus a little random noise. Frames in the same block look alike; frames in different blocks look different. The boundaries between blocks are the true change points we hope KTS will rediscover.

def make_synthetic_sequence(seed=0):
    """Return (X, true_change_points)."""
    rng = np.random.default_rng(seed)
    d = 8
    block_lengths = [40, 25, 35, 20, 30]   # 5 segments
    centers = rng.normal(scale=3.0, size=(len(block_lengths), d))

    chunks, true_cps, cursor = [], [], 0
    for i, L in enumerate(block_lengths):
        chunks.append(centers[i] + rng.normal(scale=0.4, size=(L, d)))
        cursor += L
        if i < len(block_lengths) - 1:
            true_cps.append(cursor)        # boundary = start of next block
    X = np.vstack(chunks)
    return X, np.array(true_cps)

A couple of unfamiliar bits:

  • np.random.default_rng(seed) creates a reproducible source of random numbers — using the same seed gives the same "random" data every run, so the results in this book are repeatable.
  • rng.normal(scale=0.4, size=(L, d)) draws L × d small random numbers (the per-frame noise). A bigger scale = noisier frames = harder problem.
  • np.vstack([...]) stacks the blocks on top of each other into one big (n_frames, d) array.

We use 5 blocks of lengths [40, 25, 35, 20, 30], so the true change points are at frames [40, 65, 100, 120] (4 cuts → 5 segments).

Running KTS

from kts import build_kernel, cpd_auto, cpd_nonlin

K = build_kernel(X, kind="cosine")

# automatic number of segments
cps_auto, scores = cpd_auto(K, ncp_max=12, vmax=1.0, lmin=5)

# or, if we already knew there were 5 segments:
cps_fixed, cost = cpd_nonlin(K, ncp=4, lmin=5)

What you get

Running python demo.py produces:

sequence length n = 150
true change points = [40, 65, 100, 120]  (5 segments)

[cpd_auto]   change points = [40, 65, 100, 120]  (5 segments)
[cpd_nonlin] change points = [40, 65, 100, 120]  (fixed m=4, total scatter=4.04)

fixed-m boundary errors (frames): [0, 0, 0, 0]  (mean 0.00)

🎯 Both modes recover all four true change points exactly — and cpd_auto independently figures out that there are 5 segments, without being told.

If Matplotlib is installed, the demo also saves kts_demo.png showing the feature heatmap with true boundaries (white dashed) and detected boundaries (red) overlaid.

Things to try next

  • Make it harder. Increase the per-cluster noise (scale=0.4) until segmentation starts to fail — that's the signal-to-noise limit.
  • Change the kernel. Swap "cosine" for "rbf" and watch how the bandwidth $\sigma$ changes sensitivity.
  • Tune vmax. Lower it toward 0 to over-segment; raise it to merge.
  • Plot the scores. scores from cpd_auto vs $m$ reveals the elbow that the penalty is exploiting.
  • Real features. Replace X with CNN features extracted one-per-second from a real video, then map change points back to timestamps.

That's KTS, from the math all the way to a working segmenter. 🎉

Finally, let's step back and look at the research paper this all came from — how to read it, how to reproduce a paper like it, and where KTS is used. 👉

The original paper: understand & reproduce

So far we built KTS as if from thin air. In reality, KTS comes from a research paper, and the previous chapters are a reproduction of one piece of it. This chapter steps back and tells that story: what the paper actually says, how to read a paper like it, the general process of turning any paper into working code, and an honest reflection on what we built versus what the paper specifies.

The paper

Danila Potapov, Matthijs Douze, Zaid Harchaoui, Cordelia Schmid. Category-Specific Video Summarization. ECCV 2014.

The paper's main goal is video summarization: given a long video and a known category (e.g. "changing a tire", "making a sandwich"), produce a short summary that keeps the important moments for that category. KTS — "Kernel Temporal Segmentation" — is one component of that system, not the whole thing.

The paper's pipeline, roughly:

  1. Describe each frame with features (visual descriptors).
  2. Segment the video into shots — this is KTS, the part we reproduced.
  3. Score each segment for importance using a category-specific model.
  4. Select the highest-scoring segments under a length budget to form the summary.

We rebuilt step 2. It's a clean, self-contained algorithm with a precise objective, which makes it an ideal target for a "from scratch" reproduction.

Where the idea comes from

KTS didn't appear out of nowhere either — it adapts kernel change-point detection from the statistics/machine-learning literature (notably Harchaoui & Cappé, Retrospective Multiple Change-Point Estimation with Kernels, 2007, and related work on penalized model selection by Arlot, Celisse and others). The paper's contribution here is applying that machinery to video frames and wiring it into a summarization system. Recognizing this lineage is useful: when the paper is terse about a detail, the cited prior work usually spells it out.

How to read a paper like this

Research papers are dense and are not meant to be read top-to-bottom like a tutorial. A practical recipe (a version of Keshav's well-known "three-pass" method):

Pass 1 — the 5-minute skim. Read the title, abstract, figures, and section headings. Goal: What problem? What's the key idea? Is this even the part I need? For us, the realization is "the summarization framing is interesting, but the reusable algorithm is the temporal segmentation."

Pass 2 — the method, carefully. Read the algorithm/method section and its math, but skip proofs and most experiments. Write down, in your own words:

  • the inputs and outputs (in: per-frame features; out: change points),
  • the objective (minimize total within-segment kernel variance),
  • the algorithm (dynamic programming),
  • the hyperparameters (kernel choice, number of segments / penalty weight),
  • the complexity (so you know if it'll scale).

Pass 3 — reproduce. Re-derive the key equations and implement them. This is the only pass that truly tests understanding — and it's the rest of this book.

Tip: keep a one-page "notation sheet" as you read. Papers reuse symbols heavily ($K$, $\phi$, $\mu$, $n$, $m$…); a glossary you write yourself prevents most confusion. Our math chapter literally opens with one.

From paper to code: the general process

Reproducing a paper is a repeatable craft. The steps we followed — and that transfer to almost any algorithmic paper:

  1. Isolate the algorithm. Separate the reusable core (objective + DP) from the surrounding system (the summarization pipeline) you don't need.
  2. Translate math into data structures. Each mathematical object becomes a concrete array:
    • kernel $K_{ij}$ → an n × n NumPy matrix (Chapter 3),
    • segment scatter $v(i,j)$ → a precomputed cost table (Chapter 4),
    • the DP recurrence $C[k,\ell]$ → a 2-D table filled bottom-up (Chapter 5).
  3. Make the slow thing fast. The naive scatter is too slow, so we used the cumulative-sum trick. Papers often state the efficient form without derivation; reconstructing it is part of the work.
  4. Fill the gaps. Papers omit details for space. The biggest gap here is the exact model-selection penalty for choosing the number of segments. We reconstructed it from the algorithm's intent and the companion reference code (more below).
  5. Validate against ground truth. Build synthetic data where you know the answer and check the code recovers it (Chapter 8), plus internal sanity checks (kernel symmetry; scatter matching its definition).
  6. Reflect on the differences. Document where your version diverges and why — the next section does exactly that.

What the paper specifies vs. what we filled in

Being honest about the boundary between "from the paper" and "our choices" is part of a good reproduction.

PieceFrom the paper / its lineageOur reproduction
Objective (within-segment kernel variance)Yes — the core idea.Implemented exactly (calc_scatters).
Optimal segmentation by dynamic programmingYes.Implemented exactly (cpd_nonlin).
Cumulative-sum speedupStandard, often implied.Implemented (the prefix-sum trick).
Choosing the number of segmentsA penalized model-selection criterion.The KTS reference-code penalty (see below).
Kernel & featuresVisual descriptors with an appropriate kernel.We expose linear / cosine / rbf; cosine is the practical default.
The summarization systemThe paper's main contribution.Out of scope — we reproduced only the segmentation.

A reflection on the penalty (and a real bug we hit)

The trickiest "gap" was the penalty that decides how many segments to keep. Our first attempt made the penalty proportional to the scatter — which seems reasonable but is wrong: as the scatter shrinks toward zero, the penalty vanishes, so extra cuts become free and the algorithm never stops splitting. We caught this because our auto-selection demo returned absurdly many segments.

The fix matches the widely-used KTS reference implementation: a penalty that grows with the number of change points, independent of the scatter —

$$ \text{penalty}(m) = \frac{m}{2n}\,\Big(\log\!\big(\tfrac{n}{m}\big) + 1\Big). $$

This is the lesson of step 4 above in miniature: the paper tells you a penalty exists and what it must accomplish; getting the exact form right takes a careful read of the criterion (and, when available, the authors' released code). It's also a reminder that validation catches reproduction bugs that "it compiles and looks plausible" never would.

How KTS is used — general use cases

KTS solves a generic problem — split an ordered sequence into homogeneous segments — so it's useful well beyond the original paper:

  • Video summarization (its home turf). Standard preprocessing for benchmarks like TVSum and SumMe: segment first, score segments, then select.
  • Shot / scene boundary detection. Find where one shot ends and the next begins, for indexing, editing, or thumbnail selection.
  • Generic change-point detection. Any sequential signal where the distribution shifts in blocks: sensor streams, financial/time-series data, audio segmentation, EEG, activity logs. Swap in a kernel appropriate to the data and KTS works unchanged.
  • A preprocessing step for downstream models. Whenever you want to classify or score segments rather than individual samples, KTS gives you the segments.

When KTS is a good fit — and when it isn't

Good fit when:

  • the signal is genuinely piecewise-homogeneous (blocks that are internally consistent), and
  • you can compute, or already have, per-item features and a sensible kernel.

Reach for something else when:

  • the sequence is very long — the kernel matrix is $n \times n$, so you must subsample (e.g. one feature per second) and map results back;
  • you need online/streaming detection — KTS is retrospective (it looks at the whole sequence at once), not real-time;
  • segments aren't well modeled as "constant-ish then a jump" (e.g. slow drifts).

A reusable checklist for reproducing any paper

  1. Skim for the one idea and locate the part you actually need.
  2. Write your own notation sheet and a plain-English statement of inputs, outputs, objective, and algorithm.
  3. Map each math object to a concrete data structure.
  4. Implement the naive version first; optimize only once it's correct.
  5. Identify gaps the paper leaves; fill them from cited work or released code, and record your assumptions.
  6. Validate on data with known answers, plus internal invariants.
  7. Document the deltas between your version and the paper.

That's the full arc: from a dense ECCV paper to a small, correct, well-understood implementation you can read, run, and reuse. 🎓

See the References for the paper and its intellectual roots.

One-file CLI: KTS on a real video

Everything in this book, distilled into one self-contained script that runs on an actual video file and does a useful job: automatic shot detection + storyboard extraction (one keyframe per shot). The KTS core is plain NumPy, written from scratch; only video decoding uses OpenCV.

The steps

  1. Sample frames. Decode the video and keep ~1 frame per second.
  2. Describe each frame. Compute an HSV color histogram — a model-free numeric fingerprint of the frame.
  3. Build the kernel. Make the n × n similarity matrix K between frames (cosine by default).
  4. Cost table. Precompute every segment's within-segment scatter J[i,j] using cumulative sums (instant lookups).
  5. Segment. Dynamic programming (cpd_nonlin) finds the optimal cuts for a given count; cpd_auto adds a penalty and chooses the count for you.
  6. Map back & report. Convert sampled-frame cuts to real timestamps, print a shot table, and save one keyframe per shot.

Install & run

pip install numpy opencv-python

# auto-detect shots and save a keyframe per shot into ./keyframes/
python kts_video.py myvideo.mp4

# sample 2 fps, fewer shots (higher penalty), custom output dir
python kts_video.py myvideo.mp4 --fps 2 --vmax 1.5 --outdir shots

# force an exact number of shots
python kts_video.py myvideo.mp4 --num-shots 8

# just print the shot table, no images
python kts_video.py myvideo.mp4 --no-keyframes

What it prints

[1/4] decoding & sampling myvideo.mp4 at 1.0 fps ...
      183 frames sampled (4575 total, 25.00 fps source)
[2/4] building cosine kernel matrix (183x183) ...
[3/4] auto-selecting shots (vmax=1.0) ...
[4/4] found 7 shots:

  shot       start         end      dur
  --------------------------------------
     0     0:00:00.0     0:00:21.0    21.0s
     1     0:00:21.0     0:00:48.0    27.0s
     2     0:00:48.0     0:01:33.0    45.0s
     ...
saved 7 keyframes to ./keyframes/

You get a timestamped shot list plus keyframes/shot_000.jpg, shot_001.jpg, … — a ready-made storyboard, and exactly the preprocessing step video-summarization pipelines need.

The exact numbers above are illustrative (they depend on your video). The shot count adapts to content via the penalty; tune it with --vmax (higher → fewer shots) or pin it with --num-shots.

The complete script

#!/usr/bin/env python3
"""
kts_video.py — Kernel Temporal Segmentation on a real video, from scratch.

A single-file command-line tool that:
  1. decodes a video and samples frames (e.g. 1 per second),
  2. describes each sampled frame with a color histogram (no deep model needed),
  3. builds a kernel (similarity) matrix between frames,
  4. runs Kernel Temporal Segmentation to find shot boundaries,
  5. prints a shot table (start/end timestamps) and saves one keyframe per shot
     — i.e. an automatic storyboard / video-summary preprocessing step.

The KTS core (scatter table + dynamic programming + auto model selection) is
implemented here in plain NumPy. Only video decoding uses OpenCV.

Usage:
    python kts_video.py path/to/video.mp4
    python kts_video.py video.mp4 --fps 2 --vmax 1.0 --outdir shots
    python kts_video.py video.mp4 --num-shots 8        # force exactly 8 shots

Requirements:
    pip install numpy opencv-python
"""

from __future__ import annotations

import argparse
import os
import sys

import numpy as np


# ===========================================================================
# 1. Feature extraction — turn each sampled frame into a number vector
# ===========================================================================
def color_histogram(frame_bgr, bins=8):
    """
    Describe a frame by an HSV color histogram (a robust, model-free fingerprint).

    Returns a 1-D feature vector of length 3*bins, normalized to sum to 1.
    """
    import cv2

    hsv = cv2.cvtColor(frame_bgr, cv2.COLOR_BGR2HSV)
    h = np.histogram(hsv[:, :, 0], bins=bins, range=(0, 180))[0]
    s = np.histogram(hsv[:, :, 1], bins=bins, range=(0, 256))[0]
    v = np.histogram(hsv[:, :, 2], bins=bins, range=(0, 256))[0]
    feat = np.concatenate([h, s, v]).astype(np.float64)
    total = feat.sum()
    return feat / total if total > 0 else feat


def extract_features(path, target_fps=1.0, bins=8):
    """
    Decode `path`, sample ~target_fps frames per second, and return:
        X            : (n_sampled, 3*bins) feature matrix
        frame_index  : original frame number of each sampled frame
        src_fps      : the video's real frame rate
        n_total      : total number of frames in the video
    """
    import cv2

    cap = cv2.VideoCapture(path)
    if not cap.isOpened():
        raise SystemExit(f"error: could not open video {path!r}")

    src_fps = cap.get(cv2.CAP_PROP_FPS) or 25.0
    n_total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT) or 0)
    step = max(1, int(round(src_fps / max(target_fps, 1e-6))))

    feats, frame_index = [], []
    i = 0
    while True:
        ok, frame = cap.read()
        if not ok:
            break
        if i % step == 0:
            feats.append(color_histogram(frame, bins=bins))
            frame_index.append(i)
        i += 1
    cap.release()

    if len(feats) < 2:
        raise SystemExit("error: video too short / too few sampled frames")
    return np.array(feats), np.array(frame_index), src_fps, (n_total or i)


# ===========================================================================
# 2. Kernel (similarity) matrix
# ===========================================================================
def build_kernel(X, kind="cosine", eps=1e-8):
    """Similarity grid K[i, j] between every pair of frame fingerprints."""
    X = np.asarray(X, dtype=np.float64)
    if kind == "linear":
        return X @ X.T
    if kind == "cosine":
        norms = np.linalg.norm(X, axis=1, keepdims=True)
        Xn = X / np.maximum(norms, eps)
        return Xn @ Xn.T
    if kind == "rbf":
        sq = np.sum(X ** 2, axis=1)
        d2 = np.maximum(sq[:, None] + sq[None, :] - 2.0 * (X @ X.T), 0.0)
        pos = d2[d2 > 0]
        sigma = np.sqrt(0.5 * np.median(pos)) if pos.size else 1.0
        return np.exp(-d2 / (2.0 * max(sigma, eps) ** 2))
    raise ValueError(f"unknown kernel {kind!r}")


# ===========================================================================
# 3. Scatter (cost) table — within-segment variance for every segment, fast
# ===========================================================================
def calc_scatters(K):
    """J[i, j] = within-segment scatter of frames i..j, via prefix sums."""
    K = np.asarray(K, dtype=np.float64)
    n = K.shape[0]
    dcum = np.concatenate(([0.0], np.cumsum(np.diag(K))))
    Kcum = np.zeros((n + 1, n + 1))
    Kcum[1:, 1:] = np.cumsum(np.cumsum(K, axis=0), axis=1)

    J = np.zeros((n, n))
    for i in range(n):
        js = np.arange(i, n)
        L = js - i + 1
        diag_sum = dcum[js + 1] - dcum[i]
        block = (Kcum[js + 1, js + 1] - Kcum[i, js + 1]
                 - Kcum[js + 1, i] + Kcum[i, i])
        J[i, i:] = diag_sum - block / L
    return J


# ===========================================================================
# 4. Dynamic programming — optimal cuts for a fixed number of change points
# ===========================================================================
def cpd_nonlin(K, ncp, lmin=1, lmax=None, J=None):
    """Globally optimal segmentation with exactly `ncp` change points."""
    n = K.shape[0]
    m = int(ncp)
    if lmax is None:
        lmax = n
    if (m + 1) * lmin > n:
        raise ValueError("sequence too short for this many change points")
    if J is None:
        J = calc_scatters(K)

    INF = 1e18
    C = np.full((m + 1, n + 1), INF)
    back = np.zeros((m + 1, n + 1), dtype=int)
    for l in range(lmin, min(lmax, n) + 1):
        C[0, l] = J[0, l - 1]
    for k in range(1, m + 1):
        for l in range((k + 1) * lmin, n + 1):
            t_lo = max(k * lmin, l - lmax)
            t_hi = l - lmin
            if t_lo > t_hi:
                continue
            ts = np.arange(t_lo, t_hi + 1)
            cand = C[k - 1, ts] + J[ts, l - 1]
            idx = int(np.argmin(cand))
            C[k, l] = cand[idx]
            back[k, l] = ts[idx]

    cps = np.zeros(m, dtype=int)
    l = n
    for k in range(m, 0, -1):
        t = back[k, l]
        cps[k - 1] = t
        l = t
    return cps, float(C[m, n])


# ===========================================================================
# 5. Automatic model selection — let KTS choose the number of shots
# ===========================================================================
def cpd_auto(K, ncp_max, vmax=1.0, lmin=1, lmax=None):
    """Pick the number of change points by a penalized score; return cuts."""
    n = K.shape[0]
    ncp_max = min(ncp_max, n - 1)
    J = calc_scatters(K)
    best_score, best_cps = np.inf, np.array([], dtype=int)
    for m in range(0, ncp_max + 1):
        if (m + 1) * lmin > n:
            break
        cps, J_m = cpd_nonlin(K, m, lmin=lmin, lmax=lmax, J=J)
        penalty = 0.0 if m == 0 else (m / (2.0 * n)) * (np.log(n / m) + 1.0)
        score = J_m / n + vmax * penalty
        if score < best_score:
            best_score, best_cps = score, cps
    return best_cps


# ===========================================================================
# Helpers: segments, timestamps, keyframes
# ===========================================================================
def segments_from_cps(cps, n):
    """Turn change points into (start, end) index pairs covering 0..n-1."""
    bounds = [0] + list(cps) + [n]
    return [(bounds[i], bounds[i + 1] - 1) for i in range(len(bounds) - 1)]


def fmt_time(seconds):
    """Seconds -> H:MM:SS.s timestamp string."""
    h = int(seconds // 3600)
    m = int((seconds % 3600) // 60)
    s = seconds % 60
    return f"{h}:{m:02d}:{s:04.1f}"


def save_keyframes(path, segments, frame_index, outdir, max_side=480):
    """Save one representative frame (segment midpoint) per shot into outdir."""
    import cv2

    os.makedirs(outdir, exist_ok=True)
    cap = cv2.VideoCapture(path)
    saved = []
    for k, (a, b) in enumerate(segments):
        mid_sampled = (a + b) // 2
        src_frame = int(frame_index[mid_sampled])
        cap.set(cv2.CAP_PROP_POS_FRAMES, src_frame)
        ok, frame = cap.read()
        if not ok:
            continue
        h, w = frame.shape[:2]
        if max(h, w) > max_side:
            scale = max_side / max(h, w)
            frame = cv2.resize(frame, (int(w * scale), int(h * scale)))
        out = os.path.join(outdir, f"shot_{k:03d}.jpg")
        cv2.imwrite(out, frame)
        saved.append(out)
    cap.release()
    return saved


# ===========================================================================
# Command-line interface
# ===========================================================================
def main(argv=None):
    p = argparse.ArgumentParser(
        description="Detect shots in a video with Kernel Temporal Segmentation "
                    "and save one keyframe per shot (a storyboard).")
    p.add_argument("video", help="path to the input video file")
    p.add_argument("--fps", type=float, default=1.0,
                   help="frames sampled per second (default: 1)")
    p.add_argument("--vmax", type=float, default=1.0,
                   help="penalty weight; larger = fewer shots (default: 1.0)")
    p.add_argument("--num-shots", type=int, default=None,
                   help="force an exact number of shots (skips auto-selection)")
    p.add_argument("--max-shots", type=int, default=100,
                   help="upper bound for auto-selection (default: 100)")
    p.add_argument("--min-shot-sec", type=float, default=1.0,
                   help="minimum shot length in seconds (default: 1.0)")
    p.add_argument("--kernel", choices=["cosine", "linear", "rbf"],
                   default="cosine", help="similarity kernel (default: cosine)")
    p.add_argument("--bins", type=int, default=8,
                   help="histogram bins per channel (default: 8)")
    p.add_argument("--outdir", default="keyframes",
                   help="directory for keyframe images (default: keyframes)")
    p.add_argument("--no-keyframes", action="store_true",
                   help="only print the shot table; do not save images")
    args = p.parse_args(argv)

    # Step 1–2: features + kernel
    print(f"[1/4] decoding & sampling {args.video} at {args.fps} fps ...")
    X, frame_index, src_fps, n_total = extract_features(
        args.video, target_fps=args.fps, bins=args.bins)
    print(f"      {len(X)} frames sampled ({n_total} total, {src_fps:.2f} fps source)")

    print(f"[2/4] building {args.kernel} kernel matrix ({len(X)}x{len(X)}) ...")
    K = build_kernel(X, kind=args.kernel)

    # Step 3–5: segmentation
    sample_dt = (frame_index[1] - frame_index[0]) / src_fps if len(frame_index) > 1 else 1.0
    lmin = max(1, int(round(args.min_shot_sec / max(sample_dt, 1e-6))))
    if args.num_shots:
        print(f"[3/4] segmenting into exactly {args.num_shots} shots ...")
        cps, _ = cpd_nonlin(K, args.num_shots - 1, lmin=lmin)
    else:
        print(f"[3/4] auto-selecting shots (vmax={args.vmax}) ...")
        cps = cpd_auto(K, args.max_shots, vmax=args.vmax, lmin=lmin)

    segments = segments_from_cps(cps, len(X))

    # Report
    print(f"[4/4] found {len(segments)} shots:\n")
    print(f"  {'shot':>4}  {'start':>10}  {'end':>10}  {'dur':>7}")
    print("  " + "-" * 38)
    for k, (a, b) in enumerate(segments):
        t0 = frame_index[a] / src_fps
        t1 = (frame_index[b] / src_fps) + sample_dt
        print(f"  {k:>4}  {fmt_time(t0):>10}  {fmt_time(t1):>10}  {t1 - t0:>6.1f}s")

    if not args.no_keyframes:
        saved = save_keyframes(args.video, segments, frame_index, args.outdir)
        print(f"\nsaved {len(saved)} keyframes to ./{args.outdir}/")


if __name__ == "__main__":
    main()

That's the whole tool: decode → fingerprint → kernel → DP → timestamps, in one file you can read top to bottom.

References

  1. Danila Potapov, Matthijs Douze, Zaid Harchaoui, Cordelia Schmid. Category-Specific Video Summarization. ECCV 2014. — The paper that introduced KTS.

  2. Zaid Harchaoui, Olivier Cappé. Retrospective Multiple Change-Point Estimation with Kernels. IEEE/SP SSP 2007. — The kernel change-point framework KTS builds on.

  3. Ke Zhang, Wei-Lun Chao, Fei Sha, Kristen Grauman. Video Summarization with Long Short-Term Memory. ECCV 2016. — Popularized KTS as preprocessing for the TVSum / SumMe benchmarks.

  • Dynamic programming for change-point detection — the same optimal-substructure recurrence appears in PELT, segmented regression, and ruptures-style libraries.
  • The kernel trick — writing variance/distance purely through a Gram matrix is the same move behind kernel PCA and kernel k-means.

This book's code

Both depend only on NumPy (Matplotlib optional, for the demo plot).