Introduction

No background required. This book assumes you know nothing about programming, AI, or math beyond high-school basics. Every concept, line of code, and symbol is explained, and every code example is followed by the exact output it produces. If you've never written Python, read the 5-minute primer first.

The one-sentence idea

Kernel Temporal Segmentation (KTS) automatically chops a video into a small number of consistent pieces ("shots" or "scenes") by finding the moments where the picture changes a lot.

Imagine flipping through a photo album. Pages 1–10 are a beach trip, pages 11–18 are a birthday party, pages 19–25 are a hike. You can instantly spot the boundaries — the points where the photos suddenly start looking different. KTS does exactly this, automatically, for the frames of a video. The boundaries it finds are called change points.

Where it's used

KTS was introduced by Potapov, Douze, Harchaoui & Schmid in "Category-Specific Video Summarization" (ECCV 2014). It became a standard first step in video summarization: before a computer decides which parts of a video are worth keeping, it first splits the video into shots — and KTS is the tool that does the splitting. It's used to prepare well-known datasets like TVSum and SumMe.

How a video becomes numbers

A computer can't "see" a picture the way we do — it works with numbers. So each video frame is turned into a list of numbers called a feature vector: a numeric fingerprint of that frame. Similar-looking frames get similar fingerprints.

So our input is a sequence of feature vectors, one per frame:

$$ x_1, x_2, \dots, x_n \in \mathbb{R}^d $$

Read that line as: "there are $n$ frames; each frame $x_t$ is a list of $d$ numbers." (The symbol $\mathbb{R}^d$ just means "a list of $d$ real numbers.") Frames within the same shot have similar fingerprints; at a shot boundary the fingerprint changes sharply. KTS finds those sharp changes.

What KTS actually does

Given those fingerprints, KTS finds the change points so that:

frames inside a segment are as similar to each other as possible, and
the number of segments stays small (we don't want one segment per frame).

It does this with two ideas, which are the two halves of this book:

Measure how "spread out" a segment is using a kernel — a flexible similarity measure (Chapters 1–2).
Find the cut points that minimize the total spread using dynamic programming, a technique that is guaranteed to find the best possible answer, not just a good guess (Chapters 3–6).

What you'll build

By the end you'll have one short, fully-understood Python file, kts.py, and you'll watch it recover hidden boundaries in test data exactly.

What you need

Python 3.8+ (the primer shows how code/output is presented)
NumPy (a number-crunching library — also covered in the primer)
(optional) Matplotlib, only for one picture in the final demo

Let's start with the two background ideas: kernels and change points. 👉

Kernel Temporal Segmentation from Scratch