Introduction

No background required. This book assumes you know nothing about programming, AI, or math beyond high-school basics. Every concept, line of code, and symbol is explained, and every code example is followed by the exact output it produces. If you've never written Python, read the 5-minute primer first.

The one-sentence idea

Kernel Temporal Segmentation (KTS) automatically chops a video into a small number of consistent pieces ("shots" or "scenes") by finding the moments where the picture changes a lot.

An everyday analogy

Imagine flipping through a photo album. Pages 1–10 are a beach trip, pages 11–18 are a birthday party, pages 19–25 are a hike. You can instantly spot the boundaries — the points where the photos suddenly start looking different. KTS does exactly this, automatically, for the frames of a video. The boundaries it finds are called change points.

Where it's used

KTS was introduced by Potapov, Douze, Harchaoui & Schmid in "Category-Specific Video Summarization" (ECCV 2014). It became a standard first step in video summarization: before a computer decides which parts of a video are worth keeping, it first splits the video into shots — and KTS is the tool that does the splitting. It's used to prepare well-known datasets like TVSum and SumMe.

How a video becomes numbers

A computer can't "see" a picture the way we do — it works with numbers. So each video frame is turned into a list of numbers called a feature vector: a numeric fingerprint of that frame. Similar-looking frames get similar fingerprints.

So our input is a sequence of feature vectors, one per frame:

$$ x_1, x_2, \dots, x_n \in \mathbb{R}^d $$

Read that line as: "there are $n$ frames; each frame $x_t$ is a list of $d$ numbers." (The symbol $\mathbb{R}^d$ just means "a list of $d$ real numbers.") Frames within the same shot have similar fingerprints; at a shot boundary the fingerprint changes sharply. KTS finds those sharp changes.

What KTS actually does

Given those fingerprints, KTS finds the change points so that:

  • frames inside a segment are as similar to each other as possible, and
  • the number of segments stays small (we don't want one segment per frame).

It does this with two ideas, which are the two halves of this book:

  1. Measure how "spread out" a segment is using a kernel — a flexible similarity measure (Chapters 1–2).
  2. Find the cut points that minimize the total spread using dynamic programming, a technique that is guaranteed to find the best possible answer, not just a good guess (Chapters 3–6).

What you'll build

By the end you'll have one short, fully-understood Python file, kts.py, and you'll watch it recover hidden boundaries in test data exactly.

What you need

  • Python 3.8+ (the primer shows how code/output is presented)
  • NumPy (a number-crunching library — also covered in the primer)
  • (optional) Matplotlib, only for one picture in the final demo

Let's start with the two background ideas: kernels and change points. 👉