Step 1 — The kernel matrix

Everything in KTS reads from the kernel matrix $K$: a grid where the entry in row $i$, column $j$ is the similarity between frame $i$ and frame $j$. Step one is to turn our table of frame fingerprints into this similarity grid.

Our input is an array X with one row per frame and one column per feature number. So X has shape (n, d): n frames, each described by d numbers. The output K has shape (n, n): a similarity for every pair of frames.

The code

import numpy as np


def linear_kernel(X):
    """K[i, j] = dot product of frame i and frame j. Plain variance."""
    return X @ X.T


def cosine_kernel(X, eps=1e-8):
    """Similarity by direction only (length-normalized). A good default."""
    norms = np.linalg.norm(X, axis=1, keepdims=True)   # length of each frame vector
    Xn = X / np.maximum(norms, eps)                    # rescale each to length 1
    return Xn @ Xn.T


def rbf_kernel(X, sigma=None):
    """Gaussian similarity: 1 when identical, fading to 0 as frames differ."""
    sq = np.sum(X**2, axis=1)                          # squared length of each frame
    # squared distance between every pair: ||a-b||^2 = ||a||^2 + ||b||^2 - 2 a.b
    d2 = sq[:, None] + sq[None, :] - 2.0 * (X @ X.T)
    d2 = np.maximum(d2, 0.0)                            # clip tiny negatives (rounding)
    if sigma is None:
        med = np.median(d2[d2 > 0]) if np.any(d2 > 0) else 1.0
        sigma = np.sqrt(0.5 * med) if med > 0 else 1.0  # auto bandwidth
    return np.exp(-d2 / (2.0 * sigma**2))


def build_kernel(X, kind="cosine", **kwargs):
    """Pick a kernel by name and build the matrix."""
    X = np.asarray(X, dtype=np.float64)
    kernels = {"linear": linear_kernel, "cosine": cosine_kernel, "rbf": rbf_kernel}
    if kind not in kernels:
        raise ValueError(f"unknown kernel {kind!r}; pick one of {list(kernels)}")
    return kernels[kind](X, **kwargs) if kind == "rbf" else kernels[kind](X)

Reading the code, line by line

X @ X.T — @ is matrix multiplication and .T is the transpose (rows ↔ columns). Multiplying X by its own transpose computes the dot product of every pair of rows at once. That grid of dot products is the linear kernel.
np.linalg.norm(X, axis=1, keepdims=True) — the length of each frame vector. axis=1 means "do it per row" (one length per frame).
X / norms — divide each frame by its own length, so every frame becomes length 1. After this, the dot product measures direction only — that's the cosine kernel. np.maximum(norms, eps) just avoids dividing by zero.
In rbf_kernel, sq[:, None] + sq[None, :] - 2 * (X @ X.T) is the algebra identity for squared distance between every pair, computed in one shot.
np.exp(...) raises $e$ to a power elementwise; here it turns "distance" into "similarity" (distance 0 → similarity 1; large distance → similarity ~0).

Try it: input and output

import numpy as np

X = np.array([[1.0, 0.0],     # frame 0: points right
              [0.9, 0.1],     # frame 1: nearly the same as frame 0
              [0.1, 1.0]])    # frame 2: points up (different)

K = build_kernel(X, kind="cosine")
print(np.round(K, 3))         # round to 3 decimals for readability

Output:

[[1.    0.994 0.1  ]
 [0.994 1.    0.209]
 [0.1   0.209 1.   ]]

How to read this 3×3 grid:

Diagonal is all 1.0 — each frame is identical to itself.
K[0,1] = 0.994 — frames 0 and 1 are nearly identical (same shot).
K[0,2] = 0.1 — frames 0 and 2 are very different (a boundary sits between them).

The same three frames with the other two kernels:

print("linear:\n", np.round(build_kernel(X, "linear"), 3))
print("rbf:\n",    np.round(build_kernel(X, "rbf"), 3))

Output:

linear:
 [[1.   0.9  0.1 ]
 [0.9  0.82 0.19]
 [0.1  0.19 1.01]]
rbf:
 [[1.    0.986 0.287]
 [0.986 1.    0.368]
 [0.287 0.368 1.   ]]

All three agree on the story — frames 0 and 1 are similar, frame 2 is the odd one out — they just put the numbers on different scales.

Which kernel should I use?

cosine — the safe default for AI "deep" features; it ignores brightness/ scale and compares content.
linear — reproduces ordinary variance; simplest.
rbf — adds non-linear sensitivity, at the cost of a sigma knob (how fast similarity fades with distance). The code picks a reasonable sigma automatically.

Two practical notes

Memory. K has n × n entries. A 1-hour video has too many frames, so in practice you keep one frame per second, run KTS, then map the change points back to real time.
Sanity check. A valid kernel matrix is symmetric (K[i,j] == K[j,i]), and for the cosine kernel the diagonal is exactly 1:

Xr = np.random.randn(6, 4)            # 6 random frames, 4 features each
Kc = build_kernel(Xr, "cosine")
print("symmetric:", np.allclose(Kc, Kc.T))
print("diagonal all 1:", np.allclose(np.diag(Kc), 1.0))

Output:

symmetric: True
diagonal all 1: True

With $K$ in hand we can score any segment. Let's make that fast. 👉

Kernel Temporal Segmentation from Scratch

Step 1 — The kernel matrix

The code

Reading the code, line by line

Try it: input and output

Which kernel should I use?

Two practical notes