Kernels: five meanings of one word

Few words cause more confusion than kernel, because it means five completely different things depending on who's talking. An ML researcher, a GPU programmer, a computer-vision engineer, and an operating-systems person all say "kernel" and mean something unrelated. This chapter pins down all five so you're never lost — then goes deep on the two that matter most for AI.

Don't be confused — the five kernels at a glance:

Context"Kernel" meansOne-line definition
ML / SVMa similarity function$k(a,b)$ = a similarity that secretly equals a dot product in some richer space
CNN / imagea small filtera tiny weight grid slid over an image to detect a pattern
GPU / CUDAa function that runs on the GPUcode launched across thousands of parallel threads
Operating systemthe core of the OSthe privileged code managing hardware (Linux kernel)
Stats / densitya weighting windowthe bump function in kernel density estimation / RBF

They are genuinely unrelated meanings of one overloaded word. Which one is meant is always clear from context — once you know there are five.

Meaning 1 — The ML kernel (a similarity that's a hidden dot product)

This is the one that earns the deepest understanding. Start with a problem: the XOR data below can't be separated by any straight line.

import numpy as np
X = np.array([[0,0],[0,1],[1,0],[1,1]], float)
y = np.array([0, 1, 1, 0])     # XOR: 1 only when exactly one input is 1

No line splits the 1s from the 0s. But invent a new feature — the product $x_1 x_2$ — and the classes separate trivially:

print("new feature x1*x2:", X[:, 0] * X[:, 1])

Output:

new feature x1*x2: [0. 0. 0. 1.]

The lone 1 (for input [1,1]) is now separable from the three 0s by the simple rule "is the feature > 0.5?". Mapping data into a higher-dimensional space can make a hard problem linear. Call that map $\phi$ (phi).

The trick: skip building $\phi$, compute the dot product directly

The catch: $\phi$ can be huge or infinite-dimensional, so building it is expensive. The kernel trick is the beautiful shortcut: for many useful $\phi$, the dot product in the mapped space equals a cheap formula on the original vectors — so you never construct $\phi$ at all.

$$ k(a, b) = \phi(a) \cdot \phi(b) \quad\text{but computed without ever forming } \phi $$

Concretely, for the degree-2 polynomial map $\phi(x) = (x_1^2,\ \sqrt{2},x_1 x_2,\ x_2^2)$:

def phi(x):
    x1, x2 = x
    return np.array([x1**2, np.sqrt(2)*x1*x2, x2**2])

a = np.array([1., 2.]); b = np.array([3., 1.])
explicit = phi(a) @ phi(b)      # build the 3-D features, then dot
implicit = (a @ b) ** 2         # the kernel: just (a·b)^2
print("explicit (build phi, then dot):", round(explicit, 3))
print("kernel   (a·b)^2 directly      :", round(implicit, 3))

Output:

explicit (build phi, then dot): 25.0
kernel   (a·b)^2 directly      : 25.0

Identical — but the right side never built the feature vectors. That's the kernel trick: $(a\cdot b)^2$ silently computes a dot product in the 3-D quadratic space. Scale this up and a few cheap kernel evaluations stand in for dot products in infinite-dimensional spaces.

The kernel zoo (these are similarity functions)

KernelFormulaImplied feature space
Linear$k(a,b) = a\cdot b$the original space (no map)
Polynomial$k(a,b) = (a\cdot b + c)^d$all degree-$\le d$ monomials
RBF / Gaussian$k(a,b)=\exp(-\gamma\lVert a-b\rVert^2)$infinite-dimensional

You met linear, polynomial-ish, and RBF as similarities in Chapter 4. Now you know the deeper truth: a valid kernel is exactly a similarity function that corresponds to a dot product in some feature space. That's why the kernel matrix $K_{ij} = k(x_i, x_j)$ shows up everywhere — in SVMs, Gaussian processes, kernel PCA, and the Kernel Temporal Segmentation sister book, whose whole algorithm reads from this matrix.

Don't be confused: a kernel is a similarity, not a distance. $k(a,a)$ is large (a thing is maximally similar to itself), and a valid kernel matrix is symmetric and positive semi-definite. If you have a distance, exp(-gamma * d^2) turns it into a kernel.

Meaning 2 — The convolution kernel (a filter)

In CNNs, a kernel (a.k.a. filter) is a small grid of weights — say 3×3 — slid across an image. At each position it multiplies-and-sums the overlapping pixels, detecting a local pattern (an edge, a corner, a texture). A famous hand-built example, the vertical-edge detector:

import numpy as np
kernel = np.array([[-1, 0, 1],
                   [-1, 0, 1],
                   [-1, 0, 1]])   # responds to left-to-right brightness changes

A CNN learns these kernels by gradient descent instead of hand-coding them. Early layers learn edge/color kernels; deeper layers learn eyes, wheels, faces. (More in Chapter 13.)

Don't be confused: ML kernel vs. convolution kernel. The SVM kernel is a function $k(a,b)$ returning a similarity; the CNN kernel is a small array of weights. Same word, no relationship. Context: "kernel trick / RBF kernel / kernel matrix" → meaning 1; "3×3 kernel / conv kernel / filter" → meaning 2.

Meaning 3 — The GPU/CUDA kernel (a function on the GPU)

In GPU programming, a kernel is a function compiled to run on the GPU across thousands of threads at once. "We wrote a custom CUDA kernel to make attention faster" means "we wrote a GPU function." Libraries like FlashAttention are famous kernels in this sense. When an ML-systems engineer says "kernel," they usually mean this — a unit of GPU work — not the SVM concept.

Meaning 4 — The OS kernel (the core of the operating system)

The Linux kernel is the privileged core that manages memory, processes, and hardware. Relevant to AI only operationally (drivers, GPU access, containers). It shares nothing with the other four beyond the name "core/center."

Meaning 5 — The statistics kernel (a weighting window)

In kernel density estimation and smoothing, a kernel is a small bump function (often Gaussian) placed over each data point and summed to estimate a smooth distribution. This is related to meaning 1 — the RBF kernel is the same Gaussian bump — which is why the word overlaps there.

How to tell which kernel is meant

  • "kernel trick," "RBF kernel," "kernel matrix," "kernel method," SVM → ML kernel (meaning 1).
  • "3×3 kernel," "conv kernel," "filter," CNN → convolution kernel (meaning 2).
  • "CUDA kernel," "GPU kernel," "fused kernel," "launch" → GPU kernel (meaning 3).
  • "Linux kernel," "kernel panic," "kernel space" → OS kernel (meaning 4).
  • "kernel density," "smoothing kernel" → stats kernel (meaning 5).

The takeaway

"Kernel" is five unrelated ideas wearing one name. The two that matter for AI: the ML kernel is a similarity function that's secretly a dot product in a richer space (the kernel trick lets you use that space without building it), and the convolution kernel is a small learnable filter in a CNN. Know the other three exist so you're never thrown. With features and similarities in hand, we can now build actual predictive models — starting with the simplest, the line. 👉