The deep-learning zoo (CNN, RNN, Transformer)

The MLP from Chapter 11 treats its input as a flat bag of numbers. But real data has structure — images have spatial locality, text and audio have sequence, graphs have connections. The major neural architectures are each a clever way of baking a data structure's prior into the network so it doesn't have to learn that structure from scratch. This chapter is the conceptual map: what each one is, the one idea that defines it, and what it's for.

The unifying idea: inductive bias

Every architecture encodes an inductive bias — a built-in assumption about the data that makes learning vastly more efficient:

Architecture	Data structure	Built-in assumption
MLP	none (flat vector)	every input interacts with every output
CNN	grid (images)	patterns are local and position-independent
RNN/LSTM	sequence	the past influences the present, step by step
Transformer	sequence/set	everything attends to everything, in parallel
GNN	graph	nodes are defined by their neighbors

CNN — convolutional neural network (images)

A CNN's core operation is the convolution from Chapter 5: slide a small learnable kernel/filter across the input, multiply-and-sum at each position. Here's the 1-D version detecting an edge — a jump in the signal:

import numpy as np
signal = np.array([0,0,0,1,1,1,0,0], float)
kernel = np.array([-1, 1], float)                 # responds to changes
conv = np.array([signal[i:i+2] @ kernel for i in range(len(signal)-1)])
print("signal:", signal)
print("conv  :", conv)

Output:

signal: [0. 0. 0. 1. 1. 1. 0. 0.]
conv  : [ 0.  0.  1.  0.  0. -1.  0.]

The output spikes (+1) where the signal rises and dips (−1) where it falls — the kernel found the edges and was silent on the flat regions. Two properties make this powerful and efficient:

Parameter sharing — the same small kernel is reused at every position, so a 3×3 kernel has 9 weights whether the image is 32×32 or 4K. Massively fewer parameters than an MLP.
Translation invariance — a cat detector works wherever the cat is, because the kernel slides everywhere.

A CNN stacks many such layers (with pooling to shrink spatial size): early layers learn edges, middle layers textures, deep layers objects. Use CNNs for images, video, spectrograms — anything with grid/spatial structure. (Famous names: LeNet, AlexNet, ResNet, U-Net.)

RNN / LSTM — recurrent networks (sequences)

A recurrent neural network processes a sequence one element at a time, carrying a hidden state that summarizes everything seen so far:

$$ h_t = \text{activation}(W_x x_t + W_h h_{t-1} + b) $$

The same weights run at every timestep; the loop is the sequence prior. Plain RNNs struggle to remember across long gaps (the vanishing gradient again), so the LSTM (Long Short-Term Memory) and GRU add gates — small learned valves that decide what to keep, forget, and output — letting information survive across hundreds of steps.

Use RNNs/LSTMs for time series, and historically for text and speech — though for language they've largely been replaced by transformers, which train far faster.

Don't be confused: RNN vs. CNN vs. Transformer for sequences. All three can process sequences, but: RNNs are inherently sequential (step t needs step t−1) → slow to train, weak long-range memory. CNNs capture local patterns in parallel but have a limited window. Transformers see the whole sequence at once and connect any two positions directly → the reason they won.

Transformer — attention (the architecture behind modern AI)

The transformer (2017, "Attention Is All You Need") is the architecture behind every modern LLM, and most new vision and audio models too. Its one idea is self-attention: each element looks at every other element and decides how much to pay attention to each, then mixes their information accordingly.

The mechanism, in three letters — Query, Key, Value. Each token emits a query ("what am I looking for?"), a key ("what do I offer?"), and a value ("what I'll contribute"). Attention weights come from query·key similarity (a dot product — the similarity from Chapter 4!), softmaxed into a distribution, then used to take a weighted average of the values:

$$ \text{Attention}(Q,K,V) = \text{softmax}!\left(\frac{QK^\top}{\sqrt{d}}\right)V $$

A stripped-down version (using one matrix as Q, K, and V to show the shape of it):

def softmax(z):
    z = z - z.max(-1, keepdims=True); e = np.exp(z)
    return e / e.sum(-1, keepdims=True)

X = np.array([[1.,0,1,0], [0,1,0,1], [1,1,0,0]])   # 3 tokens, dim 4
scores = (X @ X.T) / np.sqrt(X.shape[1])           # token-to-token similarity
A = softmax(scores)                                # attention weights (rows sum to 1)
print("attention weights:\n", np.round(A, 3))
print("output:\n", np.round(A @ X, 3))

Output:

attention weights:
 [[0.506 0.186 0.307]
 [0.186 0.506 0.307]
 [0.274 0.274 0.452]]
output:
 [[0.814 0.494 0.506 0.186]
 [0.494 0.814 0.186 0.506]
 [0.726 0.726 0.274 0.274]]

Each row of the weight matrix is a probability distribution (sums to 1) saying how much that token attends to each token — and each output row is the corresponding weighted blend. That softmax(QKᵀ/√d)·V is, at heart, the entire transformer. Stack many attention layers (with MLPs and normalization between them), run them in parallel across the whole sequence, and you get GPT, BERT, and friends. We go deeper in Chapter 15.

Why it dominates: attention is parallel (no sequential bottleneck → trains fast on GPUs), has unlimited range (any token reaches any other in one step), and scales beautifully with data and parameters.

The rest of the menagerie (names you'll hear)

Embedding layers — turn discrete tokens/IDs into dense vectors (Chapter 14).
Autoencoders — compress to a bottleneck and reconstruct; learn representations unsupervised.
GANs — a generator vs. a discriminator in a contest; once the kings of image generation.
Diffusion models — generate by iteratively denoising random noise; behind Stable Diffusion, DALL·E, Midjourney.
Graph Neural Networks (GNNs) — learn on graph-structured data (molecules, social networks) by aggregating neighbor information.
Mixture of Experts (MoE) — route each input to a few specialist sub-networks; how the biggest models stay affordable.

How to choose an architecture

Your data	Start with
Tabular (rows & columns)	gradient-boosted trees or an MLP
Images / video	CNN (or a Vision Transformer)
Text / language	Transformer (use a pretrained one)
Time series	LSTM, temporal CNN, or Transformer
Generation (images)	Diffusion model
Graphs / networks	GNN

You will almost never train these from scratch. You'll download a pretrained model and fine-tune it — which is the heart of Chapter 15.

The takeaway

Each architecture bakes in a prior about its data: CNNs assume local, position- independent patterns (images); RNNs/LSTMs assume step-by-step sequence (time series); transformers assume everything-attends-to-everything via softmax(QKᵀ/√d)·V (language and beyond). Pick the architecture whose built-in assumption matches your data's structure. One word kept recurring — embedding. It deserves its own chapter. 👉

AI Foundations in Depth