The deep-learning zoo (CNN, RNN, Transformer)
The MLP from Chapter 11 treats its input as a flat bag of numbers. But real data has structure — images have spatial locality, text and audio have sequence, graphs have connections. The major neural architectures are each a clever way of baking a data structure's prior into the network so it doesn't have to learn that structure from scratch. This chapter is the conceptual map: what each one is, the one idea that defines it, and what it's for.
The unifying idea: inductive bias
Every architecture encodes an inductive bias — a built-in assumption about the data that makes learning vastly more efficient:
| Architecture | Data structure | Built-in assumption |
|---|---|---|
| MLP | none (flat vector) | every input interacts with every output |
| CNN | grid (images) | patterns are local and position-independent |
| RNN/LSTM | sequence | the past influences the present, step by step |
| Transformer | sequence/set | everything attends to everything, in parallel |
| GNN | graph | nodes are defined by their neighbors |
CNN — convolutional neural network (images)
A CNN's core operation is the convolution from Chapter 5: slide a small learnable kernel/filter across the input, multiply-and-sum at each position. Here's the 1-D version detecting an edge — a jump in the signal:
import numpy as np
signal = np.array([0,0,0,1,1,1,0,0], float)
kernel = np.array([-1, 1], float) # responds to changes
conv = np.array([signal[i:i+2] @ kernel for i in range(len(signal)-1)])
print("signal:", signal)
print("conv :", conv)
Output:
signal: [0. 0. 0. 1. 1. 1. 0. 0.]
conv : [ 0. 0. 1. 0. 0. -1. 0.]
The output spikes (+1) where the signal rises and dips (−1) where it falls — the kernel found the edges and was silent on the flat regions. Two properties make this powerful and efficient:
- Parameter sharing — the same small kernel is reused at every position, so a 3×3 kernel has 9 weights whether the image is 32×32 or 4K. Massively fewer parameters than an MLP.
- Translation invariance — a cat detector works wherever the cat is, because the kernel slides everywhere.
A CNN stacks many such layers (with pooling to shrink spatial size): early layers learn edges, middle layers textures, deep layers objects. Use CNNs for images, video, spectrograms — anything with grid/spatial structure. (Famous names: LeNet, AlexNet, ResNet, U-Net.)
RNN / LSTM — recurrent networks (sequences)
A recurrent neural network processes a sequence one element at a time, carrying a hidden state that summarizes everything seen so far:
$$ h_t = \text{activation}(W_x x_t + W_h h_{t-1} + b) $$
The same weights run at every timestep; the loop is the sequence prior. Plain RNNs struggle to remember across long gaps (the vanishing gradient again), so the LSTM (Long Short-Term Memory) and GRU add gates — small learned valves that decide what to keep, forget, and output — letting information survive across hundreds of steps.
Use RNNs/LSTMs for time series, and historically for text and speech — though for language they've largely been replaced by transformers, which train far faster.
Don't be confused: RNN vs. CNN vs. Transformer for sequences. All three can process sequences, but: RNNs are inherently sequential (step t needs step t−1) → slow to train, weak long-range memory. CNNs capture local patterns in parallel but have a limited window. Transformers see the whole sequence at once and connect any two positions directly → the reason they won.
Transformer — attention (the architecture behind modern AI)
The transformer (2017, "Attention Is All You Need") is the architecture behind every modern LLM, and most new vision and audio models too. Its one idea is self-attention: each element looks at every other element and decides how much to pay attention to each, then mixes their information accordingly.
The mechanism, in three letters — Query, Key, Value. Each token emits a query ("what am I looking for?"), a key ("what do I offer?"), and a value ("what I'll contribute"). Attention weights come from query·key similarity (a dot product — the similarity from Chapter 4!), softmaxed into a distribution, then used to take a weighted average of the values:
$$ \text{Attention}(Q,K,V) = \text{softmax}!\left(\frac{QK^\top}{\sqrt{d}}\right)V $$
A stripped-down version (using one matrix as Q, K, and V to show the shape of it):
def softmax(z):
z = z - z.max(-1, keepdims=True); e = np.exp(z)
return e / e.sum(-1, keepdims=True)
X = np.array([[1.,0,1,0], [0,1,0,1], [1,1,0,0]]) # 3 tokens, dim 4
scores = (X @ X.T) / np.sqrt(X.shape[1]) # token-to-token similarity
A = softmax(scores) # attention weights (rows sum to 1)
print("attention weights:\n", np.round(A, 3))
print("output:\n", np.round(A @ X, 3))
Output:
attention weights:
[[0.506 0.186 0.307]
[0.186 0.506 0.307]
[0.274 0.274 0.452]]
output:
[[0.814 0.494 0.506 0.186]
[0.494 0.814 0.186 0.506]
[0.726 0.726 0.274 0.274]]
Each row of the weight matrix is a probability distribution (sums to 1) saying how
much that token attends to each token — and each output row is the corresponding
weighted blend. That softmax(QKᵀ/√d)·V is, at heart, the entire transformer.
Stack many attention layers (with MLPs and normalization between them), run them in
parallel across the whole sequence, and you get GPT, BERT, and friends. We go deeper
in Chapter 15.
Why it dominates: attention is parallel (no sequential bottleneck → trains fast on GPUs), has unlimited range (any token reaches any other in one step), and scales beautifully with data and parameters.
The rest of the menagerie (names you'll hear)
- Embedding layers — turn discrete tokens/IDs into dense vectors (Chapter 14).
- Autoencoders — compress to a bottleneck and reconstruct; learn representations unsupervised.
- GANs — a generator vs. a discriminator in a contest; once the kings of image generation.
- Diffusion models — generate by iteratively denoising random noise; behind Stable Diffusion, DALL·E, Midjourney.
- Graph Neural Networks (GNNs) — learn on graph-structured data (molecules, social networks) by aggregating neighbor information.
- Mixture of Experts (MoE) — route each input to a few specialist sub-networks; how the biggest models stay affordable.
How to choose an architecture
| Your data | Start with |
|---|---|
| Tabular (rows & columns) | gradient-boosted trees or an MLP |
| Images / video | CNN (or a Vision Transformer) |
| Text / language | Transformer (use a pretrained one) |
| Time series | LSTM, temporal CNN, or Transformer |
| Generation (images) | Diffusion model |
| Graphs / networks | GNN |
You will almost never train these from scratch. You'll download a pretrained model and fine-tune it — which is the heart of Chapter 15.
The takeaway
Each architecture bakes in a prior about its data: CNNs assume local, position-
independent patterns (images); RNNs/LSTMs assume step-by-step sequence (time series);
transformers assume everything-attends-to-everything via softmax(QKᵀ/√d)·V
(language and beyond). Pick the architecture whose built-in assumption matches your
data's structure. One word kept recurring — embedding. It deserves its own
chapter. 👉