A neural network from scratch
Now we assemble everything — linear units, non-linearities, a loss, and backpropagation — into an actual neural network, built in pure NumPy with the backward pass derived by hand. No framework hiding the magic. After this chapter, the word "neural network" will hold no mystery: it's stacked logistic regressions trained by gradient descent.
The full program is code/nn_from_scratch.py; it prints
exactly the output below.
A neuron is something you already know
One neuron computes activation(w·x + b) — a weighted sum (from
Chapter 6) pushed through a non-linear function. Stack many
neurons side by side → a layer. Stack layers → a multilayer perceptron
(MLP), the original neural network:
$$ \mathbf{h} = \text{activation}(\mathbf{x}W_1 + \mathbf{b}_1), \qquad \mathbf{output} = \mathbf{h}W_2 + \mathbf{b}_2 $$
The middle layers are hidden layers; their outputs are features the network invents for itself. That's the superpower: instead of you hand-crafting features (Chapter 3), the network learns them.
Why non-linearity is the whole point
Don't be confused: without an activation function, depth is an illusion. Stack two linear layers and you get $,(xW_1)W_2 = x(W_1W_2)$ — still just one linear layer. A hundred linear layers collapse into one. The non-linear activation between layers is what lets a deep network represent curved, complex functions. No activation, no deep learning.
The activation zoo
import numpy as np
z = np.array([-2., -0.5, 0., 0.5, 2.])
print("relu :", np.maximum(0, z))
print("sigmoid:", 1/(1+np.exp(-z)))
print("tanh :", np.tanh(z))
print("leaky :", np.where(z>0, z, 0.01*z))
Output:
relu : [0. 0. 0. 0.5 2. ]
sigmoid: [0.119 0.378 0.5 0.622 0.881]
tanh : [-0.964 -0.462 0. 0.462 0.964]
leaky : [-0.02 -0.005 0. 0.5 2. ]
| Activation | Range | Use it for |
|---|---|---|
ReLU max(0,z) | $[0,\infty)$ | the default for hidden layers — cheap, no vanishing gradient |
| Leaky ReLU | $(-\infty,\infty)$ | ReLU that avoids "dead" neurons (small slope for $z<0$) |
| Sigmoid | $(0,1)$ | a single probability output (binary) |
| Tanh | $(-1,1)$ | zero-centered hidden activations (older nets, RNNs) |
| GELU/SiLU | smooth | the modern default inside transformers |
Don't be confused: ReLU for hidden layers, sigmoid/softmax for outputs. Beginners reach for sigmoid everywhere because it's famous. In practice: ReLU (or GELU) in the hidden layers, and sigmoid/softmax only at the output to produce probabilities. Sigmoid in deep hidden layers causes the vanishing gradient problem (its slope is near-zero in the tails, so gradients die on the way back) — the bug that stalled deep learning for years until ReLU.
The network we'll build
A tiny MLP for a problem no linear model can solve — the XOR layout, where each class lives in two opposite corners:
2 inputs -> 8 hidden neurons (tanh) -> 1 output neuron (sigmoid)
Two corners are class 0, the other two are class 1 — not separable by any line, so
the hidden layer must invent useful features. Loss: binary cross-entropy. The
forward pass is two matrix multiplies with a tanh between them; the backward pass
is the chain rule, by hand:
# ---- forward pass ----
z1 = X @ W1 + b1 # (m, hidden)
a1 = np.tanh(z1) # hidden activation
z2 = a1 @ W2 + b2 # (m, 1)
p = sigmoid(z2) # predicted probability
# ---- loss: binary cross-entropy ----
loss = -np.mean(y*np.log(p+eps) + (1-y)*np.log(1-p+eps))
# ---- backward pass (chain rule, layer by layer from the output) ----
dz2 = (p - y) / m # dL/dz2 — the clean (pred - target) form again
dW2 = a1.T @ dz2
db2 = dz2.sum(0, keepdims=True)
da1 = dz2 @ W2.T # push the error back through W2
dz1 = da1 * (1 - a1**2) # times tanh'(z) = 1 - tanh(z)^2
dW1 = X.T @ dz1
db1 = dz1.sum(0, keepdims=True)
# ---- gradient-descent update ----
W1 -= lr*dW1; b1 -= lr*db1; W2 -= lr*dW2; b2 -= lr*db2
Notice three things you already know reappearing: (p - y) is the cross-entropy
gradient from Chapter 7; dz2 @ W2.T is the backprop
"push the error backward" from Chapter 8; the final four
lines are plain gradient descent. A neural network is not a new idea — it's the
old ideas, composed.
Training it
$ python nn_from_scratch.py
Output:
data: 400 points, 2 features, 2 classes (XOR layout)
epoch 0 loss 0.6983 train-acc 0.505
epoch 400 loss 0.0377 train-acc 0.990
epoch 800 loss 0.0230 train-acc 0.998
epoch 1200 loss 0.0180 train-acc 0.998
epoch 1600 loss 0.0154 train-acc 0.998
epoch 1999 loss 0.0139 train-acc 0.998
final train accuracy: 0.998
sanity check on the four cluster centers:
point [-1. -1.] -> P(class 1) = 0.000
point [1. 1.] -> P(class 1) = 0.000
point [-1. 1.] -> P(class 1) = 1.000
point [ 1. -1.] -> P(class 1) = 1.000
It starts at 50.5% — pure chance, since the classes aren't linearly separable —
and climbs to 99.8%. The sanity check confirms it learned the XOR pattern
exactly: the two "0" corners get P≈0.000, the two "1" corners get P≈1.000. A
plain logistic regression on this data is stuck near 50% forever; the hidden
layer is what made the difference, by bending the input space until the classes
became separable.
Initialization matters more than you'd think
We initialized weights to small random values scaled by $1/\sqrt{n_{\text{in}}}$ (Xavier/He initialization). Two failure modes this avoids:
- All zeros → every neuron computes the same thing and gets the same gradient; they never differentiate. The network is effectively one neuron wide forever.
- Too large → activations saturate, gradients vanish or explode, training stalls or NaNs.
Good initialization keeps the signal's variance stable as it flows through layers. Frameworks do this for you, but knowing why explains a whole class of "my network won't train" bugs.
The takeaway
A neuron is activation(w·x + b); a layer is many neurons; an MLP is stacked
layers with non-linearities between them — and that non-linearity is what makes
depth meaningful. Forward pass = matrix multiplies + activations; backward pass =
the chain rule applied layer by layer; update = gradient descent. We just did all
of it by hand and solved a problem no line can. Now let's let a framework write that
backward pass for us. 👉