A neural network from scratch

Now we assemble everything — linear units, non-linearities, a loss, and backpropagation — into an actual neural network, built in pure NumPy with the backward pass derived by hand. No framework hiding the magic. After this chapter, the word "neural network" will hold no mystery: it's stacked logistic regressions trained by gradient descent.

The full program is code/nn_from_scratch.py; it prints exactly the output below.

A neuron is something you already know

One neuron computes activation(w·x + b) — a weighted sum (from Chapter 6) pushed through a non-linear function. Stack many neurons side by side → a layer. Stack layers → a multilayer perceptron (MLP), the original neural network:

$$ \mathbf{h} = \text{activation}(\mathbf{x}W_1 + \mathbf{b}_1), \qquad \mathbf{output} = \mathbf{h}W_2 + \mathbf{b}_2 $$

The middle layers are hidden layers; their outputs are features the network invents for itself. That's the superpower: instead of you hand-crafting features (Chapter 3), the network learns them.

Why non-linearity is the whole point

Don't be confused: without an activation function, depth is an illusion. Stack two linear layers and you get $,(xW_1)W_2 = x(W_1W_2)$ — still just one linear layer. A hundred linear layers collapse into one. The non-linear activation between layers is what lets a deep network represent curved, complex functions. No activation, no deep learning.

The activation zoo

import numpy as np
z = np.array([-2., -0.5, 0., 0.5, 2.])
print("relu   :", np.maximum(0, z))
print("sigmoid:", 1/(1+np.exp(-z)))
print("tanh   :", np.tanh(z))
print("leaky  :", np.where(z>0, z, 0.01*z))

Output:

relu   : [0.  0.  0.  0.5 2. ]
sigmoid: [0.119 0.378 0.5   0.622 0.881]
tanh   : [-0.964 -0.462  0.     0.462  0.964]
leaky  : [-0.02  -0.005  0.     0.5    2.   ]
ActivationRangeUse it for
ReLU max(0,z)$[0,\infty)$the default for hidden layers — cheap, no vanishing gradient
Leaky ReLU$(-\infty,\infty)$ReLU that avoids "dead" neurons (small slope for $z<0$)
Sigmoid$(0,1)$a single probability output (binary)
Tanh$(-1,1)$zero-centered hidden activations (older nets, RNNs)
GELU/SiLUsmooththe modern default inside transformers

Don't be confused: ReLU for hidden layers, sigmoid/softmax for outputs. Beginners reach for sigmoid everywhere because it's famous. In practice: ReLU (or GELU) in the hidden layers, and sigmoid/softmax only at the output to produce probabilities. Sigmoid in deep hidden layers causes the vanishing gradient problem (its slope is near-zero in the tails, so gradients die on the way back) — the bug that stalled deep learning for years until ReLU.

The network we'll build

A tiny MLP for a problem no linear model can solve — the XOR layout, where each class lives in two opposite corners:

2 inputs  ->  8 hidden neurons (tanh)  ->  1 output neuron (sigmoid)

Two corners are class 0, the other two are class 1 — not separable by any line, so the hidden layer must invent useful features. Loss: binary cross-entropy. The forward pass is two matrix multiplies with a tanh between them; the backward pass is the chain rule, by hand:

# ---- forward pass ----
z1 = X @ W1 + b1          # (m, hidden)
a1 = np.tanh(z1)          # hidden activation
z2 = a1 @ W2 + b2         # (m, 1)
p  = sigmoid(z2)          # predicted probability

# ---- loss: binary cross-entropy ----
loss = -np.mean(y*np.log(p+eps) + (1-y)*np.log(1-p+eps))

# ---- backward pass (chain rule, layer by layer from the output) ----
dz2 = (p - y) / m              # dL/dz2 — the clean (pred - target) form again
dW2 = a1.T @ dz2
db2 = dz2.sum(0, keepdims=True)
da1 = dz2 @ W2.T               # push the error back through W2
dz1 = da1 * (1 - a1**2)        # times tanh'(z) = 1 - tanh(z)^2
dW1 = X.T @ dz1
db1 = dz1.sum(0, keepdims=True)

# ---- gradient-descent update ----
W1 -= lr*dW1; b1 -= lr*db1; W2 -= lr*dW2; b2 -= lr*db2

Notice three things you already know reappearing: (p - y) is the cross-entropy gradient from Chapter 7; dz2 @ W2.T is the backprop "push the error backward" from Chapter 8; the final four lines are plain gradient descent. A neural network is not a new idea — it's the old ideas, composed.

Training it

$ python nn_from_scratch.py

Output:

data: 400 points, 2 features, 2 classes (XOR layout)

epoch    0  loss 0.6983  train-acc 0.505
epoch  400  loss 0.0377  train-acc 0.990
epoch  800  loss 0.0230  train-acc 0.998
epoch 1200  loss 0.0180  train-acc 0.998
epoch 1600  loss 0.0154  train-acc 0.998
epoch 1999  loss 0.0139  train-acc 0.998

final train accuracy: 0.998

sanity check on the four cluster centers:
  point [-1. -1.]  ->  P(class 1) = 0.000
  point [1. 1.]  ->  P(class 1) = 0.000
  point [-1.  1.]  ->  P(class 1) = 1.000
  point [ 1. -1.]  ->  P(class 1) = 1.000

It starts at 50.5% — pure chance, since the classes aren't linearly separable — and climbs to 99.8%. The sanity check confirms it learned the XOR pattern exactly: the two "0" corners get P≈0.000, the two "1" corners get P≈1.000. A plain logistic regression on this data is stuck near 50% forever; the hidden layer is what made the difference, by bending the input space until the classes became separable.

Initialization matters more than you'd think

We initialized weights to small random values scaled by $1/\sqrt{n_{\text{in}}}$ (Xavier/He initialization). Two failure modes this avoids:

  • All zeros → every neuron computes the same thing and gets the same gradient; they never differentiate. The network is effectively one neuron wide forever.
  • Too large → activations saturate, gradients vanish or explode, training stalls or NaNs.

Good initialization keeps the signal's variance stable as it flows through layers. Frameworks do this for you, but knowing why explains a whole class of "my network won't train" bugs.

The takeaway

A neuron is activation(w·x + b); a layer is many neurons; an MLP is stacked layers with non-linearities between them — and that non-linearity is what makes depth meaningful. Forward pass = matrix multiplies + activations; backward pass = the chain rule applied layer by layer; update = gradient descent. We just did all of it by hand and solved a problem no line can. Now let's let a framework write that backward pass for us. 👉