Linear & logistic models
Before neural networks, meet the two models that 90% of practical ML still runs on and that every neural net is built from: linear regression (predict a number) and logistic regression (predict a probability). A neural network is literally a stack of these with non-linearities between them — so understand these two and the rest is repetition.
The linear model: a weighted sum
Every linear model computes the same thing — a weighted sum of the features plus a bias:
$$ z = w_1 x_1 + w_2 x_2 + \dots + w_d x_d + b = \mathbf{w}\cdot\mathbf{x} + b $$
Each weight $w_i$ says how much feature $i$ pushes the answer up or down; the bias $b$ is the baseline. That dot product is the entire model. What you do with $z$ is what separates regression from classification.
Linear regression — predict a number
Use $z$ directly as the prediction. We already did this in
Chapter 1: recover y = 2x + 1. The loss is mean
squared error, and there's even a closed-form solution (the normal equations),
but gradient descent generalizes to everything, so we lead with it.
Use it for: house prices, demand forecasting, any continuous target — and as a baseline you must beat before reaching for anything fancier.
Logistic regression — predict a probability
For yes/no questions ("will this user click?"), a raw weighted sum is wrong: it ranges over all real numbers, but a probability must live in $[0, 1]$. So we squash $z$ through the sigmoid (logistic) function:
$$ \sigma(z) = \frac{1}{1 + e^{-z}} \qquad p = \sigma(\mathbf{w}\cdot\mathbf{x}+b) $$
import numpy as np
def sigmoid(z): return 1 / (1 + np.exp(-z))
print(sigmoid(np.array([-4., 0., 4.])))
Output:
[0.018 0.5 0.982]
Sigmoid maps $-\infty\to0$, $0\to0.5$, $+\infty\to1$ — a smooth S-curve turning any score into a probability. Despite the name, logistic regression is a classification model.
A worked example: hours studied → pass/fail
import numpy as np
def sigmoid(z): return 1 / (1 + np.exp(-z))
hours = np.array([1.,2.,3.,4.,5.,6.,7.,8.]).reshape(-1, 1)
y = np.array([0.,0.,0.,0.,1.,1.,1.,1.]) # passed only with >= 5 hours
X = np.hstack([np.ones_like(hours), hours]) # column of 1s = the bias term
w = np.zeros(2)
for step in range(20000): # gradient descent
p = sigmoid(X @ w)
grad = X.T @ (p - y) / len(y) # gradient of cross-entropy
w -= 0.1 * grad
print("weights (bias, slope):", np.round(w, 3))
for h in [3.0, 4.5, 6.0]:
print(f"hours={h} -> P(pass)={sigmoid(w[0] + w[1]*h):.3f}")
print("decision boundary at hours =", round(-w[0]/w[1], 3))
Output:
weights (bias, slope): [-18.223 4.077]
hours=3.0 -> P(pass)=0.002
hours=4.5 -> P(pass)=0.531
hours=6.0 -> P(pass)=0.998
decision boundary at hours = 4.469
The model learned a confident S-curve: 3 hours → basically fail, 6 hours → almost
certain pass, and a decision boundary at ~4.47 hours (where $P=0.5$, i.e.
$\mathbf{w}\cdot\mathbf{x}+b=0$). Notice the gradient X.T @ (p - y) — the same
clean (prediction − target) form we'll see again in backprop.
Don't be confused: logits vs. probabilities. The raw score $z = \mathbf{w}\cdot\mathbf{x}+b$ — before the sigmoid — is the logit. It lives in $(-\infty, +\infty)$. After the sigmoid it's a probability in $(0,1)$. Models output logits internally; loss functions like
BCEWithLogitsLosstake logits (more numerically stable than applying sigmoid yourself). When someone says "the logits," they mean the pre-activation scores. Inverse of sigmoid is the logit function $\log\frac{p}{1-p}$ — hence the name.
More than two classes: softmax
For 3+ classes (digit 0–9, which-category), generalize the sigmoid to the softmax: produce one logit per class, then normalize them into a probability distribution that sums to 1.
$$ \text{softmax}(z)_i = \frac{e^{z_i}}{\sum_j e^{z_j}} $$
def softmax(z):
z = z - np.max(z) # numerically-stable shift (Chapter 17)
e = np.exp(z)
return e / e.sum()
print(np.round(softmax(np.array([2.0, 1.0, 0.1])), 3))
Output:
[0.659 0.242 0.099]
The biggest logit gets the biggest probability, everything sums to 1. Softmax is the final layer of essentially every classifier, including the output of a language model over its vocabulary. (Multi-class logistic regression is also called softmax regression or, confusingly, maximum-entropy classifier.)
Don't be confused: sigmoid vs. softmax. Sigmoid = one output, "yes/no" (binary, or independent multi-label). Softmax = several outputs that compete and sum to 1 (pick one of N mutually-exclusive classes). Two classes with softmax is mathematically the same as one sigmoid.
Why linear models matter even in the deep-learning era
- They're the baseline. If logistic regression on good features already hits 92%, a neural net that hits 92.3% may not be worth the complexity.
- They're interpretable. Each weight is a clear "this feature pushes the answer this much" — invaluable in regulated domains (credit, medicine).
- They're the atom of deep learning. One neural-network "neuron" is exactly
sigmoid(w·x + b). A layer is many of them; a deep net is layers stacked with non-linearities. You already understand the unit.
The takeaway
Linear models compute a weighted sum (a dot product). Use it raw for regression, through a sigmoid for binary probability, through softmax for multi-class. Logits are the pre-squash scores. These are both the strongest simple baseline and the building block of every neural network. But we keep saying "minimize the loss" — what exactly is a loss function? That's next. 👉