Linear & logistic models

Before neural networks, meet the two models that 90% of practical ML still runs on and that every neural net is built from: linear regression (predict a number) and logistic regression (predict a probability). A neural network is literally a stack of these with non-linearities between them — so understand these two and the rest is repetition.

The linear model: a weighted sum

Every linear model computes the same thing — a weighted sum of the features plus a bias:

$$ z = w_1 x_1 + w_2 x_2 + \dots + w_d x_d + b = \mathbf{w}\cdot\mathbf{x} + b $$

Each weight $w_i$ says how much feature $i$ pushes the answer up or down; the bias $b$ is the baseline. That dot product is the entire model. What you do with $z$ is what separates regression from classification.

Linear regression — predict a number

Use $z$ directly as the prediction. We already did this in Chapter 1: recover y = 2x + 1. The loss is mean squared error, and there's even a closed-form solution (the normal equations), but gradient descent generalizes to everything, so we lead with it.

Use it for: house prices, demand forecasting, any continuous target — and as a baseline you must beat before reaching for anything fancier.

Logistic regression — predict a probability

For yes/no questions ("will this user click?"), a raw weighted sum is wrong: it ranges over all real numbers, but a probability must live in $[0, 1]$. So we squash $z$ through the sigmoid (logistic) function:

$$ \sigma(z) = \frac{1}{1 + e^{-z}} \qquad p = \sigma(\mathbf{w}\cdot\mathbf{x}+b) $$

import numpy as np
def sigmoid(z): return 1 / (1 + np.exp(-z))
print(sigmoid(np.array([-4., 0., 4.])))

Output:

[0.018 0.5   0.982]

Sigmoid maps $-\infty\to0$, $0\to0.5$, $+\infty\to1$ — a smooth S-curve turning any score into a probability. Despite the name, logistic regression is a classification model.

A worked example: hours studied → pass/fail

import numpy as np
def sigmoid(z): return 1 / (1 + np.exp(-z))

hours = np.array([1.,2.,3.,4.,5.,6.,7.,8.]).reshape(-1, 1)
y     = np.array([0.,0.,0.,0.,1.,1.,1.,1.])      # passed only with >= 5 hours
X     = np.hstack([np.ones_like(hours), hours])  # column of 1s = the bias term

w = np.zeros(2)
for step in range(20000):                        # gradient descent
    p = sigmoid(X @ w)
    grad = X.T @ (p - y) / len(y)                # gradient of cross-entropy
    w -= 0.1 * grad

print("weights (bias, slope):", np.round(w, 3))
for h in [3.0, 4.5, 6.0]:
    print(f"hours={h} -> P(pass)={sigmoid(w[0] + w[1]*h):.3f}")
print("decision boundary at hours =", round(-w[0]/w[1], 3))

Output:

weights (bias, slope): [-18.223   4.077]
hours=3.0 -> P(pass)=0.002
hours=4.5 -> P(pass)=0.531
hours=6.0 -> P(pass)=0.998
decision boundary at hours = 4.469

The model learned a confident S-curve: 3 hours → basically fail, 6 hours → almost certain pass, and a decision boundary at ~4.47 hours (where $P=0.5$, i.e. $\mathbf{w}\cdot\mathbf{x}+b=0$). Notice the gradient X.T @ (p - y) — the same clean (prediction − target) form we'll see again in backprop.

Don't be confused: logits vs. probabilities. The raw score $z = \mathbf{w}\cdot\mathbf{x}+b$ — before the sigmoid — is the logit. It lives in $(-\infty, +\infty)$. After the sigmoid it's a probability in $(0,1)$. Models output logits internally; loss functions like BCEWithLogitsLoss take logits (more numerically stable than applying sigmoid yourself). When someone says "the logits," they mean the pre-activation scores. Inverse of sigmoid is the logit function $\log\frac{p}{1-p}$ — hence the name.

More than two classes: softmax

For 3+ classes (digit 0–9, which-category), generalize the sigmoid to the softmax: produce one logit per class, then normalize them into a probability distribution that sums to 1.

$$ \text{softmax}(z)_i = \frac{e^{z_i}}{\sum_j e^{z_j}} $$

def softmax(z):
    z = z - np.max(z)          # numerically-stable shift (Chapter 17)
    e = np.exp(z)
    return e / e.sum()

print(np.round(softmax(np.array([2.0, 1.0, 0.1])), 3))

Output:

[0.659 0.242 0.099]

The biggest logit gets the biggest probability, everything sums to 1. Softmax is the final layer of essentially every classifier, including the output of a language model over its vocabulary. (Multi-class logistic regression is also called softmax regression or, confusingly, maximum-entropy classifier.)

Don't be confused: sigmoid vs. softmax. Sigmoid = one output, "yes/no" (binary, or independent multi-label). Softmax = several outputs that compete and sum to 1 (pick one of N mutually-exclusive classes). Two classes with softmax is mathematically the same as one sigmoid.

Why linear models matter even in the deep-learning era

  • They're the baseline. If logistic regression on good features already hits 92%, a neural net that hits 92.3% may not be worth the complexity.
  • They're interpretable. Each weight is a clear "this feature pushes the answer this much" — invaluable in regulated domains (credit, medicine).
  • They're the atom of deep learning. One neural-network "neuron" is exactly sigmoid(w·x + b). A layer is many of them; a deep net is layers stacked with non-linearities. You already understand the unit.

The takeaway

Linear models compute a weighted sum (a dot product). Use it raw for regression, through a sigmoid for binary probability, through softmax for multi-class. Logits are the pre-squash scores. These are both the strongest simple baseline and the building block of every neural network. But we keep saying "minimize the loss" — what exactly is a loss function? That's next. 👉