What is a model, really?
Strip away the mystique and a model is a function with adjustable numbers inside it. You feed it an input, it produces an output, and the "adjustable numbers" are what get tuned during training so the outputs come out right.
$$ \text{output} = f(\text{input}; ; \theta) $$
That $\theta$ ("theta") is the bag of adjustable numbers — the parameters. Training is the search for good values of $\theta$.
The five words that define the whole field
| Word | Plain meaning |
|---|---|
| Parameters (weights) | the numbers inside the model that training adjusts |
| Training (fitting) | the process of adjusting parameters to reduce error |
| Loss | a single number measuring how wrong the model currently is |
| Inference (prediction) | running a trained model on new input |
| Hyperparameters | knobs you set that aren't learned (e.g. learning rate) |
Learn these five and you can follow almost any ML conversation. Let's make them concrete with the smallest possible model.
The smallest model: a straight line
Suppose the true relationship is y = 2x + 1, but we don't know the 2 or the
1. Our model is pred = w*x + b, with two parameters: a slope w and an
intercept b. We'll start them at zero and let training discover w=2, b=1.
import numpy as np
X = np.array([0.0, 1.0, 2.0, 3.0, 4.0])
y = 2.0 * X + 1.0 # the truth we're trying to recover
w, b = 0.0, 0.0 # parameters: start knowing nothing
lr = 0.05 # learning rate (a hyperparameter)
for step in range(1000):
pred = w * X + b # 1. forward pass: the model's guess
err = pred - y
loss = np.mean(err ** 2) # 2. loss: mean squared error
grad_w = np.mean(2 * err * X) # 3. gradient: which way is "wronger"
grad_b = np.mean(2 * err)
w -= lr * grad_w # 4. update: step downhill
b -= lr * grad_b
if step in (0, 200, 999):
print(f"step {step:4d} loss {loss:8.4f} w {w:.4f} b {b:.4f}")
print("learned:", round(w, 3), round(b, 3))
Output:
step 0 loss 33.0000 w 1.4000 b 0.5000
step 200 loss 0.0000 w 2.0002 b 0.9994
step 999 loss 0.0000 w 2.0000 b 1.0000
learned: 2.0 1.0
The model started clueless (w=0, b=0, loss 33) and discovered w=2, b=1
purely by repeatedly nudging the parameters to shrink the loss. That four-step
loop — forward → loss → gradient → update — is the heartbeat of all of
machine learning. A transformer does exactly this; it just has billions of
parameters instead of two, and a fancier f.
Inference vs. training
Once w and b are learned, inference is trivial — no loss, no gradients,
just the forward pass on new data:
x_new = 10.0
print("prediction:", w * x_new + b)
Output:
prediction: 21.0
Training is the expensive part (millions of forward/loss/gradient/update steps); inference is cheap (one forward pass). This is why training a large model costs millions of dollars but using it costs a fraction of a cent.
Parameters vs. hyperparameters
Don't be confused. A parameter is learned by the model (
w,babove). A hyperparameter is chosen by you and held fixed during training (the learning ratelr, the number of steps, how many layers). Rule of thumb: if gradient descent changes it, it's a parameter; if you change it, it's a hyperparameter. Picking good hyperparameters is "tuning."
What "supervised" means
The example above is supervised learning: every input x came with the
correct answer y, and the loss measured the gap. The three classic flavors:
- Supervised — learn from
(input, correct-answer)pairs. Spam/not-spam, house-price prediction, image labels. Most of applied ML. - Unsupervised — no labels; find structure. Clustering, dimensionality reduction, most of how embeddings are pre-trained.
- Reinforcement — learn from rewards by trial and error. Game-playing, robotics, RLHF (the alignment step in modern chatbots).
Parametric vs. non-parametric (a useful distinction)
Our line is parametric: a fixed set of parameters (w, b) summarizes the
data, and you can throw the data away afterward. A non-parametric model like
k-nearest-neighbors keeps the data and answers by looking up similar examples
at query time — no w to learn. Both are "models"; they just store what they
learned differently. (The sister HNSW and IVF-PQ books are entirely about
making that lookup fast.)
The takeaway
A model is parameters + a forward function. Training minimizes a loss by gradient descent. Inference is just the forward pass. Everything else in this book — features, similarities, neural layers, transformers — is detail layered on top of this skeleton. Next we'll look hard at the objects these computations run on: tensors and their shapes. 👉