What is a model, really?

Strip away the mystique and a model is a function with adjustable numbers inside it. You feed it an input, it produces an output, and the "adjustable numbers" are what get tuned during training so the outputs come out right.

$$ \text{output} = f(\text{input}; ; \theta) $$

That $\theta$ ("theta") is the bag of adjustable numbers — the parameters. Training is the search for good values of $\theta$.

The five words that define the whole field

WordPlain meaning
Parameters (weights)the numbers inside the model that training adjusts
Training (fitting)the process of adjusting parameters to reduce error
Lossa single number measuring how wrong the model currently is
Inference (prediction)running a trained model on new input
Hyperparametersknobs you set that aren't learned (e.g. learning rate)

Learn these five and you can follow almost any ML conversation. Let's make them concrete with the smallest possible model.

The smallest model: a straight line

Suppose the true relationship is y = 2x + 1, but we don't know the 2 or the 1. Our model is pred = w*x + b, with two parameters: a slope w and an intercept b. We'll start them at zero and let training discover w=2, b=1.

import numpy as np

X = np.array([0.0, 1.0, 2.0, 3.0, 4.0])
y = 2.0 * X + 1.0          # the truth we're trying to recover

w, b = 0.0, 0.0            # parameters: start knowing nothing
lr = 0.05                  # learning rate (a hyperparameter)

for step in range(1000):
    pred = w * X + b               # 1. forward pass: the model's guess
    err = pred - y
    loss = np.mean(err ** 2)       # 2. loss: mean squared error
    grad_w = np.mean(2 * err * X)  # 3. gradient: which way is "wronger"
    grad_b = np.mean(2 * err)
    w -= lr * grad_w               # 4. update: step downhill
    b -= lr * grad_b
    if step in (0, 200, 999):
        print(f"step {step:4d}  loss {loss:8.4f}  w {w:.4f}  b {b:.4f}")

print("learned:", round(w, 3), round(b, 3))

Output:

step    0  loss  33.0000  w 1.4000  b 0.5000
step  200  loss   0.0000  w 2.0002  b 0.9994
step  999  loss   0.0000  w 2.0000  b 1.0000
learned: 2.0 1.0

The model started clueless (w=0, b=0, loss 33) and discovered w=2, b=1 purely by repeatedly nudging the parameters to shrink the loss. That four-step loop — forward → loss → gradient → update — is the heartbeat of all of machine learning. A transformer does exactly this; it just has billions of parameters instead of two, and a fancier f.

Inference vs. training

Once w and b are learned, inference is trivial — no loss, no gradients, just the forward pass on new data:

x_new = 10.0
print("prediction:", w * x_new + b)

Output:

prediction: 21.0

Training is the expensive part (millions of forward/loss/gradient/update steps); inference is cheap (one forward pass). This is why training a large model costs millions of dollars but using it costs a fraction of a cent.

Parameters vs. hyperparameters

Don't be confused. A parameter is learned by the model (w, b above). A hyperparameter is chosen by you and held fixed during training (the learning rate lr, the number of steps, how many layers). Rule of thumb: if gradient descent changes it, it's a parameter; if you change it, it's a hyperparameter. Picking good hyperparameters is "tuning."

What "supervised" means

The example above is supervised learning: every input x came with the correct answer y, and the loss measured the gap. The three classic flavors:

  • Supervised — learn from (input, correct-answer) pairs. Spam/not-spam, house-price prediction, image labels. Most of applied ML.
  • Unsupervised — no labels; find structure. Clustering, dimensionality reduction, most of how embeddings are pre-trained.
  • Reinforcement — learn from rewards by trial and error. Game-playing, robotics, RLHF (the alignment step in modern chatbots).

Parametric vs. non-parametric (a useful distinction)

Our line is parametric: a fixed set of parameters (w, b) summarizes the data, and you can throw the data away afterward. A non-parametric model like k-nearest-neighbors keeps the data and answers by looking up similar examples at query time — no w to learn. Both are "models"; they just store what they learned differently. (The sister HNSW and IVF-PQ books are entirely about making that lookup fast.)

The takeaway

A model is parameters + a forward function. Training minimizes a loss by gradient descent. Inference is just the forward pass. Everything else in this book — features, similarities, neural layers, transformers — is detail layered on top of this skeleton. Next we'll look hard at the objects these computations run on: tensors and their shapes. 👉