The classical ML algorithm zoo

Here's a secret the deep-learning hype hides: for tabular data — the spreadsheets that run most companies — gradient-boosted trees still beat neural networks, and they're what you'll be asked about in interviews far more than transformers. This chapter fills the biggest gap between Parts II and III: the classical algorithms every ML engineer must know cold — k-NN, Naive Bayes, decision trees, random forests, gradient boosting, and SVMs. We build the core of each from scratch so the intuition sticks.

Where this fits: read this any time after Chapter 10 (you need losses and metrics first). It's placed here as an "advanced part" so the deep-learning spine stayed unbroken, but conceptually it sits right alongside linear models.

k-Nearest Neighbors (k-NN) — lazy, simple, surprisingly strong

The simplest classifier there is: to label a new point, find the k closest training points and take a majority vote. No training — it just memorizes the data (a non-parametric, instance-based method).

import numpy as np
Xtr = np.array([[1,1], [1.2,0.9], [3,3], [3.2,2.8]]); ytr = np.array([0,0,1,1])

def knn_predict(q, k=3):
    d = np.sqrt(((Xtr - q) ** 2).sum(1))      # distance to every training point
    nearest = d.argsort()[:k]                 # indices of the k closest
    return np.bincount(ytr[nearest]).argmax() # majority vote

print("near cluster 0:", knn_predict(np.array([1.1, 1.0])))
print("near cluster 1:", knn_predict(np.array([3.1, 3.0])))

Output:

near cluster 0: 0
near cluster 1: 1

Pros: zero training, no assumptions, naturally multi-class, a great baseline.
Cons: slow at prediction time (compares to all data — this is exactly what the HNSW and IVF-PQ sister books accelerate), needs feature scaling (Chapter 3), and degrades in high dimensions (the curse of dimensionality).
The one hyperparameter, k: small k → low bias, high variance (jagged boundary); large k → smoother, higher bias. Classic bias–variance.

Don't be confused: k-NN vs. k-means. Both have a "k" and both use distances, but k-NN is supervised classification (k = how many neighbors vote); k-means is unsupervised clustering (Chapter 21) (k = how many clusters). Totally different algorithms — a favorite interview trap.

Naive Bayes — probability with a bold assumption

Apply Bayes' theorem (Chapter 22) to classify, assuming every feature is independent given the class — the "naive" part. That assumption is usually false, yet it works astonishingly well, especially for text (spam filters were built on it):

$$ P(\text{class} \mid \text{features}) \propto P(\text{class}) \prod_i P(\text{feature}_i \mid \text{class}) $$

Pros: blazingly fast, needs little data, a strong text baseline.
Cons: the independence assumption hurts when features are correlated; probabilities are poorly calibrated (treat the ranking, not the exact number).
Variants: Multinomial (word counts), Bernoulli (binary features), Gaussian (continuous features).

Decision trees — the if/else machine

A decision tree asks a sequence of yes/no questions about features, splitting the data until each leaf is (nearly) one class. To pick each split, it searches for the question that most reduces impurity, measured by Gini impurity (or entropy):

$$ \text{Gini} = 1 - \sum_c p_c^2 \quad (\text{0 = pure, higher = mixed}) $$

def gini(labels):
    p = np.bincount(labels) / len(labels)
    return 1 - (p ** 2).sum()

parent = np.array([0,0,1,1,1,1])
print("impurity before split:", round(gini(parent), 3))
print("after a perfect split :", gini(np.array([0,0])), gini(np.array([1,1,1,1])))

Output:

impurity before split: 0.444
after a perfect split : 0.0 0.0

The split drove impurity from 0.444 to 0 — a clean separation. The tree greedily repeats this at every node.

Pros: interpretable ("why? because feature X > 5"), needs no scaling, handles non-linear boundaries and mixed feature types.
Cons: a single deep tree overfits badly (it can memorize the training set). The fix is to combine many trees — which gives us the two most important tabular algorithms in the field.

Don't be confused: Gini vs. entropy. Both measure node impurity and give nearly identical trees; Gini is slightly faster (no log). Don't agonize over the choice — interviewers want to know you understand impurity, not which formula you picked.

Random forests — many trees, bagged (variance ↓)

A random forest trains hundreds of decision trees, each on a random bootstrap sample of the rows and a random subset of features at each split, then averages their votes. This technique — training models on resampled data and averaging — is called bagging (bootstrap aggregating), and it crushes the single tree's variance:

Pros: robust, hard to overfit, little tuning, gives feature importances, a fantastic default for tabular data.
Cons: larger and slower than one tree, less interpretable than a single tree.
Key idea: the trees' errors are decorrelated (different rows, different features), so averaging cancels them out. Bagging reduces variance.

Gradient boosting — many trees, sequential (bias ↓)

The other ensemble, and the king of tabular ML: instead of averaging independent trees, gradient boosting builds trees sequentially, each one correcting the errors (the residuals) of the trees so far. Here's the entire idea — fit each new stump to what's left over, add a shrunken version of it:

x = np.linspace(0, 1, 20); y = np.where(x > 0.5, 1.0, 0.0)   # a step to learn
pred = np.zeros_like(y); lr = 0.3
for _ in range(50):
    residual = y - pred                       # what we still get wrong
    # fit the best decision stump (1 split) to the residual:
    best = (1e9, None, 0, 0)
    for t in x:
        l, r = residual[x <= t], residual[x > t]
        lm, rm = (l.mean() if len(l) else 0), (r.mean() if len(r) else 0)
        sse = ((l-lm)**2).sum() + ((r-rm)**2).sum()
        if sse < best[0]: best = (sse, t, lm, rm)
    _, t, lm, rm = best
    pred = pred + lr * np.where(x <= t, lm, rm)   # add the shrunken correction
print("MSE after 50 boosting rounds:", round(float(((y - pred)**2).mean()), 4))

Output:

MSE after 50 boosting rounds: 0.0

Fifty weak stumps, each fixing the last's mistakes, combined into a perfect fit. That's boosting: a sequence of weak learners summed into a strong one — the same "reduce the residual" spirit as gradient descent (Chapter 8), which is why it's called gradient boosting.

You will not implement this in practice — you'll use a library:

Library	Why it's famous
XGBoost	the Kaggle-winning workhorse; fast, regularized, battle-tested
LightGBM	faster on big data (histogram splits, leaf-wise growth)
CatBoost	best-in-class handling of categorical features

Don't be confused: bagging (random forest) vs. boosting (XGBoost). Both are tree ensembles, but: bagging trains trees independently and in parallel on resampled data, then averages → reduces variance (fixes overfitting). Boosting trains trees sequentially, each fixing the last's errors → reduces bias (fixes underfitting) but can overfit if over-trained. Rule of thumb: random forest is the safe default; tuned gradient boosting is usually the winner. This is one of the most common ML interview questions — know it cold.

Support Vector Machines (SVM) — the max-margin classifier

An SVM finds the decision boundary with the widest margin — the largest gap between the two classes. Only the points on the edge of that gap (the support vectors) matter. Combined with the kernel trick from Chapter 5, an SVM draws curved boundaries by implicitly mapping data into a higher-dimensional space (the RBF kernel is the usual choice). Its loss is the hinge loss from Chapter 7.

Pros: strong on small/medium datasets, effective in high dimensions, elegant theory.
Cons: doesn't scale to huge datasets, sensitive to the C and gamma hyperparameters, no native probabilities. Largely superseded by gradient boosting for tabular data and neural nets for perception — but still interview canon.

The decision guide

Your situation	Reach for
Tabular data, want the best score	gradient boosting (XGBoost/LightGBM)
Tabular data, want a robust default	random forest
Need interpretability	a single decision tree or logistic regression
Text classification, fast baseline	Naive Bayes
Small data, clear margin	SVM
Simplest possible baseline / similarity-based	k-NN
Images, audio, language	neural networks (Part III)

The takeaway

Classical ML is not obsolete — it rules tabular data and ML interviews. k-NN votes by distance; Naive Bayes multiplies independent feature probabilities; a decision tree splits to reduce impurity; random forests bag trees to cut variance; gradient boosting sequences trees to cut bias (and wins Kaggle); SVMs maximize the margin. Know bagging-vs-boosting and k-NN-vs-k-means cold. Next, the unsupervised side: finding structure with no labels at all. 👉

AI Foundations in Depth