Features & feature extraction

A model never sees "a customer," "a sentence," or "a photo." It sees a row of numbers. A feature is one of those numbers — one measurable property of the thing. The row is a feature vector, and the whole table is the feature matrix X of shape (n_samples, n_features).

Feature extraction is the craft of turning raw, messy reality into that clean numeric table. It is, in practice, where most of the real work — and most of the accuracy — lives. "Garbage in, garbage out" is never more true than here.

Don't be confused: the three "feature ___" terms.

Feature extraction — create numeric features from raw data (pixels → edges, text → TF-IDF, audio → spectrogram). Changes the representation.

Feature engineering — hand-craft new features from existing ones (price_per_sqft = price / area). Uses domain knowledge.

Feature selection — pick a subset of existing features and drop the rest (remove redundant or useless columns). Reduces dimensionality.

Deep learning's headline trick is that it does feature extraction for you — the hidden layers learn features instead of you crafting them. That's what "representation learning" means (Chapter 14).

All code below is in code/features.py and prints exactly what's shown.

1. Numeric features — and why you must scale them

Raw numbers come on wildly different scales: age ∈ [0, 100], income ∈ [0, 1,000,000]. Many models (anything using distances, gradients, or regularization) will let the big-scale feature dominate purely because its numbers are bigger. Standardization fixes this: per column, subtract the mean and divide by the standard deviation, giving every feature mean 0 and std 1.

$$ z = \frac{x - \mu}{\sigma} $$

import numpy as np

def standardize(X):
    X = np.asarray(X, dtype=float)
    return (X - X.mean(axis=0)) / (X.std(axis=0) + 1e-12)

age_income = np.array([[25, 30000.0],
                       [40, 60000.0],
                       [55, 90000.0]])
print(standardize(age_income))

Output:

[[-1.225 -1.225]
 [ 0.     0.   ]
 [ 1.225  1.225]]

Now age and income are on the same footing — both range over the same small spread, so neither bullies the other. Note we used axis=0 (per column / per feature), exactly as Chapter 2 warned.

Don't be confused: normalization vs standardization. People use these loosely, but the usual meanings are:

Standardization → z-score: mean 0, std 1. Unbounded output. The default for tabular features.

Min-max normalization → rescale to a fixed range like [0, 1].

L2 normalization → scale each row to unit length (for cosine similarity; Chapter 4).

They answer different needs. When someone says "normalize," ask "to unit length, to [0,1], or to mean-0/std-1?"

The cardinal rule: fit the scaler on the training set only, then apply those same numbers to validation/test. Computing the mean over all data leaks test information into training — a classic, silent mistake.

2. Categorical features — one-hot encoding

A category like color ∈ {red, green, blue} is not a number. Encoding it as red=0, green=1, blue=2 is a trap: it tells the model blue > red and green is "between" them, which is nonsense. One-hot encoding instead gives each category its own 0/1 column:

from collections import Counter
import numpy as np

def one_hot_encode(values):
    vocab = sorted(set(values))
    index = {v: i for i, v in enumerate(vocab)}
    out = np.zeros((len(values), len(vocab)))
    for row, v in enumerate(values):
        out[row, index[v]] = 1.0
    return out, vocab

M, vocab = one_hot_encode(["red", "blue", "red", "green"])
print("vocab:", vocab)
print(M)

Output:

vocab: ['blue', 'green', 'red']

[[0. 0. 1.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]

Each row has a single 1 marking its category. No false ordering, no fake distances.

When not to one-hot: if a category has thousands of values (user IDs, product IDs), one-hot makes an absurdly wide, sparse matrix. There you use an embedding — a learned, dense vector per category — which is exactly the bridge to neural recommenders (Chapter 14). And ordinal categories with a real order (small < medium < large) can be integer-coded on purpose.

3. Text features — bag-of-words and TF-IDF

Text is the classic feature-extraction problem: variable-length strings must become fixed-length vectors. The oldest, still-useful idea is the bag of words — count how often each vocabulary word appears, ignoring order.

docs = ["the cat sat",
        "the dog sat",
        "the cat and the dog"]
# vocab (sorted): ['and', 'cat', 'dog', 'sat', 'the']

The count matrix (rows = documents, columns = words):

[[0. 1. 0. 1. 1.]      # "the cat sat"
 [0. 0. 1. 1. 1.]      # "the dog sat"
 [1. 1. 1. 0. 2.]]     # "the cat and the dog"  -> "the" appears twice

The problem: the appears everywhere, so it dominates the counts while carrying no discriminating signal. TF-IDF (Term Frequency × Inverse Document Frequency) fixes this by down-weighting words that appear in many documents:

$$ \text{tfidf}(t, d) = \underbrace{\text{count}(t, d)}{\text{term frequency}} \times \underbrace{\left(\log\frac{1 + N}{1 + \text{df}(t)} + 1\right)}{\text{inverse doc. frequency}} $$

where $N$ is the number of documents and $\text{df}(t)$ is how many documents contain term $t$. Rows are then L2-normalized. The result:

[[0.    0.62  0.    0.62  0.481]
 [0.    0.    0.62  0.62  0.481]
 [0.531 0.404 0.404 0.    0.627]]

Compare column 4 (the) — its weight (0.481, 0.481, 0.627) is pulled down relative to distinctive words like cat and dog (0.62), because the appears in every document so its IDF is small. TF-IDF is the workhorse behind classic search ranking and a strong baseline for text classification — and it's the direct ancestor of the dense embeddings used today.

The feature-extraction checklist

For any new dataset, walk these questions:

Per column, what type is it? Numeric, categorical, ordinal, text, date, image, ID?
Numeric → scale it (standardize by default).
Categorical → one-hot if low-cardinality, embed if high-cardinality.
Text → TF-IDF for a baseline, sentence-embeddings for semantics.
Dates → extract day_of_week, month, is_weekend, time-since-event.
Missing values → decide: drop, fill with mean/median, or add an "is-missing" flag (often the missingness itself is signal).
Fit transforms on train only, apply to test. Always.

The takeaway

A feature is a number describing one property; feature extraction builds the numeric table the model trains on. Scale numerics, one-hot small categories, embed big ones, TF-IDF text — and never let test data leak into your transforms. Now that we have feature vectors, the next question is the one that powers search, recommendation, and clustering: how do we measure that two vectors are similar? 👉