Features & feature extraction
A model never sees "a customer," "a sentence," or "a photo." It sees a row of
numbers. A feature is one of those numbers — one measurable property of the
thing. The row is a feature vector, and the whole table is the feature
matrix X of shape (n_samples, n_features).
Feature extraction is the craft of turning raw, messy reality into that clean numeric table. It is, in practice, where most of the real work — and most of the accuracy — lives. "Garbage in, garbage out" is never more true than here.
Don't be confused: the three "feature ___" terms.
- Feature extraction — create numeric features from raw data (pixels → edges, text → TF-IDF, audio → spectrogram). Changes the representation.
- Feature engineering — hand-craft new features from existing ones (
price_per_sqft = price / area). Uses domain knowledge.- Feature selection — pick a subset of existing features and drop the rest (remove redundant or useless columns). Reduces dimensionality.
Deep learning's headline trick is that it does feature extraction for you — the hidden layers learn features instead of you crafting them. That's what "representation learning" means (Chapter 14).
All code below is in code/features.py and prints exactly
what's shown.
1. Numeric features — and why you must scale them
Raw numbers come on wildly different scales: age ∈ [0, 100], income ∈ [0,
1,000,000]. Many models (anything using distances, gradients, or regularization)
will let the big-scale feature dominate purely because its numbers are bigger.
Standardization fixes this: per column, subtract the mean and divide by the
standard deviation, giving every feature mean 0 and std 1.
$$ z = \frac{x - \mu}{\sigma} $$
import numpy as np
def standardize(X):
X = np.asarray(X, dtype=float)
return (X - X.mean(axis=0)) / (X.std(axis=0) + 1e-12)
age_income = np.array([[25, 30000.0],
[40, 60000.0],
[55, 90000.0]])
print(standardize(age_income))
Output:
[[-1.225 -1.225]
[ 0. 0. ]
[ 1.225 1.225]]
Now age and income are on the same footing — both range over the same
small spread, so neither bullies the other. Note we used axis=0 (per column /
per feature), exactly as Chapter 2 warned.
Don't be confused: normalization vs standardization. People use these loosely, but the usual meanings are:
- Standardization → z-score: mean 0, std 1. Unbounded output. The default for tabular features.
- Min-max normalization → rescale to a fixed range like [0, 1].
- L2 normalization → scale each row to unit length (for cosine similarity; Chapter 4).
They answer different needs. When someone says "normalize," ask "to unit length, to [0,1], or to mean-0/std-1?"
The cardinal rule: fit the scaler on the training set only, then apply those same numbers to validation/test. Computing the mean over all data leaks test information into training — a classic, silent mistake.
2. Categorical features — one-hot encoding
A category like color ∈ {red, green, blue} is not a number. Encoding it as
red=0, green=1, blue=2 is a trap: it tells the model blue > red and green is
"between" them, which is nonsense. One-hot encoding instead gives each category
its own 0/1 column:
from collections import Counter
import numpy as np
def one_hot_encode(values):
vocab = sorted(set(values))
index = {v: i for i, v in enumerate(vocab)}
out = np.zeros((len(values), len(vocab)))
for row, v in enumerate(values):
out[row, index[v]] = 1.0
return out, vocab
M, vocab = one_hot_encode(["red", "blue", "red", "green"])
print("vocab:", vocab)
print(M)
Output:
vocab: ['blue', 'green', 'red']
[[0. 0. 1.]
[1. 0. 0.]
[0. 0. 1.]
[0. 1. 0.]]
Each row has a single 1 marking its category. No false ordering, no fake
distances.
When not to one-hot: if a category has thousands of values (user IDs, product IDs), one-hot makes an absurdly wide, sparse matrix. There you use an embedding — a learned, dense vector per category — which is exactly the bridge to neural recommenders (Chapter 14). And ordinal categories with a real order (
small < medium < large) can be integer-coded on purpose.
3. Text features — bag-of-words and TF-IDF
Text is the classic feature-extraction problem: variable-length strings must become fixed-length vectors. The oldest, still-useful idea is the bag of words — count how often each vocabulary word appears, ignoring order.
docs = ["the cat sat",
"the dog sat",
"the cat and the dog"]
# vocab (sorted): ['and', 'cat', 'dog', 'sat', 'the']
The count matrix (rows = documents, columns = words):
[[0. 1. 0. 1. 1.] # "the cat sat"
[0. 0. 1. 1. 1.] # "the dog sat"
[1. 1. 1. 0. 2.]] # "the cat and the dog" -> "the" appears twice
The problem: the appears everywhere, so it dominates the counts while carrying
no discriminating signal. TF-IDF (Term Frequency × Inverse Document
Frequency) fixes this by down-weighting words that appear in many documents:
$$ \text{tfidf}(t, d) = \underbrace{\text{count}(t, d)}{\text{term frequency}} \times \underbrace{\left(\log\frac{1 + N}{1 + \text{df}(t)} + 1\right)}{\text{inverse doc. frequency}} $$
where $N$ is the number of documents and $\text{df}(t)$ is how many documents contain term $t$. Rows are then L2-normalized. The result:
[[0. 0.62 0. 0.62 0.481]
[0. 0. 0.62 0.62 0.481]
[0.531 0.404 0.404 0. 0.627]]
Compare column 4 (the) — its weight (0.481, 0.481, 0.627) is pulled down
relative to distinctive words like cat and dog (0.62), because the appears
in every document so its IDF is small. TF-IDF is the workhorse behind classic
search ranking and a strong baseline for text classification — and it's the
direct ancestor of the dense embeddings used today.
The feature-extraction checklist
For any new dataset, walk these questions:
- Per column, what type is it? Numeric, categorical, ordinal, text, date, image, ID?
- Numeric → scale it (standardize by default).
- Categorical → one-hot if low-cardinality, embed if high-cardinality.
- Text → TF-IDF for a baseline, sentence-embeddings for semantics.
- Dates → extract
day_of_week,month,is_weekend, time-since-event. - Missing values → decide: drop, fill with mean/median, or add an "is-missing" flag (often the missingness itself is signal).
- Fit transforms on train only, apply to test. Always.
The takeaway
A feature is a number describing one property; feature extraction builds the numeric table the model trains on. Scale numerics, one-hot small categories, embed big ones, TF-IDF text — and never let test data leak into your transforms. Now that we have feature vectors, the next question is the one that powers search, recommendation, and clustering: how do we measure that two vectors are similar? 👉