Embeddings & representation learning

Embedding is one of the most-used and least-defined words in modern AI. It's the bridge between "raw discrete stuff" (words, users, products, images) and "vectors a model can do math on." Master this chapter and a huge swath of AI — search, RAG, recommendation, LLMs — clicks into place, because they all run on embeddings.

The one-sentence definition

An embedding is a learned, dense vector that represents something, arranged so that similar things land near each other in the vector space.

Contrast with the encodings from Chapter 3:

One-hot [0,0,1,0,...] — sparse, huge, and every word is equidistant from every other (no notion of similarity). "cat" is as far from "kitten" as from "spreadsheet."
Embedding [0.2, -0.7, 0.1, ...] — dense (say 300–1536 numbers), and "cat" sits near "kitten" because the geometry encodes meaning.

The embedding learns meaning; one-hot just assigns slots.

Meaning becomes geometry

Here's the magic, with a toy 4-dimensional "meaning space" whose axes you can read as royalty, male, female, fruit:

import numpy as np
emb = {
    "king":  np.array([1., 1, 0, 0]),
    "queen": np.array([1., 0, 1, 0]),
    "man":   np.array([0., 1, 0, 0]),
    "woman": np.array([0., 0, 1, 0]),
    "apple": np.array([0., 0, 0, 1]),
}
def cos(a, b): return float(a @ b / (np.linalg.norm(a)*np.linalg.norm(b)))

print("cos(king, queen):", round(cos(emb['king'], emb['queen']), 3))
print("cos(king, apple):", round(cos(emb['king'], emb['apple']), 3))

Output:

cos(king, queen): 0.5
cos(king, apple): 0.0

king and queen are related (both royalty) → positive cosine; king and apple share nothing → cosine 0. Similarity in meaning became similarity in geometry — exactly what Chapter 4 measures.

The famous trick: analogies are vector arithmetic

Because directions in a good embedding space carry meaning, you can do algebra on concepts. The legendary example: king − man + woman ≈ queen.

v = emb['king'] - emb['man'] + emb['woman']     # remove "male", add "female"
print("king - man + woman =", v)
sims = {k: round(cos(v, emb[k]), 3) for k in emb if k not in ('king','man','woman')}
print("nearest words:", sims)

Output:

king - man + woman = [1. 0. 1. 0.]
nearest words: {'queen': 1.0, 'apple': 0.0}

Subtracting man and adding woman moved us exactly to queen. Real embeddings (word2vec, GloVe) learned this from raw text alone, with no one ever labeling "royalty" or "gender" — the axes emerged from the training objective. That emergence is the deep idea of this chapter.

Representation learning: features the model invents

Don't be confused: feature engineering vs. representation learning. In Chapter 3 you hand-built features (TF-IDF, ratios). Representation learning is when the model discovers the features — embeddings are learned, not designed. This is the headline difference between classical ML and deep learning: deep nets turn raw data into useful representations automatically. An embedding is literally the network's learned internal feature vector.

Where do embeddings come from? They're just a lookup table of vectors (an nn.Embedding layer) whose entries are parameters trained by gradient descent, same as any weight. The training objective shapes them:

word2vec / GloVe — predict a word from its neighbors → words used in similar contexts get similar vectors ("you shall know a word by the company it keeps").
Sentence/text embeddings (Sentence-BERT, OpenAI/Cohere embeddings) — map whole sentences to one vector so semantic search works on meaning, not keywords.
Recommender embeddings — one vector per user and per item, learned so that user · item predicts a rating (the matrix-factorization idea; see the recsys book).
Image embeddings (CLIP, ResNet features) — map images to vectors; CLIP even puts images and text in the same space, so you can search images by text.

Why embeddings power half of modern AI

Once everything — words, documents, users, products, images — is a vector in a shared space, a single toolkit solves an enormous range of tasks:

Semantic search — embed the query, find nearest-neighbor document vectors.
RAG (retrieval-augmented generation) — embed your documents, retrieve the relevant ones at query time, feed them to an LLM. This is the dominant pattern for building on top of LLMs.
Recommendation — recommend items whose vectors are near the user's vector.
Clustering / deduplication / classification — all become geometry on vectors.
The input to every LLM — the first thing a transformer does is embed each token.

And finding those nearest neighbors fast, over millions of vectors, is exactly what the sister HNSW and IVF-PQ books build. This chapter is why those books exist.

Don't be confused: "embedding" the noun vs. the verb, and the dimension. "An embedding" = the resulting vector. "To embed" = the act of producing it. "Embedding dimension" = how long the vector is (e.g. 768, 1536) — a hyperparameter trading expressiveness for cost. Also: a token embedding (one word) and a sentence embedding (one whole sentence) are different granularities of the same idea.

A practical note: normalize before you compare

Embeddings are almost always compared with cosine similarity, so the universal convention is to L2-normalize them (Chapter 3) — then cosine is a plain dot product, and vector databases can use fast inner-product search. If your retrieval results look random, the first thing to check is whether both the query and the stored vectors are normalized the same way.

The takeaway

An embedding is a learned dense vector where geometric closeness = semantic similarity; directions carry meaning, so concept arithmetic works. Representation learning means the model invents these features instead of you engineering them. Embeddings turn words, users, items, and images into one common vector language — the substrate of search, RAG, recommendation, and LLMs. Speaking of which: how do those LLMs actually turn text into tokens and tokens into answers? 👉

AI Foundations in Depth