Tokens, transformers & LLMs
Large language models are the reason most people care about AI today — and they are built entirely from the pieces in this book: tokens become embeddings (Chapter 14), attention mixes them (Chapter 13), a softmax over the vocabulary picks the next word (Chapter 6), and cross-entropy trains the whole thing (Chapter 7). This chapter assembles those parts into a working mental model of an LLM and decodes the vocabulary — tokens, context window, temperature, fine-tuning, RAG — so none of it is mysterious.
What an LLM actually does
Strip away the hype and a language model does one thing: predict the next token. Given some text, it outputs a probability distribution over its entire vocabulary for what comes next, samples one, appends it, and repeats. That's it. "Write me an email," "solve this bug," "summarize this" — all emerge from next-token prediction at scale. It's Chapter 6's softmax classifier, with a transformer computing the logits and a vocabulary of ~100k classes.
Step 1: Tokenization — text becomes integers
A model can't consume raw characters; it consumes integers. Tokenization splits text into tokens — usually subword chunks — and maps each to an ID. Subwords are the sweet spot between whole words (vocabulary too big, can't handle new words) and individual characters (sequences too long):
text = "unbelievable"
# a subword tokenizer might split it like this:
tokens = ["un", "believ", "able"] # 3 tokens, not 12 characters, not 1 word
Each token becomes an ID, each ID looks up an embedding vector, and that sequence of vectors is what the transformer actually processes. Practical consequences you must internalize:
- You pay per token, not per word. Roughly 1 token ≈ 0.75 English words, or ~4 characters. A 1,000-word email is ~1,300 tokens.
- Token boundaries cause famous quirks. "Why can't the model count the r's in
strawberry?" Because it sees tokens like
straw+berry, not letters. Many "dumb" LLM failures are really tokenization artifacts. - Rare words and code fragment into many tokens, inflating cost and context use.
Don't be confused: tokens vs. words vs. characters. A token is the model's atomic unit, typically a subword. It is neither a word nor a character. Budgeting context, latency, and API cost all happen in tokens — learn to think in them.
Step 2: The context window
The context window is the maximum number of tokens the model can attend to at once — its working memory. Everything (your prompt, the documents you paste, the conversation history, and the reply being generated) must fit. Modern windows range from a few thousand to millions of tokens.
Don't be confused: context window vs. training data vs. "memory." The context window is what the model sees right now, this request. Training data is what it learned from, months ago, and is baked into the weights. An LLM has no memory between calls — each API request is stateless; a chatbot "remembers" only because the app resends the prior conversation inside the context window every time. This single fact explains most confusion about what LLMs can and can't do.
Step 3: Generation — temperature, top-k, top-p
The model outputs logits over the vocabulary; a softmax turns them into probabilities; then a sampling strategy picks the next token. The key knob is temperature $T$, which divides the logits before the softmax:
$$ p_i = \text{softmax}(z_i / T) $$
import numpy as np
def softmax(z): z = z - z.max(); e = np.exp(z); return e / e.sum()
logits = np.array([2.0, 1.0, 0.5, 0.0]) # 4 candidate next-tokens
for T in [0.5, 1.0, 2.0]:
print(f"T={T}: {np.round(softmax(logits/T), 3)}")
Output:
T=0.5: [0.831 0.112 0.041 0.015]
T=1.0: [0.579 0.213 0.129 0.078]
T=2.0: [0.409 0.248 0.193 0.15 ]
- Low temperature (0.5) → distribution sharpens; the top token dominates → focused, deterministic, repetitive. Use for code, math, extraction.
- High temperature (2.0) → distribution flattens; long-shot tokens get a real chance → creative, varied, riskier. Use for brainstorming, fiction.
- Temperature 0 → always take the argmax (greedy) → fully deterministic.
Two companions trim the candidate pool before sampling: top-k keeps only the k most likely tokens; top-p (nucleus) keeps the smallest set whose probabilities sum to p. Both cut off the implausible tail so high temperature stays coherent.
Don't be confused: temperature ≠ "how smart" the model is. It only controls randomness of selection, not capability. High temperature doesn't make a model more creative-in-a-good-way so much as more random; too high and it produces incoherent text. For most precise tasks, lower is better.
How an LLM is built: pretraining → fine-tuning → alignment
- Pretraining — train next-token prediction on a huge text corpus. This is the expensive part (millions of dollars, months of GPU time) and produces a base model with broad knowledge but no instinct to be helpful.
- Fine-tuning — continue training on narrower data to specialize. Instruction tuning teaches it to follow instructions; domain fine-tuning teaches it your jargon.
- Alignment (RLHF/DPO) — use human preference data to make it helpful, honest, and harmless. This is the Chapter 1 reinforcement- learning flavor at work.
Building with LLMs: prompting vs. RAG vs. fine-tuning
You rarely train an LLM; you build on one. Three escalating options — know which to reach for:
| Approach | What it is | Use when |
|---|---|---|
| Prompting | craft the input text; few-shot examples | the model already knows enough |
| RAG | retrieve relevant docs, paste into the prompt | you need current or private facts |
| Fine-tuning | adjust the weights on your data | you need a consistent style/format or a narrow skill |
Don't be confused: RAG vs. fine-tuning — they solve different problems. RAG adds knowledge (facts the model can read at query time, kept fresh in a vector database). Fine-tuning adds behavior (tone, format, a skill baked into the weights). To make a model know your latest docs, use RAG, not fine-tuning — fine-tuning teaches how to act, not what's true today. Most production systems use RAG first and fine-tune only when style consistency demands it.
RAG is built entirely from this book: embed your documents (Chapter 14), index them for fast nearest-neighbor search (the HNSW/IVF-PQ books), retrieve the top matches by cosine similarity (Chapter 4), and stuff them into the context window. You now understand every layer of it.
A few more terms you'll meet
- Hallucination — the model states something false with confidence. It optimizes plausible next token, not truth — RAG and verification mitigate it.
- Parameters (e.g. "70B") — the count of learnable weights; loosely correlates with capability and cost.
- Quantization — store weights in fewer bits (int8/int4) to run big models on small hardware (the same compression spirit as the IVF-PQ book).
- Multimodal — models that take images/audio alongside text (GPT-4o, Claude), built by embedding all modalities into one space.
- Agents / tool use — let the model call functions and APIs in a loop, turning a text predictor into something that acts.
The takeaway
An LLM is a next-token predictor: text → tokens → embeddings → transformer attention → softmax over the vocabulary → sample (temperature controls randomness). It's stateless between calls; "memory" is the resent context window. Build on it with prompting, RAG (adds knowledge), or fine-tuning (adds behavior) — and RAG is just the embedding + nearest-neighbor stack you already know. Next, the unglamorous machinery that makes any of this trainable: the data pipeline. 👉