Project 1 — Train a GPT from scratch
This is the project at the center of modern AI: a GPT — a decoder-only transformer that generates text by predicting the next token, over and over. It's the exact architecture behind GPT-4, Claude, and Llama. We build the whole thing — embeddings, masked self-attention, feed-forward layers, residuals, layer norm — train it on a CPU in 12 seconds, and watch it generate coherent text. Scale this up and it's an LLM.
Full code: code/projects/gpt.py (PyTorch, CPU-friendly).
What a GPT does
One thing, repeatedly: predict the next token given all previous tokens (Chapter 15). We work at the character level so the vocabulary is tiny and it trains instantly; real models use subword tokens, but the mechanics are identical.
input: "to be or not to b" -> model predicts -> "e"
Train it to do this well on a corpus, then feed its own output back in, and it generates.
The architecture, piece by piece
Each piece is a concept you already met, now assembled into the real thing:
class GPT(nn.Module):
def __init__(self):
self.tok = nn.Embedding(VOCAB, N_EMBD) # token → vector (Ch 14)
self.pos = nn.Embedding(BLOCK, N_EMBD) # position → vector
self.blocks = nn.ModuleList(Block() for _ in range(N_LAYER))
self.head = nn.Linear(N_EMBD, VOCAB) # vector → next-token logits (Ch 6)
- Token embedding — each character becomes a learned vector (Chapter 14).
- Positional embedding — attention has no inherent sense of order, so we add a vector encoding each position. ("to be" ≠ "be to".)
- Transformer blocks — the heart, below.
- Head — projects the final vector to a logit per vocabulary item; softmax → next-character probabilities.
The transformer block: attention + MLP, each with a residual
class Block(nn.Module):
def forward(self, x):
T = x.size(1)
mask = torch.triu(torch.ones(T, T), diagonal=1).bool() # causal mask
a, _ = self.attn(self.ln1(x), self.ln1(x), self.ln1(x), attn_mask=mask, ...)
x = x + a # residual connection
x = x + self.mlp(self.ln2(x)) # residual connection
return x
Three ideas make this work, all from earlier chapters:
- Masked (causal) self-attention — each position attends to all earlier
positions but not future ones (the upper-triangular
mask). That mask is what makes it a generator: predicting position $t$ can only use positions $< t$. The attention itself issoftmax(QKᵀ/√d)·Vfrom Chapter 13. - Residual connections (
x = x + ...) — add each sublayer's output back to its input, so gradients flow cleanly through deep stacks (Chapter 11). - Layer norm — stabilizes training (Chapter 9).
That's the entire transformer. Stack a few blocks and you have GPT.
Training it
The training loop is the canonical one from Chapter 12: get a batch of (context, next-char) windows, compute cross-entropy loss (Chapter 7) between predictions and the actual next chars, backprop, step.
logits = model(x)
loss = F.cross_entropy(logits.reshape(-1, VOCAB), y.reshape(-1))
opt.zero_grad(); loss.backward(); opt.step()
Run it:
$ python gpt.py
Output:
vocab=13 params=103821
step 0 loss 2.658
step 100 loss 0.063
step 200 loss 0.082
step 300 loss 0.051
step 400 loss 0.051
trained in 11.9s
sample: that is the question to be or not to be that is the question
A 104k-parameter transformer, trained in 12 seconds on a CPU, went from loss
2.66 (random — $\ln 13 \approx 2.56$ is pure-guess level) to 0.05, and now generates
coherent text continuing the corpus. That sample line is the model writing,
character by character, by repeatedly sampling its own next-token distribution.
Generation: sampling one token at a time
@torch.no_grad()
def generate(self, idx, n, temperature=0.8):
for _ in range(n):
logits = self(idx[:, -BLOCK:]) # only the last BLOCK chars fit
probs = F.softmax(logits[:, -1, :] / temperature, dim=-1)
idx = torch.cat([idx, torch.multinomial(probs, 1)], dim=1) # sample & append
return idx
This is how ChatGPT and Claude generate — one token at a time, each conditioned on
everything so far, with temperature controlling randomness
(Chapter 15). The [:, -BLOCK:] is the context
window in action: the model can only see its last BLOCK characters.
From this to a real LLM
The gap is scale and data, not architecture. A frontier model is this exact code with:
- Subword tokenization instead of characters (a ~100k-token vocabulary).
- Billions of parameters (more layers, bigger
N_EMBD, more heads) instead of 104k. - Trillions of tokens of training data instead of one sentence.
- Thousands of GPUs, mixed precision, and learning-rate schedules.
- Then fine-tuning (next chapter) and alignment (RLHF/DPO, Chapter 27) to make it helpful.
Every one of those is an engineering scale-up of what you just ran. The famous nanoGPT is essentially this file, productionized.
Make it production
- Serve it behind an API and stream tokens (the Production ML & AI Tools book's FastAPI + LLM-serving chapters).
- Don't train your own foundation model — pretraining costs millions. In practice you take a pretrained model and adapt it, which is the next project.
- Evaluate with perplexity ($e^{\text{loss}}$) and downstream task metrics (Chapter 10).
The takeaway
You built a GPT: token + positional embeddings, masked self-attention with residuals and layer norm, trained by next-token cross-entropy, generating one token at a time with temperature. It's 104k parameters and 12 seconds on a CPU — but it's the real architecture, and the only thing between it and GPT-4 is scale. Now let's adapt a pretrained model to a new task — efficiently — with LoRA. 👉