Project 1 — Train a GPT from scratch

This is the project at the center of modern AI: a GPT — a decoder-only transformer that generates text by predicting the next token, over and over. It's the exact architecture behind GPT-4, Claude, and Llama. We build the whole thing — embeddings, masked self-attention, feed-forward layers, residuals, layer norm — train it on a CPU in 12 seconds, and watch it generate coherent text. Scale this up and it's an LLM.

Full code: code/projects/gpt.py (PyTorch, CPU-friendly).

What a GPT does

One thing, repeatedly: predict the next token given all previous tokens (Chapter 15). We work at the character level so the vocabulary is tiny and it trains instantly; real models use subword tokens, but the mechanics are identical.

input:  "to be or not to b"   ->  model predicts ->  "e"

Train it to do this well on a corpus, then feed its own output back in, and it generates.

The architecture, piece by piece

Each piece is a concept you already met, now assembled into the real thing:

class GPT(nn.Module):
    def __init__(self):
        self.tok = nn.Embedding(VOCAB, N_EMBD)    # token → vector (Ch 14)
        self.pos = nn.Embedding(BLOCK, N_EMBD)    # position → vector
        self.blocks = nn.ModuleList(Block() for _ in range(N_LAYER))
        self.head = nn.Linear(N_EMBD, VOCAB)      # vector → next-token logits (Ch 6)
  • Token embedding — each character becomes a learned vector (Chapter 14).
  • Positional embedding — attention has no inherent sense of order, so we add a vector encoding each position. ("to be" ≠ "be to".)
  • Transformer blocks — the heart, below.
  • Head — projects the final vector to a logit per vocabulary item; softmax → next-character probabilities.

The transformer block: attention + MLP, each with a residual

class Block(nn.Module):
    def forward(self, x):
        T = x.size(1)
        mask = torch.triu(torch.ones(T, T), diagonal=1).bool()   # causal mask
        a, _ = self.attn(self.ln1(x), self.ln1(x), self.ln1(x), attn_mask=mask, ...)
        x = x + a                       # residual connection
        x = x + self.mlp(self.ln2(x))   # residual connection
        return x

Three ideas make this work, all from earlier chapters:

  • Masked (causal) self-attention — each position attends to all earlier positions but not future ones (the upper-triangular mask). That mask is what makes it a generator: predicting position $t$ can only use positions $< t$. The attention itself is softmax(QKᵀ/√d)·V from Chapter 13.
  • Residual connections (x = x + ...) — add each sublayer's output back to its input, so gradients flow cleanly through deep stacks (Chapter 11).
  • Layer norm — stabilizes training (Chapter 9).

That's the entire transformer. Stack a few blocks and you have GPT.

Training it

The training loop is the canonical one from Chapter 12: get a batch of (context, next-char) windows, compute cross-entropy loss (Chapter 7) between predictions and the actual next chars, backprop, step.

logits = model(x)
loss = F.cross_entropy(logits.reshape(-1, VOCAB), y.reshape(-1))
opt.zero_grad(); loss.backward(); opt.step()

Run it:

$ python gpt.py

Output:

vocab=13  params=103821
step   0  loss 2.658
step 100  loss 0.063
step 200  loss 0.082
step 300  loss 0.051
step 400  loss 0.051
trained in 11.9s
sample: that is the question to be or not to be that is the question

A 104k-parameter transformer, trained in 12 seconds on a CPU, went from loss 2.66 (random — $\ln 13 \approx 2.56$ is pure-guess level) to 0.05, and now generates coherent text continuing the corpus. That sample line is the model writing, character by character, by repeatedly sampling its own next-token distribution.

Generation: sampling one token at a time

@torch.no_grad()
def generate(self, idx, n, temperature=0.8):
    for _ in range(n):
        logits = self(idx[:, -BLOCK:])                  # only the last BLOCK chars fit
        probs = F.softmax(logits[:, -1, :] / temperature, dim=-1)
        idx = torch.cat([idx, torch.multinomial(probs, 1)], dim=1)  # sample & append
    return idx

This is how ChatGPT and Claude generate — one token at a time, each conditioned on everything so far, with temperature controlling randomness (Chapter 15). The [:, -BLOCK:] is the context window in action: the model can only see its last BLOCK characters.

From this to a real LLM

The gap is scale and data, not architecture. A frontier model is this exact code with:

  • Subword tokenization instead of characters (a ~100k-token vocabulary).
  • Billions of parameters (more layers, bigger N_EMBD, more heads) instead of 104k.
  • Trillions of tokens of training data instead of one sentence.
  • Thousands of GPUs, mixed precision, and learning-rate schedules.
  • Then fine-tuning (next chapter) and alignment (RLHF/DPO, Chapter 27) to make it helpful.

Every one of those is an engineering scale-up of what you just ran. The famous nanoGPT is essentially this file, productionized.

Make it production

  • Serve it behind an API and stream tokens (the Production ML & AI Tools book's FastAPI + LLM-serving chapters).
  • Don't train your own foundation model — pretraining costs millions. In practice you take a pretrained model and adapt it, which is the next project.
  • Evaluate with perplexity ($e^{\text{loss}}$) and downstream task metrics (Chapter 10).

The takeaway

You built a GPT: token + positional embeddings, masked self-attention with residuals and layer norm, trained by next-token cross-entropy, generating one token at a time with temperature. It's 104k parameters and 12 seconds on a CPU — but it's the real architecture, and the only thing between it and GPT-4 is scale. Now let's adapt a pretrained model to a new task — efficiently — with LoRA. 👉