Project 2 — Fine-tuning & LoRA

You almost never train a model from scratch (last chapter showed why — it takes millions of dollars). Instead you take a pretrained model and fine-tune it: a little extra training to adapt its broad knowledge to your task. And the modern, efficient way to do that — the technique behind nearly every custom LLM today — is LoRA. This project builds LoRA from scratch and measures why it's a big deal.

Full code: code/projects/finetune_lora.py.

What fine-tuning is

A pretrained model already learned general features (language, edges, concepts) from a huge corpus. Fine-tuning continues training it on a smaller, task-specific dataset so it specializes — sentiment on your reviews, your company's writing style, a medical domain. It's transfer learning: reuse the expensive pretraining, pay only for the cheap adaptation.

The problem: full fine-tuning updates all the weights. For a 70-billion-parameter model that means storing and training 70 billion numbers — enormous memory and a full copy of the model per task. LoRA fixes exactly this.

The LoRA idea: freeze the giant, train a tiny add-on

LoRA (Low-Rank Adaptation) freezes the pretrained weights $W_0$ and adds a small trainable detour: a low-rank (Chapter 23) pair of matrices $A$ and $B$. The adapted layer is:

$$ y = x W_0 + \frac{\alpha}{r}, x A B $$

where $A$ is $(d \times r)$ and $B$ is $(r \times d)$ with the rank $r$ tiny (4, 8, 16). You train only $A$ and $B$ — a handful of parameters — while the huge $W_0$ never moves. From the project:

class LoRALinear(nn.Module):
    def __init__(self, base, r=4, alpha=8):
        self.base = base
        for p in self.base.parameters():
            p.requires_grad = False                 # FREEZE the pretrained weights
        self.A = nn.Parameter(torch.randn(base.in_features, r) * 0.01)
        self.B = nn.Parameter(torch.zeros(r, base.out_features))   # starts as a no-op
        self.scale = alpha / r

    def forward(self, x):
        return self.base(x) + self.scale * (x @ self.A @ self.B)   # frozen + low-rank

Why it works: the change needed to adapt a model is usually low-rank — it lives in a small subspace. B starts at zero so training begins exactly at the pretrained model and only departs as needed. This is the same low-rank, "best approximation with few numbers" idea as SVD (Chapter 23).

Measuring the win

The project adapts a "pretrained" layer to a new classification task two ways — full fine-tuning vs. LoRA — and counts trainable parameters:

$ python finetune_lora.py

Output:

full fine-tune : trainable params = 1188  acc = 1.000
LoRA (r=4)     : trainable params =  388  acc = 1.000

LoRA trains 3.1x fewer parameters for comparable accuracy.

Same 100% accuracy, a third of the trainable parameters. On our toy layer the ratio is 3×; on a real LLM, where the frozen base is billions of weights and the adapter is millions, the ratio is 100–1000×. That's the difference between needing a data center and fine-tuning on a single GPU.

Why LoRA changed the game

  • Cheap — train <1% of the parameters; fits on modest hardware.
  • Small artifacts — a LoRA adapter is a few megabytes, not gigabytes. Ship one base model + many tiny adapters (one per customer/task) instead of many full copies.
  • No quality loss — matches full fine-tuning on most tasks.
  • Composable & reversible — swap adapters in and out; the base is untouched.

QLoRA adds quantization (Chapter 27) so you can fine-tune a 70B model on a single consumer GPU. This is why a small team can now customize a giant model.

The real recipe (follow-along)

In practice you don't hand-roll LoRALinear — you use Hugging Face transformers + peft. The whole production fine-tune is a few lines:

# pip install transformers peft datasets
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
lora = LoraConfig(r=8, lora_alpha=16, target_modules=["q_lin", "v_lin"])
model = get_peft_model(model, lora)          # wraps the model: freezes base, adds LoRA
model.print_trainable_parameters()           # e.g. "trainable: 0.6% of all params"

Trainer(model, TrainingArguments("out", num_train_epochs=3),
        train_dataset=ds).train()            # trains ONLY the adapter
model.save_pretrained("my-adapter")          # a few MB, not GB

get_peft_model does exactly what our LoRALinear did — freeze the base, inject low-rank adapters into the attention layers — just for a real pretrained transformer.

Don't be confused: fine-tuning vs. RAG vs. prompting. Fine-tuning changes the model's behavior/skill by training (use it for tone, format, a narrow capability). RAG (Chapter 14) adds knowledge at query time without training. Prompting changes nothing. To make a model act a certain way → fine-tune; to make it know your latest docs → RAG. Reaching for fine-tuning when you needed RAG (or vice versa) is the most common modern-AI design mistake.

Make it production

  • Track every fine-tune with MLflow; version the adapter in a registry (tools book).
  • Evaluate before shipping — a held-out set and an eval gate (Chapter 10; the tools book's testing/eval chapters).
  • Serve the base model once and hot-swap adapters per task.

The takeaway

Fine-tuning adapts a pretrained model to your task; LoRA does it efficiently by freezing the base and training a tiny low-rank adapter — you measured 3× fewer parameters here, 100–1000× on real LLMs, at no accuracy cost. It's the same low-rank idea as SVD, and the reason custom LLMs are now within reach of small teams. Next, we give a model the ability to act: an agent with tools. 👉