Project 2 — Fine-tuning & LoRA
You almost never train a model from scratch (last chapter showed why — it takes millions of dollars). Instead you take a pretrained model and fine-tune it: a little extra training to adapt its broad knowledge to your task. And the modern, efficient way to do that — the technique behind nearly every custom LLM today — is LoRA. This project builds LoRA from scratch and measures why it's a big deal.
Full code: code/projects/finetune_lora.py.
What fine-tuning is
A pretrained model already learned general features (language, edges, concepts) from a huge corpus. Fine-tuning continues training it on a smaller, task-specific dataset so it specializes — sentiment on your reviews, your company's writing style, a medical domain. It's transfer learning: reuse the expensive pretraining, pay only for the cheap adaptation.
The problem: full fine-tuning updates all the weights. For a 70-billion-parameter model that means storing and training 70 billion numbers — enormous memory and a full copy of the model per task. LoRA fixes exactly this.
The LoRA idea: freeze the giant, train a tiny add-on
LoRA (Low-Rank Adaptation) freezes the pretrained weights $W_0$ and adds a small trainable detour: a low-rank (Chapter 23) pair of matrices $A$ and $B$. The adapted layer is:
$$ y = x W_0 + \frac{\alpha}{r}, x A B $$
where $A$ is $(d \times r)$ and $B$ is $(r \times d)$ with the rank $r$ tiny (4, 8, 16). You train only $A$ and $B$ — a handful of parameters — while the huge $W_0$ never moves. From the project:
class LoRALinear(nn.Module):
def __init__(self, base, r=4, alpha=8):
self.base = base
for p in self.base.parameters():
p.requires_grad = False # FREEZE the pretrained weights
self.A = nn.Parameter(torch.randn(base.in_features, r) * 0.01)
self.B = nn.Parameter(torch.zeros(r, base.out_features)) # starts as a no-op
self.scale = alpha / r
def forward(self, x):
return self.base(x) + self.scale * (x @ self.A @ self.B) # frozen + low-rank
Why it works: the change needed to adapt a model is usually low-rank — it lives in a small subspace.
Bstarts at zero so training begins exactly at the pretrained model and only departs as needed. This is the same low-rank, "best approximation with few numbers" idea as SVD (Chapter 23).
Measuring the win
The project adapts a "pretrained" layer to a new classification task two ways — full fine-tuning vs. LoRA — and counts trainable parameters:
$ python finetune_lora.py
Output:
full fine-tune : trainable params = 1188 acc = 1.000
LoRA (r=4) : trainable params = 388 acc = 1.000
LoRA trains 3.1x fewer parameters for comparable accuracy.
Same 100% accuracy, a third of the trainable parameters. On our toy layer the ratio is 3×; on a real LLM, where the frozen base is billions of weights and the adapter is millions, the ratio is 100–1000×. That's the difference between needing a data center and fine-tuning on a single GPU.
Why LoRA changed the game
- Cheap — train <1% of the parameters; fits on modest hardware.
- Small artifacts — a LoRA adapter is a few megabytes, not gigabytes. Ship one base model + many tiny adapters (one per customer/task) instead of many full copies.
- No quality loss — matches full fine-tuning on most tasks.
- Composable & reversible — swap adapters in and out; the base is untouched.
QLoRA adds quantization (Chapter 27) so you can fine-tune a 70B model on a single consumer GPU. This is why a small team can now customize a giant model.
The real recipe (follow-along)
In practice you don't hand-roll LoRALinear — you use Hugging Face transformers +
peft. The whole production fine-tune is a few lines:
# pip install transformers peft datasets
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
lora = LoraConfig(r=8, lora_alpha=16, target_modules=["q_lin", "v_lin"])
model = get_peft_model(model, lora) # wraps the model: freezes base, adds LoRA
model.print_trainable_parameters() # e.g. "trainable: 0.6% of all params"
Trainer(model, TrainingArguments("out", num_train_epochs=3),
train_dataset=ds).train() # trains ONLY the adapter
model.save_pretrained("my-adapter") # a few MB, not GB
get_peft_model does exactly what our LoRALinear did — freeze the base, inject
low-rank adapters into the attention layers — just for a real pretrained transformer.
Don't be confused: fine-tuning vs. RAG vs. prompting. Fine-tuning changes the model's behavior/skill by training (use it for tone, format, a narrow capability). RAG (Chapter 14) adds knowledge at query time without training. Prompting changes nothing. To make a model act a certain way → fine-tune; to make it know your latest docs → RAG. Reaching for fine-tuning when you needed RAG (or vice versa) is the most common modern-AI design mistake.
Make it production
- Track every fine-tune with MLflow; version the adapter in a registry (tools book).
- Evaluate before shipping — a held-out set and an eval gate (Chapter 10; the tools book's testing/eval chapters).
- Serve the base model once and hot-swap adapters per task.
The takeaway
Fine-tuning adapts a pretrained model to your task; LoRA does it efficiently by freezing the base and training a tiny low-rank adapter — you measured 3× fewer parameters here, 100–1000× on real LLMs, at no accuracy cost. It's the same low-rank idea as SVD, and the reason custom LLMs are now within reach of small teams. Next, we give a model the ability to act: an agent with tools. 👉