The sticky-notes analogy
Imagine a 1,000-page manual. To customise it for your team, you do not rewrite the manual — you stick post-it notes in the margins. The original is untouched. The notes are tiny. Anyone can peel them off and stick on a different set for a different team.
LoRA — Low-Rank Adaptation does the same thing. The base model's weights stay frozen. Tiny adapter matrices are trained next to specific layers and added to the originals at inference time. You ship one base + many cheap adapters.
The math, in one line
For a frozen weight matrix W of shape d × d, LoRA learns two small matrices A (d × r) and B (r × d) where r ≪ d. The effective weight becomes:
W_effective = W + α · B · A
r is the rank — usually 4, 8, 16, 32. α is a scaling constant. The original W is never updated.
For a 4096×4096 layer with r=8, you train 4096×8 + 8×4096 = ~65k parameters instead of ~16M. ~250× fewer.
Why this is allowed to work
Empirically, the change a fine-tune wants to make to a pretrained model lives in a low-dimensional subspace. You do not need a full-rank update — a rank-8 nudge is usually enough to specialise a model on a narrow task. This was the surprising finding in the original LoRA paper.
What you save
- Memory at training — you do not need optimiser state for the frozen weights, the biggest VRAM eater. A 7B model that needs ~80GB to fully fine-tune fits in ~16GB with LoRA.
- Storage — adapters are MBs, not GBs. Ship 50 customised models from one base.
- Switching cost — load one base, swap adapters per request or per tenant.
QLoRA — even cheaper
QLoRA combines LoRA with 4-bit quantisation of the base model. The frozen weights are stored in 4 bits (a quarter of fp16); only the adapters stay in higher precision. Result: a 70B model fine-tunable on a single consumer GPU.
When LoRA is enough
- Tone, format, domain language (✅).
- Tool-call adherence (✅).
- Narrow task quality (✅).
- Adding a fundamentally new skill the base model cannot do at all (⚠️ — full fine-tune is often needed).
- Substantial knowledge injection (⚠️ — RAG is still better than burning facts into adapters).
Practical tips
- Start with
r=8,α=16. Almost always a reasonable default. - Apply adapters to the attention projections (Q, K, V, O). Adding to the FFN helps marginally and roughly doubles adapter size.
- Don't go wild on rank —
r=64rarely beatsr=16and costs 4× the storage. - Eval on the target task and on a held-out general task. LoRA can quietly damage general capability if you push too hard.