Home Concept Explainers Training & Fine-Tuning LoRA: Cheap Fine-Tuning Without Touching the Whole Model

Training & Fine-Tuning Agent loop 3 Slider

LoRA: Cheap Fine-Tuning Without Touching the Whole Model

LoRA freezes the giant model and trains tiny rank-r adapters next to it. 7B-param model, ~1% of the trainable weights, 99% of the quality.

Apr 29, 2026 · 2 Min. Lesezeit

Zum Lab springen Keine Anmeldung · Für immer kostenlos

▸ Selbst ausprobieren

Zieh einen Slider — das Diagramm reagiert in Echtzeit.

Leertaste für Play · ←/→ zum Scrubben

Agent loop

FR /100 SN-74A

SPACE · ◄ ►

¶ Die Analogie

The sticky-notes analogy

Imagine a 1,000-page manual. To customise it for your team, you do not rewrite the manual — you stick post-it notes in the margins. The original is untouched. The notes are tiny. Anyone can peel them off and stick on a different set for a different team.

LoRA — Low-Rank Adaptation does the same thing. The base model's weights stay frozen. Tiny adapter matrices are trained next to specific layers and added to the originals at inference time. You ship one base + many cheap adapters.

The math, in one line

For a frozen weight matrix W of shape d × d, LoRA learns two small matrices A (d × r) and B (r × d) where r ≪ d. The effective weight becomes:

W_effective = W + α · B · A

r is the rank — usually 4, 8, 16, 32. α is a scaling constant. The original W is never updated.

For a 4096×4096 layer with r=8, you train 4096×8 + 8×4096 = ~65k parameters instead of ~16M. ~250× fewer.

Why this is allowed to work

Empirically, the change a fine-tune wants to make to a pretrained model lives in a low-dimensional subspace. You do not need a full-rank update — a rank-8 nudge is usually enough to specialise a model on a narrow task. This was the surprising finding in the original LoRA paper.

What you save

Memory at training — you do not need optimiser state for the frozen weights, the biggest VRAM eater. A 7B model that needs ~80GB to fully fine-tune fits in ~16GB with LoRA.
Storage — adapters are MBs, not GBs. Ship 50 customised models from one base.
Switching cost — load one base, swap adapters per request or per tenant.

QLoRA — even cheaper

QLoRA combines LoRA with 4-bit quantisation of the base model. The frozen weights are stored in 4 bits (a quarter of fp16); only the adapters stay in higher precision. Result: a 70B model fine-tunable on a single consumer GPU.

When LoRA is enough

Tone, format, domain language (✅).
Tool-call adherence (✅).
Narrow task quality (✅).
Adding a fundamentally new skill the base model cannot do at all (⚠️ — full fine-tune is often needed).
Substantial knowledge injection (⚠️ — RAG is still better than burning facts into adapters).

Practical tips

Start with r=8, α=16. Almost always a reasonable default.
Apply adapters to the attention projections (Q, K, V, O). Adding to the FFN helps marginally and roughly doubles adapter size.
Don't go wild on rank — r=64 rarely beats r=16 and costs 4× the storage.
Eval on the target task and on a held-out general task. LoRA can quietly damage general capability if you push too hard.