Skip to main content
Training & Fine-Tuning Agent loop 3 Slider

LoRA: Cheap Fine-Tuning Without Touching the Whole Model

LoRA freezes the giant model and trains tiny rank-r adapters next to it. 7B-param model, ~1% of the trainable weights, 99% of the quality.

· 2 Min. Lesezeit
Zum Lab springen
▸ Selbst ausprobieren

Zieh einen Slider — das Diagramm reagiert in Echtzeit.

FR /100
¶ Die Analogie

The sticky-notes analogy

Imagine a 1,000-page manual. To customise it for your team, you do not rewrite the manual — you stick post-it notes in the margins. The original is untouched. The notes are tiny. Anyone can peel them off and stick on a different set for a different team.

LoRA — Low-Rank Adaptation does the same thing. The base model's weights stay frozen. Tiny adapter matrices are trained next to specific layers and added to the originals at inference time. You ship one base + many cheap adapters.

The math, in one line

For a frozen weight matrix W of shape d × d, LoRA learns two small matrices A (d × r) and B (r × d) where r ≪ d. The effective weight becomes:

W_effective = W + α · B · A

r is the rank — usually 4, 8, 16, 32. α is a scaling constant. The original W is never updated.

For a 4096×4096 layer with r=8, you train 4096×8 + 8×4096 = ~65k parameters instead of ~16M. ~250× fewer.

Why this is allowed to work

Empirically, the change a fine-tune wants to make to a pretrained model lives in a low-dimensional subspace. You do not need a full-rank update — a rank-8 nudge is usually enough to specialise a model on a narrow task. This was the surprising finding in the original LoRA paper.

What you save

  • Memory at training — you do not need optimiser state for the frozen weights, the biggest VRAM eater. A 7B model that needs ~80GB to fully fine-tune fits in ~16GB with LoRA.
  • Storage — adapters are MBs, not GBs. Ship 50 customised models from one base.
  • Switching cost — load one base, swap adapters per request or per tenant.

QLoRA — even cheaper

QLoRA combines LoRA with 4-bit quantisation of the base model. The frozen weights are stored in 4 bits (a quarter of fp16); only the adapters stay in higher precision. Result: a 70B model fine-tunable on a single consumer GPU.

When LoRA is enough

  • Tone, format, domain language (✅).
  • Tool-call adherence (✅).
  • Narrow task quality (✅).
  • Adding a fundamentally new skill the base model cannot do at all (⚠️ — full fine-tune is often needed).
  • Substantial knowledge injection (⚠️ — RAG is still better than burning facts into adapters).

Practical tips

  • Start with r=8, α=16. Almost always a reasonable default.
  • Apply adapters to the attention projections (Q, K, V, O). Adding to the FFN helps marginally and roughly doubles adapter size.
  • Don't go wild on rank — r=64 rarely beats r=16 and costs 4× the storage.
  • Eval on the target task and on a held-out general task. LoRA can quietly damage general capability if you push too hard.
Engr Mejba Ahmed

Engr Mejba Ahmed

Claude Code Expert · Online

👋

Hey there!

Quick Actions

WhatsApp Instant reply

Chat on WhatsApp

+880 1723 741224 · Instant reply

Popular Questions

Engr Mejba Ahmed is connected
Engr Mejba Ahmed is typing...
Engr Mejba Ahmed avatar

✉ Want me to follow up? Drop your email

Engr Mejba Ahmed avatar

📞 Connect Directly

Choose how you'd like to reach me

WhatsApp

+880 1723 741224

Email

[email protected]

✓ Details sent! I'll get back to you shortly.

Powered by OpenAI

335+

Blog Posts

25

AI Courses

63

Projects

Services & Expertise

Pricing & Process

Learning & Resources

Connect & Support