Skip to main content
Chapter 8 Fine-Tuning LLMs — LoRA, QLoRA & Knowledge Augmentation

LoRA — Low-Rank Adaptation for Efficient Fine-Tuning

22 min read Lesson 36 / 50 Preview

LoRA — Low-Rank Adaptation for Efficient Fine-Tuning

Fine-tuning a 7 billion parameter model by updating every weight requires roughly 112 GB of GPU memory for the model, gradients, and optimizer states combined. That puts full fine-tuning out of reach for the vast majority of developers. LoRA changes everything — enabling near-full fine-tune quality while training less than 1% of the parameters.

The Core Problem LoRA Solves

During fine-tuning, you're updating a weight matrix W of shape (d × k). For LLaMA 3 8B, a single attention projection matrix might be (4096 × 4096) = 16.7 million values.

The key insight from the LoRA paper (Hu et al., 2021): the weight updates needed for fine-tuning have low intrinsic rank. The change ΔW doesn't need all 16.7M degrees of freedom — it can be expressed as the product of two much smaller matrices.

The Mathematical Insight

Instead of learning ΔW directly, LoRA decomposes it:

ΔW = B × A

Where:
  W is the original frozen weight matrix: shape (d × k)
  B is a matrix of shape (d × r)
  A is a matrix of shape (r × k)
  r << min(d, k)  — the rank, a small number like 4, 8, or 16

Modified forward pass:
  h = Wx + BAx
    = (W + BA)x

For a (4096 × 4096) matrix with rank r=8:

  • Original: 4096 × 4096 = 16,777,216 parameters
  • LoRA: 4096×8 + 8×4096 = 65,536 parameters (0.39% of original)

The original W is completely frozen — its gradients are never computed, saving enormous memory.

Initialization Strategy

LoRA uses a careful initialization to ensure training starts stably:

# A is initialized with random Gaussian (small values)
# B is initialized to zeros
# This ensures BA = 0 at initialization
# So the modified layer starts identical to the original
A = torch.randn(r, k) * 0.02
B = torch.zeros(d, r)

During the forward pass, LoRA also applies a scaling factor:

h = Wx + (alpha / r) × BAx

Where alpha is a hyperparameter (often set equal to r). This scaling prevents the LoRA update from dominating the original weights early in training.

Which Modules to Target

Not all weight matrices benefit equally from LoRA. The standard approach targets attention projection matrices:

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16,                          # Rank — higher = more expressive, more params
    lora_alpha=32,                 # Scaling factor (alpha/r = 2.0 here)
    target_modules=[
        "q_proj",                  # Query projection
        "k_proj",                  # Key projection
        "v_proj",                  # Value projection
        "o_proj",                  # Output projection
        "gate_proj",               # FFN gate (optional but often helpful)
        "up_proj",                 # FFN up projection (optional)
        "down_proj",               # FFN down projection (optional)
    ],
    lora_dropout=0.05,             # Dropout on LoRA layers
    bias="none",                   # Don't train biases
    task_type="CAUSAL_LM",        # Decoder-only generation
)

model = get_peft_model(base_model, config)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 8,030,261,248 || trainable%: 0.5223

Rank Selection Guidelines

Rank (r) Use Case Trainable Params (7B model)
4 Simple style adaptation ~10M
8 Standard fine-tuning ~20M
16 Domain adaptation ~41M
32 Complex task learning ~82M
64 Near full fine-tune quality ~164M

Diminishing returns set in above r=64 for most tasks. Start with r=16 and tune from there.

Merging LoRA Weights for Production

After training, you can merge the LoRA weights back into the base model, creating a single model with zero inference overhead:

# Merge and save
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./my-fine-tuned-model")
tokenizer.save_pretrained("./my-fine-tuned-model")

# Load as a normal model (no PEFT dependency)
from transformers import AutoModelForCausalLM
final_model = AutoModelForCausalLM.from_pretrained("./my-fine-tuned-model")

After merging: W_merged = W_original + (alpha/r) × BA

Actionable Takeaways

  • LoRA reduces GPU memory for fine-tuning by 10–100× depending on rank and target modules
  • The rank r is the most important hyperparameter — treat it like hidden layer width in a classical network
  • Always initialize B to zeros — this ensures a stable start from the pre-trained checkpoint
  • Targeting q_proj and v_proj alone (the minimal config) often achieves 85–90% of full LoRA performance
  • Merge weights before production deployment to eliminate the PEFT dependency and runtime overhead