Skip to main content
Training & Fine-Tuning Agent loop 3 sliders

Knowledge Distillation: Teaching a Small Model to Imitate a Big One

Distillation trains a small student model to mimic a big teacher's soft outputs. You ship the small one — much cheaper, surprisingly close in quality.

· 3 min read
Jump to the lab
▸ Try it yourself

Drag any slider — the diagram reacts in real time.

FR /100
¶ The analogy

The apprentice analogy

A master chef takes years and pages of theory to become great. An apprentice does not need to read every book — they watch the master cook a thousand dishes and copy the exact gestures: how much salt, when to flip, how the wrist moves. They become 95% as good in a fraction of the time.

Knowledge distillation is the same. A big, slow teacher model generates outputs (or full probability distributions) for a pile of inputs. A small student model is trained to match those outputs. You deploy the student.

Hard vs soft labels

  • Hard labels — "this email is spam." Either-or.
  • Soft labels — "84% spam, 12% promotion, 4% personal." A full distribution over classes.

Distillation uses soft labels from the teacher. They contain richer signal than the answer alone — they show how confident, what alternatives, what is similar. The student learns the teacher's whole worldview, not just its conclusions.

For LLMs, the analogue is the full next-token probability vector at every position, not just the chosen token.

The standard recipe

  1. Pick a strong teacher (big, expensive, accurate).
  2. Run the teacher on your dataset; capture its predictions / logits.
  3. Train a smaller student on the same data with a loss that combines:
    • matching the original ground-truth labels (if available),
    • matching the teacher's soft predictions (the distillation loss).
  4. Tune the temperature that softens the teacher's distribution — higher temperature = more nuance preserved.

Why students often beat from-scratch small models

A small model trained from scratch on hard labels has to learn the task by trial and error. A distilled student learns from a richly annotated dataset where every example carries the teacher's full opinion. It is the difference between a textbook with answers vs a textbook with answers and the worked solutions.

Where distillation shines

  • On-device deployment — phones, browsers, edge servers. A 70B teacher distilled into a 1B student fits where the teacher cannot.
  • Latency-critical paths — real-time speech, search ranking, autocomplete.
  • Cost reduction at fixed quality — same task, fraction of the inference bill.

What it cannot fix

  • Skills the teacher lacks — the student inherits the teacher's blind spots and biases.
  • Domain shift — distil on the data you will actually serve; distillation on the wrong distribution does not generalise.
  • Truly different behaviour — distillation makes the student more like the teacher, not better than it.

Practical levers

  • Student size — too small and quality collapses; just-small-enough is the sweet spot. Start at ~10–20% of teacher params.
  • Distillation data volume — more is almost always better; teachers can label cheaply at scale.
  • Temperature — 2–5 is a sane default. Sweep on a held-out eval.
  • Mixing weight — how much to weight ground truth vs teacher logits. Often 0.5/0.5 is fine; tune if you have a strong eval signal.

Distillation is the unsung workhorse of production AI. Most "small fast model" you've used was distilled from something much larger.

Engr Mejba Ahmed

Engr Mejba Ahmed

Claude Code Expert · Online

👋

Hey there!

Quick Actions

WhatsApp Instant reply

Chat on WhatsApp

+880 1723 741224 · Instant reply

Popular Questions

Engr Mejba Ahmed is connected
Engr Mejba Ahmed is typing...
Engr Mejba Ahmed avatar

✉ Want me to follow up? Drop your email

Engr Mejba Ahmed avatar

📞 Connect Directly

Choose how you'd like to reach me

WhatsApp

+880 1723 741224

Email

[email protected]

✓ Details sent! I'll get back to you shortly.

Powered by OpenAI

335+

Blog Posts

25

AI Courses

63

Projects

Services & Expertise

Pricing & Process

Learning & Resources

Connect & Support