Home Concept Explainers Training & Fine-Tuning Knowledge Distillation: Teaching a Small Model to Imitate a...

Training & Fine-Tuning Agent loop 3 sliders

Knowledge Distillation: Teaching a Small Model to Imitate a Big One

Distillation trains a small student model to mimic a big teacher's soft outputs. You ship the small one — much cheaper, surprisingly close in quality.

Apr 29, 2026 · 3 min read

Jump to the lab No sign-up · Free forever

▸ Try it yourself

Drag any slider — the diagram reacts in real time.

Space to play · ←/→ to scrub

Agent loop

FR /100 SN-74A

SPACE · ◄ ►

¶ The analogy

The apprentice analogy

A master chef takes years and pages of theory to become great. An apprentice does not need to read every book — they watch the master cook a thousand dishes and copy the exact gestures: how much salt, when to flip, how the wrist moves. They become 95% as good in a fraction of the time.

Knowledge distillation is the same. A big, slow teacher model generates outputs (or full probability distributions) for a pile of inputs. A small student model is trained to match those outputs. You deploy the student.

Hard vs soft labels

Hard labels — "this email is spam." Either-or.
Soft labels — "84% spam, 12% promotion, 4% personal." A full distribution over classes.

Distillation uses soft labels from the teacher. They contain richer signal than the answer alone — they show how confident, what alternatives, what is similar. The student learns the teacher's whole worldview, not just its conclusions.

For LLMs, the analogue is the full next-token probability vector at every position, not just the chosen token.

The standard recipe

Pick a strong teacher (big, expensive, accurate).
Run the teacher on your dataset; capture its predictions / logits.
Train a smaller student on the same data with a loss that combines:
- matching the original ground-truth labels (if available),
- matching the teacher's soft predictions (the distillation loss).
Tune the temperature that softens the teacher's distribution — higher temperature = more nuance preserved.

Why students often beat from-scratch small models

A small model trained from scratch on hard labels has to learn the task by trial and error. A distilled student learns from a richly annotated dataset where every example carries the teacher's full opinion. It is the difference between a textbook with answers vs a textbook with answers and the worked solutions.

Where distillation shines

On-device deployment — phones, browsers, edge servers. A 70B teacher distilled into a 1B student fits where the teacher cannot.
Latency-critical paths — real-time speech, search ranking, autocomplete.
Cost reduction at fixed quality — same task, fraction of the inference bill.

What it cannot fix

Skills the teacher lacks — the student inherits the teacher's blind spots and biases.
Domain shift — distil on the data you will actually serve; distillation on the wrong distribution does not generalise.
Truly different behaviour — distillation makes the student more like the teacher, not better than it.

Practical levers

Student size — too small and quality collapses; just-small-enough is the sweet spot. Start at ~10–20% of teacher params.
Distillation data volume — more is almost always better; teachers can label cheaply at scale.
Temperature — 2–5 is a sane default. Sweep on a held-out eval.
Mixing weight — how much to weight ground truth vs teacher logits. Often 0.5/0.5 is fine; tune if you have a strong eval signal.

Distillation is the unsung workhorse of production AI. Most "small fast model" you've used was distilled from something much larger.