The apprentice analogy
A master chef takes years and pages of theory to become great. An apprentice does not need to read every book — they watch the master cook a thousand dishes and copy the exact gestures: how much salt, when to flip, how the wrist moves. They become 95% as good in a fraction of the time.
Knowledge distillation is the same. A big, slow teacher model generates outputs (or full probability distributions) for a pile of inputs. A small student model is trained to match those outputs. You deploy the student.
Hard vs soft labels
- Hard labels — "this email is spam." Either-or.
- Soft labels — "84% spam, 12% promotion, 4% personal." A full distribution over classes.
Distillation uses soft labels from the teacher. They contain richer signal than the answer alone — they show how confident, what alternatives, what is similar. The student learns the teacher's whole worldview, not just its conclusions.
For LLMs, the analogue is the full next-token probability vector at every position, not just the chosen token.
The standard recipe
- Pick a strong teacher (big, expensive, accurate).
- Run the teacher on your dataset; capture its predictions / logits.
- Train a smaller student on the same data with a loss that combines:
- matching the original ground-truth labels (if available),
- matching the teacher's soft predictions (the distillation loss).
- Tune the temperature that softens the teacher's distribution — higher temperature = more nuance preserved.
Why students often beat from-scratch small models
A small model trained from scratch on hard labels has to learn the task by trial and error. A distilled student learns from a richly annotated dataset where every example carries the teacher's full opinion. It is the difference between a textbook with answers vs a textbook with answers and the worked solutions.
Where distillation shines
- On-device deployment — phones, browsers, edge servers. A 70B teacher distilled into a 1B student fits where the teacher cannot.
- Latency-critical paths — real-time speech, search ranking, autocomplete.
- Cost reduction at fixed quality — same task, fraction of the inference bill.
What it cannot fix
- Skills the teacher lacks — the student inherits the teacher's blind spots and biases.
- Domain shift — distil on the data you will actually serve; distillation on the wrong distribution does not generalise.
- Truly different behaviour — distillation makes the student more like the teacher, not better than it.
Practical levers
- Student size — too small and quality collapses; just-small-enough is the sweet spot. Start at ~10–20% of teacher params.
- Distillation data volume — more is almost always better; teachers can label cheaply at scale.
- Temperature — 2–5 is a sane default. Sweep on a held-out eval.
- Mixing weight — how much to weight ground truth vs teacher logits. Often
0.5/0.5is fine; tune if you have a strong eval signal.
Distillation is the unsung workhorse of production AI. Most "small fast model" you've used was distilled from something much larger.