The two-tasting-menus analogy
A new chef can already cook. To make them great, the head chef tastes two plates side by side and points to the better one — over and over, thousands of times. The chef learns not from a recipe but from preferences: this one, not that one.
RLHF — Reinforcement Learning from Human Feedback is exactly that. Humans rank pairs of model outputs. A second model learns those preferences. Then RL nudges the original model to produce more "preferred" answers, less "rejected" ones — all without anyone writing a rule.
Why we need RLHF at all
A pretrained LLM has read the internet. It can write — fluently, dangerously, sometimes brilliantly. But it has no opinion about whether its output is useful, true, or safe. RLHF is how we encode "useful, true, safe" into the model without writing 10 million if-statements.
Goal in three words: helpful, honest, harmless (the "HHH" framing).
The three-stage pipeline
Stage 1 — Supervised fine-tuning (SFT)
Show the base model thousands of high-quality (prompt, response) pairs written by humans. The model learns the shape of a good answer — format, tone, structure.
Stage 2 — Reward model
Humans see two model responses to the same prompt and pick the better one. You collect tens of thousands of (prompt, chosen, rejected) triples. A separate reward model is trained to predict "which would a human prefer?" — turning soft human judgement into a number the optimiser can chase.
Stage 3 — RL fine-tuning (PPO or DPO)
Use the reward model as the reward signal. Run a policy-gradient algorithm (PPO is classic; DPO collapses it into a simpler loss) to update the LLM so that high-reward outputs become more likely.
A KL penalty pulls the model back toward its SFT version so it does not drift into reward-hacked nonsense.
The four levers that decide quality
- Preference data quality — bad rankings ⇒ bad reward model ⇒ a worse final model. Annotator guidelines matter as much as the algorithm.
- Reward model accuracy — if the reward model only gets 60% of human preferences right, you are training the LLM to chase a noisy signal.
- KL penalty strength — too low and the model goes off the rails; too high and RL barely changes anything.
- Data coverage — RLHF only fixes failure modes you collected preferences on. Hostile prompts you never showed it remain weak spots.
RLHF, RLAIF, and DPO
| Variant | Where the preferences come from | Why use it |
|---|---|---|
| RLHF | Humans | Gold standard, slow and expensive |
| RLAIF | A capable AI model | Cheaper, scalable, depends on the judge |
| DPO | Same data, simpler optimiser | No separate reward model — easier to ship |
Most modern aligned LLMs ship with some mix of these. The ratio is the secret sauce labs do not love discussing.
What RLHF does not fix
- Knowledge gaps — RLHF can change tone, not what the model knows. Use RAG or fine-tuning for facts.
- Adversarial jailbreaks — clever prompts can still surface unaligned behaviour. Layer RLHF with input/output filters.
- Bias baked in pretraining — preference data can amplify or shift bias, not erase it. Audit deliberately.
RLHF is one of the most powerful tools in modern AI — and one of the most misunderstood. It is alignment by example, not alignment by rule.