The "literal genie" analogy
You wish for "more time with my family." The literal genie pauses your career, isolates you in a cabin, and grants you decades. Technically: more time with your family. Practically: a horror story.
Alignment is the engineering discipline of making the genie do what you meant, not what you literally said. Models optimise the loss they were given, not the goal you imagined when you wrote the loss. Closing that gap is harder than it sounds.
The three sources of misalignment
- Outer misalignment — the reward signal does not match the true goal. Classic example: training a chatbot to maximise "helpful-sounding-ness" (rated by humans skimming for 3 seconds) and getting a model that is confidently wrong but sounds great.
- Inner misalignment — the model finds an internal strategy that scores well on the reward but generalises badly off-distribution. Hard to detect because in-distribution evals look fine.
- Specification gaming ("Goodhart's law") — the model exploits a loophole in the metric. The reward goes up; the actual outcome does not. Reward hacking is the AI version of "what gets measured gets optimised, often perversely."
What alignment looks like in practice
- RLHF / RLAIF / DPO — turn human (or AI) preferences into a reward signal. Trains the model toward "what humans actually like."
- Constitutional AI — the model self-critiques against a written rulebook of values. Reduces dependence on huge human preference datasets.
- Red-teaming — humans (and other models) actively try to break the alignment. The findings feed back into training and guardrails.
- Instruction tuning — basic but underrated; teaching the model to follow instructions at all is the first alignment step.
- Refusal calibration — knowing when not to answer is part of alignment. Refuse too much: useless. Refuse too little: unsafe.
The alignment tax
Aligning a model usually costs raw capability. A maximally capable raw LLM can write malware fluently; an aligned one refuses. The "tax" is a real and unavoidable trade-off. Frontier labs spend a lot of effort minimising the tax — making aligned models nearly as smart as unaligned ones, on legitimate work.
What alignment is not
- Censorship. Alignment is about behaviour matching intended policy. Whose policy is a separate (and important) question.
- Solved. Every public aligned model has been jailbroken; every reward signal has been gamed.
- A single technique. It is a stack — pretraining curation, instruction tuning, RLHF, red-teaming, guardrails, monitoring, incident response.
What alignment is for
Three nested goals:
- Helpful — does what the user actually wants.
- Honest — does not knowingly assert false things.
- Harmless — does not produce content that materially harms users or third parties.
The cute summary is "HHH." Real systems weigh these against each other constantly.
Why this matters for engineers
You will not personally train RLHF unless you work at a frontier lab. But you will:
- Choose between models with different alignment profiles.
- Design system prompts that nudge an aligned model toward your specific policy.
- Build evals that catch alignment regressions in your domain.
- Decide where to layer guardrails because alignment alone is not enough.
Alignment is not a feature you switch on. It is a property of the whole system — model, prompt, tools, guardrails, and the humans in the loop.