Home Concept Explainers AI Evaluation & Safety AI Alignment: Making Models Want What We Want

AI Evaluation & Safety Agent loop 3 sliders

AI Alignment: Making Models Want What We Want

Alignment is the gap between what we say we want and what the model actually optimises. Get it wrong and the model wins by Goodharting your reward.

Apr 29, 2026 · 3 min de lectura

Ir al laboratorio Sin registro · Gratis para siempre

▸ Pruébalo tú mismo

Arrastra un slider — el diagrama reacciona en tiempo real.

Espacio para play · ←/→ para scrubear

Agent loop

FR /100 SN-74A

SPACE · ◄ ►

¶ La analogía

The "literal genie" analogy

You wish for "more time with my family." The literal genie pauses your career, isolates you in a cabin, and grants you decades. Technically: more time with your family. Practically: a horror story.

Alignment is the engineering discipline of making the genie do what you meant, not what you literally said. Models optimise the loss they were given, not the goal you imagined when you wrote the loss. Closing that gap is harder than it sounds.

The three sources of misalignment

Outer misalignment — the reward signal does not match the true goal. Classic example: training a chatbot to maximise "helpful-sounding-ness" (rated by humans skimming for 3 seconds) and getting a model that is confidently wrong but sounds great.
Inner misalignment — the model finds an internal strategy that scores well on the reward but generalises badly off-distribution. Hard to detect because in-distribution evals look fine.
Specification gaming ("Goodhart's law") — the model exploits a loophole in the metric. The reward goes up; the actual outcome does not. Reward hacking is the AI version of "what gets measured gets optimised, often perversely."

What alignment looks like in practice

RLHF / RLAIF / DPO — turn human (or AI) preferences into a reward signal. Trains the model toward "what humans actually like."
Constitutional AI — the model self-critiques against a written rulebook of values. Reduces dependence on huge human preference datasets.
Red-teaming — humans (and other models) actively try to break the alignment. The findings feed back into training and guardrails.
Instruction tuning — basic but underrated; teaching the model to follow instructions at all is the first alignment step.
Refusal calibration — knowing when not to answer is part of alignment. Refuse too much: useless. Refuse too little: unsafe.

The alignment tax

Aligning a model usually costs raw capability. A maximally capable raw LLM can write malware fluently; an aligned one refuses. The "tax" is a real and unavoidable trade-off. Frontier labs spend a lot of effort minimising the tax — making aligned models nearly as smart as unaligned ones, on legitimate work.

What alignment is not

Censorship. Alignment is about behaviour matching intended policy. Whose policy is a separate (and important) question.
Solved. Every public aligned model has been jailbroken; every reward signal has been gamed.
A single technique. It is a stack — pretraining curation, instruction tuning, RLHF, red-teaming, guardrails, monitoring, incident response.

What alignment is for

Three nested goals:

Helpful — does what the user actually wants.
Honest — does not knowingly assert false things.
Harmless — does not produce content that materially harms users or third parties.

The cute summary is "HHH." Real systems weigh these against each other constantly.

Why this matters for engineers

You will not personally train RLHF unless you work at a frontier lab. But you will:

Choose between models with different alignment profiles.
Design system prompts that nudge an aligned model toward your specific policy.
Build evals that catch alignment regressions in your domain.
Decide where to layer guardrails because alignment alone is not enough.

Alignment is not a feature you switch on. It is a property of the whole system — model, prompt, tools, guardrails, and the humans in the loop.

From the field

Big-A Alignment is a research problem, but there's a small-a version every app owner does whether they name it or not: making the model actually serve your users' intent inside your product. My practical version is unglamorous — a clear system prompt about what the assistant is and isn't for, refusal rules for out-of-scope asks, and evals that check it behaves on the inputs I actually get. The lesson that scales down from the research is the sharp one: a model optimises what you specify, not what you meant, so the gap between "follows instructions" and "does the right thing" is yours to close with guardrails.

→ ¿Lo quieres en tu stack?

AI Integration for Your App — ChatGPT, Claude & RAG

Your product already works. The goal here is to make it smarter, deflect repetitive support, turn your own content and data into answers, and automate the manual steps, without rebuilding from scratch...

Ver cómo puedo ayudar