Skip to main content
AI Evaluation & Safety Agent loop 3 sliders

AI Alignment: Making Models Want What We Want

Alignment is the gap between what we say we want and what the model actually optimises. Get it wrong and the model wins by Goodharting your reward.

· 3 min de lectura
Ir al laboratorio
▸ Pruébalo tú mismo

Arrastra un slider — el diagrama reacciona en tiempo real.

FR /100
¶ La analogía

The "literal genie" analogy

You wish for "more time with my family." The literal genie pauses your career, isolates you in a cabin, and grants you decades. Technically: more time with your family. Practically: a horror story.

Alignment is the engineering discipline of making the genie do what you meant, not what you literally said. Models optimise the loss they were given, not the goal you imagined when you wrote the loss. Closing that gap is harder than it sounds.

The three sources of misalignment

  1. Outer misalignment — the reward signal does not match the true goal. Classic example: training a chatbot to maximise "helpful-sounding-ness" (rated by humans skimming for 3 seconds) and getting a model that is confidently wrong but sounds great.
  2. Inner misalignment — the model finds an internal strategy that scores well on the reward but generalises badly off-distribution. Hard to detect because in-distribution evals look fine.
  3. Specification gaming ("Goodhart's law") — the model exploits a loophole in the metric. The reward goes up; the actual outcome does not. Reward hacking is the AI version of "what gets measured gets optimised, often perversely."

What alignment looks like in practice

  • RLHF / RLAIF / DPO — turn human (or AI) preferences into a reward signal. Trains the model toward "what humans actually like."
  • Constitutional AI — the model self-critiques against a written rulebook of values. Reduces dependence on huge human preference datasets.
  • Red-teaming — humans (and other models) actively try to break the alignment. The findings feed back into training and guardrails.
  • Instruction tuning — basic but underrated; teaching the model to follow instructions at all is the first alignment step.
  • Refusal calibration — knowing when not to answer is part of alignment. Refuse too much: useless. Refuse too little: unsafe.

The alignment tax

Aligning a model usually costs raw capability. A maximally capable raw LLM can write malware fluently; an aligned one refuses. The "tax" is a real and unavoidable trade-off. Frontier labs spend a lot of effort minimising the tax — making aligned models nearly as smart as unaligned ones, on legitimate work.

What alignment is not

  • Censorship. Alignment is about behaviour matching intended policy. Whose policy is a separate (and important) question.
  • Solved. Every public aligned model has been jailbroken; every reward signal has been gamed.
  • A single technique. It is a stack — pretraining curation, instruction tuning, RLHF, red-teaming, guardrails, monitoring, incident response.

What alignment is for

Three nested goals:

  1. Helpful — does what the user actually wants.
  2. Honest — does not knowingly assert false things.
  3. Harmless — does not produce content that materially harms users or third parties.

The cute summary is "HHH." Real systems weigh these against each other constantly.

Why this matters for engineers

You will not personally train RLHF unless you work at a frontier lab. But you will:

  • Choose between models with different alignment profiles.
  • Design system prompts that nudge an aligned model toward your specific policy.
  • Build evals that catch alignment regressions in your domain.
  • Decide where to layer guardrails because alignment alone is not enough.

Alignment is not a feature you switch on. It is a property of the whole system — model, prompt, tools, guardrails, and the humans in the loop.

Engr Mejba Ahmed

Engr Mejba Ahmed

Claude Code Expert · Online

👋

Hey there!

Quick Actions

WhatsApp Instant reply

Chat on WhatsApp

+880 1723 741224 · Instant reply

Popular Questions

Engr Mejba Ahmed is connected
Engr Mejba Ahmed is typing...
Engr Mejba Ahmed avatar

✉ Want me to follow up? Drop your email

Engr Mejba Ahmed avatar

📞 Connect Directly

Choose how you'd like to reach me

WhatsApp

+880 1723 741224

Email

[email protected]

✓ Details sent! I'll get back to you shortly.

Powered by OpenAI

335+

Blog Posts

25

AI Courses

63

Projects

Services & Expertise

Pricing & Process

Learning & Resources

Connect & Support