Home Concept Explainers AI Evaluation & Safety AI Evals: How to Tell If Your Model Is Actually Better

AI Evaluation & Safety Agent loop 3 sliders

AI Evals: How to Tell If Your Model Is Actually Better

Without evals, "the new prompt feels better" is just vibes. A good eval suite catches regressions before users do — here is how to build one.

Apr 29, 2026 · 3 min read

Jump to the lab No sign-up · Free forever

▸ Try it yourself

Drag any slider — the diagram reacts in real time.

Space to play · ←/→ to scrub

Agent loop

FR /100 SN-74A

SPACE · ◄ ►

¶ The analogy

The chef-tasting analogy

You changed the recipe. Is the dish better? Asking yourself "yeah, I think so" is not evidence. A real kitchen tastes the new dish side-by-side with the old, with the same five tasters, on the same metric (salt? acid? texture?), and writes the result down. Otherwise the chef will flip to whatever they cooked most recently.

AI evals are the same. Without them, prompt tweaks feel like progress and quietly aren't. With them, every change is a number you can defend.

What an eval actually is

An eval is a script that:

Loads a fixed set of inputs (your dataset).
Runs the system on each input.
Scores each output against a target — exact match, regex, schema, judged by another model, judged by a human.
Aggregates into a small number of metrics you trust.

If you can run it on a CI button, it is a real eval. If it requires "a senior engineer feels the difference," it is not.

Three families of metrics

Reference-based — compare output to a known correct answer. Exact match, BLEU, ROUGE for text; precision/recall/F1 for classification. Cheap, brittle to phrasing.
Reference-free / model-judged — another LLM grades the output (faithfulness, helpfulness, format). Scales, but the judge has its own biases. Always validate against humans on a sample.
Behavioural — does the output cause the desired downstream effect? Tests pass? Code compiles? User clicked? The gold standard when you can wire it.

How to build one in a day

Collect 50 real inputs from production logs (anonymise). 50 is enough to start. 500 is good. 5000 is great.
Label the desired outputs — even fuzzy ones (a rubric: 1–5 helpfulness, schema-valid yes/no).
Pick metrics. Combine an automated metric with a model-judged one for richer signal.
Lock the dataset. Treat it like a benchmark — you do not iterate the dataset to make scores go up.
Run on every change. A red eval is a blocked deploy.

Common pitfalls

Tiny eval sets — 5 examples is not an eval, it's a vibe. Aim for ≥50 minimum.
Eval / training set leakage — never tune prompts using the eval set itself; you'll overfit to it. Hold out a separate validation set.
One-metric tunnel vision — a single number hides regressions in dimensions you didn't measure (e.g. faithfulness up, but tone became corporate).
Judge LLM = same family as system LLM — judges favour their own family. Use a different model as judge when possible.
Stale ground truth — what was correct in 2024 may be wrong in 2026. Refresh ground-truth periodically.

When humans are still required

Open-ended prose — novels, marketing copy, emotional writing. Model judges are fooled too easily here.
High-stakes domains — medical, legal, safety. Bring in domain experts, not just any annotators.
Tail behaviour — adversarial prompts, jailbreak attempts, edge cases. Humans see what evals miss.

A pragmatic mix is automated evals on every PR, model-judge on every release, human review on every major change. The cost shifts as the change does.