The chef-tasting analogy
You changed the recipe. Is the dish better? Asking yourself "yeah, I think so" is not evidence. A real kitchen tastes the new dish side-by-side with the old, with the same five tasters, on the same metric (salt? acid? texture?), and writes the result down. Otherwise the chef will flip to whatever they cooked most recently.
AI evals are the same. Without them, prompt tweaks feel like progress and quietly aren't. With them, every change is a number you can defend.
What an eval actually is
An eval is a script that:
- Loads a fixed set of inputs (your dataset).
- Runs the system on each input.
- Scores each output against a target — exact match, regex, schema, judged by another model, judged by a human.
- Aggregates into a small number of metrics you trust.
If you can run it on a CI button, it is a real eval. If it requires "a senior engineer feels the difference," it is not.
Three families of metrics
- Reference-based — compare output to a known correct answer. Exact match, BLEU, ROUGE for text; precision/recall/F1 for classification. Cheap, brittle to phrasing.
- Reference-free / model-judged — another LLM grades the output (faithfulness, helpfulness, format). Scales, but the judge has its own biases. Always validate against humans on a sample.
- Behavioural — does the output cause the desired downstream effect? Tests pass? Code compiles? User clicked? The gold standard when you can wire it.
How to build one in a day
- Collect 50 real inputs from production logs (anonymise). 50 is enough to start. 500 is good. 5000 is great.
- Label the desired outputs — even fuzzy ones (a rubric: 1–5 helpfulness, schema-valid yes/no).
- Pick metrics. Combine an automated metric with a model-judged one for richer signal.
- Lock the dataset. Treat it like a benchmark — you do not iterate the dataset to make scores go up.
- Run on every change. A red eval is a blocked deploy.
Common pitfalls
- Tiny eval sets — 5 examples is not an eval, it's a vibe. Aim for ≥50 minimum.
- Eval / training set leakage — never tune prompts using the eval set itself; you'll overfit to it. Hold out a separate validation set.
- One-metric tunnel vision — a single number hides regressions in dimensions you didn't measure (e.g. faithfulness up, but tone became corporate).
- Judge LLM = same family as system LLM — judges favour their own family. Use a different model as judge when possible.
- Stale ground truth — what was correct in 2024 may be wrong in 2026. Refresh ground-truth periodically.
When humans are still required
- Open-ended prose — novels, marketing copy, emotional writing. Model judges are fooled too easily here.
- High-stakes domains — medical, legal, safety. Bring in domain experts, not just any annotators.
- Tail behaviour — adversarial prompts, jailbreak attempts, edge cases. Humans see what evals miss.
A pragmatic mix is automated evals on every PR, model-judge on every release, human review on every major change. The cost shifts as the change does.
In one line
If you cannot point at a number that went up, the prompt change did not happen.