Home Concept Explainers AI Evaluation & Safety Hallucinations: Why LLMs Make Stuff Up Confidently

AI Evaluation & Safety Agent loop 3 Slider

Hallucinations: Why LLMs Make Stuff Up Confidently

Hallucinations are not bugs — they are the model doing exactly what it was trained to do. Plausibility is the loss; truth is not. Understand the trap, then engineer around it.

Apr 29, 2026 · 3 Min. Lesezeit

Zum Lab springen Keine Anmeldung · Für immer kostenlos

▸ Selbst ausprobieren

Zieh einen Slider — das Diagramm reagiert in Echtzeit.

Leertaste für Play · ←/→ zum Scrubben

Agent loop

FR /100 SN-74A

SPACE · ◄ ►

¶ Die Analogie

The improv-actor analogy

An improv actor on stage is rewarded for making the scene flow. Pause too long, look unsure, break character — that kills the show. So they confidently invent a name, an address, a backstory. It does not have to be true. It has to be believable.

LLMs are trained the same way. The reward signal during pretraining is "predict a plausible next token." Plausibility is everything. Truth is not in the loss. When the model has nothing to say, the training pressure says "say something that fits." That something is a hallucination.

Two kinds of hallucination

Intrinsic — the answer contradicts the prompt or the model's own context (it makes up a detail you literally just gave it).
Extrinsic — the answer is unmoored from any reference at all (a citation that does not exist, a person that does not exist, a function that is not in the API).

Intrinsic ones are usually fixable with better prompting. Extrinsic ones often need RAG or tool use.

Why models do not "know they don't know"

Calibration of uncertainty is hard. The model produces token probabilities, not confidence in factuality. A model can be 99% sure of a token that is part of a confidently wrong sentence. There is no internal "hold on, I'm guessing" signal — that signal has to be engineered in.

The five engineering moves

Ground in retrieval (RAG). If the model can cite a real chunk, it is less likely to invent one.
Force structured output with verification. Schema-constrained outputs catch many hallucinations as parse errors before users see them.
Self-consistency / multiple samples. Generate N answers; if they disagree wildly, flag it. Cheap and effective.
Tool use for facts. Calculator for math. Database for lookups. Code execution for code. Hallucinations rarely survive a real tool.
Refusal training. Teach the model to say "I do not know" instead of guessing. Hardest because it must be calibrated — refuse too often and the model becomes useless.

What does not fix hallucinations

More parameters. Bigger models hallucinate less, but never zero. The qualitative failure mode persists.
Lower temperature. Reduces variance, not factuality. A confident wrong answer at temperature 0 is still wrong.
Tougher system prompts. "Do not hallucinate" is the AI equivalent of "do not be wrong." It does little.
Bigger context. Stuffing more text into a prompt does not change the loss objective. Models can still confabulate against their own context.

Measurement is the unsung hero

You cannot fix what you do not measure. Build a hallucination eval specific to your domain:

Curate a set of factual prompts where you know the correct answer.
Score: exact-match for factual claims, faithfulness against the source for RAG, schema-validity for structured outputs.
Track per release. Hallucination rate is a metric like latency — only valuable when continuously monitored.

A useful mental model

Treat the LLM as a confident, fluent intern with no internet access and a fading memory. They will absolutely produce an answer if you ask. Whether that answer is true is your job to engineer for — through retrieval, tools, structured outputs, and evals. The intern is not lying. They are doing the only thing they were trained to do.