The poll-the-experts analogy
Ask one expert a hard question, you get one answer. Ask five experts independently, count the answers, and the majority is right startlingly often — even when each expert alone is only 70% accurate. The wisdom of independent samples beats any single oracle.
Self-consistency runs that poll on a single LLM. Generate N answers at non-zero temperature, take the most common one. Cheap, easy, and the accuracy lift on hard reasoning tasks is often huge.
The recipe
- Take a prompt that benefits from chain-of-thought.
- Run it N times with
temperature > 0so the samples differ. - Extract the final answer from each sample.
- Pick the majority answer.
That's it. Three more lines of code, one more loop, N× the cost.
Why it works
LLMs trained to reason often reach the right answer through multiple valid paths. At temperature 0, the model picks one path — and if the "best-guess" path is the wrong one, you're stuck with a wrong answer.
At higher temperature, the model samples different paths. Wrong paths tend to disagree (each is wrong differently). The right path is a stable attractor — multiple samples converge on it. Majority vote surfaces the convergence.
When the lift is biggest
- Multi-step math — single-shot is brittle; majority-of-5 cleans up most arithmetic flips.
- Logic puzzles — branchy reasoning where one wrong subgoal sinks the whole answer.
- Diagnostic / classification with rationale — when the answer is one of a small set.
- Code unit tests — generate N candidate functions, run them all against tests, pick the one that passes most.
When the lift is small
- Tasks where the model is already near-perfect. No room to improve.
- Open-ended creative outputs. "Majority" makes no sense for a poem.
- Tasks where wrong answers happen to agree — biased models converge on wrong answers as confidently as right ones.
- Strict latency budgets. N× the cost = N× the wait if calls are sequential.
Engineering it well
- Parallelise the N calls. Sequential = N× latency; parallel = same latency, N× cost. Almost always parallel.
- Pick N pragmatically. N=3 catches a surprising amount; N=5 is the sweet spot for most tasks; N=10+ has diminishing returns.
- Use a moderate temperature — 0.6–0.8. Too low and samples cluster on the same wrong path; too high and the samples become noise.
- Extract answers reliably. Force structured output (
<answer>...</answer>) so parsing is trivial. - Track agreement. If 5/5 agree, ship confident. If 3/2 split, flag it for review. The vote distribution is itself a signal.
Variants
- Weighted self-consistency — different samples get different weights (longer reasoning = more weight, or per-sample confidence).
- Self-consistency with verification — for each sample, run a verifier (tests, lookup, sanity check) and use only verified answers in the vote.
- Best-of-N with reward model — score each sample with a reward model, take the best (not the majority). Strong when you have a good scorer.
Cost vs quality
This is a knob. N=1 is cheap and brittle. N=10 is 10× the cost and 10–20% absolute accuracy lift on hard tasks. Use it on the high-value, hard-reasoning slice of your traffic, not on every call. Routing rules and confidence-based escalation make this practical:
- Run N=1 first.
- If confidence (length, hesitation, structured output reject) looks shaky, escalate to N=5.
- Most calls finish at N=1; expensive escalations only happen where they pay.
What it does not solve
- Bias. A model that's wrong consistently will vote wrong consistently.
- Hallucinations of fact. All five samples can hallucinate the same nonexistent function. Self-consistency is about reasoning quality, not factual grounding — pair with RAG for the latter.
- Long-horizon planning. Voting on 50-step plans is wonky; majority across complex artifacts is ill-defined.
In one line
Self-consistency is the cheapest way to turn a "smart-sometimes" model into a "smart-usually" model — a few extra calls, a real bump in correctness, almost no engineering complexity.