Home Concept Explainers Reasoning Patterns Self-Consistency: Voting Across Multiple LLM Samples

Reasoning Patterns Crawler graph 3 sliders

Self-Consistency: Voting Across Multiple LLM Samples

Run the same prompt N times at non-zero temperature, take the majority answer. A few extra calls, big accuracy gains on hard reasoning.

Apr 29, 2026 · 3 min de lectura

Ir al laboratorio Sin registro · Gratis para siempre

▸ Pruébalo tú mismo

Arrastra un slider — el diagrama reacciona en tiempo real.

Espacio para play · ←/→ para scrubear

Crawler graph

FR /100 SN-514

SPACE · ◄ ►

¶ La analogía

The poll-the-experts analogy

Ask one expert a hard question, you get one answer. Ask five experts independently, count the answers, and the majority is right startlingly often — even when each expert alone is only 70% accurate. The wisdom of independent samples beats any single oracle.

Self-consistency runs that poll on a single LLM. Generate N answers at non-zero temperature, take the most common one. Cheap, easy, and the accuracy lift on hard reasoning tasks is often huge.

The recipe

Take a prompt that benefits from chain-of-thought.
Run it N times with temperature > 0 so the samples differ.
Extract the final answer from each sample.
Pick the majority answer.

That's it. Three more lines of code, one more loop, N× the cost.

Why it works

LLMs trained to reason often reach the right answer through multiple valid paths. At temperature 0, the model picks one path — and if the "best-guess" path is the wrong one, you're stuck with a wrong answer.

At higher temperature, the model samples different paths. Wrong paths tend to disagree (each is wrong differently). The right path is a stable attractor — multiple samples converge on it. Majority vote surfaces the convergence.

When the lift is biggest

Multi-step math — single-shot is brittle; majority-of-5 cleans up most arithmetic flips.
Logic puzzles — branchy reasoning where one wrong subgoal sinks the whole answer.
Diagnostic / classification with rationale — when the answer is one of a small set.
Code unit tests — generate N candidate functions, run them all against tests, pick the one that passes most.

When the lift is small

Tasks where the model is already near-perfect. No room to improve.
Open-ended creative outputs. "Majority" makes no sense for a poem.
Tasks where wrong answers happen to agree — biased models converge on wrong answers as confidently as right ones.
Strict latency budgets. N× the cost = N× the wait if calls are sequential.

Engineering it well

Parallelise the N calls. Sequential = N× latency; parallel = same latency, N× cost. Almost always parallel.
Pick N pragmatically. N=3 catches a surprising amount; N=5 is the sweet spot for most tasks; N=10+ has diminishing returns.
Use a moderate temperature — 0.6–0.8. Too low and samples cluster on the same wrong path; too high and the samples become noise.
Extract answers reliably. Force structured output (<answer>...</answer>) so parsing is trivial.
Track agreement. If 5/5 agree, ship confident. If 3/2 split, flag it for review. The vote distribution is itself a signal.

Variants

Weighted self-consistency — different samples get different weights (longer reasoning = more weight, or per-sample confidence).
Self-consistency with verification — for each sample, run a verifier (tests, lookup, sanity check) and use only verified answers in the vote.
Best-of-N with reward model — score each sample with a reward model, take the best (not the majority). Strong when you have a good scorer.

Cost vs quality

This is a knob. N=1 is cheap and brittle. N=10 is 10× the cost and 10–20% absolute accuracy lift on hard tasks. Use it on the high-value, hard-reasoning slice of your traffic, not on every call. Routing rules and confidence-based escalation make this practical:

Run N=1 first.
If confidence (length, hesitation, structured output reject) looks shaky, escalate to N=5.
Most calls finish at N=1; expensive escalations only happen where they pay.

What it does not solve

Bias. A model that's wrong consistently will vote wrong consistently.
Hallucinations of fact. All five samples can hallucinate the same nonexistent function. Self-consistency is about reasoning quality, not factual grounding — pair with RAG for the latter.
Long-horizon planning. Voting on 50-step plans is wonky; majority across complex artifacts is ill-defined.

In one line

Self-consistency is the cheapest way to turn a "smart-sometimes" model into a "smart-usually" model — a few extra calls, a real bump in correctness, almost no engineering complexity.