Skip to main content
Reasoning Patterns Crawler graph 3 sliders

Self-Consistency: Voting Across Multiple LLM Samples

Run the same prompt N times at non-zero temperature, take the majority answer. A few extra calls, big accuracy gains on hard reasoning.

· 3 min de lectura
Ir al laboratorio
▸ Pruébalo tú mismo

Arrastra un slider — el diagrama reacciona en tiempo real.

FR /100
¶ La analogía

The poll-the-experts analogy

Ask one expert a hard question, you get one answer. Ask five experts independently, count the answers, and the majority is right startlingly often — even when each expert alone is only 70% accurate. The wisdom of independent samples beats any single oracle.

Self-consistency runs that poll on a single LLM. Generate N answers at non-zero temperature, take the most common one. Cheap, easy, and the accuracy lift on hard reasoning tasks is often huge.

The recipe

  1. Take a prompt that benefits from chain-of-thought.
  2. Run it N times with temperature > 0 so the samples differ.
  3. Extract the final answer from each sample.
  4. Pick the majority answer.

That's it. Three more lines of code, one more loop, N× the cost.

Why it works

LLMs trained to reason often reach the right answer through multiple valid paths. At temperature 0, the model picks one path — and if the "best-guess" path is the wrong one, you're stuck with a wrong answer.

At higher temperature, the model samples different paths. Wrong paths tend to disagree (each is wrong differently). The right path is a stable attractor — multiple samples converge on it. Majority vote surfaces the convergence.

When the lift is biggest

  • Multi-step math — single-shot is brittle; majority-of-5 cleans up most arithmetic flips.
  • Logic puzzles — branchy reasoning where one wrong subgoal sinks the whole answer.
  • Diagnostic / classification with rationale — when the answer is one of a small set.
  • Code unit tests — generate N candidate functions, run them all against tests, pick the one that passes most.

When the lift is small

  • Tasks where the model is already near-perfect. No room to improve.
  • Open-ended creative outputs. "Majority" makes no sense for a poem.
  • Tasks where wrong answers happen to agree — biased models converge on wrong answers as confidently as right ones.
  • Strict latency budgets. N× the cost = N× the wait if calls are sequential.

Engineering it well

  • Parallelise the N calls. Sequential = N× latency; parallel = same latency, N× cost. Almost always parallel.
  • Pick N pragmatically. N=3 catches a surprising amount; N=5 is the sweet spot for most tasks; N=10+ has diminishing returns.
  • Use a moderate temperature — 0.6–0.8. Too low and samples cluster on the same wrong path; too high and the samples become noise.
  • Extract answers reliably. Force structured output (<answer>...</answer>) so parsing is trivial.
  • Track agreement. If 5/5 agree, ship confident. If 3/2 split, flag it for review. The vote distribution is itself a signal.

Variants

  • Weighted self-consistency — different samples get different weights (longer reasoning = more weight, or per-sample confidence).
  • Self-consistency with verification — for each sample, run a verifier (tests, lookup, sanity check) and use only verified answers in the vote.
  • Best-of-N with reward model — score each sample with a reward model, take the best (not the majority). Strong when you have a good scorer.

Cost vs quality

This is a knob. N=1 is cheap and brittle. N=10 is 10× the cost and 10–20% absolute accuracy lift on hard tasks. Use it on the high-value, hard-reasoning slice of your traffic, not on every call. Routing rules and confidence-based escalation make this practical:

  • Run N=1 first.
  • If confidence (length, hesitation, structured output reject) looks shaky, escalate to N=5.
  • Most calls finish at N=1; expensive escalations only happen where they pay.

What it does not solve

  • Bias. A model that's wrong consistently will vote wrong consistently.
  • Hallucinations of fact. All five samples can hallucinate the same nonexistent function. Self-consistency is about reasoning quality, not factual grounding — pair with RAG for the latter.
  • Long-horizon planning. Voting on 50-step plans is wonky; majority across complex artifacts is ill-defined.

In one line

Self-consistency is the cheapest way to turn a "smart-sometimes" model into a "smart-usually" model — a few extra calls, a real bump in correctness, almost no engineering complexity.

Engr Mejba Ahmed

Engr Mejba Ahmed

Claude Code Expert · Online

👋

Hey there!

Quick Actions

WhatsApp Instant reply

Chat on WhatsApp

+880 1723 741224 · Instant reply

Popular Questions

Engr Mejba Ahmed is connected
Engr Mejba Ahmed is typing...
Engr Mejba Ahmed avatar

✉ Want me to follow up? Drop your email

Engr Mejba Ahmed avatar

📞 Connect Directly

Choose how you'd like to reach me

WhatsApp

+880 1723 741224

Email

[email protected]

✓ Details sent! I'll get back to you shortly.

Powered by OpenAI

335+

Blog Posts

25

AI Courses

63

Projects

Services & Expertise

Pricing & Process

Learning & Resources

Connect & Support