The maths-test analogy
You ask a student "what's 47 × 38?" Most stare and guess. Ask the same student "show your work" and they multiply digits, carry, add — and produce the correct answer. The reasoning was always there. Asking for it changed the output.
Chain-of-thought (CoT) prompting is the same trick for LLMs. Ask the model to think through the problem in plain language before giving the final answer, and accuracy on multi-step tasks (math, logic, planning) jumps measurably. The reasoning costs tokens, but it pays in correctness.
The minimal recipe
Add a short instruction:
Think step by step before answering.
Or for stronger results:
Before answering, walk through your reasoning. Then provide the final answer in
<answer>...</answer>tags.
The first costs nothing structural. The second lets you parse the answer cleanly while still benefiting from the reasoning.
Why it works
LLMs are autoregressive — each token conditions on the ones before it. When the model commits to a final answer too quickly, it's locked in. Letting it produce intermediate reasoning tokens means the final-answer token is conditioned on a richer context the model just generated. The math gets done in tokens, not in a hidden mental step.
Variants worth knowing
- Zero-shot CoT — just the "think step by step" instruction. Cheap, often enough.
- Few-shot CoT — show 1–3 worked examples. Better for unusual problem shapes.
- Self-consistency — generate N CoT samples, take the majority answer. Big quality lift on math/logic; high cost (covered in its own explainer).
- Tree of thoughts — branch the reasoning, search across paths. For genuinely multi-path problems.
Where CoT helps most
- Multi-step math. Word problems, accounting, unit conversions.
- Logic puzzles with nested conditions.
- Planning — break a goal into ordered steps.
- Diagnostic reasoning — code debugging, root-cause analysis.
- Anything with constraints to track — schedules, allocations, eligibility rules.
Where it doesn't help (or hurts)
- Simple lookups — "what's the capital of France?" CoT just adds tokens.
- Creative writing — explicit reasoning can flatten voice; let the model write.
- Pattern-recognition tasks the model has memorised — CoT introduces noise.
- Latency-critical paths — the reasoning costs real wall-time.
Hide the scratchpad in production
Users do not want to read 800 tokens of reasoning before the answer. Two clean patterns:
- Tagged reasoning —
<thinking>...</thinking><answer>...</answer>. Render only the answer; optionally expose the thinking on a "show reasoning" toggle. - Two-call flow — first call: ask for a plan / reasoning. Second call: ask for the final answer using the plan. Lets you cap reasoning length and parallelise.
Reasoning models change the equation
Newer "reasoning models" (often labelled with a thinking step) bake long deliberation into the model itself, with internal tokens you may not even pay for. With those, explicit CoT prompting is often redundant — the model is already doing it. Always test on your task; sometimes simpler prompts beat over-instructed ones on a reasoning model.
Anti-patterns to avoid
- "Think very carefully" — magic phrases that move the needle 0%.
- Long reasoning when the task is trivial — pay for tokens you don't need.
- Letting the user see raw reasoning by default — looks unprofessional, and exposes prompt internals.
- Trusting the reasoning text as ground truth — the model can produce convincing wrong reasoning. Verify the answer.
In one line
Chain-of-thought is the cheapest reliability lever for hard prompts — turn it on, hide the scratchpad, ship.