The chess-tree analogy
A chess player does not pick a move and march. They branch — "if I push the pawn, they could do A, B, or C; let me consider each" — prune unpromising lines, and commit to the one that survives evaluation. The search is the strategy.
Tree of Thoughts (ToT) brings that to LLM reasoning. Instead of one chain of thought, the model considers several candidate next steps, expands the promising ones, kills the weak ones, and only commits when the search has done its job.
The four moving parts
- Thought decomposition — break the problem into smaller "thought" units (a paragraph, a step, a partial plan).
- Thought generation — at each node, the model proposes K candidate next thoughts.
- Thought evaluation — the model (or a separate critic) scores each candidate: is it making progress? plausible? a dead end?
- Search strategy — BFS, DFS, beam search, or Monte Carlo tree search over the node graph.
The transcript becomes a tree (or graph) of thoughts, not a single chain.
When ToT beats CoT
ToT helps when:
- The problem has multiple plausible paths (not just one obvious one).
- Early commitment is costly — once the chain marches down a wrong road, recovery is hard.
- Backtracking is natural — sometimes you have to abandon a partial solution.
- Verification is cheaper than generation — you can score a candidate quickly, even if you can't author it directly.
Examples that fit:
- Game of 24 (combine numbers with arithmetic to hit 24) — multiple decompositions, easy to verify.
- Crosswords — branching choices, verifiable correctness.
- Multi-step planning with optional steps and dependencies.
- Code synthesis with constraints — try a structure, evaluate, backtrack.
When ToT is overkill
- Single-path reasoning — math problem with a deterministic algorithm.
- Simple lookups — no branching gives you no value.
- Latency-critical UX — the search is multiplicatively more expensive.
- When self-consistency (generate N CoTs, vote) gives you 90% of the lift at a fraction of the complexity.
ToT is a heavier hammer. Use it when the simpler hammers (CoT, self-consistency, ReAct) demonstrably aren't enough.
Engineering it
- Branching factor (K) — typical 3–5. Higher K = more coverage, more cost.
- Depth limit — cap how deep the tree grows; 5–10 is enough for most tasks.
- Evaluation cost — every node-scoring is an LLM call. The eval call should be much cheaper than generation (smaller model, shorter prompt).
- Pruning aggressiveness — keep top-N at each level (beam search) to bound the explosion.
- Backtrack triggers — score below threshold, stuck on the same step twice, contradiction detected.
Variants and cousins
- Graph of Thoughts — generalises tree to a DAG; nodes can merge.
- Algorithm of Thoughts — uses heuristics from a known algorithm to guide search (a star with thoughts).
- MCTS-prompted reasoning — Monte Carlo tree search but the rollouts are LLM completions.
What ToT doesn't fix
- Bad evaluation function. Garbage scoring → garbage tree. The critic is half the system.
- Hallucination at the leaves. ToT can find consistent reasoning that's still factually wrong.
- Real-world side effects. Branching is great in pure reasoning, dangerous when tools mutate state. Use ReAct + smart rollback if branches must act.
A pragmatic decision rule
If self-consistency on chain-of-thought already crushes your eval, stop there. If the problem genuinely has branching paths and verification is cheap, reach for ToT. If it has branching paths and every branch needs side-effects, redesign — branching with side-effects is a foot-gun.
In one line
Tree of Thoughts is search added to thinking — useful when the problem has many roads, and the wrong one is expensive to walk all the way down.