The proofreader analogy
A first draft is rarely the best draft. Writers re-read, spot what's weak, and revise. Some of the best work happens in the second pass, not the first. The skill is not always producing perfect prose first try — it's noticing what's broken and fixing it.
Reflexion gives an LLM that proofreader habit. The model produces an answer, critiques itself ("did I follow the spec? are there bugs? could this be simpler?"), and revises. The first pass is rarely the last.
The basic loop
draft = model.generate(task)
for round in range(N):
critique = model.critique(draft, task)
if critique.is_done:
break
draft = model.revise(draft, critique)
return draft
Three roles played by the same model (or different models): generator, critic, reviser. The loop runs 1–3 times in practice.
Why it works
- Critique is easier than generation. Spotting a bug in code is often easier than writing the bug-free code in one pass. The critic exploits this asymmetry.
- The revision sees what the draft missed. New information at revise time = better output.
- Self-consistency on hard examples. The model tends to converge on better answers across rounds, especially when the critique surfaces concrete issues.
Variants
- Reflexion (canonical) — the model writes an explicit "lesson" after a failed attempt and reuses it on the next try. Good for agents that retry tasks.
- Plain self-critique — one critique → one revision. Simpler, often enough.
- Critic separate from generator — different model, different prompt, sometimes a smaller and cheaper model.
- Debate / multi-critic — two critics with different perspectives; model reconciles. Useful for high-stakes outputs.
Where it shines
- Code generation — "does this compile? does it match the spec? are there edge cases I missed?"
- Long-form writing — structure, clarity, factual claims.
- Complex reasoning — sanity-check the math, the steps, the conclusion.
- Multi-constraint outputs — schema validity + tone + length all at once.
- Agent retries — after a failed action, write a lesson and try a different approach next time.
Where it doesn't help
- Simple tasks — overhead beats the gain.
- Latency-critical UX — N× the calls means N× the wait (parallelise where possible, but the loop is inherently sequential).
- Cases where the model can't self-critique — narrow domains where the model has no purchase. A reflective wrong answer is just a longer wrong answer.
- When you have ground-truth verification — if you can run tests, the test result is a better critic than the model's prose.
Engineering it
- Cap the rounds. 1–3 max. Past that you get diminishing returns and runaway cost.
- Force structured critique.
{ issues: [{severity, location, fix}], ready_to_ship: bool }. Trivially parseable; loop exits cleanly. - Different model for critic vs generator — sometimes. A bigger critic checking a smaller generator can be the right cost / quality trade-off.
- Diff-based revision. Ask the reviser to apply specific changes from the critique, not regenerate from scratch. Preserves what was already good.
- Track ready-to-ship rate — your eval signal. If round 1 ships 60% and round 2 ships 80%, you know the loop is working.
Where reflexion beats simpler tricks
Reflexion's superpower is learning from a failure. After an action fails:
- Model writes a lesson ("when the API returns 422, also include the email field").
- The lesson is appended to the next attempt's prompt.
- The model retries and now avoids the same failure.
Across many runs, those lessons accumulate into a memory the agent uses to perform better on the same task class. This compounds in long-running agent systems in ways one-shot critique does not.
Common pitfalls
- Sycophantic critic. "Looks great!" on every output. Force structure: explicit issue list with severity.
- Critic and generator share a blind spot. If the same model is wrong about the same thing in both roles, the loop endorses it. Mix in tools or a separate critic model when this matters.
- Loop never ends. Always cap rounds; don't trust the model's own "is_done" signal as the only stop.
- Critique cost ignored. If your critic call is as long as your generator call, you've doubled latency and cost. Make the critic small and tight.
In one line
Reflexion teaches a model to be its own first reviewer — cheap, often surprisingly effective, especially in agent loops where you want the system to learn from failures across runs.