What if the test passes all 100 local runs and only fails in CI?

That points to an environment difference, not the test logic itself — usually test order, parallelism, the system clock, locale, or resource limits. Tell the agent your local loop is green and have it reproduce CI's conditions: randomize test order, run with the same parallelism as CI, or shift the clock. The repro step is non-negotiable; without a local failure you're back to guessing.

Why does the prompt forbid wrapping the test in retry?

Retries hide the symptom without touching the cause. A flaky test signals a real defect — a race, leaked state, or a time assumption — that can also bite in production. Banning retry forces an actual fix plus a regression test that fails on the old code, so you get proof the bug is gone instead of a green checkmark that masks it.

What should I put in the [test_name] variable?

The fully qualified name of one specific failing test (class plus method, or the exact path your runner accepts), not a file or suite. The whole method hinges on isolating a single failure with a single root cause. If several tests in a file flake, run the prompt once per test — their causes are often unrelated.

Claude Code Prompt to Debug a Flaky Test with a Reliable Repro | AI Prompt Library

What this prompt does

This prompt turns "it fails sometimes in CI" into a closed, evidence-based investigation. It forces the agent to first earn a reproduction — run the named test 100 times under a seed loop and report the actual pass/fail ratio — before it's allowed to theorize. That ordering matters: most flaky-test fixes are wrong because they're guesses layered on a failure nobody can trigger on demand.

It works because the five steps mirror how real intermittency gets cornered. Capturing stdout/stderr/logs (step 2) anchors hypotheses to observed output, not vibes. Demanding three hypotheses across the usual suspects — race condition, time dependency, shared state, external dependency — stops the model from fixating on the first plausible story. Then deterministic instrumentation (step 4) is what separates the real cause from a coincidence: you confirm the winning hypothesis instead of assuming it.

The closing constraint is the whole point. By requiring a regression test that deterministically reproduces the bug before the fix, and explicitly banning retry wrappers, the prompt blocks the two laziest "solutions" — hiding the failure behind reruns, or shipping a fix with no proof it addressed the cause.

When to use it

A test passes locally every time but fails ~1 in N runs on CI, and reruns make it "go away."
You suspect test-order dependence or leaked state between tests but can't yet prove it.
A date/time- or timezone-sensitive test fails only at certain hours or near midnight boundaries.
Async or concurrent code intermittently fails on assertions that race the code under test.
Someone has already wrapped the test in retries and you want the actual root cause instead.
A test touches a real network, clock, filesystem, or DB and you need to isolate the external dependency.

Example output

Repro: 100 runs, seed loop → 7 failures (7%). Seeds 41, 88, 93… reproduce.
Logs: failures all assert order [A,C,B]; passes assert [A,B,C].

Hypotheses:
  1) Shared state — module-level cache persists across tests  ← strongest
  2) Race — async fetch resolves out of order
  3) Time dependency — ruled out (no clock use in path)

Instrumentation: logged cache identity + test name per run.
  → Confirmed: test_user_list mutates the cache test_order reads.

Fix: reset cache in teardown (isolate state).
Regression test: seeds the polluted cache, asserts correct order —
  fails on old code, passes on patched code. No retry added.

Pro tips

Set [test_name] to a single test, not a whole file. One flaky case has one root cause; a file mixes several and muddies the hypotheses.
If 100 runs yield zero failures, the flakiness is environmental — push the agent to add CI-only variance: randomized test order, parallelism, a shifted system clock, or constrained resources.
Watch for a "fix" that just narrows the timing window. Insist the regression test fail reliably on the pre-fix code; a test that only sometimes catches the bug is the same flake wearing a disguise.
Make the loop honest with your runner's randomization flags (a seed flag, or shuffled test order) — otherwise "100 runs" just repeats the same code path 100 times and proves nothing.
For shared-state suspects, ask it to bisect by running the failing test immediately after each other test; the polluter usually surfaces fast.

Details

Claude Code Prompt to Debug a Flaky Test with a Reliable Repro

Fill in the placeholders

What this prompt does

When to use it

Example output

Pro tips

Frequently Asked Questions

Engr Mejba Ahmed

More in Claude Code Prompts

Claude Code Prompt to Ship a Production PR from Ticket to Merge

Claude Code Prompt for a Safe Large-Scale Rename Across a Repo

Claude/ChatGPT (or Cursor) Prompt to Run a Senior Code Review on a Diff

Ready to Transform

Your Ideas?

Engr Mejba Ahmed

Hey there!