Skip to main content

Claude Code Prompt to Debug a Flaky Test with a Reliable Repro

Stop an intermittent CI failure for good: reproduce it locally with a seed loop, confirm the root cause, and ship a fix plus a deterministic regression test.

Fill in the placeholders

Edit the values, then copy your finished prompt.

Your Prompt
prompt.txt

                                

What this prompt does

This prompt turns "it fails sometimes in CI" into a closed, evidence-based investigation. It forces the agent to first earn a reproduction — run the named test 100 times under a seed loop and report the actual pass/fail ratio — before it's allowed to theorize. That ordering matters: most flaky-test fixes are wrong because they're guesses layered on a failure nobody can trigger on demand.

It works because the five steps mirror how real intermittency gets cornered. Capturing stdout/stderr/logs (step 2) anchors hypotheses to observed output, not vibes. Demanding three hypotheses across the usual suspects — race condition, time dependency, shared state, external dependency — stops the model from fixating on the first plausible story. Then deterministic instrumentation (step 4) is what separates the real cause from a coincidence: you confirm the winning hypothesis instead of assuming it.

The closing constraint is the whole point. By requiring a regression test that deterministically reproduces the bug before the fix, and explicitly banning retry wrappers, the prompt blocks the two laziest "solutions" — hiding the failure behind reruns, or shipping a fix with no proof it addressed the cause.

When to use it

  • A test passes locally every time but fails ~1 in N runs on CI, and reruns make it "go away."
  • You suspect test-order dependence or leaked state between tests but can't yet prove it.
  • A date/time- or timezone-sensitive test fails only at certain hours or near midnight boundaries.
  • Async or concurrent code intermittently fails on assertions that race the code under test.
  • Someone has already wrapped the test in retries and you want the actual root cause instead.
  • A test touches a real network, clock, filesystem, or DB and you need to isolate the external dependency.

Example output

Repro: 100 runs, seed loop → 7 failures (7%). Seeds 41, 88, 93… reproduce.
Logs: failures all assert order [A,C,B]; passes assert [A,B,C].

Hypotheses:
  1) Shared state — module-level cache persists across tests  ← strongest
  2) Race — async fetch resolves out of order
  3) Time dependency — ruled out (no clock use in path)

Instrumentation: logged cache identity + test name per run.
  → Confirmed: test_user_list mutates the cache test_order reads.

Fix: reset cache in teardown (isolate state).
Regression test: seeds the polluted cache, asserts correct order —
  fails on old code, passes on patched code. No retry added.

Pro tips

  • Set [test_name] to a single test, not a whole file. One flaky case has one root cause; a file mixes several and muddies the hypotheses.
  • If 100 runs yield zero failures, the flakiness is environmental — push the agent to add CI-only variance: randomized test order, parallelism, a shifted system clock, or constrained resources.
  • Watch for a "fix" that just narrows the timing window. Insist the regression test fail reliably on the pre-fix code; a test that only sometimes catches the bug is the same flake wearing a disguise.
  • Make the loop honest with your runner's randomization flags (a seed flag, or shuffled test order) — otherwise "100 runs" just repeats the same code path 100 times and proves nothing.
  • For shared-state suspects, ask it to bisect by running the failing test immediately after each other test; the polluter usually surfaces fast.

Frequently Asked Questions

What if the test passes all 100 local runs and only fails in CI?
That points to an environment difference, not the test logic itself — usually test order, parallelism, the system clock, locale, or resource limits. Tell the agent your local loop is green and have it reproduce CI's conditions: randomize test order, run with the same parallelism as CI, or shift the clock. The repro step is non-negotiable; without a local failure you're back to guessing.
Why does the prompt forbid wrapping the test in retry?
Retries hide the symptom without touching the cause. A flaky test signals a real defect — a race, leaked state, or a time assumption — that can also bite in production. Banning retry forces an actual fix plus a regression test that fails on the old code, so you get proof the bug is gone instead of a green checkmark that masks it.
What should I put in the [test_name] variable?
The fully qualified name of one specific failing test (class plus method, or the exact path your runner accepts), not a file or suite. The whole method hinges on isolating a single failure with a single root cause. If several tests in a file flake, run the prompt once per test — their causes are often unrelated.
Engr Mejba Ahmed

Need this built for real?

Engr Mejba Ahmed

AI Developer · Software Engineer

I'm Mejba — I design and ship production AI systems, automations, and full-stack apps. If you want this turned into a working solution for your team, let's talk.

More in Claude Code Prompts

Engr Mejba Ahmed

Engr Mejba Ahmed

Claude Code Expert · Online

👋

Hey there!

Quick Actions

WhatsApp Instant reply

Chat on WhatsApp

+880 1723 741224 · Instant reply

Popular Questions

Engr Mejba Ahmed is connected
Engr Mejba Ahmed is typing...
Engr Mejba Ahmed avatar

✉ Want me to follow up? Drop your email

Engr Mejba Ahmed avatar

📞 Connect Directly

Choose how you'd like to reach me

WhatsApp

+880 1723 741224

Email

[email protected]

✓ Details sent! I'll get back to you shortly.

Powered by OpenAI

335+

Blog Posts

25

AI Courses

63

Projects

Services & Expertise

Pricing & Process

Learning & Resources

Connect & Support