What this prompt does
This prompt turns "it fails sometimes in CI" into a closed, evidence-based investigation. It forces the agent to first earn a reproduction — run the named test 100 times under a seed loop and report the actual pass/fail ratio — before it's allowed to theorize. That ordering matters: most flaky-test fixes are wrong because they're guesses layered on a failure nobody can trigger on demand.
It works because the five steps mirror how real intermittency gets cornered. Capturing stdout/stderr/logs (step 2) anchors hypotheses to observed output, not vibes. Demanding three hypotheses across the usual suspects — race condition, time dependency, shared state, external dependency — stops the model from fixating on the first plausible story. Then deterministic instrumentation (step 4) is what separates the real cause from a coincidence: you confirm the winning hypothesis instead of assuming it.
The closing constraint is the whole point. By requiring a regression test that deterministically reproduces the bug before the fix, and explicitly banning retry wrappers, the prompt blocks the two laziest "solutions" — hiding the failure behind reruns, or shipping a fix with no proof it addressed the cause.
When to use it
- A test passes locally every time but fails ~1 in N runs on CI, and reruns make it "go away."
- You suspect test-order dependence or leaked state between tests but can't yet prove it.
- A date/time- or timezone-sensitive test fails only at certain hours or near midnight boundaries.
- Async or concurrent code intermittently fails on assertions that race the code under test.
- Someone has already wrapped the test in retries and you want the actual root cause instead.
- A test touches a real network, clock, filesystem, or DB and you need to isolate the external dependency.
Example output
Repro: 100 runs, seed loop → 7 failures (7%). Seeds 41, 88, 93… reproduce.
Logs: failures all assert order [A,C,B]; passes assert [A,B,C].
Hypotheses:
1) Shared state — module-level cache persists across tests ← strongest
2) Race — async fetch resolves out of order
3) Time dependency — ruled out (no clock use in path)
Instrumentation: logged cache identity + test name per run.
→ Confirmed: test_user_list mutates the cache test_order reads.
Fix: reset cache in teardown (isolate state).
Regression test: seeds the polluted cache, asserts correct order —
fails on old code, passes on patched code. No retry added.
Pro tips
- Set
[test_name]to a single test, not a whole file. One flaky case has one root cause; a file mixes several and muddies the hypotheses. - If 100 runs yield zero failures, the flakiness is environmental — push the agent to add CI-only variance: randomized test order, parallelism, a shifted system clock, or constrained resources.
- Watch for a "fix" that just narrows the timing window. Insist the regression test fail reliably on the pre-fix code; a test that only sometimes catches the bug is the same flake wearing a disguise.
- Make the loop honest with your runner's randomization flags (a seed flag, or shuffled test order) — otherwise "100 runs" just repeats the same code path 100 times and proves nothing.
- For shared-state suspects, ask it to bisect by running the failing test immediately after each other test; the polluter usually surfaces fast.