The pre-prepped meal analogy
A restaurant pre-preps stock, sauces, mise en place at the start of service. When orders arrive, those prep ingredients get reused all night. Re-prepping them every order would be madness.
A long Claude prompt usually has the same shape: a giant system prompt + reference docs that does not change, plus a short user message that does. Without caching, the API re-charges you for the prep on every call. Prompt caching lets you mark the prep as cached — re-runs hit it at a fraction of the cost.
What gets cached
You add cache_control markers to chunks of your input that are stable across calls. Typical candidates:
- System prompt — your role + constraints + output rules.
- Long instructions / playbook — RAG with your company conventions baked in.
- Tool definitions — the JSON schemas for every tool the agent can call.
- Documents in context — a 50k-token PDF the model needs to reference.
- Conversation history prefix — older turns that are stable.
Anything that changes every request (the user's new message) goes after the cache marker.
What you save
- Input cost — cache reads are roughly 10% of normal input pricing. A 50k-token system prompt that costs $$X uncached costs ~$$X/10 on every cached subsequent call.
- Latency — cached prefixes process much faster on the server. TTFT improves on long contexts.
- Throughput — less prefill work means the same hardware serves more concurrent requests.
Output cost is unchanged. The savings are entirely on the input side.
What it costs
The first call that writes the cache pays a small premium (usually ~25%) over normal input cost. Break-even is typically after 1–3 reuses. After that you're in profit indefinitely (within the cache TTL).
Default cache TTL is short — a few minutes — but extended TTLs (e.g. 1 hour) are available at slightly different pricing.
When caching is a no-brainer
- Multi-turn chat — same system prompt + same growing history prefix on every turn.
- RAG with stable corpus — same retrieved docs reused within a session.
- Agent loops — same tool definitions and rules across every step.
- Code review on a single file — same file content across many critique iterations.
When it does not help
- Truly stateless one-shot calls with always-fresh inputs (e.g. classifying random web pages).
- Cache-busting prompts where you change wording every call (avoid this if you can).
- Tiny prompts where the absolute saving is too small to bother with.
Implementation pattern
- Identify the longest stable prefix of your prompt.
- Insert
cache_control: { "type": "ephemeral" }at the boundary between stable and dynamic parts. - Send the request. The first hit writes the cache; subsequent ones read it.
- Monitor
cache_creation_input_tokensvscache_read_input_tokensin the response — your scoreboard.
Engineering tips
- Order matters — once you cache, do not reorder stable chunks. Any change invalidates the cache.
- Multiple breakpoints — you can mark several cache points (e.g. system + tools + docs each separately) for finer-grained reuse.
- Watch the TTL — bursty traffic plus a short TTL means many cache misses. If your traffic is spiky, use the extended TTL.
- Cache bust deliberately when you actually change the prompt — appending a version tag inside the cached chunk forces a fresh write.
In one sentence
Prompt caching turns "expensive long context" into "cheap long context, paid for once" — the single biggest cost lever on the Claude API.