Skip to main content
Claude Platform MCP handshake 3 sliders

Claude Prompt Caching: 90% Cheaper, Same Quality

Mark the stable parts of your prompt as cacheable. Claude bills the cache hit at ~10% of the input cost. On chat-shaped workloads, the savings are huge.

· 3 min lezen
Naar het lab
▸ Probeer het zelf

Sleep een slider — het diagram reageert in real time.

FR /100
¶ De analogie

The pre-prepped meal analogy

A restaurant pre-preps stock, sauces, mise en place at the start of service. When orders arrive, those prep ingredients get reused all night. Re-prepping them every order would be madness.

A long Claude prompt usually has the same shape: a giant system prompt + reference docs that does not change, plus a short user message that does. Without caching, the API re-charges you for the prep on every call. Prompt caching lets you mark the prep as cached — re-runs hit it at a fraction of the cost.

What gets cached

You add cache_control markers to chunks of your input that are stable across calls. Typical candidates:

  • System prompt — your role + constraints + output rules.
  • Long instructions / playbook — RAG with your company conventions baked in.
  • Tool definitions — the JSON schemas for every tool the agent can call.
  • Documents in context — a 50k-token PDF the model needs to reference.
  • Conversation history prefix — older turns that are stable.

Anything that changes every request (the user's new message) goes after the cache marker.

What you save

  • Input cost — cache reads are roughly 10% of normal input pricing. A 50k-token system prompt that costs $$X uncached costs ~$$X/10 on every cached subsequent call.
  • Latency — cached prefixes process much faster on the server. TTFT improves on long contexts.
  • Throughput — less prefill work means the same hardware serves more concurrent requests.

Output cost is unchanged. The savings are entirely on the input side.

What it costs

The first call that writes the cache pays a small premium (usually ~25%) over normal input cost. Break-even is typically after 1–3 reuses. After that you're in profit indefinitely (within the cache TTL).

Default cache TTL is short — a few minutes — but extended TTLs (e.g. 1 hour) are available at slightly different pricing.

When caching is a no-brainer

  • Multi-turn chat — same system prompt + same growing history prefix on every turn.
  • RAG with stable corpus — same retrieved docs reused within a session.
  • Agent loops — same tool definitions and rules across every step.
  • Code review on a single file — same file content across many critique iterations.

When it does not help

  • Truly stateless one-shot calls with always-fresh inputs (e.g. classifying random web pages).
  • Cache-busting prompts where you change wording every call (avoid this if you can).
  • Tiny prompts where the absolute saving is too small to bother with.

Implementation pattern

  1. Identify the longest stable prefix of your prompt.
  2. Insert cache_control: { "type": "ephemeral" } at the boundary between stable and dynamic parts.
  3. Send the request. The first hit writes the cache; subsequent ones read it.
  4. Monitor cache_creation_input_tokens vs cache_read_input_tokens in the response — your scoreboard.

Engineering tips

  • Order matters — once you cache, do not reorder stable chunks. Any change invalidates the cache.
  • Multiple breakpoints — you can mark several cache points (e.g. system + tools + docs each separately) for finer-grained reuse.
  • Watch the TTL — bursty traffic plus a short TTL means many cache misses. If your traffic is spiky, use the extended TTL.
  • Cache bust deliberately when you actually change the prompt — appending a version tag inside the cached chunk forces a fresh write.

In one sentence

Prompt caching turns "expensive long context" into "cheap long context, paid for once" — the single biggest cost lever on the Claude API.

Engr Mejba Ahmed

Engr Mejba Ahmed

Claude Code Expert · Online

👋

Hey there!

Quick Actions

WhatsApp Instant reply

Chat on WhatsApp

+880 1723 741224 · Instant reply

Popular Questions

Engr Mejba Ahmed is connected
Engr Mejba Ahmed is typing...
Engr Mejba Ahmed avatar

✉ Want me to follow up? Drop your email

Engr Mejba Ahmed avatar

📞 Connect Directly

Choose how you'd like to reach me

WhatsApp

+880 1723 741224

Email

[email protected]

✓ Details sent! I'll get back to you shortly.

Powered by OpenAI

335+

Blog Posts

25

AI Courses

63

Projects

Services & Expertise

Pricing & Process

Learning & Resources

Connect & Support