Home Concept Explainers Claude Platform Claude Prompt Caching: 90% Cheaper, Same Quality

Claude Platform MCP handshake 3 sliders

Claude Prompt Caching: 90% Cheaper, Same Quality

Mark the stable parts of your prompt as cacheable. Claude bills the cache hit at ~10% of the input cost. On chat-shaped workloads, the savings are huge.

Apr 29, 2026 · 4 min lezen

Naar het lab Geen registratie · Voor altijd gratis

▸ Probeer het zelf

Sleep een slider — het diagram reageert in real time.

Spatie voor play · ←/→ om te scrubben

MCP handshake

FR /100 SN-312

SPACE · ◄ ►

¶ De analogie

The pre-prepped meal analogy

A restaurant pre-preps stock, sauces, mise en place at the start of service. When orders arrive, those prep ingredients get reused all night. Re-prepping them every order would be madness.

A long Claude prompt usually has the same shape: a giant system prompt + reference docs that does not change, plus a short user message that does. Without caching, the API re-charges you for the prep on every call. Prompt caching lets you mark the prep as cached — re-runs hit it at a fraction of the cost.

What gets cached

You add cache_control markers to chunks of your input that are stable across calls. Typical candidates:

System prompt — your role + constraints + output rules.
Long instructions / playbook — RAG with your company conventions baked in.
Tool definitions — the JSON schemas for every tool the agent can call.
Documents in context — a 50k-token PDF the model needs to reference.
Conversation history prefix — older turns that are stable.

Anything that changes every request (the user's new message) goes after the cache marker.

What you save

Input cost — cache reads are roughly 10% of normal input pricing. A 50k-token system prompt that costs $$X uncached costs ~$$X/10 on every cached subsequent call.
Latency — cached prefixes process much faster on the server. TTFT improves on long contexts.
Throughput — less prefill work means the same hardware serves more concurrent requests.

Output cost is unchanged. The savings are entirely on the input side.

What it costs

The first call that writes the cache pays a small premium (usually ~25%) over normal input cost. Break-even is typically after 1–3 reuses. After that you're in profit indefinitely (within the cache TTL).

Default cache TTL is short — a few minutes — but extended TTLs (e.g. 1 hour) are available at slightly different pricing.

When caching is a no-brainer

Multi-turn chat — same system prompt + same growing history prefix on every turn.
RAG with stable corpus — same retrieved docs reused within a session.
Agent loops — same tool definitions and rules across every step.
Code review on a single file — same file content across many critique iterations.

When it does not help

Truly stateless one-shot calls with always-fresh inputs (e.g. classifying random web pages).
Cache-busting prompts where you change wording every call (avoid this if you can).
Tiny prompts where the absolute saving is too small to bother with.

Implementation pattern

Identify the longest stable prefix of your prompt.
Insert cache_control: { "type": "ephemeral" } at the boundary between stable and dynamic parts.
Send the request. The first hit writes the cache; subsequent ones read it.
Monitor cache_creation_input_tokens vs cache_read_input_tokens in the response — your scoreboard.

Engineering tips

Order matters — once you cache, do not reorder stable chunks. Any change invalidates the cache.
Multiple breakpoints — you can mark several cache points (e.g. system + tools + docs each separately) for finer-grained reuse.
Watch the TTL — bursty traffic plus a short TTL means many cache misses. If your traffic is spiky, use the extended TTL.
Cache bust deliberately when you actually change the prompt — appending a version tag inside the cached chunk forces a fresh write.

In one sentence

Prompt caching turns "expensive long context" into "cheap long context, paid for once" — the single biggest cost lever on the Claude API.

From the field

Prompt caching only pays off if you design the prompt for it, and the rule is brutally simple: stable content first, volatile content last. I've watched a cache deliver almost nothing because a timestamp or the user's name got injected near the top, invalidating everything below it on every call. Put the system prompt, tool definitions, and reference docs in a fixed prefix; put the user's new message at the very end. Then actually watch your cache-hit metric — if it's low, something dynamic crept into the prefix. Done right it's the cheapest optimisation in the stack: same quality, a fraction of the input cost.

→ Wilt u dit in uw stack?

Custom Claude Code AI Agents & Workflows

Stop doing the repetitive, multi-step work that eats your team's day. This service delivers a working AI agent system that handles tasks like lead processing, data enrichment, content pipelines, and r...

Zie hoe ik kan helpen