Home Concept Explainers AI Operations & Production AI Cost Optimization: Cutting LLM Bills 80%

AI Operations & Production Crawler graph 3 Slider

AI Cost Optimization: Cutting LLM Bills 80%

Most LLM bills can be cut by 50–90% without quality loss. Caching, model routing, prompt diet, and output caps deliver the bulk of it.

Apr 29, 2026 · 4 Min. Lesezeit

Zum Lab springen Keine Anmeldung · Für immer kostenlos

▸ Selbst ausprobieren

Zieh einen Slider — das Diagramm reagiert in Echtzeit.

Leertaste für Play · ←/→ zum Scrubben

Crawler graph

FR /100 SN-514

SPACE · ◄ ►

¶ Die Analogie

The utility-bill audit analogy

Most homes pay 30–50% more in utilities than they need to — old bulbs, leaky windows, that one fridge from 1998. None of the fixes feel like sacrifices, but they compound to a much smaller monthly bill.

LLM bills work the same way. The first version of an AI feature is almost always 2–10× more expensive than necessary. The fixes are not "use a worse model" — they're hygiene. Caching, routing, and a token diet turn a scary monthly invoice into a manageable line item.

The five highest-yield levers

1. Prompt caching

Mark the stable parts of your prompt as cacheable. Cache reads cost roughly 10% of input pricing. On chat-shaped workloads (where the system prompt + history is reused), savings of 50–90% are normal.

Apply to: system prompts, tool definitions, long instruction blocks, retrieved-doc context that's reused within a session.

2. Model routing

Triage calls and send them to the cheapest model that handles them.

Haiku for: classification, extraction, routing, simple Q&A
Sonnet for: agent steps, RAG, code, content
Opus for: novel reasoning, hard refactors, hard debugging

A simple front-line classifier (Haiku) deciding the route saves 60–80% on workloads where most queries are easy.

3. Prompt diet

System prompts grow like cabling under a desk. Audit them quarterly:

Remove obsolete instructions and dead branches.
Replace long examples with concise ones.
Move stable conventions into a cached prefix.
Cut repeated phrasing.

A 30% prompt-token cut, applied to every call, is real money on busy features.

4. Output caps

max_tokens is a budget control, not just a safety net. Most "answer the question" prompts produce 300–800 tokens. The 8000-token max your code defaulted to never gets used — but pays for itself the one time the model decides to ramble.

Combine with structured outputs that have natural length bounds.

5. Caching at the application layer (semantic cache)

For tasks with common inputs (FAQ-style queries, repeated classifications), a semantic cache returns the previous answer when a new query is "close enough" by embedding similarity. Big wins on customer support, search assist, and common how-to queries.

See the dedicated semantic-caching explainer for the engineering.

Lower-yield but cumulative

Streaming + early cancellation — let users (or downstream code) cut off when the answer is done.
Batching where supported — some providers offer ~50% discount for offline batch APIs.
Provider tiers / commitment plans — meaningful discounts for predictable spend.
Region pinning — sometimes regional endpoints are cheaper or faster.
Quantized / distilled self-host — for high-volume narrow tasks, a smaller open-weights model may beat hosted on TCO.

Cost analysis: where to look

Top callers by spend. Usually a single feature dominates. Fix that first.
Top callers by tokens-per-call. If one call is 30k tokens, ask why.
Output / input ratio. Output is more expensive; high ratios mean you might be over-generating.
Cache hit rate. Should be 70%+ on chat-shaped workloads. Lower is leak.
Model mix. What % of calls go to your most expensive model? Usually too high.

The trap of "just optimise later"

The half-life of bad cost decisions is long. A feature that ships at $0.40/call gets baked into product expectations; later cuts feel like regressions. The engineering effort to introduce caching after launch is meaningfully harder than wiring it in from day one.

Ship cost-aware from the start. Cache markers, output caps, and a routing skeleton cost you a day; recouping on launch costs weeks.

What's not worth optimising

Tiny features. A feature that costs $20/month doesn't justify a sprint.
Quality-critical paths at the expense of the user. Saving $0.10 by routing to a worse model and losing customers is a bad trade.
Premature exotic stacks (custom inference, fine-tunes) when caching + routing wasn't tried.

In one line

Most LLM bills are 2–5× too high not from greed but from defaults. Cache aggressively, route by task, diet your prompts, and cap your outputs — that's the 80%.

From the field

Before optimising a single token I pull actual spend by feature and by model, because cost intuition is almost always wrong — the bill is usually dominated by one chatty endpoint or one task pointlessly running on the flagship. Nine times out of ten the big wins are boring: route simple tasks to a cheaper tier, trim a bloated prompt, cache the stable prefix, cap output length. The exotic stuff comes last and saves least. I don't optimise prematurely — at low volume, engineering time costs more than tokens — but the moment a feature scales, a day on those boring levers routinely cuts the bill by more than half.

→ Wollen Sie das in Ihrem Stack?

AWS Cloud Infrastructure & DevOps Engineering

Get an AWS environment that is secure by default, sized to what you actually run, and documented so your team can operate it without guesswork. Whether you are launching a new workload or inheriting a...

So kann ich helfen