Home Concept Explainers AI Operations & Production Semantic Caching: Cache LLM Responses That Mean the Same

AI Operations & Production Crawler graph 3 sliders

Semantic Caching: Cache LLM Responses That Mean the Same

A normal cache matches exact keys. A semantic cache matches *meanings* — return the cached answer when the new query is close enough by embedding similarity.

Apr 29, 2026 · 4 min de lecture

Aller au lab Sans inscription · Gratuit pour toujours

▸ Essaie par toi-même

Glisse un slider — le diagramme réagit en direct.

Espace pour play · ←/→ pour scruber

Crawler graph

FR /100 SN-514

SPACE · ◄ ►

¶ L'analogie

The librarian's-memory analogy

A traditional cache is a librarian who only retrieves books by exact title. Misspell anything and they shrug. A great librarian remembers books by what they're about — close enough wording, same answer.

A semantic cache turns the LLM cache from the first librarian into the second. "How do I reset my password?" and "I forgot my login credentials" mean the same thing — and should hit the same cached answer.

How it works

For each new query, compute its embedding (a vector).
Search a vector store of past (query embedding → answer) pairs.
If there's a match within a similarity threshold, return the cached answer.
Otherwise, call the LLM, store the new (query, answer) pair, return.

You're trading exact-key matching for similarity matching. The vector store does the work.

What it's good at

Customer support FAQs — same questions, paraphrased a thousand ways.
Documentation Q&A — "how do I install X?" in many flavours.
Routing layers — classify into categories that repeat constantly.
Stable-answer queries — "what does this error code mean?" doesn't change weekly.
Read-heavy assistants — far more queries than unique answers.

Hit rates of 30–60% are common on FAQ-style traffic. That's a 30–60% cost cut, not a margin gain.

What it's bad at

Personalised answers — the answer depends on the user, not just the question.
Stateful conversations — the cache key is "this query"; the right answer might depend on previous turns.
Time-sensitive answers — "what's the weather?" cached at 3pm is wrong at 9pm.
Generative creative tasks — every prompt should produce something new.
Hallucination retention — cache a wrong answer, ship it forever. Worst-case for quality.

The design knobs

Similarity threshold — how close is "close enough." Too tight = low hit rate; too loose = wrong answers. Typical: cosine ≥ 0.92–0.97 for text.
Embedding model — strong, domain-matched embeddings = better matches. Don't reuse a generic embedder for medical or legal terms.
TTL — how long before a cached answer expires. Hours to days for FAQ-style; minutes to seconds for fast-moving content.
Scope keys — segment cache by tenant, locale, user-tier, model version. A free-tier answer should not serve a paid-tier user.
Negative cache — also cache "I don't know" responses to avoid recomputing low-value paths.

Engineering it well

Validate before serving cached. Run a cheap check ("does the cached answer mention the query's key entity?") to catch bad matches.
Log near-misses. Queries close to a hit but below threshold are tuning data.
Background re-warming. Periodically re-run cached queries through the LLM to refresh stale answers.
Versioned cache keys. When the underlying prompt or model changes, bump the version so old answers don't leak.
Monitor wrongness. Sample served-cached answers for quality. A bad cached answer is now serving 40% of your users.

Semantic cache vs prompt caching

These are different things, often confused:

Prompt caching (Anthropic, OpenAI features) — provider-side cache of stable input prefixes; reads at ~10% input price.
Semantic cache — application-side cache of (query, answer) pairs by similarity. The LLM call is skipped entirely on hits.

You can and should use both. They stack.

Where it pays off most

Customer support sites with high query overlap.
Internal Q&A bots answering company-knowledge questions.
High-volume classification or routing where the input distribution has heavy reuse.

Where it backfires

Personalised RAG — a cache that ignores user context returns stranger's answers.
Compliance-sensitive replies — wrong cached answer in a regulated context is a real problem.
Rapidly evolving knowledge — cached answer about a feature that changed yesterday.

Common pitfalls

No PII handling. Storing user queries verbatim in a cache is a privacy minefield. Hash, scrub, or scope.
Threshold set by guess. Build a tuning set; pick threshold by precision/recall, not vibes.
Forgotten invalidation. Source content changed; cache didn't. Tie cache version to source version.

In one line

Semantic caching is the highest-leverage trick for read-heavy LLM workloads — turn paraphrased duplicates into free hits, and watch your bill drop without quality moving.

From the field

Semantic caching is a real cost win and a real footgun, and the whole risk lives in the similarity threshold. Set it too loose and you'll serve a cached answer to a question that only sounds like the cached one — "how do I cancel" and "how do I cancel my premium plan" can want very different replies. So I use it where questions genuinely repeat and answers are stable (FAQ, docs help) and keep it well away from anything personalised or stateful, where two similar-looking prompts have different correct answers. Start with a conservative threshold, watch for wrong-answer reports, and loosen only on evidence. A confidently-wrong cache is worse than none.

→ Vous le voulez dans votre stack ?

AWS Cloud Infrastructure & DevOps Engineering

Get an AWS environment that is secure by default, sized to what you actually run, and documented so your team can operate it without guesswork. Whether you are launching a new workload or inheriting a...

Voir comment je peux aider