Skip to main content
AI Operations & Production Crawler graph 3 sliders

Semantic Caching: Cache LLM Responses That Mean the Same

A normal cache matches exact keys. A semantic cache matches *meanings* — return the cached answer when the new query is close enough by embedding similarity.

· 4 min de lecture
Aller au lab
▸ Essaie par toi-même

Glisse un slider — le diagramme réagit en direct.

FR /100
¶ L'analogie

The librarian's-memory analogy

A traditional cache is a librarian who only retrieves books by exact title. Misspell anything and they shrug. A great librarian remembers books by what they're about — close enough wording, same answer.

A semantic cache turns the LLM cache from the first librarian into the second. "How do I reset my password?" and "I forgot my login credentials" mean the same thing — and should hit the same cached answer.

How it works

  1. For each new query, compute its embedding (a vector).
  2. Search a vector store of past (query embedding → answer) pairs.
  3. If there's a match within a similarity threshold, return the cached answer.
  4. Otherwise, call the LLM, store the new (query, answer) pair, return.

You're trading exact-key matching for similarity matching. The vector store does the work.

What it's good at

  • Customer support FAQs — same questions, paraphrased a thousand ways.
  • Documentation Q&A — "how do I install X?" in many flavours.
  • Routing layers — classify into categories that repeat constantly.
  • Stable-answer queries — "what does this error code mean?" doesn't change weekly.
  • Read-heavy assistants — far more queries than unique answers.

Hit rates of 30–60% are common on FAQ-style traffic. That's a 30–60% cost cut, not a margin gain.

What it's bad at

  • Personalised answers — the answer depends on the user, not just the question.
  • Stateful conversations — the cache key is "this query"; the right answer might depend on previous turns.
  • Time-sensitive answers — "what's the weather?" cached at 3pm is wrong at 9pm.
  • Generative creative tasks — every prompt should produce something new.
  • Hallucination retention — cache a wrong answer, ship it forever. Worst-case for quality.

The design knobs

  • Similarity threshold — how close is "close enough." Too tight = low hit rate; too loose = wrong answers. Typical: cosine ≥ 0.92–0.97 for text.
  • Embedding model — strong, domain-matched embeddings = better matches. Don't reuse a generic embedder for medical or legal terms.
  • TTL — how long before a cached answer expires. Hours to days for FAQ-style; minutes to seconds for fast-moving content.
  • Scope keys — segment cache by tenant, locale, user-tier, model version. A free-tier answer should not serve a paid-tier user.
  • Negative cache — also cache "I don't know" responses to avoid recomputing low-value paths.

Engineering it well

  • Validate before serving cached. Run a cheap check ("does the cached answer mention the query's key entity?") to catch bad matches.
  • Log near-misses. Queries close to a hit but below threshold are tuning data.
  • Background re-warming. Periodically re-run cached queries through the LLM to refresh stale answers.
  • Versioned cache keys. When the underlying prompt or model changes, bump the version so old answers don't leak.
  • Monitor wrongness. Sample served-cached answers for quality. A bad cached answer is now serving 40% of your users.

Semantic cache vs prompt caching

These are different things, often confused:

  • Prompt caching (Anthropic, OpenAI features) — provider-side cache of stable input prefixes; reads at ~10% input price.
  • Semantic cache — application-side cache of (query, answer) pairs by similarity. The LLM call is skipped entirely on hits.

You can and should use both. They stack.

Where it pays off most

  • Customer support sites with high query overlap.
  • Internal Q&A bots answering company-knowledge questions.
  • High-volume classification or routing where the input distribution has heavy reuse.

Where it backfires

  • Personalised RAG — a cache that ignores user context returns stranger's answers.
  • Compliance-sensitive replies — wrong cached answer in a regulated context is a real problem.
  • Rapidly evolving knowledge — cached answer about a feature that changed yesterday.

Common pitfalls

  • No PII handling. Storing user queries verbatim in a cache is a privacy minefield. Hash, scrub, or scope.
  • Threshold set by guess. Build a tuning set; pick threshold by precision/recall, not vibes.
  • Forgotten invalidation. Source content changed; cache didn't. Tie cache version to source version.

In one line

Semantic caching is the highest-leverage trick for read-heavy LLM workloads — turn paraphrased duplicates into free hits, and watch your bill drop without quality moving.

Engr Mejba Ahmed

Engr Mejba Ahmed

Claude Code Expert · Online

👋

Hey there!

Quick Actions

WhatsApp Instant reply

Chat on WhatsApp

+880 1723 741224 · Instant reply

Popular Questions

Engr Mejba Ahmed is connected
Engr Mejba Ahmed is typing...
Engr Mejba Ahmed avatar

✉ Want me to follow up? Drop your email

Engr Mejba Ahmed avatar

📞 Connect Directly

Choose how you'd like to reach me

WhatsApp

+880 1723 741224

Email

[email protected]

✓ Details sent! I'll get back to you shortly.

Powered by OpenAI

335+

Blog Posts

25

AI Courses

63

Projects

Services & Expertise

Pricing & Process

Learning & Resources

Connect & Support