The librarian's-memory analogy
A traditional cache is a librarian who only retrieves books by exact title. Misspell anything and they shrug. A great librarian remembers books by what they're about — close enough wording, same answer.
A semantic cache turns the LLM cache from the first librarian into the second. "How do I reset my password?" and "I forgot my login credentials" mean the same thing — and should hit the same cached answer.
How it works
- For each new query, compute its embedding (a vector).
- Search a vector store of past (query embedding → answer) pairs.
- If there's a match within a similarity threshold, return the cached answer.
- Otherwise, call the LLM, store the new (query, answer) pair, return.
You're trading exact-key matching for similarity matching. The vector store does the work.
What it's good at
- Customer support FAQs — same questions, paraphrased a thousand ways.
- Documentation Q&A — "how do I install X?" in many flavours.
- Routing layers — classify into categories that repeat constantly.
- Stable-answer queries — "what does this error code mean?" doesn't change weekly.
- Read-heavy assistants — far more queries than unique answers.
Hit rates of 30–60% are common on FAQ-style traffic. That's a 30–60% cost cut, not a margin gain.
What it's bad at
- Personalised answers — the answer depends on the user, not just the question.
- Stateful conversations — the cache key is "this query"; the right answer might depend on previous turns.
- Time-sensitive answers — "what's the weather?" cached at 3pm is wrong at 9pm.
- Generative creative tasks — every prompt should produce something new.
- Hallucination retention — cache a wrong answer, ship it forever. Worst-case for quality.
The design knobs
- Similarity threshold — how close is "close enough." Too tight = low hit rate; too loose = wrong answers. Typical: cosine ≥ 0.92–0.97 for text.
- Embedding model — strong, domain-matched embeddings = better matches. Don't reuse a generic embedder for medical or legal terms.
- TTL — how long before a cached answer expires. Hours to days for FAQ-style; minutes to seconds for fast-moving content.
- Scope keys — segment cache by tenant, locale, user-tier, model version. A free-tier answer should not serve a paid-tier user.
- Negative cache — also cache "I don't know" responses to avoid recomputing low-value paths.
Engineering it well
- Validate before serving cached. Run a cheap check ("does the cached answer mention the query's key entity?") to catch bad matches.
- Log near-misses. Queries close to a hit but below threshold are tuning data.
- Background re-warming. Periodically re-run cached queries through the LLM to refresh stale answers.
- Versioned cache keys. When the underlying prompt or model changes, bump the version so old answers don't leak.
- Monitor wrongness. Sample served-cached answers for quality. A bad cached answer is now serving 40% of your users.
Semantic cache vs prompt caching
These are different things, often confused:
- Prompt caching (Anthropic, OpenAI features) — provider-side cache of stable input prefixes; reads at ~10% input price.
- Semantic cache — application-side cache of
(query, answer)pairs by similarity. The LLM call is skipped entirely on hits.
You can and should use both. They stack.
Where it pays off most
- Customer support sites with high query overlap.
- Internal Q&A bots answering company-knowledge questions.
- High-volume classification or routing where the input distribution has heavy reuse.
Where it backfires
- Personalised RAG — a cache that ignores user context returns stranger's answers.
- Compliance-sensitive replies — wrong cached answer in a regulated context is a real problem.
- Rapidly evolving knowledge — cached answer about a feature that changed yesterday.
Common pitfalls
- No PII handling. Storing user queries verbatim in a cache is a privacy minefield. Hash, scrub, or scope.
- Threshold set by guess. Build a tuning set; pick threshold by precision/recall, not vibes.
- Forgotten invalidation. Source content changed; cache didn't. Tie cache version to source version.
In one line
Semantic caching is the highest-leverage trick for read-heavy LLM workloads — turn paraphrased duplicates into free hits, and watch your bill drop without quality moving.