← Hub
Pulse ← Library ⚡ Hire a Fractional CRO
Pulse Reviews and Analysis

What is a semantic cache and how much can it cut inference costs?

Kory WhiteCurated by Kory White · Fractional CRO, CRO Syndicate
👍 Yup or 👎 Nope — vote this up its category:
📅 Published · 7 min read
semantic cache for LLM inference cover

What is a semantic cache and how much can it cut inference costs?

Direct Answer

A semantic cache stores past LLM responses keyed by the *meaning* of the request rather than its exact text, so that a new query with the same intent — even if worded differently — returns the cached answer instead of hitting the model again. It works by embedding each incoming prompt, searching a vector store for a previously answered prompt above a similarity threshold, and serving the stored response on a hit.

For applications with repetitive or overlapping queries — FAQs, support bots, search, documentation assistants — a well-tuned semantic cache commonly serves anywhere from a small fraction to over half of traffic from cache, cutting token spend and latency proportionally on those requests.

The exact savings depend entirely on how repetitive your traffic is and how aggressively you tune the similarity threshold.

How a semantic cache works

A traditional cache (like Redis used as a key-value store) only returns a hit when the new key is *byte-for-byte identical* to a stored key. That is useless for natural language, where "What's your refund policy?" and "How do I get my money back?" are different strings but the same question. A semantic cache solves this by matching on meaning.

The flow is straightforward:

  1. Embed the incoming prompt. Convert the user's query into a vector using an embedding model (OpenAI text-embedding-3, Cohere Embed, or an open model like bge / gte served locally).
  2. Search the cache. Query a vector store (Redis with vector search, Milvus, Qdrant, pgvector, or a purpose-built layer) for the nearest stored prompt embedding.
  3. Apply a similarity threshold. If the nearest match scores above your threshold (for example cosine similarity ≥ 0.95), it is treated as a cache hit and the stored response is returned. If not, it is a miss.
  4. On a miss, call the model and store the result. The new prompt embedding and its generated response are written back to the cache for future hits.
flowchart LR Q[User query] --> E[Embed query] E --> S[Vector search cache] S -->|score >= threshold| H[Cache hit: return stored answer] S -->|score < threshold| M[Call LLM] M --> R[Return answer] R --> W[Write embedding + answer to cache]

The cost win is direct: a cache hit costs one cheap embedding call plus a vector lookup instead of a full, expensive generation call. Embedding a query is typically orders of magnitude cheaper than generating a multi-hundred-token completion, and the lookup is sub-millisecond to low-millisecond.

How much can it actually save?

The honest answer is that savings scale with your cache hit rate, and the hit rate depends on your traffic. There is no universal percentage. The math is simple:

Because completions are far more expensive than embeddings, the net saving per hit is close to the full generation cost. So a 40% hit rate translates to roughly a 35–40% reduction in generation spend on that traffic. Applications with highly repetitive queries — customer-support FAQs, internal knowledge bots, ecommerce product Q&A, documentation search — see the highest hit rates.

Applications where every query is unique and personalized (open-ended creative writing, per-user data analysis) see very low hit rates and benefit little.

flowchart TD A[Total requests] --> B{Cache hit?} B -->|Hit: cheap embed + lookup| C[Big per-request saving] B -->|Miss: embed + full generation| D[Small added overhead] C --> E[Net saving = hit rate x near-full completion cost] D --> E

The latency benefit is just as important as cost. A cache hit returns in milliseconds versus the seconds a generation can take, so the same mechanism that lowers spend also makes your fast path dramatically faster for repeated questions.

CRO Syndicate — Need a fractional Chief Revenue Officer? CRO Syndicate connects you with vetted fractional and interim revenue leaders. Kory White, Fractional CRO · 25 yrs · $0 to $200M scaled.

Reach Kory White, Fractional CRO: 📅 Book a Quick Call · 💼 Kory on LinkedIn · 🏢 CRO Syndicate

Tuning the similarity threshold

The threshold is the single most important knob, and it is a tradeoff:

False hits are the real danger of semantic caching, because they return confidently wrong answers. Best practice is to start strict, measure both hit rate and false-hit rate on real traffic, and loosen carefully. Many teams add safeguards: scoping the cache per user or per tenant, including key context (selected model, parameters, retrieved documents) in the cache key, and excluding personalized or time-sensitive queries entirely.

When you should and should not use one

Good fits: FAQ and support bots, documentation and knowledge assistants, search and autocomplete, classification of recurring inputs, and any workload where many users ask overlapping questions. RAG systems benefit too, since users frequently re-ask the same questions of the same corpus.

Poor fits: highly personalized responses that depend on user-specific or real-time data, creative generation where variety is the point, and anything where a stale answer is harmful (live pricing, account balances, breaking news). For these, either skip semantic caching or restrict it tightly and add freshness controls (TTLs, invalidation on data change).

Tools that implement semantic caching

You can build one yourself with an embedding model plus a vector store, or use purpose-built tools:

Frequently Asked Questions

Is a semantic cache the same as prompt caching from model providers? No. Provider-side prompt caching (offered by OpenAI, Anthropic, and others) reuses the *computation* of a long, repeated prompt prefix to lower per-call cost and latency, but it still calls the model. A semantic cache avoids the model call entirely on a hit.

They are complementary — use prompt caching to cheapen the calls you do make and semantic caching to skip calls altogether.

What hit rate should I expect? There is no guaranteed number; it is entirely workload-dependent. Repetitive support and FAQ traffic can hit a large share of requests, while unique, personalized workloads may hit almost none. The only reliable approach is to measure hit rate on your own traffic before assuming any saving.

How do I avoid serving wrong answers from the cache? Use a conservative similarity threshold, include all answer-affecting context (model, parameters, retrieved docs, user/tenant) in the cache key, scope caches per tenant where appropriate, and exclude personalized or time-sensitive queries.

Monitor false-hit rate with sampled human or LLM-judge review and tighten the threshold if it climbs.

Does the embedding model choice matter? Yes. The cache only matches as well as its embeddings represent meaning. A stronger retrieval embedding model produces more reliable similarity judgments, while a weak one causes both missed hits and false hits.

Keep the same embedding model for writing and reading the cache, and re-embed if you change models.

Where should the cache live in my architecture? Centralizing it at an AI gateway or a shared service in front of your models gives consistent behavior, one place to tune thresholds, and shared hit rates across applications. Per-app caches are simpler to start with but fragment your hit rate and complicate invalidation.

How do I keep cached answers fresh? Set time-to-live (TTL) values appropriate to how fast the underlying knowledge changes, and invalidate entries when source data updates (for RAG, when the indexed documents change). For volatile data, either bypass the cache or use very short TTLs.

Sources

Keep reading
Was this helpful?  
Related in the library
More from the library
pulse-speeches · speechesHow to Practice a Speech So It Sounds Naturalpulse-ai-infrastructure · ai-infrastructureHow do you set up observability for a RAG application?revops · current-events-2027How does the expanding size of B2B buying committees increase the risk of vendor consolidation paralysis?pulse-ai-infrastructure · ai-infrastructureThe 10 Best LLM Inference Servers in 2027pulse-ai-infrastructure · ai-infrastructureThe 10 Best LLM Guardrails and Safety Tools in 2027pulse-speeches · speechesA Graduation Speech for a Homeschool Graduationpulse-speeches · speechesA Speech for a City Council Swearing-Inpulse-speeches · speechesA Speech for a Charity Fundraiserpulse-ai-infrastructure · ai-infrastructureThe 10 Best GPU Cloud Providers for AI Training in 2027pulse-speeches · speechesA Retirement Speech for a Coachpulse-ai-infrastructure · ai-infrastructureHow do you route requests across multiple LLM providers?pulse-speeches · speechesA Speech for a Board Dinnerpulse-ai-infrastructure · ai-infrastructureHow do you version datasets and models for reproducibility?pulse-ai-infrastructure · ai-infrastructureThe 10 Best AI Workflow Orchestration Tools in 2027