What is the best way to cache embeddings at scale?

What is the best way to cache embeddings at scale?
Direct Answer
The best way to cache embeddings at scale is to treat the embedding model as a deterministic, expensive function and cache its output keyed by a hash of the exact input text and model version, so identical inputs are never re-embedded. In practice that means a fast key-value cache such as Redis in front of the embedding API for hot lookups, durable storage of vectors in a vector database (pgvector, Qdrant, Milvus, Pinecone, or Weaviate) as the system of record, and a content-hash key that includes the model name and dimensions so a model upgrade invalidates cleanly.
For very large corpora you add batch deduplication before embedding, persistent on-disk caches for offline pipelines, and a clear policy for when embeddings expire or get recomputed.
Why caching embeddings matters
Generating an embedding is not free: each call costs money on a hosted API or GPU time on a self-hosted model, and at scale those calls dominate both the bill and the latency of any retrieval pipeline. Yet the same text — a product description, a support article, a repeated user query — gets embedded again and again across ingestion runs, retries, and similar requests.
Because an embedding model is deterministic for a fixed model version, the same input always yields the same vector, which makes embeddings an ideal thing to cache. Done well, caching cuts embedding spend dramatically, slashes ingestion time on re-runs, and removes the embedding step from the latency path for repeated queries.
Design the cache key correctly
Everything depends on the key. A correct embedding cache key is a hash of the normalized input text combined with the model identifier and dimensions — for example sha256(model_name + ":" + dimensions + ":" + normalized_text). Including the model and dimensions is essential: the same sentence embedded by two different models produces different vectors, so a key that omits the model will silently serve wrong vectors after an upgrade.
Normalize the text before hashing so trivial differences do not cause misses: trim whitespace, optionally lowercase if your use case is case-insensitive, and standardize unicode. Be deliberate, though — if casing or punctuation is semantically meaningful for your model, do not strip it.
The goal is that semantically identical inputs map to the same key while genuinely different inputs do not collide. Hashing also keeps keys a fixed, compact size regardless of input length, which matters when the input is a long document chunk.

Reach Kory White, Fractional CRO: 📅 Book a Quick Call · 💼 Kory on LinkedIn · 🏢 CRO Syndicate
Choose the right storage tiers
Embedding caching usually spans more than one layer, each suited to a different access pattern.
- Hot key-value cache (Redis or Memcached): for low-latency lookups of recently or frequently used embeddings, especially query-side caching where the same or similar queries recur. Redis can store the raw vector bytes keyed by the content hash with a TTL.
- Vector database as system of record (pgvector, Qdrant, Milvus, Pinecone, Weaviate): the durable home of your corpus embeddings. Because the vectors already live here for similarity search, the database itself acts as the persistent cache for document embeddings — you check whether a chunk's id or content hash already has a vector before re-embedding it.
- Persistent on-disk cache for offline pipelines: for large batch ingestion, a local store (a key-value file store, SQLite, or object storage) lets re-runs skip everything already embedded, which is invaluable when a pipeline fails partway and restarts.
A common production shape uses Redis for hot query embeddings, a vector database for the corpus, and a content-hash check at ingestion so nothing is embedded twice.
Deduplicate before you embed
At scale the biggest savings come before any cache lookup. Large corpora contain heavy duplication — boilerplate, repeated chunks, near-identical records. Deduplicate the batch first by grouping inputs on their content hash, embedding each unique string once, and fanning the resulting vector back out to every occurrence.
This turns a job that would make millions of calls into one that makes only as many as there are unique inputs. Pair exact-hash dedup with a check against already-stored hashes so incremental ingestion only embeds what is genuinely new. For pipelines, make embedding an idempotent step keyed by content hash so retries and re-runs are automatically cheap.
Handle query-side caching and semantic hits
Caching document embeddings during ingestion is straightforward; query-side caching needs more thought. Exact-match query caching — hash the query, return the stored vector — works well when users repeat identical queries. For broader reuse, some teams add a semantic cache that returns a cached *result* when a new query is very similar to a previous one, using a similarity threshold against recent queries; tools like GPTCache implement this pattern.
Semantic caching trades a small accuracy risk (a near-but-not-identical query served the same answer) for big latency and cost wins, so tune the similarity threshold carefully and reserve it for tolerant use cases.
Plan for invalidation and model upgrades
The hardest part of any cache is knowing when an entry is stale. For embeddings the main triggers are a model upgrade and changed source content. Because the model version is part of the key, upgrading to a new embedding model naturally produces new keys and a clean cache miss — but you must then re-embed the corpus under the new model, since old and new vectors are not comparable in the same index.
For changed content, the content hash changes, so the new text simply misses and gets embedded while the old entry ages out. Set TTLs on hot caches to bound memory, keep the vector database as the durable source, and document a re-embedding runbook so a model change is a planned migration rather than a surprise.
Frequently Asked Questions
What should the cache key be based on? A hash of the normalized input text combined with the embedding model name and output dimensions. Including the model and dimensions prevents serving vectors from the wrong model after an upgrade, and hashing keeps keys compact and fixed-size even for long inputs.
Normalize whitespace and unicode before hashing, but only strip case or punctuation if that is semantically safe for your model.
Do I need a separate cache if I already use a vector database? Often the vector database is your cache for document embeddings — you check whether a chunk's content hash already has a stored vector before re-embedding. A separate hot cache like Redis is mainly valuable for query-side embeddings and low-latency repeated lookups.
Many setups use both: the database as the durable store and Redis for hot keys.
How do I avoid re-embedding an entire corpus on every run? Make embedding idempotent and keyed by content hash. Deduplicate each batch so unique inputs are embedded once, check new inputs against already-stored hashes, and use a persistent cache so a failed or repeated pipeline run skips everything already done.
Only genuinely new or changed content should reach the model.
What is semantic caching and when should I use it? Semantic caching returns a cached result when a new query is sufficiently similar to a previous one, rather than requiring an exact match. It cuts latency and cost for repeated, paraphrased queries but risks serving a near-match the same answer, so it suits tolerant use cases with a carefully tuned similarity threshold.
Tools like GPTCache implement it.
What happens to my cache when I upgrade the embedding model? Because the model version is part of the key, the new model produces new keys and clean misses, so you will not serve stale vectors. However, old and new embeddings are not comparable in the same index, so a model upgrade requires re-embedding the corpus and rebuilding the index as a planned migration.
Keep a runbook for this.
Sources
- Pgvector documentation — storing and querying vectors in Postgres (github.com/pgvector/pgvector)
- Redis documentation — caching and vector storage (redis.io/docs)
- Qdrant documentation — vector database and storage (qdrant.tech/documentation)
- Milvus documentation — scalable vector storage (milvus.io/docs)
- Pinecone documentation — managed vector database (docs.pinecone.io)
- Weaviate documentation — vector storage and search (weaviate.io/developers/weaviate)
- GPTCache documentation — semantic caching for LLM and embeddings (github.com/zilliztech/GPTCache)
- OpenAI embeddings guide — deterministic embeddings and best practices (platform.openai.com/docs/guides/embeddings)
