How do you architect a RAG pipeline for low latency?

How do you architect a RAG pipeline for low latency?
You architect a low-latency RAG pipeline by attacking every stage that adds time between the user's question and the first token of the answer: embed the query fast, retrieve from a vector index tuned for speed, cap and re-rank the context, and stream tokens from an optimized inference server.
The biggest wins come from caching (semantic and prompt caches that skip work entirely), parallelism (running retrieval and any metadata lookups concurrently), streaming (so perceived latency is the time to *first* token, not the whole answer), and right-sizing context (less retrieved text means faster prompts and cheaper generation).
A well-built RAG pipeline can answer in under a second to first token even over millions of documents.
Map the latency budget first
Before optimizing, decompose end-to-end latency into its stages. A typical RAG request spends time on: query embedding, vector search, optional re-ranking, prompt assembly, LLM time-to-first-token (TTFT), and token streaming. The LLM generation step almost always dominates, but retrieval and re-ranking can add hundreds of milliseconds if done naively.
Measure each stage with tracing (LangSmith, Langfuse, Arize Phoenix, or OpenTelemetry) so you optimize the real bottleneck instead of guessing.
Optimize embedding and retrieval
The query must be embedded before you can search, so use a fast embedding model and keep it close to your app. Small, efficient models (OpenAI text-embedding-3-small, Cohere Embed, or a self-hosted BGE/E5 model on GPU) embed a short query in a few milliseconds. Cache embeddings for repeated or templated queries.
For the vector search itself, the index type and parameters set the latency floor. HNSW indexes (used by Qdrant, Weaviate, Milvus, and pgvector) deliver low-latency approximate nearest-neighbor search; tuning ef_search trades a little recall for big speed gains. Keep the index in memory, co-locate the vector database in the same region (ideally the same VPC) as your app, and constrain the search with metadata filters so you scan fewer candidates.
Managed services like Pinecone are engineered for single-digit-millisecond p95 retrieval at scale.
Re-rank selectively, not always
Cross-encoder re-rankers (Cohere Rerank, BGE-reranker, or a hosted reranker) sharply improve relevance but add a model call over every retrieved candidate. To keep latency low: retrieve a modest candidate set (e.g., top 20–50), re-rank only those, and return a small final set (top 3–5).
For latency-critical paths, you can skip re-ranking entirely and rely on a strong embedding model plus hybrid (keyword + vector) search. Reserve re-ranking for queries where relevance matters most.

Reach Kory White, Fractional CRO: 📅 Book a Quick Call · 💼 Kory on LinkedIn · 🏢 CRO Syndicate
Cache aggressively
Caching is the single highest-leverage latency optimization because a cache hit skips retrieval *and* generation. There are two layers:
- Semantic cache — embed the incoming query and check whether a semantically similar question was answered before (GPTCache, Redis with vector search, or a managed semantic cache). On a hit you return the stored answer in milliseconds. Tune the similarity threshold carefully to avoid serving stale or subtly-wrong matches.
- Prompt / KV cache — provider-side prompt caching (offered by Anthropic, OpenAI, and self-hosted vLLM via prefix caching) reuses the model's computation over a repeated context prefix, cutting TTFT substantially when many requests share the same system prompt or retrieved boilerplate.
For frequently asked questions, a semantic cache can serve a large fraction of traffic without ever touching the LLM, slashing both latency and cost.
Optimize the generation step
Generation usually dominates latency, so optimize it directly. Serve open models on vLLM, TGI, or TensorRT-LLM, which use continuous batching, paged attention, and prefix caching to maximize throughput and minimize TTFT. Stream the response so the user sees the first token in a few hundred milliseconds rather than waiting for the full answer — this transforms *perceived* latency even when total generation time is unchanged.
Keep prompts lean: every extra thousand tokens of retrieved context adds prefill time, so retrieve the minimum context that answers the question. For simple queries, route to a smaller, faster model and reserve larger models for hard questions.
Run stages in parallel and at the edge
Wherever stages are independent, run them concurrently. Query embedding can overlap with metadata or permission lookups; multiple retrievers (vector + keyword) can run in parallel and merge. Co-locate the whole pipeline in one region to avoid cross-region network hops, and put the embedding model, vector DB, and inference server close together.
For globally distributed users, place retrieval and caching near users while keeping generation in a GPU region, accepting one fast hop for the LLM call.
A reference low-latency stack
A pragmatic 2027 stack looks like: a fast embedding model (hosted or self-hosted BGE/E5 on GPU); an in-region HNSW vector store (Qdrant, Weaviate, Milvus, or Pinecone) with metadata filtering and tuned ef_search; a semantic cache (Redis or GPTCache) in front; selective Cohere/BGE re-ranking; and generation on vLLM with prefix caching and streaming, fronted by an LLM gateway for routing and fallback.
Wrap everything in tracing (Langfuse, Arize Phoenix) so you can see per-stage latency and keep tightening the slowest link.
Frequently Asked Questions
What usually causes the most latency in a RAG pipeline? LLM generation — specifically time-to-first-token and total tokens generated — is almost always the largest single contributor. Retrieval and re-ranking are typically smaller but can balloon if the index isn't in memory, the region is far away, or you re-rank a huge candidate set.
Always measure before optimizing.
How much can caching reduce latency? A semantic-cache hit returns an answer in milliseconds versus hundreds of milliseconds to seconds for full retrieval plus generation, and prompt/prefix caching can cut TTFT for cache-eligible requests significantly. For workloads with repetitive questions, caching can offload a large share of traffic entirely.
Does streaming actually make it faster? Streaming does not reduce total generation time, but it dramatically reduces *perceived* latency: the user sees the first words in a few hundred milliseconds and reads as the rest arrives. For interactive apps, time-to-first-token is the metric that matters most, and streaming optimizes exactly that.
Should I always use a re-ranker? No. Re-ranking improves relevance but adds a model call. Use it on quality-critical or ambiguous queries, and skip it on latency-critical paths where a strong embedding model and hybrid search already return good results. Re-rank a small candidate set, never the whole index.
How do I keep vector search fast at scale? Use an HNSW (or comparable graph) index kept in memory, tune the search parameter (ef_search) for your recall/latency target, apply metadata filters to shrink the search space, co-locate the database with your app, and shard if the corpus is very large.
Managed vector databases are tuned for low p95 latency out of the box.
Which inference server is best for low latency? For self-hosted open models, vLLM, TGI, and TensorRT-LLM all deliver low TTFT and high throughput via continuous batching and paged attention; vLLM's prefix caching is especially valuable for RAG's repeated context. For hosted models, choose providers offering prompt caching and streaming.
Sources
- VLLM documentation (paged attention, prefix caching) — https://docs.vllm.ai/
- Pinecone — low-latency vector search — https://docs.pinecone.io/
- Qdrant HNSW and search tuning — https://qdrant.tech/documentation/
- Cohere Rerank documentation — https://docs.cohere.com/docs/rerank-overview
- GPTCache (semantic caching) — https://github.com/zilliztech/GPTCache
- Anthropic prompt caching — https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
- Langfuse / Arize Phoenix tracing for RAG — https://langfuse.com/docs and https://docs.arize.com/phoenix
