What causes high latency in LLM inference and how do you fix it?
What causes high latency in LLM inference and how do you fix it?
Direct Answer
High latency in LLM inference comes from a handful of root causes: autoregressive decoding that generates one token at a time, memory-bandwidth limits that make each token-step wait on weights and the KV cache, long prompts that inflate the prefill stage, poor batching that leaves the GPU idle, cold starts from loading weights, and network/queueing overhead in front of the model.
You fix it by attacking the right stage: cut time-to-first-token with shorter prompts, prefix caching, and prefill optimization; cut per-token latency with quantization, faster kernels (FlashAttention), tensor parallelism, and speculative decoding; and cut tail latency with continuous batching, KV-cache management (PagedAttention), autoscaling, and a semantic cache for repeat queries.
The discipline is to measure each stage separately — TTFT, inter-token latency, and queue time — so you optimize the bottleneck that actually exists rather than guessing.
The two phases of inference (and why latency splits in two)
Every LLM request has two distinct phases, and they have different performance characteristics:
- Prefill processes the entire input prompt in one parallel pass and produces the first token. It is compute-bound and scales with prompt length — a long prompt means a slow first token.
- Decode generates the rest of the output one token at a time, each step depending on the previous one. It is memory-bandwidth-bound: every token-step must read the model weights and the growing KV cache from GPU memory, so throughput is limited by how fast the GPU can move data, not how fast it can compute.
This split is why the two metrics that matter most are time-to-first-token (TTFT), dominated by prefill and queueing, and inter-token latency (ITL) or tokens-per-second, dominated by decode. A request can have great TTFT but slow streaming, or vice versa — and the fixes differ.
Cause 1: Autoregressive decoding and memory bandwidth
The fundamental tax is sequential generation: you cannot produce token N+1 until token N exists. Each step reads gigabytes of weights and KV cache from memory, so latency is gated by memory bandwidth.
Fixes:
- Quantization (INT8, FP8, INT4 via GPTQ/AWQ) shrinks weights so each step moves less data — directly raising tokens/sec.
- Speculative decoding uses a small draft model to propose several tokens that the large model verifies in one pass, generating multiple tokens per step when the draft is right.
- Optimized kernels like FlashAttention reduce memory traffic in the attention computation, and engines like vLLM, TensorRT-LLM, and SGLang ship them by default.
- Tensor parallelism splits the model across GPUs so more memory bandwidth is applied to each step (useful for very large models).
Cause 2: Long prompts inflate prefill and TTFT
Big system prompts, long RAG contexts, and few-shot examples make prefill expensive and push out the first token. Latency grows with input length.
Fixes:
- Prefix / prompt caching reuses the KV cache for shared prefixes (system prompts, common context) so repeated tokens are not recomputed. VLLM's automatic prefix caching and SGLang's RadixAttention do this.
- Trim context: retrieve fewer, better chunks in RAG; rerank to put the answer-bearing chunk first; drop redundant few-shot examples.
- Chunked prefill interleaves prefill and decode so a long prompt does not block other requests' streaming.

Reach Kory White, Fractional CRO: 📅 Book a Quick Call · 💼 Kory on LinkedIn · 🏢 CRO Syndicate
Cause 3: Poor batching leaves the GPU idle
Naive serving processes requests one at a time or in static batches, wasting GPU cycles and creating queue backlogs under load.
Fixes:
- Continuous (in-flight) batching adds and removes requests from the batch every step instead of waiting for a whole batch to finish, keeping the GPU saturated and slashing tail latency. This is a core feature of vLLM, TGI, and TensorRT-LLM.
- PagedAttention-style KV-cache management eliminates memory fragmentation so more requests fit concurrently, raising throughput at a given latency target.
Cause 4: Cold starts and model loading
Loading tens of gigabytes of weights into GPU memory takes seconds to minutes; on serverless or scale-to-zero setups, that hits the first request after a scale-up.
Fixes:
- Keep a warm pool of replicas so you rarely cold-start under traffic.
- Use faster weight loading (memory-mapped/streamed loads, safetensors) and pre-bake weights into the image or a fast volume.
- For serverless inference, set min replicas above zero for latency-sensitive endpoints and reserve scale-to-zero for spiky, latency-tolerant ones.
Cause 5: Queueing, routing, and network overhead
Even a fast model feels slow if requests sit in a queue or traverse slow hops. Under-provisioned replicas, single-region serving, and synchronous gateways add latency that has nothing to do with the GPU.
Fixes:
- Autoscale on the right signal — queue depth or concurrent requests, not just CPU — so capacity tracks load.
- Load-balance across replicas and, for global users, serve from multiple regions.
- Stream tokens to the client so perceived latency tracks TTFT, not full completion.
- Put a lightweight AI gateway in front for routing and caching without adding heavy synchronous work.
Cause 6: Recomputing answers you already have
Many production workloads ask similar or identical questions repeatedly, paying full inference cost each time.
Fixes:
- A semantic cache (for example GPTCache or built-in gateway caches) returns a stored answer when a new query is semantically close to a previous one, cutting both latency and cost to near zero for cache hits.
- Cache at the embedding and retrieval layer too, so RAG does not re-embed and re-search identical queries.
Putting it together: a latency playbook
Start by measuring TTFT, inter-token latency, queue time, and tokens/sec under realistic load. Then:
- If TTFT is high: shorten prompts, enable prefix caching, use chunked prefill, and add warm replicas.
- If inter-token latency is high: quantize, enable FlashAttention/optimized kernels, add speculative decoding, and consider tensor parallelism.
- If tail latency under load is high: turn on continuous batching and PagedAttention, and autoscale on queue depth.
- If repeat queries dominate: add a semantic cache.
Optimize one bottleneck, re-measure, then move to the next. The same model can run several times faster purely by fixing serving, before you touch the model itself.
Frequently Asked Questions
What is the single biggest cause of slow LLM responses? For streaming throughput it is memory-bandwidth-bound autoregressive decoding; for the first token it is usually long prompts and queueing. Which dominates depends on your workload, which is why you must measure TTFT and inter-token latency separately.
Does quantization make inference faster or just smaller? Both. Smaller weights mean less data moved from memory each decode step, which is the bottleneck — so quantization typically raises tokens-per-second in addition to fitting the model in less GPU memory.
What is time-to-first-token and why does it matter? TTFT is the delay before the first output token appears, driven by queueing and the prefill of the prompt. It governs perceived responsiveness in streaming interfaces, so it is often the most user-visible latency metric.
How does continuous batching reduce latency? Instead of waiting for an entire batch to finish, continuous batching adds and evicts requests every decoding step. The GPU stays saturated and individual requests are not stuck behind slow neighbors, which cuts tail latency and raises throughput.
Can speculative decoding really speed things up without hurting quality? Yes — the large model verifies every token the small draft model proposes, so the output is identical to non-speculative decoding. The speedup comes when the draft is frequently right; quality is preserved by construction.
When should I use a semantic cache? When a meaningful share of queries are repeated or near-duplicate — FAQs, support, internal tools. A cache hit returns an answer in milliseconds at near-zero cost; just set a similarity threshold conservatively and bypass the cache for queries that must be fresh.
Sources
- VLLM documentation (PagedAttention, prefix caching) — https://docs.vllm.ai/
- NVIDIA TensorRT-LLM — https://github.com/NVIDIA/TensorRT-LLM
- FlashAttention — https://github.com/Dao-AILab/flash-attention
- Hugging Face Text Generation Inference — https://huggingface.co/docs/text-generation-inference
- SGLang (RadixAttention) — https://github.com/sgl-project/sglang
- GPTCache (semantic caching) — https://github.com/zilliztech/GPTCache
- NVIDIA developer blog on LLM inference optimization — https://developer.nvidia.com/blog/
