What causes high latency in LLM inference and how do you fix it?

Question

Pulse RevOps · The Machine · Accepted Answer

![LLM inference latency causes and fixes](https://image.pollinations.ai/prompt/LLM%20inference%20latency%20time%20to%20first%20token%20KV%20cache%20batching%20GPU%20bottleneck%20optimization%20speed%20glowing%20red%20diagram?width=1280&height=720&nologo=true)

# What causes high latency in LLM inference and how do you fix it?

### Direct Answer
High latency in LLM inference comes from a handful of root causes: **autoregressive decoding** that generates one token at a time, **memory-bandwidth limits** that make each token-step wait on weights and the KV cache, **long prompts** that inflate the prefill stage, **poor batching** that leaves the GPU idle, **cold starts** from loading weights, and **network/queueing overhead** in front of the model. You fix it by attacking the right stage: cut time-to-first-token with shorter prompts, prefix caching, and prefill optimization; cut per-token latency with quantization, faster kernels (FlashAttention), tensor parallelism, and speculative decoding; and cut tail latency with continuous batching, KV-cache management (PagedAttention), autoscaling, and a semantic cache for repeat queries. The discipline is to **measure each stage separately** — TTFT, inter-token latency, and queue time — so you optimize the bottleneck that actually exists rather than guessing.

## The two phases of inference (and why latency splits in two)

Every LLM request has two distinct phases, and they have different performance characteristics:

- **Prefill** processes the entire input prompt in one parallel pass and produces the first token. It is **compute-bound** and scales with prompt length — a long prompt means a slow first token.
- **Decode** generates the rest of the output **one token at a time**, each step depending on the previous one. It is **memory-bandwidth-bound**: every token-step must read the model weights and the growing KV cache from GPU memory, so throughput is limited by how fast the GPU can move data, not how fast it can compute.

This split is why the two metrics that matter most are **time-to-first-token (TTFT)**, dominated by prefill and queueing, and **inter-token latency (ITL)** or tokens-per-second, dominated by decode. A request can have great TTFT but slow streaming, or vice versa — and the fixes differ.

```mermaid
flowchart LR
    P[Prompt] --> PF[Prefill: parallel, compute-bound]
    PF --> FT[First token TTFT]
    FT --> DC[Decode: 1 token/step, bandwidth-bound]
    DC --> DC
    DC --> OUT[Full response]
```

## Cause 1: Autoregressive decoding and memory bandwidth

The fundamental tax is sequential generation: you cannot produce token N+1 until token N exists. Each step reads gigabytes of weights and KV cache from memory, so latency is gated by memory bandwidth.

**Fixes:**
- **Quantization** (INT8, FP8, INT4 via GPTQ/AWQ) shrinks weights so each step moves less data — directly raising tokens/sec.
- **Speculative decoding** uses a small draft model to propose several tokens that the large model verifies in one pass, generating multiple tokens per step when the draft is right.
- **Optimized kernels** like **FlashAttention** reduce memory traffic in the attention computation, and engines like **vLLM**, **TensorRT-LLM**, and **SGLang** ship them by default.
- **Tensor parallelism** splits the model across GPUs so more memory bandwidth is applied to each step (useful for very large models).

## Cause 2: Long prompts inflate prefill and TTFT

Big system prompts, long RAG contexts, and few-shot examples make prefill expensive and push out the first token. Latency grows with input length.

**Fixes:**
- **Prefix / prompt caching** reuses the KV cache for shared prefixes (system prompts, common context) so repeated tokens are not recomputed. VLLM's automatic prefix caching and SGLang's RadixAttention do this.
- **Trim context**: retrieve fewer, better chunks in RAG; rerank to put the answer-bearing chunk first; drop redundant few-shot examples.
- **Chunked prefill** interleaves prefill and decode so a long prompt does not block other requests' streaming.

[![CRO Syndicate — Need a fractional Chief Revenue Officer? CRO Syndicate connects you with vetted fractional and interim revenue leaders. Kory White, Fractional CRO · 25 yrs · $0 to $200M scaled.](https://wsrv.nl/?url=files.catbox.moe/usgv65.png&w=1280&output=webp)](https://calendly.com/korywhiterevops)

**Reach Kory White, Fractional CRO:** [📅 Book a Quick Call](https://calendly.com/korywhiterevops) · [💼 Kory on LinkedIn](https://www.linkedin.com/in/korywhite) · [🏢 CRO Syndicate](https://crosyndicate.com/)

## Cause 3: Poor batching leaves the GPU idle

Naive serving processes requests one at a time or in static batches, wasting GPU cycles and creating queue backlogs under load.

**Fixes:**
- **Continuous (in-flight) batching** adds and removes requests from the batch every step instead of waiting for a whole batch to finish, keeping the GPU saturated and slashing tail latency. This is a core feature of vLLM, TGI,

What causes high latency in LLM inference and how do you fix it?

What causes high latency in LLM inference and how do you fix it?

Direct Answer

The two phases of inference (and why latency splits in two)

Cause 1: Autoregressive decoding and memory bandwidth

Cause 2: Long prompts inflate prefill and TTFT

Cause 3: Poor batching leaves the GPU idle

Cause 4: Cold starts and model loading

Cause 5: Queueing, routing, and network overhead

Cause 6: Recomputing answers you already have

Putting it together: a latency playbook

Frequently Asked Questions

Sources

What causes high latency in LLM inference and how do you fix it?

What causes high latency in LLM inference and how do you fix it?

Direct Answer

The two phases of inference (and why latency splits in two)

Cause 1: Autoregressive decoding and memory bandwidth

Cause 2: Long prompts inflate prefill and TTFT

Cause 3: Poor batching leaves the GPU idle

Cause 4: Cold starts and model loading

Cause 5: Queueing, routing, and network overhead

Cause 6: Recomputing answers you already have

Putting it together: a latency playbook

Frequently Asked Questions

Sources

What does the score mean?