← Hub
Pulse ← Library ⚡ Hire a Fractional CRO
Pulse AI Infrastructure

What causes high latency in LLM inference and how do you fix it?

Kory WhiteCurated by Kory White · Fractional CRO, CRO Syndicate
👍 Yup or 👎 Nope — vote this up its category:
📅 Published · 6 min read
LLM inference latency causes and fixes

What causes high latency in LLM inference and how do you fix it?

Direct Answer

High latency in LLM inference comes from a handful of root causes: autoregressive decoding that generates one token at a time, memory-bandwidth limits that make each token-step wait on weights and the KV cache, long prompts that inflate the prefill stage, poor batching that leaves the GPU idle, cold starts from loading weights, and network/queueing overhead in front of the model.

You fix it by attacking the right stage: cut time-to-first-token with shorter prompts, prefix caching, and prefill optimization; cut per-token latency with quantization, faster kernels (FlashAttention), tensor parallelism, and speculative decoding; and cut tail latency with continuous batching, KV-cache management (PagedAttention), autoscaling, and a semantic cache for repeat queries.

The discipline is to measure each stage separately — TTFT, inter-token latency, and queue time — so you optimize the bottleneck that actually exists rather than guessing.

The two phases of inference (and why latency splits in two)

Every LLM request has two distinct phases, and they have different performance characteristics:

This split is why the two metrics that matter most are time-to-first-token (TTFT), dominated by prefill and queueing, and inter-token latency (ITL) or tokens-per-second, dominated by decode. A request can have great TTFT but slow streaming, or vice versa — and the fixes differ.

flowchart LR P[Prompt] --> PF[Prefill: parallel, compute-bound] PF --> FT[First token TTFT] FT --> DC[Decode: 1 token/step, bandwidth-bound] DC --> DC DC --> OUT[Full response]

Cause 1: Autoregressive decoding and memory bandwidth

The fundamental tax is sequential generation: you cannot produce token N+1 until token N exists. Each step reads gigabytes of weights and KV cache from memory, so latency is gated by memory bandwidth.

Fixes:

Cause 2: Long prompts inflate prefill and TTFT

Big system prompts, long RAG contexts, and few-shot examples make prefill expensive and push out the first token. Latency grows with input length.

Fixes:

CRO Syndicate — Need a fractional Chief Revenue Officer? CRO Syndicate connects you with vetted fractional and interim revenue leaders. Kory White, Fractional CRO · 25 yrs · $0 to $200M scaled.

Reach Kory White, Fractional CRO: 📅 Book a Quick Call · 💼 Kory on LinkedIn · 🏢 CRO Syndicate

Cause 3: Poor batching leaves the GPU idle

Naive serving processes requests one at a time or in static batches, wasting GPU cycles and creating queue backlogs under load.

Fixes:

flowchart TD Q[Incoming requests] --> CB[Continuous batching] CB --> KV[Paged KV cache: pack many requests] KV --> GPU[GPU stays saturated] GPU --> S[Stream tokens per request] Q -.spikes.-> AS[Autoscale replicas] AS --> CB

Cause 4: Cold starts and model loading

Loading tens of gigabytes of weights into GPU memory takes seconds to minutes; on serverless or scale-to-zero setups, that hits the first request after a scale-up.

Fixes:

Cause 5: Queueing, routing, and network overhead

Even a fast model feels slow if requests sit in a queue or traverse slow hops. Under-provisioned replicas, single-region serving, and synchronous gateways add latency that has nothing to do with the GPU.

Fixes:

Cause 6: Recomputing answers you already have

Many production workloads ask similar or identical questions repeatedly, paying full inference cost each time.

Fixes:

Putting it together: a latency playbook

Start by measuring TTFT, inter-token latency, queue time, and tokens/sec under realistic load. Then:

  1. If TTFT is high: shorten prompts, enable prefix caching, use chunked prefill, and add warm replicas.
  2. If inter-token latency is high: quantize, enable FlashAttention/optimized kernels, add speculative decoding, and consider tensor parallelism.
  3. If tail latency under load is high: turn on continuous batching and PagedAttention, and autoscale on queue depth.
  4. If repeat queries dominate: add a semantic cache.

Optimize one bottleneck, re-measure, then move to the next. The same model can run several times faster purely by fixing serving, before you touch the model itself.

Frequently Asked Questions

What is the single biggest cause of slow LLM responses? For streaming throughput it is memory-bandwidth-bound autoregressive decoding; for the first token it is usually long prompts and queueing. Which dominates depends on your workload, which is why you must measure TTFT and inter-token latency separately.

Does quantization make inference faster or just smaller? Both. Smaller weights mean less data moved from memory each decode step, which is the bottleneck — so quantization typically raises tokens-per-second in addition to fitting the model in less GPU memory.

What is time-to-first-token and why does it matter? TTFT is the delay before the first output token appears, driven by queueing and the prefill of the prompt. It governs perceived responsiveness in streaming interfaces, so it is often the most user-visible latency metric.

How does continuous batching reduce latency? Instead of waiting for an entire batch to finish, continuous batching adds and evicts requests every decoding step. The GPU stays saturated and individual requests are not stuck behind slow neighbors, which cuts tail latency and raises throughput.

Can speculative decoding really speed things up without hurting quality? Yes — the large model verifies every token the small draft model proposes, so the output is identical to non-speculative decoding. The speedup comes when the draft is frequently right; quality is preserved by construction.

When should I use a semantic cache? When a meaningful share of queries are repeated or near-duplicate — FAQs, support, internal tools. A cache hit returns an answer in milliseconds at near-zero cost; just set a similarity threshold conservatively and bypass the cache for queries that must be fresh.

Sources

Keep reading
Was this helpful?  
Related in the library
More from the library
pulse-ai-infrastructure · ai-infrastructureThe 10 Best Model Compression Tools in 2027pulse-speeches · speechesHow to Keep a Wedding Toast Under Three Minutespulse-ai-infrastructure · ai-infrastructureWhat is the difference between batch and real-time inference infrastructure?pulse-aquariums · aquariumTop 10 Wavemakers for Reef Aquariums in 2027pulse-aquariums · aquariumHow do you set up a planted aquarium for beginners?pulse-speeches · speechesWhat Makes Steve Jobs’ Stanford Commencement a Great Speechpulse-ai-infrastructure · ai-infrastructureWhat is a feature store and do you still need one for LLM apps?pulse-speeches · speechesA Retirement Speech for a Pastorpulse-speeches · speechesA Speech for a Layoff Announcement with Compassionrevops · current-events-2027How does the expanding size of B2B buying committees increase the risk of vendor consolidation paralysis?pulse-ai-infrastructure · ai-infrastructureThe 10 Best AI Observability Platforms in 2027pulse-speeches · speechesHow to Write a Speech in 30 Minutespulse-speeches · speechesWhat Makes Winston Churchill's "Their Finest Hour" a Great Speechpulse-speeches · speechesWhat Makes David Foster Wallace’s “This Is Water” a Great Speechpulse-ai-infrastructure · ai-infrastructureThe 10 Best Model Registries in 2027