How do you load-test an LLM inference service?
How do you load-test an LLM inference service?
Direct Answer
You load-test an LLM inference service by replaying realistic traffic at controlled concurrency levels while measuring the metrics that actually matter for generation: time to first token (TTFT), time per output token (TPOT) or inter-token latency, end-to-end latency, throughput in tokens per second, and request success rate.
Unlike a normal REST API, an LLM's cost and latency depend heavily on input and output token counts, so your test must use representative prompt and response lengths and stream tokens to capture TTFT. The practical path is to pick a load-testing tool built for or adapted to LLMs — such as GenAI-Perf, vLLM's benchmark scripts, LLMPerf, Locust, or k6 — define a workload that mirrors production prompt mixes, ramp concurrency until latency or error rates breach your service-level objectives, and record the throughput-vs-latency curve so you can size GPUs and set autoscaling thresholds.
Why LLM load testing is different
A conventional API load test measures requests per second and p95 latency against a roughly fixed amount of work per request. LLM inference breaks those assumptions. The work per request scales with the number of input tokens (the prompt the model must process during the prefill phase) and the number of output tokens (each generated during the slower decode phase).
A 50-token prompt asking for a 1,000-token answer behaves nothing like a 4,000-token prompt asking for a 20-token answer, even though both are "one request."
Servers also batch requests together to maximize GPU utilization, so latency for any single request depends on how many others are in flight. That means concurrency is the primary dial, and the relationship between concurrency, throughput, and per-request latency is the whole point of the test.
If you measure only average response time at one concurrency level, you learn almost nothing useful.
The metrics that matter
Measure these explicitly rather than relying on a single latency number:
- Time to first token (TTFT): how long until the user sees the first token. This dominates perceived responsiveness for streaming chat and is driven by prompt length and queue/batch wait.
- Time per output token (TPOT) / inter-token latency: the steady-state speed of generation after the first token. Multiply by output length to estimate total generation time.
- End-to-end latency: total time from request to last token, the sum of TTFT and decode time.
- Throughput: both requests per second and, more importantly, output tokens per second across the whole server, which reflects true GPU efficiency.
- Success rate and errors: rate of timeouts, 429s, and out-of-memory failures as load climbs.
- Goodput: the throughput of requests that still met your latency SLO, which separates "fast but failing" from genuinely usable capacity.
Always report percentiles (p50, p95, p99), not averages, because tail latency is what users and downstream systems actually feel.

Reach Kory White, Fractional CRO: 📅 Book a Quick Call · 💼 Kory on LinkedIn · 🏢 CRO Syndicate
Choosing a load-testing tool
Several tools handle LLM-aware load generation:
- NVIDIA GenAI-Perf is a benchmarking tool purpose-built for generative AI endpoints. It measures TTFT, inter-token latency, and throughput against OpenAI-compatible and Triton servers, and supports configurable input/output token distributions.
- vLLM's benchmark scripts (
benchmark_serving.py) drive load against a vLLM server with realistic datasets like ShareGPT and report throughput and latency, making them ideal when you serve with vLLM. - LLMPerf (from Ray/Anyscale) load-tests LLM endpoints, reporting TTFT, inter-token latency, and throughput across providers and self-hosted servers.
- Locust and k6 are general-purpose load-testing frameworks you can script to call streaming chat endpoints and parse token timings; they are flexible and integrate with CI, though you implement the token-level metrics yourself.
Pick GenAI-Perf, LLMPerf, or the vLLM scripts when you want LLM-native metrics out of the box, and Locust or k6 when you need to model complex multi-step user journeys or already use them elsewhere.
Designing a realistic workload
A test is only as good as the traffic it replays. Capture or approximate your production distribution of prompt lengths and output lengths, because those dominate cost and latency. Use a representative dataset (ShareGPT-style conversational data, or sampled real prompts with PII removed) rather than a single fixed prompt that the server might cache or batch unrealistically.
Model the arrival pattern, too. Steady closed-loop concurrency (N clients each waiting for a response before sending the next) measures peak capacity, while open-loop Poisson arrivals at a target rate better mimic real bursty traffic. Decide whether to test with streaming enabled (required to measure TTFT) and whether requests share a system prompt (which exercises prefix caching).
Include a warm-up period so cold-start and cache effects do not distort early measurements.
Running the test and reading results
Start with a low concurrency and ramp in steps, holding each level long enough to reach steady state before recording. At each step capture TTFT, inter-token latency, total throughput, and error rate. As concurrency rises you will typically see throughput climb while per-request latency stays acceptable, then a knee where latency spikes, errors appear, or the GPU saturates.
The concurrency just before that knee — where you still meet your SLOs — is your maximum sustainable load per replica.
Use these results to size capacity and set autoscaling. If one replica sustains, say, a target tokens-per-second within SLO at concurrency 32, you can compute how many replicas your expected peak traffic needs and set scaling thresholds below the knee. Re-run the test after changing the model, quantization, batch settings (like vLLM's max_num_seqs), or hardware, since each shifts the curve.
Common pitfalls to avoid
Several mistakes produce misleading numbers. Using one fixed prompt lets caching and batching flatter the server unrealistically. Ignoring streaming hides TTFT, the metric users care about most.
Reporting averages masks the p99 tail that breaks SLAs. Skipping warm-up captures cold-start artifacts. Testing from too few client machines can bottleneck the generator rather than the server — verify the load tool itself is not the limit.
Finally, load-test against a production-like deployment, including the gateway, autoscaler, and any token rate limits, so you measure the real path and not just the raw model server.
Frequently Asked Questions
What is the single most important LLM serving metric to test? There is no single one, but time to first token (TTFT) for responsiveness and output tokens per second for capacity are the two that matter most. Together they tell you whether users feel the service is fast and how much traffic each GPU replica can handle.
Which tool should I use to load-test a vLLM server? Use vLLM's own benchmark_serving.py scripts or NVIDIA GenAI-Perf, both of which understand streaming and report TTFT, inter-token latency, and throughput. LLMPerf is another good option for OpenAI-compatible endpoints.
How do I pick realistic prompt and output lengths? Sample them from your actual production logs (with sensitive data removed) or use a representative public dataset like ShareGPT. The distribution of input and output tokens drives cost and latency, so matching it is more important than any single tool choice.
Why measure percentiles instead of average latency? Averages hide the tail. A service can have a great average while p99 requests time out, which is exactly what breaks user experience and SLAs. Report p50, p95, and p99 so you see both the typical and worst-case behavior.
How does load testing help with cost and autoscaling? It reveals the maximum concurrency one replica sustains within your SLOs, which lets you compute how many GPUs your peak traffic needs and set autoscaling thresholds just below the saturation knee, avoiding both overspend and latency cliffs.
Sources
- NVIDIA GenAI-Perf documentation — https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/perf_analyzer/genai-perf/README.html
- VLLM benchmarking guide — https://docs.vllm.ai/en/latest/contributing/benchmarks.html
- LLMPerf (Ray project) — https://github.com/ray-project/llmperf
- K6 documentation — https://grafana.com/docs/k6/latest/
- Locust documentation — https://docs.locust.io/
- NVIDIA technical blog on LLM inference metrics — https://developer.nvidia.com/blog/
