How do you load-test an LLM inference service?

Question

Pulse RevOps · The Machine · Accepted Answer

![Load-testing an LLM inference service](https://image.pollinations.ai/prompt/load%20testing%20an%20LLM%20inference%20service%20concurrency%20throughput%20latency%20graphs%20glowing%20teal%20diagram?width=1280&height=720&nologo=true)

# How do you load-test an LLM inference service?

### Direct Answer
You load-test an LLM inference service by replaying realistic traffic at controlled concurrency levels while measuring the metrics that actually matter for generation: **time to first token (TTFT)**, **time per output token (TPOT) or inter-token latency**, **end-to-end latency**, **throughput in tokens per second**, and **request success rate**. Unlike a normal REST API, an LLM's cost and latency depend heavily on input and output token counts, so your test must use representative prompt and response lengths and stream tokens to capture TTFT. The practical path is to pick a load-testing tool built for or adapted to LLMs — such as **GenAI-Perf**, **vLLM's benchmark scripts**, **LLMPerf**, **Locust**, or **k6** — define a workload that mirrors production prompt mixes, ramp concurrency until latency or error rates breach your service-level objectives, and record the throughput-vs-latency curve so you can size GPUs and set autoscaling thresholds.

## Why LLM load testing is different

A conventional API load test measures requests per second and p95 latency against a roughly fixed amount of work per request. LLM inference breaks those assumptions. The work per request scales with the **number of input tokens** (the prompt the model must process during the prefill phase) and the **number of output tokens** (each generated during the slower decode phase). A 50-token prompt asking for a 1,000-token answer behaves nothing like a 4,000-token prompt asking for a 20-token answer, even though both are "one request."

Servers also batch requests together to maximize GPU utilization, so latency for any single request depends on how many others are in flight. That means concurrency is the primary dial, and the relationship between concurrency, throughput, and per-request latency is the whole point of the test. If you measure only average response time at one concurrency level, you learn almost nothing useful.

```mermaid
flowchart LR
    PROMPT[Realistic prompt mix] --> GEN[Load generator]
    GEN --> SVC[LLM inference service]
    SVC --> M1[TTFT]
    SVC --> M2[Inter-token latency]
    SVC --> M3[Throughput tok/s]
    SVC --> M4[Errors / success rate]
    M1 --> CURVE[Latency vs concurrency curve]
    M2 --> CURVE
    M3 --> CURVE
```

## The metrics that matter

Measure these explicitly rather than relying on a single latency number:

- **Time to first token (TTFT):** how long until the user sees the first token. This dominates perceived responsiveness for streaming chat and is driven by prompt length and queue/batch wait.
- **Time per output token (TPOT) / inter-token latency:** the steady-state speed of generation after the first token. Multiply by output length to estimate total generation time.
- **End-to-end latency:** total time from request to last token, the sum of TTFT and decode time.
- **Throughput:** both **requests per second** and, more importantly, **output tokens per second** across the whole server, which reflects true GPU efficiency.
- **Success rate and errors:** rate of timeouts, 429s, and out-of-memory failures as load climbs.
- **Goodput:** the throughput of requests that still met your latency SLO, which separates "fast but failing" from genuinely usable capacity.

Always report **percentiles (p50, p95, p99)**, not averages, because tail latency is what users and downstream systems actually feel.

[![CRO Syndicate — Need a fractional Chief Revenue Officer? CRO Syndicate connects you with vetted fractional and interim revenue leaders. Kory White, Fractional CRO · 25 yrs · $0 to $200M scaled.](https://wsrv.nl/?url=files.catbox.moe/usgv65.png&w=1280&output=webp)](https://calendly.com/korywhiterevops)

**Reach Kory White, Fractional CRO:** [📅 Book a Quick Call](https://calendly.com/korywhiterevops) · [💼 Kory on LinkedIn](https://www.linkedin.com/in/korywhite) · [🏢 CRO Syndicate](https://crosyndicate.com/)

## Choosing a load-testing tool

Several tools handle LLM-aware load generation:

- **NVIDIA GenAI-Perf** is a benchmarking tool purpose-built for generative AI endpoints. It measures TTFT, inter-token latency, and throughput against OpenAI-compatible and Triton servers, and supports configurable input/output token distributions.
- **vLLM's benchmark scripts** (`benchmark_serving.py`) drive load against a vLLM server with realistic datasets like ShareGPT and report throughput and latency, making them ideal when you serve with vLLM.
- **LLMPerf** (from Ray/Anyscale) load-tests LLM endpoints, reporting TTFT, inter-token latency, and throughput across providers and self-hosted servers.
- **Locust** and **k6** are general-purpose load-testing frameworks you can script to call streaming chat endpoints

How do you load-test an LLM inference service?

How do you load-test an LLM inference service?

Direct Answer

Why LLM load testing is different

The metrics that matter

Choosing a load-testing tool

Designing a realistic workload

Running the test and reading results

Common pitfalls to avoid

Frequently Asked Questions

Sources

How do you load-test an LLM inference service?

How do you load-test an LLM inference service?

Direct Answer

Why LLM load testing is different

The metrics that matter

Choosing a load-testing tool

Designing a realistic workload

Running the test and reading results

Common pitfalls to avoid

Frequently Asked Questions

Sources

What does the score mean?