How do you A/B test different LLMs in production?

Question

Pulse RevOps · The Machine · Accepted Answer

![A/B testing LLMs in production](https://image.pollinations.ai/prompt/AB%20testing%20LLMs%20in%20production%20traffic%20split%20experiment%20metrics%20online%20evaluation%20gateway%20routing%20glowing%20green%20diagram?width=1280&height=720&nologo=true)

# How do you A/B test different LLMs in production?

### Direct Answer
You A/B test LLMs in production by splitting live traffic between two or more model variants (different models, prompts, or settings), holding everything else constant, and comparing them on the metrics that matter — answer quality, user engagement, latency, and cost — until you have enough data to call a winner with statistical confidence. In practice you put an **LLM gateway or routing layer** (LiteLLM, Portkey, or a feature-flag tool like LaunchDarkly/Statsig) in front of your app to deterministically assign each user or session to a variant, log every request and outcome to an observability platform (LangSmith, Langfuse, Arize), and score quality with a mix of **online LLM-as-a-judge evaluations**, automatic signals (thumbs, edits, task completion), and offline evals on a frozen dataset. Roll out gradually — shadow test, then a small canary, then a controlled split — and only promote the new model when it wins on your primary metric without regressing cost or latency.

## Why A/B testing LLMs is different

Classic A/B testing assumes a clear, easily measured outcome — a click, a conversion. LLM outputs are open-ended and non-deterministic, so "which is better" is harder to define and the same input can yield different outputs across runs. That changes the playbook in three ways.

First, you need a **quality metric**, not just engagement. A model can be more engaging but less accurate, so you measure both. Second, outcomes are often delayed or implicit (did the answer actually solve the user's problem?), so you combine direct signals with model-graded evaluation. Third, **cost and latency are first-class metrics** — a model that is marginally better but twice as expensive or slow may still lose. A good LLM experiment compares quality, cost, and latency together.

```mermaid
flowchart LR
    U[User request] --> R[Router / feature flag]
    R -->|50%| A[Variant A: model/prompt A]
    R -->|50%| B[Variant B: model/prompt B]
    A --> L[(Log + evaluate)]
    B --> L
    L --> M[Compare quality, cost, latency]
    M --> D{Winner with significance?}
    D -->|Yes| P[Promote]
    D -->|No| C[Keep testing]
```

## Step 1: Decide what you are testing and your hypothesis

An LLM "variant" can differ in the model (GPT vs. Claude vs. An open model), the prompt or system instructions, temperature and other parameters, the retrieval configuration in a RAG app, or the fine-tune. Change **one thing at a time** so you can attribute the result. Write a hypothesis and pick a **primary metric** up front (for example, "answer correctness as judged, with cost and p95 latency as guardrails"). Pre-committing to the primary metric prevents cherry-picking a flattering result after the fact.

## Step 2: Build the routing and assignment layer

You need to deterministically and consistently assign traffic to variants. Two common approaches:

- **LLM gateways** like **LiteLLM** and **Portkey** sit between your app and providers, offering routing, load balancing, and the ability to split traffic across models behind one API — plus logging and cost tracking built in.
- **Feature-flag / experimentation platforms** like **LaunchDarkly**, **Statsig**, or **GrowthBook** assign users to variants, support gradual rollouts and targeting, and provide the statistics engine to read results.

The key requirement is **sticky assignment**: a given user or session should stay on the same variant for the duration so their experience is consistent and your unit of analysis is clean. Hash the user ID to a bucket rather than randomizing per request.

[![CRO Syndicate — Need a fractional Chief Revenue Officer? CRO Syndicate connects you with vetted fractional and interim revenue leaders. Kory White, Fractional CRO · 25 yrs · $0 to $200M scaled.](https://wsrv.nl/?url=files.catbox.moe/usgv65.png&w=1280&output=webp)](https://calendly.com/korywhiterevops)

**Reach Kory White, Fractional CRO:** [📅 Book a Quick Call](https://calendly.com/korywhiterevops) · [💼 Kory on LinkedIn](https://www.linkedin.com/in/korywhite) · [🏢 CRO Syndicate](https://crosyndicate.com/)

## Step 3: Define and capture your metrics

Capture three families for every request:

- **Quality:** correctness, groundedness (for RAG), helpfulness, format adherence, and safety. Since you rarely have ground truth live, score a sample with **LLM-as-a-judge** evaluators and reinforce with **direct user signals** — thumbs up/down, edits, copy/regenerate, escalation to a human, or downstream task completion.
- **Operational:** latency (especially time-to-first-token and p95), error and timeout rate, throughput.
- **Cost:** tokens in/out and dollar cost per request and per resol

How do you A/B test different LLMs in production?

How do you A/B test different LLMs in production?

Direct Answer

Why A/B testing LLMs is different

Step 1: Decide what you are testing and your hypothesis

Step 2: Build the routing and assignment layer

Step 3: Define and capture your metrics

Step 4: Roll out safely — shadow, canary, split

Step 5: Read the results with statistics

Step 6: Promote, monitor, and keep iterating

Frequently Asked Questions

Sources

How do you A/B test different LLMs in production?

How do you A/B test different LLMs in production?

Direct Answer

Why A/B testing LLMs is different

Step 1: Decide what you are testing and your hypothesis

Step 2: Build the routing and assignment layer

Step 3: Define and capture your metrics

Step 4: Roll out safely — shadow, canary, split

Step 5: Read the results with statistics

Step 6: Promote, monitor, and keep iterating

Frequently Asked Questions

Sources

What does the score mean?