← Hub
Pulse ← Library ⚡ Hire a Fractional CRO
Pulse Reviews and Analysis

How do you A/B test different LLMs in production?

Kory WhiteCurated by Kory White · Fractional CRO, CRO Syndicate
👍 Yup or 👎 Nope — vote this up its category:
📅 Published · 6 min read
A/B testing LLMs in production

How do you A/B test different LLMs in production?

Direct Answer

You A/B test LLMs in production by splitting live traffic between two or more model variants (different models, prompts, or settings), holding everything else constant, and comparing them on the metrics that matter — answer quality, user engagement, latency, and cost — until you have enough data to call a winner with statistical confidence.

In practice you put an LLM gateway or routing layer (LiteLLM, Portkey, or a feature-flag tool like LaunchDarkly/Statsig) in front of your app to deterministically assign each user or session to a variant, log every request and outcome to an observability platform (LangSmith, Langfuse, Arize), and score quality with a mix of online LLM-as-a-judge evaluations, automatic signals (thumbs, edits, task completion), and offline evals on a frozen dataset.

Roll out gradually — shadow test, then a small canary, then a controlled split — and only promote the new model when it wins on your primary metric without regressing cost or latency.

Why A/B testing LLMs is different

Classic A/B testing assumes a clear, easily measured outcome — a click, a conversion. LLM outputs are open-ended and non-deterministic, so "which is better" is harder to define and the same input can yield different outputs across runs. That changes the playbook in three ways.

First, you need a quality metric, not just engagement. A model can be more engaging but less accurate, so you measure both. Second, outcomes are often delayed or implicit (did the answer actually solve the user's problem?), so you combine direct signals with model-graded evaluation.

Third, cost and latency are first-class metrics — a model that is marginally better but twice as expensive or slow may still lose. A good LLM experiment compares quality, cost, and latency together.

flowchart LR U[User request] --> R[Router / feature flag] R -->|50%| A[Variant A: model/prompt A] R -->|50%| B[Variant B: model/prompt B] A --> L[(Log + evaluate)] B --> L L --> M[Compare quality, cost, latency] M --> D{Winner with significance?} D -->|Yes| P[Promote] D -->|No| C[Keep testing]

Step 1: Decide what you are testing and your hypothesis

An LLM "variant" can differ in the model (GPT vs. Claude vs. An open model), the prompt or system instructions, temperature and other parameters, the retrieval configuration in a RAG app, or the fine-tune.

Change one thing at a time so you can attribute the result. Write a hypothesis and pick a primary metric up front (for example, "answer correctness as judged, with cost and p95 latency as guardrails"). Pre-committing to the primary metric prevents cherry-picking a flattering result after the fact.

Step 2: Build the routing and assignment layer

You need to deterministically and consistently assign traffic to variants. Two common approaches:

The key requirement is sticky assignment: a given user or session should stay on the same variant for the duration so their experience is consistent and your unit of analysis is clean. Hash the user ID to a bucket rather than randomizing per request.

CRO Syndicate — Need a fractional Chief Revenue Officer? CRO Syndicate connects you with vetted fractional and interim revenue leaders. Kory White, Fractional CRO · 25 yrs · $0 to $200M scaled.

Reach Kory White, Fractional CRO: 📅 Book a Quick Call · 💼 Kory on LinkedIn · 🏢 CRO Syndicate

Step 3: Define and capture your metrics

Capture three families for every request:

Log all of it to an observability platform — LangSmith, Langfuse, or Arize — tagged with the variant, so you can slice quality, cost, and latency by arm of the experiment.

flowchart TD REQ[Request + variant tag] --> Q[Quality: LLM-judge + user feedback] REQ --> O[Ops: latency, errors] REQ --> CST[Cost: tokens, dollars] Q --> STORE[(Experiment store)] O --> STORE CST --> STORE STORE --> STAT[Significance test by variant]

Step 4: Roll out safely — shadow, canary, split

De-risk before you expose users:

  1. Shadow / offline test: replay logged production traffic against the new variant (or run it silently in parallel) and compare outputs and evals — no user impact.
  2. Canary: route a small slice (1–5%) to the new variant and watch guardrail metrics (errors, latency, cost, safety) closely.
  3. Controlled A/B split: once the canary is clean, expand to a real split (for example 50/50 or a planned ramp) and run until you reach your pre-set sample size.

Multi-armed bandit routing is an alternative that automatically shifts traffic toward the better-performing arm — useful when you want to minimize regret, though a fixed split is simpler to analyze.

Step 5: Read the results with statistics

Do not eyeball it. Determine sample size in advance for the effect you care about, and use a proper significance test (or the experimentation platform's built-in stats) on your primary metric. Watch for peeking (calling a winner early inflates false positives) and check that guardrail metrics — cost, latency, safety — did not regress.

Only promote the new model when it wins on the primary metric and holds the guardrails. If results are flat, that itself is a decision: keep the cheaper or faster option.

Step 6: Promote, monitor, and keep iterating

Promotion is a config change at the gateway or flag layer — no redeploy needed if you routed through one. After promotion, keep the observability and online evals running, because model behavior can drift as inputs change or as the provider updates the model. Treat A/B testing as continuous: new models and prompts arrive constantly, and the same harness lets you evaluate each one against your current champion.

Frequently Asked Questions

What metrics decide an LLM A/B test? A primary quality metric (correctness, groundedness, or helpfulness, often via LLM-as-a-judge plus user feedback) decides the winner, with latency and cost as guardrails. A variant that is slightly better but much slower or more expensive usually should not win.

How do I measure quality without ground-truth answers in production? Sample live requests and score them with an LLM-as-a-judge evaluator for groundedness and helpfulness, and combine that with implicit signals like thumbs, edits, regenerations, and task completion. Reserve labeled datasets for offline evaluation in CI.

What is the difference between shadow testing and an A/B test? Shadow testing runs the new variant on real or replayed traffic without showing its output to users — zero risk, good for catching regressions. An A/B test actually serves the variant to a slice of users and compares real outcomes. Shadow first, then A/B.

Which tools route traffic between models for A/B tests? LLM gateways like LiteLLM and Portkey split traffic across models behind one API with built-in logging and cost tracking, while feature-flag/experimentation platforms like LaunchDarkly, Statsig, and GrowthBook handle assignment, gradual rollout, and the statistics.

How long should I run an LLM A/B test? Long enough to reach the sample size your pre-test power calculation requires for the effect you care about, and at least a full traffic cycle (often a week) to cover daily and weekly variation. Avoid stopping early the moment a result looks significant.

Can I use a multi-armed bandit instead of a fixed split? Yes. A bandit automatically shifts more traffic to the better-performing variant, reducing the cost of serving a worse model during the test. The trade-off is that the analysis is less clean than a fixed split, so use bandits when minimizing regret matters more than a tidy statistical readout.

Sources

Keep reading
Was this helpful?  
Related in the library
More from the library
pulse-speeches · speechesA Retirement Speech for a Flight Attendantpulse-ai-infrastructure · ai-infrastructureHow do you fine-tune an open-source LLM cost-effectively?pulse-speeches · speechesWhat Makes JFK’s Inaugural Address a Great Speechpulse-speeches · speechesHow to Write a Speech in 30 Minutespulse-speeches · speechesA Speech for an IPO Celebrationpulse-speeches · speechesA Speech for a Nonprofit Galapulse-speeches · speechesHow to Use the Rule of Three in a Speechpulse-aquariums · aquariumHow do you cycle a new aquarium?pulse-speeches · speechesA Speech for a Coach’s End-of-Season Talkpulse-ai-infrastructure · ai-infrastructureThe 10 Best Retrieval and Search Infrastructure Tools for AI in 2027pulse-ai-infrastructure · ai-infrastructureThe 10 Best Streaming Data Platforms for AI in 2027