How do you A/B test different LLMs in production?
How do you A/B test different LLMs in production?
Direct Answer
You A/B test LLMs in production by splitting live traffic between two or more model variants (different models, prompts, or settings), holding everything else constant, and comparing them on the metrics that matter — answer quality, user engagement, latency, and cost — until you have enough data to call a winner with statistical confidence.
In practice you put an LLM gateway or routing layer (LiteLLM, Portkey, or a feature-flag tool like LaunchDarkly/Statsig) in front of your app to deterministically assign each user or session to a variant, log every request and outcome to an observability platform (LangSmith, Langfuse, Arize), and score quality with a mix of online LLM-as-a-judge evaluations, automatic signals (thumbs, edits, task completion), and offline evals on a frozen dataset.
Roll out gradually — shadow test, then a small canary, then a controlled split — and only promote the new model when it wins on your primary metric without regressing cost or latency.
Why A/B testing LLMs is different
Classic A/B testing assumes a clear, easily measured outcome — a click, a conversion. LLM outputs are open-ended and non-deterministic, so "which is better" is harder to define and the same input can yield different outputs across runs. That changes the playbook in three ways.
First, you need a quality metric, not just engagement. A model can be more engaging but less accurate, so you measure both. Second, outcomes are often delayed or implicit (did the answer actually solve the user's problem?), so you combine direct signals with model-graded evaluation.
Third, cost and latency are first-class metrics — a model that is marginally better but twice as expensive or slow may still lose. A good LLM experiment compares quality, cost, and latency together.
Step 1: Decide what you are testing and your hypothesis
An LLM "variant" can differ in the model (GPT vs. Claude vs. An open model), the prompt or system instructions, temperature and other parameters, the retrieval configuration in a RAG app, or the fine-tune.
Change one thing at a time so you can attribute the result. Write a hypothesis and pick a primary metric up front (for example, "answer correctness as judged, with cost and p95 latency as guardrails"). Pre-committing to the primary metric prevents cherry-picking a flattering result after the fact.
Step 2: Build the routing and assignment layer
You need to deterministically and consistently assign traffic to variants. Two common approaches:
- LLM gateways like LiteLLM and Portkey sit between your app and providers, offering routing, load balancing, and the ability to split traffic across models behind one API — plus logging and cost tracking built in.
- Feature-flag / experimentation platforms like LaunchDarkly, Statsig, or GrowthBook assign users to variants, support gradual rollouts and targeting, and provide the statistics engine to read results.
The key requirement is sticky assignment: a given user or session should stay on the same variant for the duration so their experience is consistent and your unit of analysis is clean. Hash the user ID to a bucket rather than randomizing per request.

Reach Kory White, Fractional CRO: 📅 Book a Quick Call · 💼 Kory on LinkedIn · 🏢 CRO Syndicate
Step 3: Define and capture your metrics
Capture three families for every request:
- Quality: correctness, groundedness (for RAG), helpfulness, format adherence, and safety. Since you rarely have ground truth live, score a sample with LLM-as-a-judge evaluators and reinforce with direct user signals — thumbs up/down, edits, copy/regenerate, escalation to a human, or downstream task completion.
- Operational: latency (especially time-to-first-token and p95), error and timeout rate, throughput.
- Cost: tokens in/out and dollar cost per request and per resolved task.
Log all of it to an observability platform — LangSmith, Langfuse, or Arize — tagged with the variant, so you can slice quality, cost, and latency by arm of the experiment.
Step 4: Roll out safely — shadow, canary, split
De-risk before you expose users:
- Shadow / offline test: replay logged production traffic against the new variant (or run it silently in parallel) and compare outputs and evals — no user impact.
- Canary: route a small slice (1–5%) to the new variant and watch guardrail metrics (errors, latency, cost, safety) closely.
- Controlled A/B split: once the canary is clean, expand to a real split (for example 50/50 or a planned ramp) and run until you reach your pre-set sample size.
Multi-armed bandit routing is an alternative that automatically shifts traffic toward the better-performing arm — useful when you want to minimize regret, though a fixed split is simpler to analyze.
Step 5: Read the results with statistics
Do not eyeball it. Determine sample size in advance for the effect you care about, and use a proper significance test (or the experimentation platform's built-in stats) on your primary metric. Watch for peeking (calling a winner early inflates false positives) and check that guardrail metrics — cost, latency, safety — did not regress.
Only promote the new model when it wins on the primary metric and holds the guardrails. If results are flat, that itself is a decision: keep the cheaper or faster option.
Step 6: Promote, monitor, and keep iterating
Promotion is a config change at the gateway or flag layer — no redeploy needed if you routed through one. After promotion, keep the observability and online evals running, because model behavior can drift as inputs change or as the provider updates the model. Treat A/B testing as continuous: new models and prompts arrive constantly, and the same harness lets you evaluate each one against your current champion.
Frequently Asked Questions
What metrics decide an LLM A/B test? A primary quality metric (correctness, groundedness, or helpfulness, often via LLM-as-a-judge plus user feedback) decides the winner, with latency and cost as guardrails. A variant that is slightly better but much slower or more expensive usually should not win.
How do I measure quality without ground-truth answers in production? Sample live requests and score them with an LLM-as-a-judge evaluator for groundedness and helpfulness, and combine that with implicit signals like thumbs, edits, regenerations, and task completion. Reserve labeled datasets for offline evaluation in CI.
What is the difference between shadow testing and an A/B test? Shadow testing runs the new variant on real or replayed traffic without showing its output to users — zero risk, good for catching regressions. An A/B test actually serves the variant to a slice of users and compares real outcomes. Shadow first, then A/B.
Which tools route traffic between models for A/B tests? LLM gateways like LiteLLM and Portkey split traffic across models behind one API with built-in logging and cost tracking, while feature-flag/experimentation platforms like LaunchDarkly, Statsig, and GrowthBook handle assignment, gradual rollout, and the statistics.
How long should I run an LLM A/B test? Long enough to reach the sample size your pre-test power calculation requires for the effect you care about, and at least a full traffic cycle (often a week) to cover daily and weekly variation. Avoid stopping early the moment a result looks significant.
Can I use a multi-armed bandit instead of a fixed split? Yes. A bandit automatically shifts more traffic to the better-performing variant, reducing the cost of serving a worse model during the test. The trade-off is that the analysis is less clean than a fixed split, so use bandits when minimizing regret matters more than a tidy statistical readout.
Sources
- LiteLLM (LLM gateway/routing) — https://docs.litellm.ai/
- Portkey AI gateway — https://portkey.ai/docs
- LaunchDarkly experimentation — https://docs.launchdarkly.com/
- Statsig — https://docs.statsig.com/
- LangSmith (tracing + evaluation) — https://docs.smith.langchain.com/
- Langfuse — https://langfuse.com/docs
- Arize AI observability — https://docs.arize.com/
