A/B testing for AI features behaves differently than A/B testing for traditional product changes. Required sample sizes are typically 3 to 8x larger because user behavior is more heterogeneous, effect sizes are smaller, and cost-of-error is asymmetric. The result is that sample size; not statistical sophistication, not metric selection, not decision-rule cleverness; dominates the economics of AI A/B testing. Teams that do not have the traffic to reach the required sample size should not be running A/B tests on aggregate metrics; they should be using offline evals as the primary decision instrument and reserving A/B tests for behavior-dependent uncertainty that offline evals cannot resolve. Underpowered A/B tests are worse than no test because they produce false confidence in wrong decisions. This piece is the framework for when to A/B test AI features, when to skip, and how to size the test when one is justified.
This is a spoke under the AI project economics manifesto. The manifesto argues that AI economics requires evaluation-cost framing rather than feature-cost framing. A/B testing is the experimentation budget within that framing; the cost of resolving behavior-dependent uncertainty that the offline eval suite cannot resolve.
Why AI A/B is different
Traditional product A/B testing; a button color change, a copy change, a layout change; has well-understood sample-size economics. Effect sizes are typically detectable at 5,000 to 30,000 users per arm. Variance on the primary metric is moderate. Decision rules are well-established.
AI feature A/B testing breaks each of these assumptions. The change is in model output, prompt, retrieval strategy, or reasoning approach. The user-visible effect is mediated by user behavior, comprehension, and task variation. Required sample sizes balloon. Decision rules need to handle distributional changes, not just mean shifts. Guardrails matter more because AI changes can degrade silently along dimensions that are not the primary metric.
The result is that A/B testing is not the default experimentation tool for AI features the way it is for traditional product changes. The default is offline evaluation; A/B testing is the layer above for behavior-dependent uncertainty. This inverts the priority order most product teams are used to.
The three sample-size pressures
Heterogeneous user behavior. AI features interact with user behavior in ways that are highly variable across user segments. A coding assistant change helps senior engineers and hurts junior ones. A customer support summarizer helps users with simple tickets and confuses users with complex ones. The aggregate effect averages out, masking segment-level effects and inflating variance on the primary metric.
The sample size needed to detect a 3 percent aggregate effect with high heterogeneity is roughly 2 to 4x the size needed for a homogeneous-behavior change of the same effect size. AI A/B tests routinely run into this multiplier.
Small effect sizes. Traditional A/B tests on visible UI changes often produce 8 to 15 percent effects on engagement metrics. AI feature changes typically produce 1 to 4 percent effects on aggregate quality metrics; most of the system’s quality budget is set by the model and prompt that everyone in test and control has access to, and the change moves only a sliver of the total quality.
The sample size required scales as 1 over effect size squared. Detecting a 3 percent effect requires roughly 7x the sample of detecting an 8 percent effect.
Asymmetric cost-of-error. Shipping an AI regression has higher downstream cost than shipping no change. A regression in a customer-facing AI feature can produce an incident, a brand-trust hit, or an audit-committee escalation; none of which a traditional A/B test risks. The asymmetric cost of false positives drives AI A/B tests toward higher confidence thresholds (often 99 percent rather than 95 percent), which further inflates required sample size.
The combined effect: AI A/B tests need 50,000 to 200,000 user-experiences per arm to reach defensible conclusions on small effects with confidence appropriate to the cost-of-error. Most AI features at most companies do not have this traffic available within a reasonable test window.
When to skip the A/B test
The most common A/B test failure mode in AI projects is running tests that should rarely have been launched. Three categories of AI changes do not need an A/B test:
Offline-eval-decisive changes. If the offline eval suite shows a clear regression or improvement (typically 8 percent or larger movement on a primary quality metric), the A/B test will not produce different information. Ship the improvement, fix the regression, and use the A/B test budget on something more uncertain. The AI project evaluation budget piece covers offline-eval allocation.
Behavior-independent changes. Backend latency optimizations, model swaps that do not change user-visible output, internal eval pipeline changes; these do not need an A/B because there is no user behavior dimension. Force-fitting A/B onto these wastes the experimentation budget. Ship them after offline validation.
Low-traffic features. Features that cannot reach the required sample size within a reasonable window (typically 4 to 8 weeks) should be evaluated offline, not via A/B. Running an underpowered A/B on a low-traffic feature produces noise that is worse than no test.
The negative-space rule: A/B tests are expensive in calendar time and statistical infrastructure. The default for AI feature changes is offline eval. A/B is reserved for the subset of changes where user behavior is the dominant uncertainty and the offline eval cannot resolve it.
When to run it longer
When an AI A/B test is justified but traffic is borderline, the math says run longer rather than ship at a noisy stopping point. Three guidelines:
Run for at least 4 weeks. Below 4 weeks, day-of-week effects and onboarding effects dominate. AI features in particular show different effects in week 1 (novelty) versus week 4 (steady state).
Run through one full eval cycle. AI systems have natural eval cycles (model upgrades, prompt revisions, retrieval improvements). A/B tests that conclude before the next eval cycle does not give the test arm time to reach steady state on the underlying system.
Run until the confidence interval excludes the do-nothing-meaningful threshold. If the test is borderline at the planned end date, extend until the confidence interval clearly includes or excludes a meaningful effect size. Stopping at “p = 0.06” produces no decision and burns the experiment slot.
The stop-budgeting-in-story-points piece covers how A/B tests fit into the eval-run budget cadence.
Pre-registration and decision rules
The most reliable failure-prevention discipline for AI A/B tests is pre-registration. Before traffic starts splitting:
- Name the primary metric. Single metric, written in the kickoff document.
- Name the effect size. What movement is meaningful enough to ship the change for.
- Compute the sample size. From effect size and variance estimates, run the power calculation, document the result.
- Name the guardrails. Latency, error rate, cost per request, downstream task completion.
- Name the decision rule. What outcome means ship, what means kill, what means run-longer.
Pre-registration prevents the most common AI A/B failure mode: running until the data “looks interesting” and stopping early at a noise spike. Pre-registered tests have stable conclusions; ad-hoc tests do not. The discipline takes 2 to 4 hours per test and saves the experimentation slot.
Guardrail metrics for AI tests
AI features can move primary metrics in the right direction while degrading other dimensions silently. A summarizer that improves average summary quality can also increase hallucination rate on a tail subset. A retrieval improvement that improves answer quality can also increase latency past acceptable thresholds. A reasoning model that improves task completion can also increase cost per request 3x.
Guardrails are non-primary metrics with one-sided alarm thresholds. They are not the deciding factor on their own, but they can veto a change that wins on the primary. Standard AI guardrails:
- Latency. P50, P95, P99. Alarm threshold typically 10 to 20 percent regression.
- Error rate. Tool failures, malformed outputs, refusals. Alarm threshold typically zero increase.
- Cost per request. Token cost from upstream provider plus downstream serving. Alarm threshold typically 30 to 50 percent regression.
- Downstream task completion. Whether users use the AI output. Alarm threshold typically 5 to 10 percent regression.
A test that wins on the primary but breaks a guardrail does not ship. The the-hidden-cost-of-ai-evals piece covers how guardrails are computed within the eval infrastructure.
The economics of underpowered tests
The temptation to run smaller tests is constant. Calendar pressure, traffic constraints, and the desire to ship push teams toward 2-week tests with 10,000 users per arm. The economics of these tests:
Direct cost. 2 weeks of engineering and infrastructure time to run the test, plus statistical analysis cost. Typical: $20K to $40K.
Indirect cost. An underpowered test that fails to detect a real improvement costs the value of the unshipped improvement, often 10x to 50x the test cost. An underpowered test that detects noise as effect costs the downstream regression cleanup, often 5x to 20x the test cost.
Expected cost. For an underpowered test with 40 percent power instead of 80 percent, expected cost is 2 to 3x a properly powered test because half the time the test produces a wrong-direction conclusion. The “savings” from running a small test are negative in expectation.
The defensible posture: do not run A/B tests that cannot reach 80 percent power on a meaningful effect size. If the traffic is not there, switch to offline eval as the primary instrument or accept that the decision will be made without an A/B test.
Frequently asked questions
Why do AI A/B tests need larger samples than traditional A/B tests?
Three reasons. Heterogeneous user behavior produces higher variance in the metric. Effect sizes for AI feature changes are typically smaller than for visible UI changes. Cost-of-error is asymmetric: shipping a regression has higher downstream cost than shipping nothing. The combination drives required sample size 3 to 8x higher than equivalent traditional A/B tests.
What sample size do AI A/B tests typically need?
For an effect size of 2 to 5 percent on a primary metric with 80 percent power and 5 percent alpha, sample sizes of 50,000 to 200,000 user-experiences per arm are typical for AI feature tests. Smaller samples can detect larger effects but most AI feature changes produce small effects on aggregate metrics.
When should an AI A/B test be skipped entirely?
When the eval suite already shows a clear regression or improvement. If offline evals on a held-out test set show 8 percent or larger movement on a primary quality metric, the A/B test will not produce different information. Ship the improvement (or fix the regression) and use the A/B test budget on a more uncertain decision.
What’s the cost of running an underpowered AI A/B test?
Two costs. The direct cost of the experiment time (2 to 8 weeks of traffic split). The indirect cost of a misleading conclusion: an underpowered test that fails to detect a real improvement leads to shipping the wrong variant. Underpowered AI A/B tests are worse than no test because they produce false confidence in a wrong decision.
How do you handle low-traffic AI features that cannot reach sample size?
Three options. Run the test for longer (months, not weeks) and accept the slow decision cycle. Switch to offline eval as the primary decision instrument. Pool decisions across multiple low-traffic features into a portfolio test that aggregates power. Most low-traffic AI features are better evaluated offline; A/B is wasted on them.
What’s heterogeneous user behavior in the AI A/B context?
AI features interact with user behavior in ways that are highly variable across user segments. A coding assistant change might help senior engineers and hurt junior ones. The aggregate effect averages out, masking the segment-level effects. Heterogeneous behavior inflates the variance on the primary metric and pushes required sample size higher.
Should AI A/B tests use guardrail metrics?
Yes, usually. AI features can move primary metrics in the right direction while degrading guardrail metrics (latency, error rate, cost per request, downstream task completion). Guardrail metrics are checked at lower power than primary metrics but with one-sided alarm thresholds. A test that wins on the primary but breaks a guardrail does not ship.
How do offline evals and A/B tests fit together?
Offline evals are the primary quality gate; A/B tests are the secondary gate for behavior-dependent effects. The sequence: build, run offline evals, ship to a small holdout if evals pass, run A/B if user behavior is the dominant uncertainty. Many AI feature changes rarely need an A/B test because the offline eval is decisive.
What’s the right pre-registration discipline for AI A/B tests?
Name the primary metric, the effect size, the sample size, the guardrails, and the decision rule before traffic starts splitting. Pre-registration prevents the most common AI A/B failure mode: running until the data “looks interesting” and stopping early at a noise spike. Pre-registered tests have stable conclusions; ad-hoc tests do not.
When does A/B testing fail outright for AI projects?
When the change is not behavior-dependent. A backend latency optimization, a model swap that does not change user-visible output, an internal eval pipeline change; these do not need an A/B because there is no user behavior dimension. Force-fitting A/B onto these wastes the experimentation budget that should fund the behavior-dependent decisions.
Key takeaways
- AI A/B tests need 3 to 8x larger samples than traditional A/B tests due to heterogeneous behavior, small effect sizes, and asymmetric cost-of-error.
- Sample size dominates the economics of AI A/B testing; teams without the traffic should be using offline evals as the primary decision instrument.
- Skip A/B tests when offline evals are decisive, when changes are behavior-independent, or when traffic cannot reach required sample size.
- Pre-registration of primary metric, effect size, sample size, guardrails, and decision rule prevents the most common A/B failure mode (early stopping on noise).
- Underpowered A/B tests are worse than no test because they produce false confidence in wrong decisions; the expected cost is 2 to 3x a properly powered test.
A/B testing for AI features is the experimentation layer above the offline eval. Most AI feature decisions can and should be made offline; A/B is reserved for the behavior-dependent uncertainty that the eval cannot resolve. Teams that respect this priority order produce reliable AI ship decisions with manageable experimentation budgets; teams that A/B everything spend their experimentation slots on tests that should rarely have been launched.
Arthur Wandzel