Variance is a measure of how much a metric's values spread out across users. High variance means individual users differ widely on the metric (some generate 500). Low variance means users cluster close to the average. In A/B testing, variance directly determines how large your experiment needs to be: higher variance requires more sample size to detect the same effect.
Variance is the single biggest driver of experiment runtime for most teams. Two metrics can have the same average treatment effect, but if one has 10x the variance, it needs roughly 10x the sample size to reach the same power. This is why teams that understand and manage variance run faster experimentation programs. At Spotify, where 300+ teams share experiment bandwidth, variance reduction techniques are among the most valuable investments in experimentation infrastructure. Reducing metric variance doesn't just help one test. It increases the effective capacity of the entire program.
How does variance affect experiment design?
Variance enters the sample size formula directly. For a standard two-sample z-test, the required sample size per group is proportional to the metric variance and inversely proportional to the squared minimum detectable effect:
n is proportional to sigma^2 / delta^2
where sigma^2 is the variance and delta is the MDE. Double the variance, double the sample size. This relationship means that a 50% reduction in variance cuts the required sample size in half, which can turn a four-week experiment into a two-week experiment.
Confidence's sample size calculator uses the historical variance of each metric, updated with pre-experiment data, to compute required sample sizes. When variance reduction methods like CUPED are enabled, the calculator shows the adjusted variance and the correspondingly smaller sample requirement. This lets teams see the runtime savings before they commit to running the test.
What causes high variance in experiment metrics?
Outlier users. Revenue metrics are classic high-variance metrics because a small number of heavy spenders pull the distribution far from the average. One user spending $1,000 can move the group mean measurably.
Heterogeneous populations. If your experiment includes both daily power users and users who visit once a month, their metric values will differ enormously. The power users generate most of the signal; the infrequent visitors add noise.
Metrics that aggregate across many events. Total session time, total items viewed, and similar cumulative metrics naturally have higher variance than rates (like conversion rate), because they compound variation across events.
Short observation windows. With only a day of data per user, metric values fluctuate more than over a week-long window. Longer observation windows smooth out day-to-day variation but extend experiment runtime.
How do you reduce variance?
CUPED (Controlled-experiment Using Pre-Existing Data). The most effective general-purpose method. CUPED uses each user's pre-experiment metric value as a covariate to remove predictable variation. If a user's revenue was high before the experiment, it's likely high during the experiment regardless of treatment assignment. CUPED subtracts that predictable component, leaving only the variation caused by the treatment plus residual noise. Confidence applies the Negi-Wooldridge full regression estimator, which is more precise than the original CUPED formulation. Variance reductions of ~50% are common for metrics with stable user-level baselines.
Metric capping (winsorization). Replacing extreme values with a cap at a chosen percentile (e.g., capping revenue at the 99th percentile) removes the outsized influence of outlier users. This is a blunt instrument compared to CUPED but effective when outliers are the primary source of variance.
Trigger analysis. If only a fraction of users encounter the changed feature, the untriggered users contribute variance without contributing signal. Restricting the analysis to triggered users removes that dilution. This doesn't reduce per-user variance, but it reduces the effective sample size needed by focusing on users where the treatment could have had an effect.
Better metric design. Sometimes the highest-variance metric isn't the right one to measure. A binary metric (converted / didn't convert) typically has lower variance than a continuous metric (total revenue), and may be a better fit for the hypothesis being tested.