Statistical Methods

What is a Variance Reduction?

Variance reduction is a set of statistical techniques that tighten the confidence intervals of an A/B test without requiring more traffic.

Variance reduction is a set of statistical techniques that tighten the confidence intervals of an A/B test without requiring more traffic. By removing noise that isn't related to the treatment effect, these methods make it easier to detect real changes in your metrics, effectively giving every experiment more statistical power for free.

Variance reduction matters because experiment bandwidth is a finite resource. Most product teams can't simply double their sample size to detect smaller effects. Techniques like CUPED, metric capping, and trigger analysis reduce the noise in metric estimates, which means you can detect the same effect with fewer users or detect smaller effects with the same number of users. At Spotify, where 300+ teams run over 10,000 experiments per year, variance reduction is built into the default analysis pipeline in Confidence. Without it, a large share of those experiments would require weeks of additional runtime to reach the same statistical power.

How does variance reduction improve an experiment?

Every A/B test metric has two components: the signal (the actual treatment effect you're trying to measure) and the noise (the natural variation in user behavior that has nothing to do with the change you made). Variance reduction techniques shrink the noise component.

Consider a streaming metric like "minutes listened per user per day." Some users listen for six hours. Others open the app once a week. That range of behavior creates enormous variance in your metric, which makes it hard to detect a 1% improvement even with millions of users. Variance reduction attacks this problem at the source.

The practical impact is substantial. CUPED, the most widely used method, typically reduces variance by 20-50% depending on the metric and the quality of the pre-experiment covariate. That's equivalent to increasing your sample size by 25-100% at no cost. Metric capping handles the extreme tails: a single user who listens 20 hours a day can dominate the variance of an entire treatment group. Trigger analysis removes users who never encountered the change, which eliminates dilution from irrelevant observations.

Which variance reduction methods does Confidence support?

Confidence applies three methods, each targeting a different source of noise.

CUPED (Controlled-experiment Using Pre-Existing Data) uses each user's pre-experiment behavior to adjust their post-experiment metric. If a user was already a heavy listener before the experiment started, CUPED accounts for that, so the remaining variation reflects the treatment effect more cleanly. Confidence uses the Negi-Wooldridge 2021 full regression estimator, which is more precise than the original CUPED formulation.

Metric capping winsorizes extreme values to reduce the influence of outliers. Rather than letting a single power user dominate your metric's variance, capping clips values at a threshold (for example, the 99th percentile), keeping the directional information while removing disproportionate noise.

Trigger analysis restricts the analysis to users who actually experienced the change being tested. If you're testing a new checkout flow, including users who never reached checkout adds noise without adding signal. Trigger analysis removes that dilution.

These methods compose with the rest of Confidence's statistical stack. Variance reduction adjustments carry through to sequential testing, power analysis, and multiple testing corrections. This is a deliberate design choice: shipping CUPED without adapting sample size calculations to account for the variance reduction leaves teams with power estimates that don't match their actual analysis.

When should you use variance reduction?

Use it by default. There's rarely a good reason not to. The only case where CUPED doesn't help is when the metric has no pre-experiment history to condition on, such as metrics that measure entirely new behavior introduced by the experiment itself. Even then, metric capping and trigger analysis still apply.

The gains compound across an experimentation program. If variance reduction cuts your required runtime by 30% on average, a team running 50 experiments per year effectively gets 15 additional experiment slots. That's 15 more product questions answered, 15 more opportunities to learn.