Lesson 8: Variance reduction
Understanding variance reduction is essential for metric selection, not just statistics. A metric with high raw variance might still be the best choice if it has a strongly correlated covariate—because variance reduction can bring its effective variance well below that of a seemingly simpler metric. This lesson explains how regression adjustment works, what drives how much reduction you get, and how to handle outliers so you can make smarter metric choices.
Why metric selection and variance reduction are inseparable
When you choose a metric, raw variance is only half the picture. What matters for your experiment's power is effective variance—variance after applying regression adjustment. A continuous metric like total streams per user may look noisy in isolation, but if user behavior is stable over time, a pre-experiment covariate will absorb most of that noise. The result can be a far more sensitive metric than a binary alternative that seemed cleaner on the surface.
This means you can't evaluate metrics without understanding variance reduction, and you can't apply variance reduction thoughtfully without understanding which metrics it works well for. The two decisions are made together.
CUPED, CUPAC, and their relatives are all regression adjustment
CUPED, CUPAC, and every other branded variance reduction technique in online experimentation are fundamentally the same thing: regression adjustment. You regress a pre-experiment covariate out of the outcome and analyze the residuals. This reduces variance by the factor (1 − ρ²), where ρ is the correlation between the covariate and the outcome.
The statistical principle goes back decades. The efficiency gains from adjusting for pre-treatment covariates were formalized by Cochran (1957) under the name analysis of covariance (ANCOVA), building on the potential outcomes framework introduced by Neyman (1923). What the 2013 CUPED paper genuinely contributed was adapting these classical results to online A/B testing at scale, extending them to ratio metrics, and adding a conceptually important insight: by estimating the adjustment coefficient from pre-experiment data rather than the experimental sample, the adjusted outcome is unbiased without requiring any modeling assumption about the covariate-outcome relationship. Because the pre-experiment period cannot be influenced by treatment, the adjustment is valid by design.
CUPAC, for instance, is CUPED with an ML-predicted outcome as the covariate instead of the raw pre-experiment metric—useful when the simple covariate is weakly correlated. Most subsequent methods follow the same pattern: a different covariate choice within the same regression framework. When you see a new acronym, the question that cuts through is: "what covariate, and how correlated is it with the outcome?" Same principle, different covariate.
The (1 − ρ²) factor is the key lever. If your covariate has a correlation of 0.7 with the outcome, you reduce variance by 51%. A correlation of 0.9 gives you 81% variance reduction. This is why choosing a strongly correlated covariate matters much more than which specific method you use.
How much variance reduction should you expect?
The answer depends on how stable the metric is for your users over time—specifically, how well past behavior predicts future behavior.
At Spotify, for behavioral metrics with high temporal correlation—such as listening minutes or streams per user—variance reduction of 50-80% is common.
For sparser metrics like purchase conversion or binary activation outcomes, reductions of 20-30% are more typical. Your results will depend on how stable the metric is for your specific user base and time horizon.
High temporal correlation—strong variance reduction: A user who streamed 400 minutes last week is very likely to stream a similar amount next week. Using last week's streaming minutes as the CUPED covariate gives a high ρ, which translates to large variance reduction. The experiment reaches the same statistical power in substantially less time—or detects a smaller effect with the same sample.
Low temporal correlation—modest variance reduction: Whether a user converts to a paid plan this week tells you relatively little about whether they'll convert again next week (most users either have or haven't converted). The covariate has low predictive power, ρ is small, and the variance reduction is correspondingly modest.
The best covariate: the metric itself
In practice, the single most reliable covariate for most behavioral metrics is the pre-experiment measurement of the metric you're trying to reduce variance on. If you're measuring "streams per user" in the experiment, using "streams per user in the weeks before the experiment" as your covariate tends to be hard to beat.
The intuition is straightforward: past behavior is the best predictor of future behavior. A user's pre-experiment streaming behavior reflects their baseline preferences, habits, and engagement level far better than any demographic or derived feature. This gives a high ρ, which translates directly into large variance reduction.
It turns out this simple choice—the pre-experiment metric itself, exactly what CUPED uses—is hard to beat. Even with sophisticated feature engineering or ML-predicted outcomes, the extra variance reduction you can squeeze out beyond it is limited: at most a further 29% narrowing of confidence intervals (Ting and Hung, 2023). More complex covariates can still be worth exploring, but this is a strong default that requires no feature engineering and is easy to explain and audit.
When in doubt, start with the pre-experiment measurement of your metric as the covariate. It requires no feature engineering, is easy to explain and audit, and performs well empirically. Move to more complex covariates only if there's a specific reason to expect they'll do better.
Outlier treatment: cap versus winsorize
Variance reduction via regression adjustment addresses noise from natural behavioral variation. But a separate problem is outliers: a small number of extreme users can dominate metric variance and distort your estimates even after regression adjustment.
Two common approaches exist for handling this.
Cap
Capping sets an absolute maximum value for the metric. For example, you might cap daily streams at 500. Any user who streamed more than 500 times in a day is treated as though they streamed exactly 500.
The advantage of capping is that the threshold is fixed, predictable, and consistent across experiments. If your team agrees that 500 streams per day is the cap, every experiment that uses this metric applies the same rule, regardless of the population being tested or when the experiment runs.
Winsorize
Winsorizing is conceptually similar but uses a percentile-derived threshold rather than a fixed value. You might winsorize at the ninety-ninth percentile, replacing any value above that percentile with the percentile value itself.
The problem with winsorizing is that the threshold is a function of the sample. Different experiments targeting different user populations will produce different capping points—and those differences are unpredictable and non-comparable. An experiment targeting heavy users might winsorize at 900 streams per day; one targeting casual users might winsorize at 80. These are not the same metric, even if the winsorizing rule is nominally identical.
Prefer capping over winsorizing for metrics used consistently across experiments. An absolute cap is predictable, stable, and easy to reason about when comparing results across teams and over time. Reserve winsorizing for exploratory analysis where cross-experiment comparability is not a concern.
Both approaches involve a trade-off: you lose some information about extreme users in exchange for lower variance and more reliable estimates. The key question is whether the extreme values reflect genuine user behavior you want to capture, or noise and edge cases you'd rather control for. In most experimentation contexts, the latter is more common—an extreme outlier is rarely the user your feature change is targeting.
What do CUPED, CUPAC, and similar variance reduction techniques have in common?
If the correlation between your covariate and outcome is ρ = 0.8, by approximately what factor does regression adjustment reduce variance?
Why is capping generally preferred over winsorizing for metrics used across multiple experiments?
Notes for nerds
Variance reduction for ratio metrics. In Confidence, variance reduction is applied to all metric types—including ratio metrics—by using the method introduced in Ying Jin and Shan Ba (2021), which extends regression adjustment to ratio metrics directly. This means you don't need to pre-aggregate ratio metrics to the user level before applying variance reduction; the platform handles the joint estimation of numerator and denominator covariates automatically. For reference, the delta method variance formula and the general framework for ratio metrics in online experimentation are covered in: Deng, A., Lu, J., & Wang, S. (2018). "Applying the Delta Method in Metric Analytics: A Practical Guide with Novel Ideas." Proceedings of KDD 2018.
Covariate selection and rerandomization. Schultzberg and Johansson (2020) examines using historical data to predict experimental outcomes and using those predicted outcomes as covariates—the same covariate construction idea as CUPAC, applied in a rerandomization context. A related result from Li and Ding (2020) shows that Mahalanobis-distance rerandomization is asymptotically equivalent to regression adjustment using the same covariates. Together, these papers establish a clean theoretical bridge between design-based variance reduction (rerandomization) and analysis-based variance reduction (CUPED/ANCOVA): the two approaches converge when they use the same covariates.