Lesson 8: Variance reduction

Why metric selection and variance reduction are inseparable

When you choose a metric, raw variance is only half the picture. What matters for your experiment's power is effective variance—variance after applying regression adjustment. A continuous metric like total streams per user may look noisy in isolation, but if user behavior is stable over time, a pre-experiment covariate will absorb most of that noise. The result can be a far more sensitive metric than a binary alternative that seemed cleaner on the surface.

This means you can't evaluate metrics without understanding variance reduction, and you can't apply variance reduction thoughtfully without understanding which metrics it works well for. The two decisions are made together.

CUPED, CUPAC, and their relatives are all regression adjustment

CUPED, CUPAC, and every other branded variance reduction technique in online experimentation are fundamentally the same thing: regression adjustment. You regress a pre-experiment covariate out of the outcome and analyze the residuals. This reduces variance by the factor (1 − ρ²), where ρ is the correlation between the covariate and the outcome.

The statistical principle goes back decades. The efficiency gains from adjusting for pre-treatment covariates were formalized by Cochran (1957) under the name analysis of covariance (ANCOVA), building on the potential outcomes framework introduced by Neyman (1923). What the 2013 CUPED paper genuinely contributed was adapting these classical results to online A/B testing at scale, extending them to ratio metrics, and adding a conceptually important insight: by estimating the adjustment coefficient from pre-experiment data rather than the experimental sample, the adjusted outcome is unbiased without requiring any modeling assumption about the covariate-outcome relationship. Because the pre-experiment period cannot be influenced by treatment, the adjustment is valid by design.

CUPAC, for instance, is CUPED with an ML-predicted outcome as the covariate instead of the raw pre-experiment metric—useful when the simple covariate is weakly correlated. Most subsequent methods follow the same pattern: a different covariate choice within the same regression framework. When you see a new acronym, the question that cuts through is: "what covariate, and how correlated is it with the outcome?" Same principle, different covariate.

How much variance reduction should you expect?

The answer depends on how stable the metric is for your users over time—specifically, how well past behavior predicts future behavior.

At Spotify, for behavioral metrics with high temporal correlation—such as listening minutes or streams per user—variance reduction of 50-80% is common.

For sparser metrics like purchase conversion or binary activation outcomes, reductions of 20-30% are more typical. Your results will depend on how stable the metric is for your specific user base and time horizon.

The best covariate: the metric itself

In practice, the single most reliable covariate for most behavioral metrics is the pre-experiment measurement of the metric you're trying to reduce variance on. If you're measuring "streams per user" in the experiment, using "streams per user in the weeks before the experiment" as your covariate tends to be hard to beat.

The intuition is straightforward: past behavior is the best predictor of future behavior. A user's pre-experiment streaming behavior reflects their baseline preferences, habits, and engagement level far better than any demographic or derived feature. This gives a high ρ, which translates directly into large variance reduction.

It turns out this simple choice—the pre-experiment metric itself, exactly what CUPED uses—is hard to beat. Even with sophisticated feature engineering or ML-predicted outcomes, the extra variance reduction you can squeeze out beyond it is limited: at most a further 29% narrowing of confidence intervals (Ting and Hung, 2023). More complex covariates can still be worth exploring, but this is a strong default that requires no feature engineering and is easy to explain and audit.

Outlier treatment: cap versus winsorize

Variance reduction via regression adjustment addresses noise from natural behavioral variation. But a separate problem is outliers: a small number of extreme users can dominate metric variance and distort your estimates even after regression adjustment.

Two common approaches exist for handling this.

Cap

Capping sets an absolute maximum value for the metric. For example, you might cap daily streams at 500. Any user who streamed more than 500 times in a day is treated as though they streamed exactly 500.

The advantage of capping is that the threshold is fixed, predictable, and consistent across experiments. If your team agrees that 500 streams per day is the cap, every experiment that uses this metric applies the same rule, regardless of the population being tested or when the experiment runs.

Winsorize

Winsorizing is conceptually similar but uses a percentile-derived threshold rather than a fixed value. You might winsorize at the ninety-ninth percentile, replacing any value above that percentile with the percentile value itself.

The problem with winsorizing is that the threshold is a function of the sample. Different experiments targeting different user populations will produce different capping points—and those differences are unpredictable and non-comparable. An experiment targeting heavy users might winsorize at 900 streams per day; one targeting casual users might winsorize at 80. These are not the same metric, even if the winsorizing rule is nominally identical.

Both approaches involve a trade-off: you lose some information about extreme users in exchange for lower variance and more reliable estimates. The key question is whether the extreme values reflect genuine user behavior you want to capture, or noise and edge cases you'd rather control for. In most experimentation contexts, the latter is more common—an extreme outlier is rarely the user your feature change is targeting.

Notes for nerds

Variance reduction for ratio metrics. In Confidence, variance reduction is applied to all metric types—including ratio metrics—by using the method introduced in Ying Jin and Shan Ba (2021), which extends regression adjustment to ratio metrics directly. This means you don't need to pre-aggregate ratio metrics to the user level before applying variance reduction; the platform handles the joint estimation of numerator and denominator covariates automatically. For reference, the delta method variance formula and the general framework for ratio metrics in online experimentation are covered in: Deng, A., Lu, J., & Wang, S. (2018). "Applying the Delta Method in Metric Analytics: A Practical Guide with Novel Ideas." Proceedings of KDD 2018.

Covariate selection and rerandomization. Schultzberg and Johansson (2020) examines using historical data to predict experimental outcomes and using those predicted outcomes as covariates—the same covariate construction idea as CUPAC, applied in a rerandomization context. A related result from Li and Ding (2020) shows that Mahalanobis-distance rerandomization is asymptotically equivalent to regression adjustment using the same covariates. Together, these papers establish a clean theoretical bridge between design-based variance reduction (rerandomization) and analysis-based variance reduction (CUPED/ANCOVA): the two approaches converge when they use the same covariates.