Confidence
  • Documentation
  • Blog
  • Bootcamp
  • Status
  • Confidence Bootcamp
    • My learning
    • Intro to experimentation
      • Introduction
      • Lesson 1: Why you should experiment
      • Lesson 2: Experiment hypothesis
      • Lesson 3: Success and guardrail metrics
      • Lesson 4: Success metrics
      • Lesson 5: Set up your experiment
      • Lesson 6: Calculation frequency
      • Lesson 7: Target audience
      • Lesson 8: Sample size
      • Lesson 9: Quality assurance
      • Lesson 10: Run your experiment
      • Lesson 11: Evaluate your experiment and make a decision
      • Lesson 12: A/B tests and rollouts
      • Course wrap up
    • Intro to metrics
      • Introduction
      • Lesson 1: What is a metric?
      • Lesson 2: Metric roles
      • Lesson 3: Time considerations
      • Lesson 4: Capturing behavior
      • Lesson 5: Strategic metrics
      • Lesson 6: Interpretability
      • Lesson 7: Feasibility and sensitivity
      • Lesson 8: Variance reduction
      • Lesson 9: Select metrics
      • Lesson 10: Segment-level analysis
      • Course wrap up
    • Scientific product development
      • Introduction
      • Lesson 1: Why you should experiment
      • Lesson 2: The scientific method
      • Lesson 3: Randomized controlled trials
      • Lesson 4: Experiment hypothesis
      • Lesson 5: Case study
        • Case study
        • Answers to case study
      • Lesson 6: Why do we need statistics?
      • Lesson 7: Success metrics
      • Lesson 8: Detectable effects and sample size
      • Lesson 9: Make a decision
      • Course wrap up
    • A primer on hypothesis testing
      • Introduction
      • Lesson 1: Introduction to hypothesis testing
      • Lesson 2: True vs estimated effects
      • Lesson 3: Sampling distribution of the difference-in-means estimator
      • Lesson 4: Z-tests and how to reject the null hypothesis
      • Lesson 5: False postive rate and alpha
      • Lesson 6: True positive rate, MDE, and power
      • Course wrap up
    • Intro to Feature Flags
      • Introduction
      • Lesson 1: What is a feature flag?
      • Lesson 2: Lifecycle of a feature flag
      • Lesson 3: Clients
      • Lesson 4: Evaluation context and targeting
    • Sample size calculation - I
      • Introduction
      • Lesson 1: What is the required sample size?
      • Lesson 2: Alpha and power
      • Lesson 3: Baseline mean and variance
      • Lesson 4: Sample size playground - I
    • Sample size calculation - II
      • Introduction
      • Lesson 1: Multi-metric decision making
      • Lesson 2: Number of success metrics
      • Lesson 3: Number of guardrail metrics
      • Lesson 4: Number of comparisons
      • Lesson 5: Sample size playground - II
    • Sample size calculation - III
      • Introduction
      • Lesson 1: Binary metrics
      • Lesson 2: Treatment group proportions
      • Lesson 3: Variance reduction
      • Lesson 4: Sequential testing and sample size
      • Lesson 5: Sample size playground - III
    • Advance your experimentation
      • Introduction
      • Lesson 1: Guardrail metrics with non-inferiority margins
      • Lesson 2: Choose evaluation frequency
      • Lesson 3: Metrics' roles in experiments
      • Lesson 4: Cumulative holdback evaluations
    • Experimentation culture
      • Introduction
      • Lesson 1: Onboarding into experimentation
      • Lesson 2: Empowering experimentation champions
      • Lesson 3: Sustaining the experimentation culture
    • Videos

Lesson 8: Variance reduction

Summary

Understanding variance reduction is essential for metric selection, not just statistics. A metric with high raw variance might still be the best choice if it has a strongly correlated covariate—because variance reduction can bring its effective variance well below that of a seemingly simpler metric. This lesson explains how regression adjustment works, what drives how much reduction you get, and how to handle outliers so you can make smarter metric choices.

Why metric selection and variance reduction are inseparable

When you choose a metric, raw variance is only half the picture. What matters for your experiment's power is effective variance—variance after applying regression adjustment. A continuous metric like total streams per user may look noisy in isolation, but if user behavior is stable over time, a pre-experiment covariate will absorb most of that noise. The result can be a far more sensitive metric than a binary alternative that seemed cleaner on the surface.

This means you can't evaluate metrics without understanding variance reduction, and you can't apply variance reduction thoughtfully without understanding which metrics it works well for. The two decisions are made together.

CUPED, CUPAC, and their relatives are all regression adjustment

CUPED, CUPAC, and every other branded variance reduction technique in online experimentation are fundamentally the same thing: regression adjustment. You regress a pre-experiment covariate out of the outcome and analyze the residuals. This reduces variance by the factor (1 − ρ²), where ρ is the correlation between the covariate and the outcome.

The statistical principle goes back decades. The efficiency gains from adjusting for pre-treatment covariates were formalized by Cochran (1957) under the name analysis of covariance (ANCOVA), building on the potential outcomes framework introduced by Neyman (1923). What the 2013 CUPED paper genuinely contributed was adapting these classical results to online A/B testing at scale, extending them to ratio metrics, and adding a conceptually important insight: by estimating the adjustment coefficient from pre-experiment data rather than the experimental sample, the adjusted outcome is unbiased without requiring any modeling assumption about the covariate-outcome relationship. Because the pre-experiment period cannot be influenced by treatment, the adjustment is valid by design.

CUPAC, for instance, is CUPED with an ML-predicted outcome as the covariate instead of the raw pre-experiment metric—useful when the simple covariate is weakly correlated. Most subsequent methods follow the same pattern: a different covariate choice within the same regression framework. When you see a new acronym, the question that cuts through is: "what covariate, and how correlated is it with the outcome?" Same principle, different covariate.

Note

The (1 − ρ²) factor is the key lever. If your covariate has a correlation of 0.7 with the outcome, you reduce variance by 51%. A correlation of 0.9 gives you 81% variance reduction. This is why choosing a strongly correlated covariate matters much more than which specific method you use.

How much variance reduction should you expect?

The answer depends on how stable the metric is for your users over time—specifically, how well past behavior predicts future behavior.

At Spotify, for behavioral metrics with high temporal correlation—such as listening minutes or streams per user—variance reduction of 50-80% is common.

For sparser metrics like purchase conversion or binary activation outcomes, reductions of 20-30% are more typical. Your results will depend on how stable the metric is for your specific user base and time horizon.

Example

High temporal correlation—strong variance reduction: A user who streamed 400 minutes last week is very likely to stream a similar amount next week. Using last week's streaming minutes as the CUPED covariate gives a high ρ, which translates to large variance reduction. The experiment reaches the same statistical power in substantially less time—or detects a smaller effect with the same sample.

Low temporal correlation—modest variance reduction: Whether a user converts to a paid plan this week tells you relatively little about whether they'll convert again next week (most users either have or haven't converted). The covariate has low predictive power, ρ is small, and the variance reduction is correspondingly modest.

The best covariate: the metric itself

In practice, the single most reliable covariate for most behavioral metrics is the pre-experiment measurement of the metric you're trying to reduce variance on. If you're measuring "streams per user" in the experiment, using "streams per user in the weeks before the experiment" as your covariate tends to be hard to beat.

The intuition is straightforward: past behavior is the best predictor of future behavior. A user's pre-experiment streaming behavior reflects their baseline preferences, habits, and engagement level far better than any demographic or derived feature. This gives a high ρ, which translates directly into large variance reduction.

It turns out this simple choice—the pre-experiment metric itself, exactly what CUPED uses—is hard to beat. Even with sophisticated feature engineering or ML-predicted outcomes, the extra variance reduction you can squeeze out beyond it is limited: at most a further 29% narrowing of confidence intervals (Ting and Hung, 2023). More complex covariates can still be worth exploring, but this is a strong default that requires no feature engineering and is easy to explain and audit.

Recommendation

When in doubt, start with the pre-experiment measurement of your metric as the covariate. It requires no feature engineering, is easy to explain and audit, and performs well empirically. Move to more complex covariates only if there's a specific reason to expect they'll do better.

Outlier treatment: cap versus winsorize

Variance reduction via regression adjustment addresses noise from natural behavioral variation. But a separate problem is outliers: a small number of extreme users can dominate metric variance and distort your estimates even after regression adjustment.

Two common approaches exist for handling this.

Cap

Capping sets an absolute maximum value for the metric. For example, you might cap daily streams at 500. Any user who streamed more than 500 times in a day is treated as though they streamed exactly 500.

The advantage of capping is that the threshold is fixed, predictable, and consistent across experiments. If your team agrees that 500 streams per day is the cap, every experiment that uses this metric applies the same rule, regardless of the population being tested or when the experiment runs.

Winsorize

Winsorizing is conceptually similar but uses a percentile-derived threshold rather than a fixed value. You might winsorize at the ninety-ninth percentile, replacing any value above that percentile with the percentile value itself.

The problem with winsorizing is that the threshold is a function of the sample. Different experiments targeting different user populations will produce different capping points—and those differences are unpredictable and non-comparable. An experiment targeting heavy users might winsorize at 900 streams per day; one targeting casual users might winsorize at 80. These are not the same metric, even if the winsorizing rule is nominally identical.

Recommendation

Prefer capping over winsorizing for metrics used consistently across experiments. An absolute cap is predictable, stable, and easy to reason about when comparing results across teams and over time. Reserve winsorizing for exploratory analysis where cross-experiment comparability is not a concern.

Both approaches involve a trade-off: you lose some information about extreme users in exchange for lower variance and more reliable estimates. The key question is whether the extreme values reflect genuine user behavior you want to capture, or noise and edge cases you'd rather control for. In most experimentation contexts, the latter is more common—an extreme outlier is rarely the user your feature change is targeting.

Reader exercise

What do CUPED, CUPAC, and similar variance reduction techniques have in common?

Reader exercise

If the correlation between your covariate and outcome is ρ = 0.8, by approximately what factor does regression adjustment reduce variance?

Reader exercise

Why is capping generally preferred over winsorizing for metrics used across multiple experiments?

Notes for nerds

Variance reduction for ratio metrics. In Confidence, variance reduction is applied to all metric types—including ratio metrics—by using the method introduced in Ying Jin and Shan Ba (2021), which extends regression adjustment to ratio metrics directly. This means you don't need to pre-aggregate ratio metrics to the user level before applying variance reduction; the platform handles the joint estimation of numerator and denominator covariates automatically. For reference, the delta method variance formula and the general framework for ratio metrics in online experimentation are covered in: Deng, A., Lu, J., & Wang, S. (2018). "Applying the Delta Method in Metric Analytics: A Practical Guide with Novel Ideas." Proceedings of KDD 2018.

Covariate selection and rerandomization. Schultzberg and Johansson (2020) examines using historical data to predict experimental outcomes and using those predicted outcomes as covariates—the same covariate construction idea as CUPAC, applied in a rerandomization context. A related result from Li and Ding (2020) shows that Mahalanobis-distance rerandomization is asymptotically equivalent to regression adjustment using the same covariates. Together, these papers establish a clean theoretical bridge between design-based variance reduction (rerandomization) and analysis-based variance reduction (CUPED/ANCOVA): the two approaches converge when they use the same covariates.

Was this page helpful?

PreviousLesson 7: Feasibility and sensitivity
NextLesson 9: Select metrics

© Copyright 2026. All rights reserved.

Follow us on TwitterFollow us on GitHub

On this page

  1. Why metric selection and variance reduction are inseparable

  2. CUPED, CUPAC, and their relatives are all regression adjustment

  3. How much variance reduction should you expect?

  4. The best covariate: the metric itself

  5. Outlier treatment: cap versus winsorize

  6. Notes for nerds