Statistical Methods

What is CUPED?

CUPED (Controlled-experiment Using Pre-Existing Data) is a variance reduction method that uses data from before an experiment started to remove predictable noise from metric estimates, producing ti...

CUPED (Controlled-experiment Using Pre-Existing Data) is a variance reduction method that uses data from before an experiment started to remove predictable noise from metric estimates, producing tighter confidence intervals without requiring additional traffic. It was introduced by Deng et al. at Microsoft in 2013 and has since become the standard variance reduction technique in online experimentation.

The core insight is simple: if you know how a user behaved before the experiment, you can subtract the predictable part of their behavior from the post-experiment metric. What remains is a cleaner estimate of the treatment effect. In practice, CUPED reduces metric variance by 20-50%, which is equivalent to running the experiment with 25-100% more users at no additional cost. At Spotify, CUPED is applied by default to every experiment analyzed in Confidence, across 10,000+ experiments per year.

How does CUPED work?

CUPED adjusts each user's outcome metric using a covariate: typically the same metric measured during a pre-experiment period. If you're measuring "streams per user per day" during the experiment, the covariate would be "streams per user per day" during the two weeks before the experiment started.

The adjustment works through regression. For each user, CUPED estimates how much of their post-experiment metric is explained by their pre-experiment behavior and subtracts that component. A user who streamed 100 songs per day before the experiment and 102 during the experiment contributes a residual of roughly +2, adjusted for the population-level relationship between pre and post behavior. A user who streamed 5 songs before and 7 during contributes a similar residual. The treatment effect estimate comes from comparing these residuals across control and treatment groups.

The variance reduction depends on the correlation between the covariate and the outcome. When the correlation is high (which it typically is for behavioral metrics measured over consecutive time windows), the variance reduction can exceed 50%. When the correlation is low, CUPED still helps but the gains are smaller.

What does Confidence do differently with CUPED?

Confidence uses the Negi-Wooldridge 2021 full regression estimator rather than the original CUPED formulation. The difference matters for precision.

The original CUPED paper uses a single control variate adjustment with a population-level coefficient. The Negi-Wooldridge approach runs a full regression of the outcome on the covariate, interacted with treatment assignment. This allows the relationship between pre-experiment behavior and post-experiment behavior to differ between control and treatment groups. In practice, the Negi-Wooldridge estimator is weakly more efficient: it's always at least as precise as original CUPED, and strictly more precise when the covariate-outcome relationship differs across groups.

This matters at Spotify's scale. A 2-3% improvement in precision across thousands of experiments translates into meaningfully shorter runtimes or the ability to detect smaller effects. The improvement is free in computational terms, since the regression runs inside the data warehouse alongside the rest of the analysis.

Confidence also integrates CUPED with the full statistical stack. Power calculations account for the expected variance reduction, so sample size estimates reflect what the analysis will actually do. Sequential testing boundaries are computed on the CUPED-adjusted statistics. Multiple testing corrections apply to the adjusted confidence intervals. This full integration is a deliberate design choice: applying CUPED to the analysis without adjusting the power calculation leaves teams with sample size estimates that don't match reality.

When does CUPED not help?

CUPED requires a pre-experiment covariate that correlates with the outcome metric. Two scenarios limit its effectiveness.

First, new metrics with no history. If your experiment introduces an entirely new feature and you're measuring engagement with that feature, there's no pre-experiment covariate to use. The metric didn't exist before the experiment started.

Second, metrics with low temporal autocorrelation. If user behavior on your metric is essentially random from week to week, the pre-experiment value won't predict the post-experiment value well, and the variance reduction will be small.

For most product experimentation metrics (engagement, revenue, retention, conversion rates), neither limitation applies. Users' behavior next week strongly predicts their behavior this week, which is exactly the structure CUPED exploits.