The difference-in-means estimator is the simplest and most common estimator of the treatment effect in an A/B test. It computes the average outcome in the treatment group minus the average outcome in the control group. When assignment is random, this difference is an unbiased estimate of the average causal effect of the change being tested.
Its simplicity is its strength. The difference-in-means estimator doesn't require modeling assumptions, distributional assumptions, or tuning parameters. It works for any metric type: continuous (revenue per user), binary (converted or not), count (sessions per week). At Spotify, where Confidence analyzes over 10,000 experiments per year, the difference-in-means estimator is the starting point for every experiment's analysis. More sophisticated methods like CUPED build on top of it by reducing the variance of the estimate, but the core quantity being estimated remains the same.
How is the difference-in-means estimator calculated?
The calculation is direct. Let Y_t be the sample mean of the metric in the treatment group and Y_c be the sample mean in the control group. The estimated treatment effect is:
tau_hat = Y_t - Y_c
The standard error of this estimate is:
SE = sqrt(s_t^2/n_t + s_c^2/n_c)
where s_t^2 and s_c^2 are the sample variances in the treatment and control groups, and n_t and n_c are the sample sizes. The 95% confidence interval is tau_hat plus or minus 1.96 times SE.
For a concrete example: if the treatment group (500,000 users) has an average of 12.3 streams per day and the control group (500,000 users) has an average of 12.0 streams per day, the difference-in-means estimate is +0.3 streams per day. The confidence interval tells you the range of plausible true effects, and the p-value tells you whether the difference is distinguishable from chance.
Why does randomization make it work?
The difference-in-means estimator is unbiased specifically because of random assignment. Without randomization, the groups might differ in ways that affect the outcome. If power users are more likely to end up in the treatment group, the treatment mean will be higher regardless of whether the treatment had any effect.
Random assignment ensures that, in expectation, the two groups are identical on every characteristic, both observed and unobserved. The only systematic difference is the treatment itself. This means the difference in sample means reflects the treatment effect, not pre-existing differences between the groups.
Sample ratio mismatch (SRM) checks verify that this property holds in practice. If the observed split deviates significantly from the intended ratio, something in the assignment or logging pipeline has introduced a systematic difference, and the difference-in-means estimate may be biased. Confidence runs SRM checks automatically for every experiment.
How does CUPED improve on the difference-in-means estimator?
The difference-in-means estimator is unbiased but can be noisy. CUPED (Controlled-experiment Using Pre-Existing Data) reduces that noise by adjusting for pre-experiment covariates.
Think of it this way: if a user streamed 100 songs per day before the experiment, you'd expect them to stream roughly 100 songs per day during the experiment regardless of treatment assignment. CUPED subtracts this predictable component, so the remaining variation more clearly reflects the treatment effect.
The CUPED-adjusted estimator is still a difference-in-means, but computed on the adjusted (residualized) metric rather than the raw metric. The adjustment reduces the variance of the estimator without introducing bias, which tightens the confidence interval. Confidence uses the Negi-Wooldridge 2021 full regression estimator, which allows the covariate adjustment to differ between treatment and control groups, making it weakly more efficient than the original CUPED formulation.
The improvement is meaningful. For metrics with high temporal autocorrelation (which includes most behavioral product metrics), CUPED reduces variance by 20-50%. That's equivalent to running the experiment with 25-100% more users for free.
What are the limitations?
The difference-in-means estimator assumes independent observations: each user contributes one observation (or one summary statistic) to the analysis. When users interact with each other (network effects, marketplace dynamics), the independence assumption breaks down, and the standard error underestimates the true uncertainty.
For most product A/B tests, where assignment is at the user level and the metric is computed per-user, independence holds well enough. When it doesn't, cluster-randomized designs (randomizing at the group or market level) with cluster-robust standard errors are the standard fix.