Statistical Methods

What is a Sample Size?

Sample size is the number of experimental units (typically users) needed in an A/B test to detect a given effect with a specified level of confidence and power.

Sample size is the number of experimental units (typically users) needed in an A/B test to detect a given effect with a specified level of confidence and power. Too few users and the experiment can't distinguish real effects from noise. Too many and you've used traffic that could have powered another test.

Getting sample size right is the most consequential pre-experiment decision. An experiment with inadequate sample size produces ambiguous results regardless of how good the hypothesis, implementation, or metrics are. At Spotify, where over 10,000 experiments run per year and 58 teams shared the mobile home screen for 520 experiments in 2025, sample size planning directly determines how many experiments the organization can run concurrently. Every test that runs longer than necessary because it was undersized (or oversized for its MDE) consumes bandwidth that could have gone to another team's experiment.

How is sample size calculated?

Sample size depends on four parameters:

Significance level (alpha). The false positive rate you'll tolerate, typically 0.05.

Statistical power. The probability of detecting a real effect, typically 0.80 or higher.

Minimum detectable effect (MDE). The smallest change in the metric you want to be able to detect. Smaller MDEs require more users.

Metric variance. How much the metric varies across users. Higher variance means more noise, which means more users to cut through it.

For a two-sample z-test on a continuous metric, the per-group sample size is roughly:

n = (Z_alpha + Z_beta)^2 * 2 * sigma^2 / delta^2

where sigma^2 is the metric variance and delta is the MDE. The formula makes the trade-offs explicit: halving the MDE quadruples the required sample. Doubling the variance doubles it. These aren't independent knobs. They're constraints that bind against each other.

Confidence provides sample size calculators that account for the specific statistical method being used. Sequential testing methods like Group Sequential Tests require a larger maximum sample size than a fixed-horizon test (typically 20-30% more) because they spend some statistical budget on interim analyses. The calculator shows this overhead and lets teams decide whether the ability to stop early is worth the additional maximum sample.

What reduces the required sample size?

Variance reduction. CUPED (Controlled-experiment Using Pre-Existing Data) uses pre-experiment metric values to remove predictable noise. Confidence applies the Negi-Wooldridge full regression estimator, which is more precise than the original CUPED formulation. Variance reduction of ~50% is common for metrics with stable user-level baselines, which translates to roughly halving the required sample size.

Trigger analysis. If only 10% of users encounter the changed feature, including all assigned users in the analysis dilutes the effect by 10x. Restricting the analysis to triggered users recovers the undiluted effect and dramatically reduces the sample needed to detect it.

Bolder implementations. A treatment that produces a 5% lift needs a quarter of the sample required to detect a 2.5% lift. Testing the "Maximum Viable Change" first (the loudest version of the idea that still works as a user experience) is often the cheapest way to reduce required sample size.

Better metrics. Some metrics are inherently lower-variance than others. Revenue per user has higher variance than conversion rate in most products. Choosing a metric with lower natural variance, or capping extreme values, reduces sample requirements.

What goes wrong when sample size is too small or too large?

Undersized experiments produce results that can't be interpreted. The confidence interval is wide enough to include both meaningful improvement and meaningful harm. The team spent the engineering effort to build, instrument, and monitor the experiment and got nothing actionable in return.

Oversized experiments waste traffic. If you only needed 50,000 users but ran 200,000, you've consumed bandwidth for three extra weeks that another experiment could have used. At organizations running many concurrent tests, this matters.

The discipline is in doing the calculation before the experiment launches and committing to the plan. Confidence flags experiments where the projected runtime exceeds the team's planned window or where the MDE is unrealistically small for the available traffic, helping teams avoid both failure modes before they start.