Statistical Methods

What is Statistical Power?

Statistical power is the probability that an experiment will detect a real effect when one exists.

Statistical power is the probability that an experiment will detect a real effect when one exists. A test with 80% power has an 80% chance of producing a statistically significant result if the treatment genuinely changes the metric by at least the minimum detectable effect. The remaining 20% is the false negative rate: the chance of missing a real improvement.

Underpowered experiments are the most common source of wasted experiment bandwidth. A test with 30% power produces a clear answer less than a third of the time. The rest of the time, the result is ambiguous: you can't tell whether the change had no effect or whether the test was simply too small to see it. At Spotify, where 58 teams ran 520 experiments on the mobile home screen alone in 2025, running underpowered tests doesn't just waste one team's time. It consumes shared experiment capacity that another team could have used to learn something.

What determines statistical power?

Power depends on four linked parameters. Change one, and the others shift.

Sample size. More users means more data, which means a narrower confidence interval and higher power. This is the most direct lever.

Minimum detectable effect (MDE). The smaller the effect you want to detect, the more power you need, which means more sample. Detecting a 5% lift in conversion requires far fewer users than detecting a 0.5% lift.

Significance level (alpha). A stricter significance threshold (lower alpha) demands stronger evidence, which reduces power at the same sample size. Most teams hold alpha at 0.05 and adjust other parameters.

Variance. Higher variance in the metric makes it harder to distinguish a real effect from noise, reducing power. Variance reduction techniques like CUPED directly increase power without requiring more traffic. Confidence uses the Negi-Wooldridge full regression estimator for CUPED, which is more precise than the original formulation and can meaningfully reduce the sample size needed to reach a given power level.

How much power do you need?

The conventional target is 80%, meaning you'll detect the effect four out of five times it's real. Some teams target 90% for high-stakes decisions, accepting the larger sample sizes required.

Targeting less than 80% is rarely justified. At 50% power, your experiment is a coin flip: half the time it can't tell a real effect from nothing. The cost of running the experiment (engineering time, user exposure, bandwidth consumed) stays the same regardless of power. The only thing that changes is how often you get a useful answer.

Confidence's sample size calculator shows the power curve for any experiment configuration. Before a test launches, you can see how power changes with different sample sizes, MDEs, and variance assumptions. If the test can't reach 80% power within a reasonable timeline, that's a signal to rethink the approach: use a bolder implementation to increase the expected effect size, apply CUPED to reduce variance, or use trigger analysis to restrict to users who actually encounter the change.

What happens when you ignore power?

Underpowered experiments don't produce neutral results. They produce misleading ones. Confidence flags this risk before launch.

When a low-power test happens to reach significance, the observed effect size is almost always inflated. This is called the winner's curse: among the results that clear the significance bar, the ones that got lucky with noise dominate. The team ships a feature expecting a 10% improvement and sees 2% in production.

When a low-power test doesn't reach significance, it gets misinterpreted as "no effect." Null results from underpowered tests are uninformative. They consume the same bandwidth as a properly powered test but produce nothing actionable. At Spotify, the experimentation platform flags experiments that are projected to be underpowered before they launch, helping teams avoid this trap.