Powered ≠ Trustworthy

Want to experiment like Spotify? Sign up for a 30 day free trial.

"Don't trust underpowered experiments" is a recurring piece of advice. It is correct. But it is less helpful than it appears, because most people draw the obvious opposite conclusion: trust powered experiments. Power depends on choices the experimenter makes before running the study, and to trust a powered experiment you must also trust the experimenter that planned it.

Trust, as used here, means: is the observed effect likely to be close to the true effect? The experimenter does not know the true effect (that is why they are running the experiment), so they guess and size the study around that guess. When the true effect is larger than assumed, the experiment is overpowered and the result is reliable. When it is smaller, the experimenter is unknowingly underpowered, and significant results will be inflated.

The rest of this post works through that breakdown: why hitting your power target is not a fix, what the significance filter actually selects for, and what to ask instead.

Why "powered" doesn't mean trustworthy

Power is always calculated for a specific effect size: the MDE (minimum detectable effect). An 80% power target means: if the true effect equals the MDE, you will detect it 80% of the time. When the true effect differs from the MDE, you are at a different power level than you planned for, and the reliability is hard to evaluate.

The MDE was designed to be a business question: what is the smallest effect worth acting on? In modern software development, there are many changes where the cost of implementation and maintenance is very low, and very small true effects can have real business value. So many teams work the MDE out backwards instead: given traffic and time, what is the smallest detectable effect? That number becomes the MDE: a statistical constraint, not a business judgment.

The 80% power threshold adds a second layer of arbitrariness. Cohen adopted it in the 1960s on informal reasoning he himself described as tentative. The convention stuck anyway. Even at exactly 80% power, when the MDE matches the true effect, significant results still inflate by roughly 12–13% under standard settings.

What significance filters out

When an experiment is underpowered, only the largest observed effects clear the significance threshold. Among experiments that reach significance, measured effects skew upward relative to reality. Kohavi, Deng, and Vermeer put it plainly: low statistical power means the effects you detect are inflated. This is the winner's curse.

Gelman and Carlin named the formal mechanism Type M error: the factor by which a significant result overstates the true effect. In low-power settings this can easily reach 2× or more. An analysis of neuroscience studies with power between 8% and 31% found inflation of 25%–50% across the literature. Airbnb documented it in their own data: six significant experiments showed a combined observed effect of 7.2%, but a holdout measured the true aggregate effect at 4%. Large enough that Airbnb built debiasing corrections into their reporting framework directly.

Set the true effect, MDE, sample size, and significance level to explore how planned and actual power diverge. The curve illustrates the sampling distribution of the difference-in-means estimator under the true effect. The orange line marks the MDE; the green area shows experiments that would find a significant positive result.

True effect (δ): 0.20

MDE: 0.50

Sample size (n): 100

Significance level (α): 0.05

Planned power: 30.3%Actual power: 8.3%Type M factor: 4.19×

SignificantTrue effect (δ)MDE

The MDE is set to 0.50, giving a planned power of 30.3%. The true effect is 0.20, so the actual power is 8.3%. Observed effects need to exceed 0.68 to reach significance, so the average significant result (in the green area) is 0.84, a 4.19× overestimate of the true effect.

This interactive is also part of the Confidence Bootcamp lesson on the winner's curse.

Post-hoc power fools you

When a significant result looks surprisingly large, a common instinct is to calculate post-hoc power using the observed effect. If it comes out high, people treat the result as trustworthy.

It does not help. A significant result will always produce high post-hoc power; a non-significant one will always produce low post-hoc power. Post-hoc power adds no information beyond the p-value you already have.

Reasoning about plausibility

Power and post-hoc calculations cannot answer the trustworthiness question. The following checks can narrow it.

Twyman's Law: "Any figure that looks interesting or different is usually wrong." A powered experiment with an implausibly large effect deserves the same scrutiny as an underpowered one.

Check your experiment history. An effect that exceeds anything similar by three times or more should not ship on the p-value alone. At Spotify, the win rate is around 12%; most features move the needle modestly if at all.

Check for a sample ratio mismatch (SRM). If the split between treatment and control differs from the configured ratio, randomization is broken and the result is untrustworthy regardless of power.

Replicate before a major launch. A rollout is the natural path: Expose the winner to more users and check whether the effect holds.

Power is a design tool

Power targets are useful. The mistake is treating a significant result as credible because the experiment was powered. A powered experiment can produce a suspect result if the MDE was optimistic. Since MDEs are difficult to choose and practices for setting them vary, optimistic MDEs are common. An underpowered experiment can produce a credible result if the true effect is far larger than the MDE.

After a significant result, ask whether the effect size is believable given what you know about this feature and user population. "Was this powered?" is the wrong question. Ask it before the experiment runs, when you are deciding what effect size to power for.

That question has no formula. There is no significance threshold that answers it for you. At Spotify, we often replicate business critical A/B test findings with a rollout. That gives us both a replication study and a larger-sample estimate of the true effect under real traffic.

Learn about experimentation from the free online Confidence bootcamp. This 72-lesson bootcamp has taught thousands of people at Spotify and other companies to experiment. Start learning right now!