What is a Fixed-Power Design?

A fixed-power design is a sequential experiment plan where the stopping rule is based on achieving a pre-specified level of statistical power rather than on observing a statistically significant result. Instead of asking "is the effect significant yet?" at each interim analysis, a fixed-power design asks "do I now have enough data to guarantee the power level I planned for?" When the answer is yes, the experiment stops and the final analysis is conducted.

This approach was introduced in a 2024 research paper by Nordin and Schultzberg at Spotify, which proposed precision-based stopping rules as an alternative to significance-based sequential methods. The core idea: if your stopping criterion doesn't depend on the observed treatment effect, you can peek at intermediate results without biasing inference. Power, unlike significance, is a function of sample size and variance, not of the observed effect. So peeking at power is safe in a way that peeking at p-values is not.

How does a fixed-power design differ from standard sequential testing?

In a group sequential test or always-valid inference method, the decision to stop depends on the observed test statistic. The data you've seen determines whether you stop. Because the stopping rule is tied to the outcome, sequential corrections (alpha spending, confidence sequences) are required to maintain valid false positive rates.

A fixed-power design decouples stopping from the observed effect. You stop when the accumulated sample guarantees a target power level, not when the data looks promising. Since the stopping rule doesn't depend on the treatment effect estimate, it doesn't introduce the selection bias that causes the peeking problem. The final analysis can use a standard fixed-horizon test at the nominal significance level.

This is a genuine structural difference. Standard sequential methods solve peeking by adjusting thresholds. Fixed-power designs avoid the problem entirely by changing what you peek at.

What can you peek at during a fixed-power experiment?

The Spotify blog post on fixed-power designs explains the distinction. You can safely peek at anything that determines whether you've reached your power target: the current sample size, the observed variance of your metric, and the resulting power estimate. These quantities don't carry information about whether the treatment effect is positive or negative, so monitoring them doesn't bias the final test.

You still can't peek at the treatment effect itself, the p-value, or the confidence interval during the experiment. Those remain off-limits until the power criterion is met and the experiment stops.

In practice, this means you can build a dashboard showing "this experiment currently has 72% power and needs approximately 4 more days to reach 80%" without compromising the statistical validity of the result. Teams at Spotify get a real-time view of when their experiment will be ready for analysis, which helps with planning and bandwidth allocation across the 10,000+ experiments running annually.

When is a fixed-power design the right choice?

Fixed-power designs work best when the primary uncertainty is about variance, not about traffic. If you're running an experiment on a metric whose variance you haven't observed before (a new metric, a new user segment, a recently launched feature), the required sample size is hard to estimate in advance. A fixed-power design adapts: it monitors the accumulating variance and stops when enough data has been collected to guarantee the target power, regardless of whether that takes more or less time than initially expected.

They're also useful when the cost of the experiment is proportional to its duration and you want to stop as soon as the power target is met. No overshooting.

Fixed-power designs are less useful when you want the option to stop early because the effect is very large. In a GST, a large effect can be detected at the first interim analysis. In a fixed-power design, you still wait until the power criterion is satisfied, even if the evidence for the effect is already overwhelming. The two approaches optimize for different things: GSTs optimize for expected sample size given the effect, fixed-power designs optimize for guaranteed power regardless of the effect.

How does this connect to precision-based designs?

The fixed-power design is one of two precision-based stopping rules proposed in the same paper. The other is a fixed-width confidence interval design, which stops when the confidence interval around the treatment effect is narrow enough (i.e., precise enough) to be useful for decision-making, regardless of whether it contains zero.

Both share the same principle: the stopping rule is a function of precision (sample size, variance), not of the observed effect. Both allow peeking at precision metrics without sequential corrections. Confidence supports fixed-power designs as part of its sequential testing toolkit, giving teams an alternative when the standard GST or always-valid inference framework isn't the best fit.

What is a Fixed-Power Design?

How does a fixed-power design differ from standard sequential testing?

What can you peek at during a fixed-power experiment?

When is a fixed-power design the right choice?

How does this connect to precision-based designs?

Related terms