Optional stopping is the practice of ending an experiment based on observed results rather than a pre-determined stopping rule. When an experimenter checks the data, sees a significant p-value, and decides to stop early, that's optional stopping. Without a statistical framework designed for it, this practice inflates false positive rates: results that look significant at the time of stopping may reflect random fluctuation rather than a real treatment effect.
The term describes a behavior, not a method. Optional stopping isn't inherently wrong. It becomes a problem when the analysis method assumes the experiment was evaluated at a single fixed time point but the actual decision to stop was influenced by the data. The mismatch between assumption and reality is what inflates error rates.
Why does optional stopping inflate false positives?
A standard significance test sets a threshold (usually p < 0.05) and asks whether the observed data would be unlikely under the null hypothesis. That probability calculation assumes you're looking at the data exactly once, at a fixed sample size.
When you stop because the result looks good, you've introduced a selection effect. You're more likely to stop when random noise happens to favor the treatment than when it doesn't. Over many experiments, this means the experiments that get stopped early are enriched for false positives.
The inflation can be substantial. An experiment checked once a week for five weeks at a nominal 5% significance level has a true false positive rate of roughly 14%. Check it 20 times and the rate approaches 25%. The more often you look and the more willing you are to stop on a positive result, the worse the inflation gets.
At Spotify, where 300+ teams run experiments simultaneously and each team faces pressure to free up experiment bandwidth, optional stopping without correction would systematically degrade the reliability of the entire experimentation program. That's not a hypothetical. It's why sequential testing methods are built into Confidence as a default, not an add-on.
How is optional stopping different from the peeking problem?
The two concepts are closely related but describe different aspects of the same issue. The peeking problem is the statistical phenomenon: checking results multiple times inflates false positives. Optional stopping is the behavioral pattern: deciding when to stop based on what you've seen.
You can peek without optionally stopping (you look at interim results but commit to running the experiment to its planned end regardless). You can also optionally stop without peeking at the test statistic itself (for example, stopping when a power target is met, as in a fixed-power design). The false positive inflation occurs specifically when the decision to stop is correlated with the observed treatment effect.
What methods make optional stopping valid?
Two classes of sequential testing methods handle optional stopping correctly.
Group sequential tests (GSTs) pre-plan a finite number of interim analyses and use an alpha spending function to adjust the significance threshold at each look. Optional stopping is allowed, but only at the pre-planned analysis points. Between analyses, you don't look. This controlled form of optional stopping maintains the correct overall false positive rate. Spotify's framework comparison showed GSTs provide the best power when maximum sample sizes can be estimated.
Always-valid inference (AVI) methods construct confidence sequences valid at any stopping time. With AVI, you can optionally stop at literally any point, not just pre-planned analyses, and the inference remains valid. The trade-off is lower statistical power compared to GSTs.
A third approach avoids the problem entirely. Fixed-power designs base the stopping decision on precision metrics (sample size, variance, estimated power) that don't depend on the observed treatment effect. Since the stopping rule is independent of the outcome, there's no selection bias and no sequential correction is needed.
Confidence supports all three approaches, reflecting the principle that teams should choose the method that matches their experimental context rather than being forced into a single framework.
What happens in organizations that ignore optional stopping?
The damage is cumulative and hard to detect. Each individual experiment might or might not be a false positive. You can't tell from the result alone. But across hundreds of experiments, the base rate of false positives is higher than the nominal rate, which means more features get shipped that have no real effect.
The downstream costs are concrete: engineering time spent building out "winning" features that don't actually improve the product, follow-up experiments that fail to replicate the original result, and a gradual loss of trust in experimentation as a decision-making tool. At Spotify, where 42% of experiments are rolled back after guardrail metrics detect regressions, maintaining statistical rigor in stopping decisions is part of what makes the guardrail system trustworthy. If the significance of the original result was inflated by optional stopping, the entire detection chain is compromised.