Confidence
  • Pricing
  • Success stories
  • Contact us
  • Login
Start free trial
All terms
Sequential Testing

What is a Peeking Problem?

The peeking problem is the inflation of false positive rates that occurs when experimenters check statistical results before the planned sample size has been reached and stop the experiment early b...

The peeking problem is the inflation of false positive rates that occurs when experimenters check statistical results before the planned sample size has been reached and stop the experiment early based on what they see. A standard fixed-horizon test is designed to be evaluated once, at the end. Every additional look at the data creates another chance for random noise to cross the significance threshold, and the cumulative false positive rate grows with each look. At five interim checks using a nominal 5% significance level, the actual false positive rate can exceed 14%.

This matters because peeking is the default human behavior. Product managers want to know if something is working. Engineers want to free up experiment bandwidth. Stakeholders ask for updates. At Spotify, where 300+ teams run over 10,000 experiments per year, the pressure to peek is structural, not a character flaw. The solution is methodology that accounts for how people actually use results.

Why does peeking inflate false positives?

A p-value measures surprise under the null hypothesis at a single, pre-specified time point. When you compute it at multiple time points, you're running multiple tests, each with its own chance of producing a false alarm. The more you look, the more likely it is that at least one look will cross the threshold by chance alone.

The math is straightforward. If each look has an independent 5% false positive probability, the probability of at least one false positive after k looks is approximately 1 - (0.95)^k. After 10 looks, that's roughly 40%. In practice the looks aren't independent (they share accumulated data), so the inflation is lower than the independent case but still substantial. Five looks at a 5% nominal level yields a true type I error rate around 14%, not 5%.

The consequence is concrete: teams ship changes that had no real effect, consume experiment bandwidth on follow-up experiments to understand why the "improvement" disappeared, and gradually erode organizational trust in experimentation.

How do sequential testing methods solve peeking?

Sequential testing frameworks are designed for repeated looks at the data. They adjust the significance threshold at each analysis point so the overall false positive rate stays at the desired level.

Two main approaches exist. Group sequential tests (GSTs) pre-plan a fixed number of interim analyses at specific information fractions and distribute the significance budget across those looks using an alpha spending function. Always-valid inference (AVI) methods construct confidence sequences that remain valid at any stopping time, with no requirement to pre-specify when you'll look. Confidence supports both GSTs and always-valid inference, giving teams the right tool for their experimental context.

Spotify's engineering blog published a detailed comparison of these two frameworks, concluding that GSTs provide superior statistical power when the maximum sample size can be estimated in advance, while AVI is preferable when sample sizes are genuinely uncertain.

Is there a peeking problem beyond the classical one?

Yes. Spotify's research identified what they called the "Peeking Problem 2.0": even with sequential testing corrections applied, false positive rates can inflate in experiments with longitudinal data (repeated measurements per user over time). Standard sequential tests assume independent observations, but user-level data in digital experiments violates that assumption. The follow-up post showed that modeling within-unit dependencies using longitudinal models with robust standard errors restores valid false positive control.

The lesson: solving the peeking problem requires more than applying an off-the-shelf sequential correction. The correction itself has to account for the data structure of your experiment.

Related terms

Sequential Testing
Sequential Testing

Sequential testing is a statistical framework that allows experimenters to make valid decisions at multiple analysis points during an experiment, rather than waiting for a single final evaluation.

Sequential Testing
Group Sequential Test

A group sequential test (GST) is a sequential testing method that pre-plans a fixed number of interim analyses at specific points during an experiment, using an alpha spending function to distribut...

Sequential Testing
Always-Valid Inference

Always-valid inference (AVI) is a class of sequential testing methods that construct confidence intervals remaining valid at any stopping time, without requiring the experimenter to pre-plan when o...

Sequential Testing
Alpha Spending

Alpha spending is the method of distributing a fixed significance budget (alpha, typically 5%) across multiple interim analyses in a group sequential test.

Sequential Testing
Optional Stopping

Optional stopping is the practice of ending an experiment based on observed results rather than a pre-determined stopping rule.

Sequential Testing
Fixed-Power Design

A fixed-power design is a sequential experiment plan where the stopping rule is based on achieving a pre-specified level of statistical power rather than on observing a statistically significant re...

Statistical Methods
Statistical Significance

Statistical significance is the determination that an observed difference between experiment groups is unlikely to have occurred by chance alone.

Spotify

Learn more

  • Read our blog
  • See comparisons
  • Glossary
  • RFP guides
  • Listen to us
  • Read our docs
  • Status page

Need help

  • Contact us

Legal

  • Terms of Service
  • Data Protection Agreement
  • Privacy Policy
  • Cookies

© 2026 Spotify

The Confidence name and logo are registered trademarks of Spotify.