The peeking problem is the inflation of false positive rates that occurs when experimenters check statistical results before the planned sample size has been reached and stop the experiment early based on what they see. A standard fixed-horizon test is designed to be evaluated once, at the end. Every additional look at the data creates another chance for random noise to cross the significance threshold, and the cumulative false positive rate grows with each look. At five interim checks using a nominal 5% significance level, the actual false positive rate can exceed 14%.
This matters because peeking is the default human behavior. Product managers want to know if something is working. Engineers want to free up experiment bandwidth. Stakeholders ask for updates. At Spotify, where 300+ teams run over 10,000 experiments per year, the pressure to peek is structural, not a character flaw. The solution is methodology that accounts for how people actually use results.
Why does peeking inflate false positives?
A p-value measures surprise under the null hypothesis at a single, pre-specified time point. When you compute it at multiple time points, you're running multiple tests, each with its own chance of producing a false alarm. The more you look, the more likely it is that at least one look will cross the threshold by chance alone.
The math is straightforward. If each look has an independent 5% false positive probability, the probability of at least one false positive after k looks is approximately 1 - (0.95)^k. After 10 looks, that's roughly 40%. In practice the looks aren't independent (they share accumulated data), so the inflation is lower than the independent case but still substantial. Five looks at a 5% nominal level yields a true type I error rate around 14%, not 5%.
The consequence is concrete: teams ship changes that had no real effect, consume experiment bandwidth on follow-up experiments to understand why the "improvement" disappeared, and gradually erode organizational trust in experimentation.
How do sequential testing methods solve peeking?
Sequential testing frameworks are designed for repeated looks at the data. They adjust the significance threshold at each analysis point so the overall false positive rate stays at the desired level.
Two main approaches exist. Group sequential tests (GSTs) pre-plan a fixed number of interim analyses at specific information fractions and distribute the significance budget across those looks using an alpha spending function. Always-valid inference (AVI) methods construct confidence sequences that remain valid at any stopping time, with no requirement to pre-specify when you'll look. Confidence supports both GSTs and always-valid inference, giving teams the right tool for their experimental context.
Spotify's engineering blog published a detailed comparison of these two frameworks, concluding that GSTs provide superior statistical power when the maximum sample size can be estimated in advance, while AVI is preferable when sample sizes are genuinely uncertain.
Is there a peeking problem beyond the classical one?
Yes. Spotify's research identified what they called the "Peeking Problem 2.0": even with sequential testing corrections applied, false positive rates can inflate in experiments with longitudinal data (repeated measurements per user over time). Standard sequential tests assume independent observations, but user-level data in digital experiments violates that assumption. The follow-up post showed that modeling within-unit dependencies using longitudinal models with robust standard errors restores valid false positive control.
The lesson: solving the peeking problem requires more than applying an off-the-shelf sequential correction. The correction itself has to account for the data structure of your experiment.