What is a Sequential Testing?

Sequential testing is a statistical framework that allows experimenters to make valid decisions at multiple analysis points during an experiment, rather than waiting for a single final evaluation. By adjusting how evidence is assessed at each look, sequential methods control the overall false positive rate while letting teams stop experiments earlier when the evidence is already clear.

The practical value is speed. A fixed-horizon test locks you into a pre-determined sample size regardless of how obvious the result becomes. If the treatment effect is twice as large as your minimum detectable effect, you still wait for the full sample. Sequential testing lets you stop early in that scenario while maintaining the same statistical guarantees. At Spotify, where teams run over 10,000 experiments per year across 300+ teams, this translates directly into higher experiment throughput: each experiment that concludes early frees bandwidth for the next one.

How does sequential testing differ from a fixed-horizon test?

In a fixed-horizon test, you calculate the required sample size in advance, run the experiment until you reach it, and then analyze the data exactly once. The significance level (typically 5%) applies to that single analysis. If you look at the results before the planned end, you've introduced the peeking problem: the actual false positive rate exceeds your nominal level.

Sequential tests are designed for the opposite workflow. They assume you'll look at the data multiple times and adjust the statistical thresholds accordingly. Each look uses a slightly stricter significance criterion so that the total probability of a false positive across all looks stays at the desired level.

The trade-off is statistical power. Because sequential methods spend some of their significance budget on early looks, they typically require a somewhat larger maximum sample size than a fixed-horizon test to achieve the same power, assuming the experiment runs to completion. In practice, the expected sample size is often smaller because many experiments stop early.

What are the main sequential testing approaches?

Two families dominate modern experimentation platforms.

Group sequential tests (GSTs) pre-plan a fixed number of interim analyses at specified information fractions (e.g., 25%, 50%, 75%, 100% of the planned sample). An alpha spending function distributes the significance budget across these analyses. GSTs require estimating the maximum sample size in advance, but in return they provide higher statistical power than methods that allow arbitrary stopping. Spotify's comparison of sequential frameworks concluded that GSTs have superior power when the maximum sample size can be estimated, which describes most product A/B tests where traffic is predictable.

Always-valid inference (AVI) methods construct confidence sequences or e-processes that remain valid at any stopping time, with no requirement to pre-plan analysis times. AVI is the right choice when you genuinely don't know how long the experiment will run, when data arrives continuously and you want the flexibility to stop at any moment, or when the experiment has no natural maximum duration. The cost is lower statistical power compared to GSTs at the same sample size.

Confidence supports both GSTs and always-valid inference, reflecting the position that the right sequential method depends on what you know about your experiment before it starts.

What happens if you use sequential testing without accounting for your data structure?

Sequential corrections assume certain properties of the test statistic at each analysis point. When those assumptions are violated, the corrections fail silently.

The most important example: longitudinal data. In digital experiments, the same user generates repeated observations over time. Standard sequential tests assume independent data points. Spotify's research showed that applying group sequential corrections to longitudinal data still inflates false positive rates because the within-user correlations violate the independence assumption. The solution is to use longitudinal models with robust standard errors before applying the sequential correction.

This is an example of a broader principle from the Confidence platform: a statistical method only works correctly when it comes with its full supporting methodology. Sequential testing without the right variance estimation, without adapted sample size calculations, and without multiple testing correction is a partial implementation that may not control what it claims to control.

How does sequential testing interact with variance reduction?

Variance reduction techniques like CUPED (using pre-experiment data to tighten confidence intervals) reduce the variance of your test statistic, which means experiments reach conclusive results faster. When combined with sequential testing, the interaction needs to be handled correctly: the variance reduction changes the information fraction at each analysis, and the sequential boundaries need to be recalculated accordingly.

Confidence adapts its sequential testing implementation to work with CUPED variance reduction, so the sample size calculations, spending functions, and stopping boundaries all reflect the reduced variance. This is one of the capability-matrix cells that matters in practice: supporting both features individually is different from supporting them together correctly.

What is a Sequential Testing?

How does sequential testing differ from a fixed-horizon test?

What are the main sequential testing approaches?

What happens if you use sequential testing without accounting for your data structure?

How does sequential testing interact with variance reduction?

Related terms