Confidence
  • Pricing
  • Success stories
  • Contact us
  • Login
Start free trial
All terms
Sequential Testing

What is a Group Sequential Test?

A group sequential test (GST) is a sequential testing method that pre-plans a fixed number of interim analyses at specific points during an experiment, using an alpha spending function to distribut...

A group sequential test (GST) is a sequential testing method that pre-plans a fixed number of interim analyses at specific points during an experiment, using an alpha spending function to distribute the significance budget across those analyses. Each interim analysis occurs at a predetermined information fraction (e.g., after 25%, 50%, 75%, and 100% of the planned sample has been observed), and the test can stop early if the evidence is strong enough to reject or accept the null hypothesis at any of those points.

Confidence uses group sequential tests as its primary sequential testing method. The reason is power: Spotify's published comparison of sequential frameworks showed that GSTs provide superior statistical power compared to always-valid inference when the maximum sample size can be estimated in advance. For most product A/B tests, where traffic volumes are predictable and experiments have a natural maximum duration, that condition holds.

How does a group sequential test work?

A GST requires three inputs before the experiment starts: the maximum sample size, the number and timing of interim analyses, and the alpha spending function that governs how the significance budget is allocated.

At each interim analysis, the test computes a test statistic from the accumulated data and compares it to a critical boundary. The boundary at each look is determined by how much of the total alpha has been "spent" so far. Early looks use a small portion of the budget, producing strict thresholds that require strong evidence to cross. Later looks are more lenient because more of the budget is available.

If the test statistic crosses the boundary at any interim analysis, the experiment can stop with a conclusion. If it doesn't cross at any interim look, the experiment continues to the next planned analysis or to the final analysis at full sample size.

The key distinction from other sequential methods: in a GST, you commit to when you'll look before the experiment starts. You don't look between planned analyses. This structure is what gives GSTs their power advantage. By restricting when decisions happen, less of the significance budget is consumed protecting against arbitrary stopping.

Why do GSTs have higher power than always-valid methods?

Statistical power is the probability of detecting a real effect when one exists. Every method that allows early stopping pays a power cost relative to a single-look fixed-horizon test, because some significance budget goes toward controlling false positives at interim looks instead of concentrating it all at the end.

GSTs minimize this cost by restricting analyses to a small number of pre-planned points. An always-valid inference method, by contrast, must maintain valid confidence intervals at every possible stopping time. That continuous validity guarantee consumes more of the significance budget, resulting in lower power at any given sample size.

The practical difference is meaningful. In Spotify's simulations, GSTs with four interim analyses achieved notably higher power than always-valid methods at the same maximum sample size. For a platform running 10,000+ experiments per year, that power difference translates into more experiments producing clear answers and fewer producing ambiguous results.

When should you use a GST vs. always-valid inference?

GSTs are the right choice when you can estimate the maximum sample size before the experiment starts. This describes most product A/B tests: you know your daily active users, you've calculated the sample needed to detect your minimum detectable effect, and you can predict roughly how long the experiment will run.

Always-valid inference is preferable when the maximum sample size is genuinely uncertain, when data streams in continuously with no natural endpoint, or when you need the flexibility to stop at arbitrary times rather than pre-planned analysis points.

At Spotify, the vast majority of experiments fit the GST case. Traffic is predictable. Experiment durations are bounded by business needs. The maximum sample size can be estimated with reasonable accuracy. That's why Confidence defaults to GSTs for sequential analysis.

What is the relationship between GSTs and alpha spending?

An alpha spending function is what makes a GST work. It defines the cumulative proportion of the total significance level (alpha) that has been allocated through each interim analysis.

Two common spending functions are the O'Brien-Fleming and Pocock boundaries. O'Brien-Fleming spending is conservative early (very strict thresholds at the first looks, making early stopping unlikely unless the effect is large) and concentrates most of the alpha at the final analysis. Pocock spending distributes alpha more evenly, making early stopping more likely but reducing power at the final analysis.

The choice of spending function reflects a trade-off: how aggressively do you want to stop early versus how much power do you want to preserve if the experiment runs to completion? For most product experiments, O'Brien-Fleming-style spending is the better default because the majority of experiments run close to their planned duration and benefit from having most of the alpha available at the end.

Related terms

Sequential Testing
Sequential Testing

Sequential testing is a statistical framework that allows experimenters to make valid decisions at multiple analysis points during an experiment, rather than waiting for a single final evaluation.

Sequential Testing
Alpha Spending

Alpha spending is the method of distributing a fixed significance budget (alpha, typically 5%) across multiple interim analyses in a group sequential test.

Sequential Testing
Information Fraction

The information fraction is the proportion of the total planned statistical information that has been observed so far in a sequential experiment.

Sequential Testing
Always-Valid Inference

Always-valid inference (AVI) is a class of sequential testing methods that construct confidence intervals remaining valid at any stopping time, without requiring the experimenter to pre-plan when o...

Sequential Testing
Peeking Problem

The peeking problem is the inflation of false positive rates that occurs when experimenters check statistical results before the planned sample size has been reached and stop the experiment early b...

Sequential Testing
Optional Stopping

Optional stopping is the practice of ending an experiment based on observed results rather than a pre-determined stopping rule.

Sequential Testing
Fixed-Power Design

A fixed-power design is a sequential experiment plan where the stopping rule is based on achieving a pre-specified level of statistical power rather than on observing a statistically significant re...

Spotify

Learn more

  • Read our blog
  • See comparisons
  • Glossary
  • RFP guides
  • Listen to us
  • Read our docs
  • Status page

Need help

  • Contact us

Legal

  • Terms of Service
  • Data Protection Agreement
  • Privacy Policy
  • Cookies

© 2026 Spotify

The Confidence name and logo are registered trademarks of Spotify.