What is a Group Sequential Test?

A group sequential test (GST) is a sequential testing method that pre-plans a fixed number of interim analyses at specific points during an experiment, using an alpha spending function to distribute the significance budget across those analyses. Each interim analysis occurs at a predetermined information fraction (e.g., after 25%, 50%, 75%, and 100% of the planned sample has been observed), and the test can stop early if the evidence is strong enough to reject or accept the null hypothesis at any of those points.

Confidence uses group sequential tests as its primary sequential testing method. The reason is power: Spotify's published comparison of sequential frameworks showed that GSTs provide superior statistical power compared to always-valid inference when the maximum sample size can be estimated in advance. For most product A/B tests, where traffic volumes are predictable and experiments have a natural maximum duration, that condition holds.

How does a group sequential test work?

A GST requires three inputs before the experiment starts: the maximum sample size, the number and timing of interim analyses, and the alpha spending function that governs how the significance budget is allocated.

At each interim analysis, the test computes a test statistic from the accumulated data and compares it to a critical boundary. The boundary at each look is determined by how much of the total alpha has been "spent" so far. Early looks use a small portion of the budget, producing strict thresholds that require strong evidence to cross. Later looks are more lenient because more of the budget is available.

If the test statistic crosses the boundary at any interim analysis, the experiment can stop with a conclusion. If it doesn't cross at any interim look, the experiment continues to the next planned analysis or to the final analysis at full sample size.

The key distinction from other sequential methods: in a GST, you commit to when you'll look before the experiment starts. You don't look between planned analyses. This structure is what gives GSTs their power advantage. By restricting when decisions happen, less of the significance budget is consumed protecting against arbitrary stopping.

Why do GSTs have higher power than always-valid methods?

Statistical power is the probability of detecting a real effect when one exists. Every method that allows early stopping pays a power cost relative to a single-look fixed-horizon test, because some significance budget goes toward controlling false positives at interim looks instead of concentrating it all at the end.

GSTs minimize this cost by restricting analyses to a small number of pre-planned points. An always-valid inference method, by contrast, must maintain valid confidence intervals at every possible stopping time. That continuous validity guarantee consumes more of the significance budget, resulting in lower power at any given sample size.

The practical difference is meaningful. In Spotify's simulations, GSTs with four interim analyses achieved notably higher power than always-valid methods at the same maximum sample size. For a platform running 10,000+ experiments per year, that power difference translates into more experiments producing clear answers and fewer producing ambiguous results.

When should you use a GST vs. always-valid inference?

GSTs are the right choice when you can estimate the maximum sample size before the experiment starts. This describes most product A/B tests: you know your daily active users, you've calculated the sample needed to detect your minimum detectable effect, and you can predict roughly how long the experiment will run.

Always-valid inference is preferable when the maximum sample size is genuinely uncertain, when data streams in continuously with no natural endpoint, or when you need the flexibility to stop at arbitrary times rather than pre-planned analysis points.

At Spotify, the vast majority of experiments fit the GST case. Traffic is predictable. Experiment durations are bounded by business needs. The maximum sample size can be estimated with reasonable accuracy. That's why Confidence defaults to GSTs for sequential analysis.

What is the relationship between GSTs and alpha spending?

An alpha spending function is what makes a GST work. It defines the cumulative proportion of the total significance level (alpha) that has been allocated through each interim analysis.

Two common spending functions are the O'Brien-Fleming and Pocock boundaries. O'Brien-Fleming spending is conservative early (very strict thresholds at the first looks, making early stopping unlikely unless the effect is large) and concentrates most of the alpha at the final analysis. Pocock spending distributes alpha more evenly, making early stopping more likely but reducing power at the final analysis.

The choice of spending function reflects a trade-off: how aggressively do you want to stop early versus how much power do you want to preserve if the experiment runs to completion? For most product experiments, O'Brien-Fleming-style spending is the better default because the majority of experiments run close to their planned duration and benefit from having most of the alpha available at the end.

What is a Group Sequential Test?

How does a group sequential test work?

Why do GSTs have higher power than always-valid methods?

When should you use a GST vs. always-valid inference?

What is the relationship between GSTs and alpha spending?

Related terms