What is an Always-Valid Inference?

Always-valid inference (AVI) is a class of sequential testing methods that construct confidence intervals remaining valid at any stopping time, without requiring the experimenter to pre-plan when or how often they'll look at the data. Unlike group sequential tests, which fix a schedule of interim analyses in advance, always-valid methods let you check results at any point and stop whenever the evidence is sufficient, while maintaining the correct overall false positive rate.

The practical appeal is flexibility. Some experiments have unpredictable sample sizes. Data may stream in continuously with no natural endpoint. A team might need to stop an experiment early for business reasons they couldn't anticipate at design time. In all these cases, always-valid inference provides the guarantee that whatever time you happen to stop, the confidence interval you observe is honestly calibrated.

How does always-valid inference work?

Traditional confidence intervals are valid at a single, pre-specified time point. If you compute a 95% confidence interval and check it repeatedly as data accumulates, the probability that the interval excludes the true parameter at some point during the experiment grows well beyond 5%. This is the peeking problem.

Always-valid methods replace point-in-time confidence intervals with confidence sequences: sequences of intervals that are simultaneously valid at every sample size. At any time t, the interval has at most a 5% probability of excluding the true value, regardless of how many times you've looked before.

The mathematical machinery varies across implementations. Some use mixture martingales. Others construct e-processes (measures of evidence that can only grow under the null hypothesis). The common thread is that validity doesn't depend on a pre-specified stopping rule.

Confidence supports always-valid inference as one of its two sequential testing options, alongside group sequential tests. The choice between them depends on what you know about your experiment before it starts.

When is always-valid inference the right choice?

Always-valid inference is best suited to three scenarios.

First, experiments with genuinely uncertain sample sizes. If you're launching a feature in a new market and don't know whether you'll get 10,000 users or 100,000, pre-planning interim analyses at fixed information fractions (as a GST requires) is guesswork. AVI handles this naturally because it doesn't need a maximum sample size estimate.

Second, continuous monitoring. Some platform-level experiments run indefinitely, with the goal of detecting regressions as early as possible. There's no "final analysis" because the experiment has no planned end. AVI's guarantee of validity at every time point fits this use case directly.

Third, situations where stopping decisions are driven by external factors. A product launch date, a competitor move, a shift in business priorities. When the stopping rule comes from outside the experiment design, AVI ensures the statistical conclusions are still valid.

What is the power trade-off compared to group sequential tests?

The flexibility of always-valid inference comes at a cost: lower statistical power at any given sample size compared to a well-designed group sequential test.

Spotify's published comparison quantified this trade-off. GSTs with a fixed number of interim analyses achieved higher power than AVI methods across a range of effect sizes and sample sizes. The reason is structural: a GST restricts when decisions can happen, which concentrates the significance budget more efficiently. AVI must maintain validity at every possible stopping time, spreading the budget thinner.

For most product A/B tests at Spotify, where traffic is predictable and maximum sample sizes can be estimated, the power advantage of GSTs matters. That's why Confidence uses GSTs as the default sequential method. But for the scenarios described above, where flexibility outweighs raw power, AVI is the right tool. Having both options available, with each correctly integrated into the full statistical stack (sample size calculation, variance reduction, multiple testing correction), is what makes the choice real rather than theoretical.

How does always-valid inference relate to optional stopping?

Optional stopping is the practice of stopping an experiment based on observed results, which inflates error rates when using standard (non-sequential) tests. Always-valid inference is one of the methods that makes optional stopping statistically valid.

With AVI, stopping whenever your confidence sequence excludes zero doesn't inflate the false positive rate. That's the core guarantee. The term "always-valid" means exactly this: the inference is valid regardless of the (possibly data-dependent) rule you used to stop.

This distinction matters in practice. Teams that peek at results and stop early aren't doing anything wrong if the analysis method supports it. The error is in using a method that wasn't designed for that behavior.

What is an Always-Valid Inference?

How does always-valid inference work?

When is always-valid inference the right choice?

What is the power trade-off compared to group sequential tests?

How does always-valid inference relate to optional stopping?

Related terms