Lesson 9: Sequential and non-sequential tests

In Confidence, sometimes you can see results during the experiment and sometimes only after it ends. This depends on the evaluation strategy chosen for the experiment. Understanding the difference is important because it determines when you can act on what you see.

The peeking problem

Standard statistical tests are designed to be run once: you collect your data, run the test, and look at the result. If you look at the result repeatedly (checking every day, or every time data updates) and you are willing to act the moment you see significance, you are effectively running many tests. Each look is an opportunity to find a false positive by chance, and the more often you look, the more likely you are to get one.

With a 5% false positive rate per test and no correction, checking results 20 times gives you roughly a 64% chance of seeing at least one false positive result across all those looks, even if there is truly no effect.

Sequential tests: built for continuous evaluation

Sequential tests solve the peeking problem by adjusting the statistical test to account for the fact that you will look at the results multiple times. They allocate the acceptable false positive rate across all the looks you will take, so that the overall false positive rate stays controlled. Sequential testing is the right choice when speed matters: when you want to be able to ship as soon as you have strong enough evidence, without waiting for a pre-specified end date.

There are two main types of sequential tests:

  • Group Sequential Tests: used when you provide an expected sample size. These are more powerful than always-valid tests and are the preferred option when you have a reasonable estimate of how many users your experiment will collect.
  • Always-Valid Tests: used when you do not provide an expected sample size. These can run indefinitely without inflating the false positive rate, but they have lower power than group sequential tests for a given sample size.

The trade-off for both types is that sequential tests require slightly more data than non-sequential tests to achieve the same power for a given effect size.

Non-sequential tests: the highest power option

A standard non-sequential test (also called a fixed horizon test) is designed to be run once, after data collection ends. With this approach, you should not act on the metric results you see while the experiment is still running. The numbers mid-experiment are informational but they do not have the statistical guarantees of the final result.

The benefit is that non-sequential tests achieve higher statistical power for the same sample size compared to sequential tests. If you know in advance you will run the experiment for a fixed period regardless of what you see, this is the most efficient choice.

Deterioration checks always run sequentially

Regardless of which evaluation strategy you choose, Confidence always monitors your metrics for deterioration using sequential tests. This means you will be alerted if a metric starts to move in the wrong direction, even if you chose upon-conclusion evaluation and have not yet reached the end of your experiment.

This is an important design choice: the ability to detect harm early is always on. You do not sacrifice protection against regressions by choosing a non-sequential test for your main results.

What this means in practice

When you look at experiment results, the key question to ask is: what evaluation strategy is this experiment using?

  • Sequential (continuous) evaluation: the results you see are statistically valid to act on at any point. The test has accounted for repeated looks.
  • Non-sequential (upon conclusion) evaluation: the final result after the experiment ends is valid to act on. Mid-experiment numbers should be used for awareness only, not decisions.

For a deeper look at how to choose between these strategies when setting up an experiment, including trade-offs around time effects, novelty effects, and power, see Lesson 2: Choose evaluation strategy in the Advance your experimentation course.

Notes for nerds

Confidence supports two types of sequential tests. Group Sequential Tests (GST) are used when you provide an expected sample size at setup. They are more powerful than the alternative for a given sample size and are the preferred option when you have a reasonable estimate of how many users the experiment will collect. Always-Valid Tests are used when no expected sample size is provided. They can run indefinitely without inflating the false positive rate, at the cost of somewhat lower power than a GST for the same amount of data.

The sample size calculator in Confidence accounts for this choice. When you switch from non-sequential to sequential evaluation, you can see the impact on required sample size directly in the calculator. Sequential tests require more data to achieve the same power for a given effect size, and the calculator shows you exactly how much.