Lesson 9: Sequential and non-sequential tests
In Confidence, sometimes you can see results during the experiment and sometimes only after it ends. This lesson explains why, and what it means for how you act on what you see.
In Confidence, sometimes you can see results during the experiment and sometimes only after it ends. This depends on the evaluation strategy chosen for the experiment. Understanding the difference is important because it determines when you can act on what you see.
The peeking problem
Standard statistical tests are designed to be run once: you collect your data, run the test, and look at the result. If you look at the result repeatedly (checking every day, or every time data updates) and you are willing to act the moment you see significance, you are effectively running many tests. Each look is an opportunity to find a false positive by chance, and the more often you look, the more likely you are to get one.
With a 5% false positive rate per test and no correction, checking results 20 times gives you roughly a 64% chance of seeing at least one false positive result across all those looks, even if there is truly no effect.
Sequential tests: built for continuous evaluation
Sequential tests solve the peeking problem by adjusting the statistical test to account for the fact that you will look at the results multiple times. They allocate the acceptable false positive rate across all the looks you will take, so that the overall false positive rate stays controlled. Sequential testing is the right choice when speed matters: when you want to be able to ship as soon as you have strong enough evidence, without waiting for a pre-specified end date.
There are two main types of sequential tests:
- Group Sequential Tests: used when you provide an expected sample size. These are more powerful than always-valid tests and are the preferred option when you have a reasonable estimate of how many users your experiment will collect.
- Always-Valid Tests: used when you do not provide an expected sample size. These can run indefinitely without inflating the false positive rate, but they have lower power than group sequential tests for a given sample size.
The trade-off for both types is that sequential tests require slightly more data than non-sequential tests to achieve the same power for a given effect size.
In Confidence, you use sequential tests by selecting Continuous as the evaluation frequency. Confidence automatically picks between Group Sequential Tests and Always-Valid Tests based on whether you provide an expected sample size.
Non-sequential tests: the highest power option
A standard non-sequential test (also called a fixed horizon test) is designed to be run once, after data collection ends. With this approach, you should not act on the metric results you see while the experiment is still running. The numbers mid-experiment are informational but they do not have the statistical guarantees of the final result.
The benefit is that non-sequential tests achieve higher statistical power for the same sample size compared to sequential tests. If you know in advance you will run the experiment for a fixed period regardless of what you see, this is the most efficient choice.
In Confidence, non-sequential tests correspond to the Upon Conclusion evaluation frequency. You can see the current powered effect at any point during the experiment, which helps you judge when it makes sense to end it.
The recommendation at Confidence is to use upon-conclusion evaluation unless speed of decision is a priority. This is because most teams benefit from running experiments for a full, pre-specified period to account for time effects and novelty effects, and upon-conclusion evaluation maximizes power for that use case.
Deterioration checks always run sequentially
Regardless of which evaluation strategy you choose, Confidence always monitors your metrics for deterioration using sequential tests. This means you will be alerted if a metric starts to move in the wrong direction, even if you chose upon-conclusion evaluation and have not yet reached the end of your experiment.
This is an important design choice: the ability to detect harm early is always on. You do not sacrifice protection against regressions by choosing a non-sequential test for your main results.
What this means in practice
When you look at experiment results, the key question to ask is: what evaluation strategy is this experiment using?
- Sequential (continuous) evaluation: the results you see are statistically valid to act on at any point. The test has accounted for repeated looks.
- Non-sequential (upon conclusion) evaluation: the final result after the experiment ends is valid to act on. Mid-experiment numbers should be used for awareness only, not decisions.
For a deeper look at how to choose between these strategies when setting up an experiment, including trade-offs around time effects, novelty effects, and power, see Lesson 2: Choose evaluation strategy in the Advance your experimentation course.
You are running an experiment with upon-conclusion evaluation. Why shouldn't you look at the results before the experiment has ended?
What is always true about deterioration monitoring in Confidence, regardless of evaluation strategy?
Notes for nerds
Confidence supports two types of sequential tests. Group Sequential Tests (GST) are used when you provide an expected sample size at setup. They are more powerful than the alternative for a given sample size and are the preferred option when you have a reasonable estimate of how many users the experiment will collect. Always-Valid Tests are used when no expected sample size is provided. They can run indefinitely without inflating the false positive rate, at the cost of somewhat lower power than a GST for the same amount of data.
The sample size calculator in Confidence accounts for this choice. When you switch from non-sequential to sequential evaluation, you can see the impact on required sample size directly in the calculator. Sequential tests require more data to achieve the same power for a given effect size, and the calculator shows you exactly how much.