Lesson 2: Choose evaluation strategy: Sequential or non-sequential tests

In this lesson you learn how to select evaluation frequency and the various trade-offs to consider when making this choice.

This video gives a 2 minutes and 10 seconds overview of the evaluation frequency topic.

The evaluation frequency determines the statistical test

The evaluation frequency has two options:

  • Continuous: lets you see results during the experiment
  • Upon conclusion: lets you see results only when the experiment has ended

Continuous evaluation uses sequential tests

When you select continuous evaluation, Confidence automatically uses a sequential test behind the scenes. Sequential tests ensure that the repeated testing implied by calculating results continuously throughout the experiment does not inflate the risk of finding a false positive result.

There are two types of sequential tests:

  • Group Sequential Tests: require the experimenter to provide a maximum sample size and must be stopped once this sample size is reached; they have higher power than Always Valid Tests for a given sample size
  • Always Valid Tests: do not require any pre-specified sample size and can keep letting new users into the sample indefinitely; they have lower power than Group Sequential Tests for a given sample size

Upon Conclusion evaluation uses non-sequential tests

When you select Upon Conclusion, Confidence automatically uses a fixed horizon test—the classical statistical tests from intro statistics courses.

Fixed horizon tests, such as the z-test, do not allow you to see the results during the experiment, but they have higher power for the given sample size after you stop the experiment and run the analysis as compared to the sequential tests. Although you should run sample size calculations before the experiment, you can view the powered effect at any point during the experiment, which makes it possible to judge when it is time to end it.

A note on quality tests during the experiment

Regardless of your choice of evaluation frequency, a well-designed experimentation platform uses continuous evaluation (sequential tests) on all quality and deterioration checks during the experiment. This means that you will not miss any experiments that deteriorate or have fatal errors, even if you choose Upon Conclusion as your evaluation frequency.

This means the only reason to select continuous evaluation is if you want to make a shipping decision as soon as possible.

A note on early stopping to ship a variant

Using sequential testing to detect harm and errors to abort failing experiments early is standard in modern A/B testing. However, stopping early to ship is less straightforward. This is because it is often easier to prove that something is too bad to ship than to prove that it is sufficiently good to ship.

A few things to consider are:

  • Time effects. If there are strong effects of for example weekdays it might be important to run the experiment over a full week to average these out
  • Novelty effects. If there is a strong novelty effect, it might be important to observe users for a longer time to determine the longer time effect of the change
  • Power. Under powered experiments tends to overestimate the treatment effect. If the effect is larger than you expect early in the data collection, you might want run it longer to ensure your experiment is well powered.

How to choose

Selecting evaluation frequency and thereby type of statistical test is a trade-off between various interests. Some general guidelines are:

  • The ability to abort failing experiments should not influence the choice of evaluation frequency. All experiments have sequential checks for deterioration and quality errors, regardless of the result evaluation frequency.
  • If speed of decision making is more important than the quality of the estimates, choose continuous evaluation
  • If you are using continuous evaluation, provide a maximum sample size if you can to use Group Sequential Tests that maximizes the power
  • If you are going to run the experiment for a fixed number of days anyway before making the decision to ship, use Upon Conclusion evaluation

For a guide on how to read and act on results depending on which evaluation strategy your experiment uses, see Lesson 9: Sequential and non-sequential tests in the Interpreting experiment results course.

The recommendation to experimenters at Spotify is to use the upon-conclusion evaluation frequency. This is because Confidence offers continuous deterioration and quality tests regardless of the evaluation frequency, there are many reasons to run tests for at least a fixed time period such as time effects and novelty effects, and upon-conclusion maximizes the chances to detect a true effect (power).

If you are interested in reading more about types of sequential tests and how they compare, checkout this Spotify engineering blog post about choosing a Sequential Testing Framework There is also more details about the tests in the documentation.