Lesson 6: Calculation frequency
- Calculating results only once ("Upon Conclusion") is the most efficient way to set up your experiments. You only get to see the results at the end of your experiment, which gives you higher sensitivity.
- Calculating results continuously sacrifices sensitivity for faster results. Select this if getting a rough estimate early on is important to you.
When you set up an experiment, you need to decide when you want results to be displayed.
Deliver results continuously or upon conclusion
There are two options for when to calculate results:
- Continuously display results to see new results every hour or day when the experiment is live.
- Upon Conclusion to see results when you end the experiment.
For both settings, all experiments run until you decide to stop them. The largest difference between the two options is that in experiments with results delivered Upon Conclusion, you can't view the results for your metrics during the experiment. The results show up when you end your experiment. For experiments using results calculated Continuously, you can follow the experiment results during the course of the experiment and decide to end it whenever you want.
If you're wondering why not always choose to view results continuously—stay tuned! We'll explain that shortly.
Two strategies to avoid being fooled by randomness
Imagine that you run an experiment that truly has no impact whatsoever on any metric. You just split users in two groups, measure some outcome, and calculate the difference between the groups every day. Even if the experiment didn't actually change anything, you can expect to see results fluctuate over time, just by random chance. If you check the results every day, you might on some days see a difference that's so large that you wrongly assume that this experiment impacts the metric. Such results are called false positive results. Checking the results every day means there are multiple chances to find a false significant result. To avoid getting fooled by randomness and draw the wrong conclusions based on a false positive result, you can use either of two strategies to keep this risk under control.
-
Calculate result upon conclusion and use standard statistical tests
Before starting your experiment, you define a point in time when you plan to calculate the results. To make sure that you have a good chance of finding an impact, you calculate what sample size you need to reliably detect an effect of a certain size. By only calculating the results this one time, you have only one chance to be fooled by a false positive result. This method is the most efficient way to minimize the risks of false positives.
-
Calculate the results continuously and use sequential tests
You calculate results every day or every hour, and correct for the increased risk of being fooled by randomness. The statistical methods used, known as sequential tests, correct for the multiple peeks at the data by using a stricter threshold to conclude significance. With this approach, you can check the results daily without hesitation. Because the tests use a stricter threshold for calling significance, the impact needs to be larger to be reliably detected. This means that for an experiment with results updated continuously, you either need to increase the sample size, or accept that you lose some sensitivity to detect changes. Sequential tests trade off sensitivity for faster results.
Learn about difference between different evaluation frequencies in 2 minutes and 14 seconds.
Automatic sequential monitoring
Modern experimentation platforms run automatic sequential monitoring checks regardless of the evaluation frequency you choose. This means that you do not have to choose to calculate results continuously to sleep well at night—the platform monitors your experiment for deterioration in all your metrics.
Confidence monitors your experiment using sequential tests for all checks, regardless of the evaluation frequency you choose.
What you should choose
The trade-off is different in all experiments. Sequential experiments offer you the opportunity to stop experiments early at the expense of some certainty. If speed is more important than precise estimates of the treatment effect, continuously updating the results is the better choice.
For a given experiment length, like say two weeks, only calculating the results once at the end gives more precise estimates. If you are going to run the experiment for a fixed amount of time regardless, select Upon Conclusion to maximize your chances of finding effects.
At Spotify, most experiments use Upon Conclusion and only display results when the experiment ends. For early abortion of experiments due to errors or negative user experiences, experimenters rely on Confidence's monitoring of their experiments. For the product decision, most teams want at least two weeks of data to make decisions to ship.