Skip to main content
For a successful and well-planned experiment, you should commit beforehand to a strategy for when and how to evaluate the results to avoid the infamous pitfalls of peeking. Similarly, knowing how much traffic you require to be able to identify the effects of interest with a high probability is essential for results to be trustworthy.

Test Evaluation Frequency

When setting up an experiment, you need to select how often to evaluate the results of the test. You have two options:
  • View results continuously
  • View results upon conclusion
Viewing results continuously means that you get the results updated and presented hourly or daily using sequential tests. Viewing results only upon conclusion separates the data collection and analysis phases of the experiment. With this choice, you can view the results after you end the experiment using fixed horizon tests. Read more about the details on the sequential tests page. Selecting a strategy before launching the experiment is crucial, as it makes it possible to control the risk of finding false positives regardless of the choice made. Failure to handle this issue and looking at the results when you shouldn’t is commonly referred to as the “peeking problem.”
If you choose to view results only at the end of the experiment, Confidence still uses sequential tests to run daily checks on all your metrics to ensure they have not deteriorated. Read more about that in the monitoring section.
The benefit of viewing results at the end of the experiment is that it has higher precision compared to a test with results that update daily. This means that sticking to that approach leads to less uncertainty in the final estimates of the effects of the treatments.

Alpha and Power

The false positive rate is also known as alpha and is by default 5%. The power level, also known as the true positive rate, has a default value of 80%. A lower alpha means that false positives are less likely to happen, but at the same time the chance of finding an effect if there is one is also less likely to happen. The power level sets the desired probability for being able to find an effect if there is one. See the next section on power analysis for how the required sample size can help inform you how much traffic you need to achieve the desired level of power.
Confidence adjusts the selected alpha and power levels for multiple comparisons using a Bonferroni correction that handles success and guardrail metrics differently. The corrections ensure that your error rates for the decision to ship the feature is at most the errors rates implied by the false positive rate, determined by alpha, and the power level you give. Read more about adjustment for multiple comparisons.