Lesson 4: Number of comparisons
This lesson teaches you about how the number of comparisons affects the required sample size in experiments. Everything else held constant, the more treatment groups and thereby comparisons you have, the smaller alpha you will have to use per metric to bound the false positive rate for the decision below alpha, which leads to a larger required sample size. The number of comparisons does not affect the power you need to use per metric.
The impact of multiple comparisons
The most common pattern in product A/B tests is to compare all treatment groups against a control group. This means there are as many comparisons as there are treatment groups being tested.
In principle, it is also possible to compare all treatment groups against each other. This would mean the number of comparisons equals the number of pairs of treatments.
The number of comparisons affects the required sample size. The more comparisons, the more samples are required. This is because the probability of making a Type I error (false positive) increases with the number of comparisons. To counter this, we adjust the alpha level for multiple comparisons, which increases the required sample size.
The intuition behind this adjustment is that the more tests we run, the more chances there are to find a significant result by random chance. For example, if we run an experiment with 100 treatments and an alpha of 10%, even if no treatment has any effect, we would expect to see 10 treatments with a (false positive) significant result just by chance.
How does the number of comparisons in an experiment affect the required sample size?
Why is the alpha level adjusted when there are multiple comparisons in an experiment?
Notes for nerds
Some people wonder what to do if more than one treatment is significantly better than the control group. This is a deep question. You can test the treatments against each other to see if one is better than the other. However, the difference between the treatment groups is likely smaller than the difference between them and the control group. This makes the power to detect a difference between treatments lower than the power to detect a difference between a treatment and the control group.
There are more advanced methods for finding the best treatment among many, such as Tukey's and Scheffé's methods, and Dunnett's test. However, we don't recommend using these methods due to the complexity involved in learning how to use them. Instead, you should gather stakeholders to decide which of the significant treatments to implement based on factors such as:
- Complexity
- Cost
- Future extensibility