Lesson 2: Number of success metrics
This lesson teaches you how the number of success metrics affects the required sample size in experiments. The more success metrics you have, the smaller alpha you will have to use per metric to bound the overall false positive rate for the decision below alpha. This leads to a larger required sample size. The number of success metrics doesn't affect the power you need to use per metric.
False positive rate
Recall Spotify's multi-metric decision rule:
Ship if, and only if, at least one success metric has significantly improved, and the treatment is significantly non-inferior to control in all guardrail metrics.
Each additional success metric in the experiment gives an additional chance of reaching the conclusion to ship the treatment. If each success metric is tested with a significance level of 5%, the probability of having at least one false positive increases with the number of success metrics.
To counter this, we adjust the significance level for each success metric test to keep the chance of at least one false positive at 5% or smaller. The simplest way to do this is the Bonferroni correction, where alpha is adjusted by dividing it by the number of success metrics.
Probability of at least one false positive
As we can see, the Bonferroni correction effectively bounds the false positive rate to be alpha% or smaller, regardless of the number of success metrics.
True positive rate
The power of the decision rule is also affected by the number of success metrics. However, instead of reducing the true positive rate, it increases it.
If each success metric is powered with 80% power, the probability of having at least one success metric with a true positive rate of 80% or larger increases with the number of success metrics. This means no adjustment to the power setting per success metric is needed to obtain the overall power for the decision rule that is at least as high as the desired level of power.
How does the number of success metrics affect the required sample size in experiments?
What is the purpose of adjusting alpha for each success metric in experiments?
What is the simplest method to adjust alpha for multiple success metrics?
Notes for nerds
-
False positive rate:
- For the false positive rate, independent metrics represent the worst-case scenario. Each metric creates a completely new chance for a false positive, necessitating alpha adjustment.
-
True positive rate:
- For the true positive rate, the worst-case scenario is that the metrics are perfectly dependent. In this case, only one chance exists for a true positive.
If we knew the correlation structure between metrics or made assumptions about it, we could reduce the conservativeness of the alpha adjustment and decrease the power per success metric. However, such methods add complexity in understanding and interpretation.