Lesson 2: Number of success metrics


False positive rate

Recall Spotify's multi-metric decision rule:

Ship if, and only if, at least one success metric has significantly improved, and the treatment is significantly non-inferior to control in all guardrail metrics.

Each additional success metric in the experiment gives an additional chance of reaching the conclusion to ship the treatment. If each success metric is tested with a significance level of 5%, the probability of having at least one false positive increases with the number of success metrics.

To counter this, we adjust the significance level for each success metric test to keep the chance of at least one false positive at 5% or smaller. The simplest way to do this is the Bonferroni correction, where alpha is adjusted by dividing it by the number of success metrics.

Probability of at least one false positive

0.05
8
Probability that at least 1 metric is significant: 0.337

As we can see, the Bonferroni correction effectively bounds the false positive rate to be alpha% or smaller, regardless of the number of success metrics.


True positive rate

The power of the decision rule is also affected by the number of success metrics. However, instead of reducing the true positive rate, it increases it.

If each success metric is powered with 80% power, the probability of having at least one success metric with a true positive rate of 80% or larger increases with the number of success metrics. This means no adjustment to the power setting per success metric is needed to obtain the overall power for the decision rule that is at least as high as the desired level of power.



Notes for nerds

  1. False positive rate:

    • For the false positive rate, independent metrics represent the worst-case scenario. Each metric creates a completely new chance for a false positive, necessitating alpha adjustment.
  2. True positive rate:

    • For the true positive rate, the worst-case scenario is that the metrics are perfectly dependent. In this case, only one chance exists for a true positive.

If we knew the correlation structure between metrics or made assumptions about it, we could reduce the conservativeness of the alpha adjustment and decrease the power per success metric. However, such methods add complexity in understanding and interpretation.