Lesson 3: Number of guardrail metrics
This lesson teaches you how the number of guardrail metrics affects the required sample size in experiments. Only guardrail metrics with NIMs affect the sample size: the more you have, the larger power you will have to use per metric to bound the true positive rate above power, which leads to a larger required sample size. Guardrail metrics without NIMs have no cost to the shipping decision's sample size.
Guardrail metrics with and without NIMs
Not all guardrail metrics have a Non-Inferiority Margin (NIM). A guardrail metric with a NIM requires the experiment to pass a non-inferiority test before shipping. A guardrail metric without a NIM is tested only for regression (whether it has significantly deteriorated) and does not need to pass a non-inferiority test for the experiment to ship.
Adding a guardrail metric without a NIM has no cost to the shipping decision's sample size. It does not affect the alpha correction (it cannot increase the shipping false positive rate) and does not affect the beta correction (it cannot block the ship by not being significant).
There is, however, an alpha cost for the abort decision. Regression tests across all user-specified metrics, required metrics, and health checks like the sample ratio mismatch test all contribute to the alpha correction for abort. Adding many guardrail metrics without NIMs inflates this correction, reducing the power to detect actual regressions. We don't currently target a specific power for the abort decision, so this effect is not tracked.
The rest of this lesson applies to guardrail metrics with NIMs. Read more about how guardrail metrics with and without NIMs fit into smaller-sample experimentation in Experiments with Smaller Samples.
False positive rate
Since all guardrail metrics with NIMs must be simultaneously non-inferior, the probability of a false positive for the decision rule decreases as the number of guardrail metrics increases.
You can think about this as the difference between having 5 dice and needing at least one six versus needing all 5 dice to show sixes. The more dice you have, the less likely it is that all of them are sixes at the same time by chance.
However, if the guardrail metrics are highly correlated, the probability of getting all sixes doesn't decrease as quickly with the number of dice.
For this reason, we don't need to adjust alpha for the number of guardrail metrics.
True positive rate
The power of the experiment is affected by the number of guardrail metrics. Since we need all guardrail metrics to be non-inferior simultaneously, it is not enough for each metric to have the probability power to be significantly non-inferior under the alternative hypothesis.
Probability that all metrics are non-inferior
The plot above assumes that the metrics are independent. If they are highly correlated, the probability of all metrics being significant doesn't decrease as quickly with the number of metrics. However, without knowing the correlation, we must ensure that the power is high enough in the worst-case scenario of independent metrics.
How does the number of guardrail metrics with NIMs affect the required sample size?
Why does the probability of a false positive decision decrease as the number of guardrail metrics increases?
Why is it unnecessary to adjust alpha for the number of guardrail metrics?
How does the number of guardrail metrics affect the power of the decision of an experiment?
What happens to the simultaneous power of the overall decision if the power correction is used and the guardrail metrics are highly correlated?
What is the effect of adding a guardrail metric without a NIM on the required sample size for the shipping decision?
Notes for nerds
- False positive rate: For guardrail metrics, the false positive rate decreases as the number of metrics increases. This is because all metrics must be simultaneously non-inferior, making it harder to have a false positive as metrics are added.
- True positive rate: For true positives, the worst-case scenario is independent metrics, where each metric is treated as a separate hurdle. This decreases the overall probability of all metrics being significant.
If we knew the correlation structure between metrics, we could adjust the power to account for dependencies, but this adds complexity to the design and interpretation.