Lesson 1: Multi-metric decision making
This lesson teaches you how to formalize decision-making from experiments with guardrail and success metrics. Confidence uses a decision rule to map the results of all success and guardrail metrics to one decision: Ship or not.
The two main types of metrics
There are two types of metrics used in experiments: success metrics and guardrail metrics. Guardrail metrics are metrics that we ensure don't move in the wrong direction due to our product change. Success metrics are metrics that we want to improve. By combining these metrics, we can make better, more precise, product decisions.
Spotify's decision rule
At Spotify we use the decision rule: Ship if, and only if, at least one success metric has significantly improved, and the treatment is significantly non-inferior to control in all guardrail metrics.
This is also the default decision rule for all recommendations in Confidence.
The goal of this experimental design is to manage the risks associated with decisions made using this rule. This means accounting for all metrics and how they affect the decision rule - simultaneously.
In the following lessons, we explain how the number of success and guardrail metrics affects the sample size calculation. If you want to learn more about how Confidence manages risk in decision-making, read more in this blog post.
Intro to guardrail metrics and non-inferiority tests
Guardrail metrics are different from success metrics. They are metrics that we want to ensure do not significantly worsen in the treatment group compared to the control group.
This means that we aim to prove that the metric did not deteriorate in the treatment group compared to the control group. To do this, we use non-inferiority tests.
This video introduces the concept of non-inferiority tests and how the choice of the Non-Inferiority Margin (NIM) affects the sample size calculation.
There are more ways to use guardrail metrics in Confidence besides with non-inferiority tests. Read more in this blog post.
What is the primary purpose of guardrail metrics in experiments?
What is Spotify's decision rule for shipping a product change?
What is the relationship between the Non-Inferiority Margin (NIM) and sample size?
Notes for nerds
Deterioration metrics and health checks also affect the risk management of decision-making. We don't want to ship if:
- Any metric included in the experiment moves significantly in the wrong direction.
- Any health check, like the sample ratio mismatch test, triggers.
These are also based on statistical tests and therefore have a risk of incorrectly triggering.
Read more in this blog post and in this academic paper.