Lesson 1: Multi-metric decision making

The two main types of metrics

There are two types of metrics used in experiments: success metrics and guardrail metrics. Guardrail metrics are metrics that we ensure don't move in the wrong direction due to our product change. Success metrics are metrics that we want to improve. By combining these metrics, we can make better, more precise, product decisions.

Spotify's decision rule

At Spotify we use the decision rule: Ship if, and only if, at least one success metric has significantly improved, and the treatment is significantly non-inferior to control in all guardrail metrics.

This is also the default decision rule for all recommendations in Confidence.

The goal of this experimental design is to manage the risks associated with decisions made using this rule. This means accounting for all metrics and how they affect the decision rule - simultaneously.

In the following lessons, we explain how the number of success and guardrail metrics affects the sample size calculation. If you want to learn more about how Confidence manages risk in decision-making, read more in this blog post.

Intro to guardrail metrics and non-inferiority tests

Guardrail metrics are different from success metrics. They are metrics that we want to ensure do not significantly worsen in the treatment group compared to the control group.

This means that we aim to prove that the metric did not deteriorate in the treatment group compared to the control group. To do this, we use non-inferiority tests.

This video introduces the concept of non-inferiority tests and how the choice of the Non-Inferiority Margin (NIM) affects the sample size calculation.

There are more ways to use guardrail metrics in Confidence besides with non-inferiority tests. Read more in this blog post.

Notes for nerds

Deterioration metrics and health checks also affect the risk management of decision-making. We don't want to ship if:

  • Any metric included in the experiment moves significantly in the wrong direction.
  • Any health check, like the sample ratio mismatch test, triggers.

These are also based on statistical tests and therefore have a risk of incorrectly triggering.

Read more in this blog post and in this academic paper.