Confidence
  • Documentation
  • Blog
  • Bootcamp
  • Status
  • Confidence Bootcamp
    • My learning
    • Intro to experimentation
      • Introduction
      • Lesson 1: Why you should experiment
      • Lesson 2: Experiment hypothesis
      • Lesson 3: Success and guardrail metrics
      • Lesson 4: Success metrics
      • Lesson 5: Set up your experiment
      • Lesson 6: Calculation frequency
      • Lesson 7: Target audience
      • Lesson 8: Sample size
      • Lesson 9: Quality assurance
      • Lesson 10: Run your experiment
      • Lesson 11: Evaluate your experiment and make a decision
      • Lesson 12: A/B tests and rollouts
      • Course wrap up
    • Intro to metrics
      • Introduction
      • Lesson 1: What is a metric?
      • Lesson 2: Metric roles
      • Lesson 3: Time considerations
      • Lesson 4: Capturing behavior
      • Lesson 5: Strategic metrics
      • Lesson 6: Interpretability
      • Lesson 7: Feasibility and sensitivity
      • Lesson 8: Variance reduction
      • Lesson 9: Select metrics
      • Lesson 10: Segment-level analysis
      • Course wrap up
    • Scientific product development
      • Introduction
      • Lesson 1: Why you should experiment
      • Lesson 2: The scientific method
      • Lesson 3: Randomized controlled trials
      • Lesson 4: Experiment hypothesis
      • Lesson 5: Case study
        • Case study
        • Answers to case study
      • Lesson 6: Why do we need statistics?
      • Lesson 7: Success metrics
      • Lesson 8: Detectable effects and sample size
      • Lesson 9: Make a decision
      • Course wrap up
    • A primer on hypothesis testing
      • Introduction
      • Lesson 1: Introduction to hypothesis testing
      • Lesson 2: True vs estimated effects
      • Lesson 3: Sampling distribution of the difference-in-means estimator
      • Lesson 4: Z-tests and how to reject the null hypothesis
      • Lesson 5: False postive rate and alpha
      • Lesson 6: True positive rate, MDE, and power
      • Course wrap up
    • Intro to Feature Flags
      • Introduction
      • Lesson 1: What is a feature flag?
      • Lesson 2: Lifecycle of a feature flag
      • Lesson 3: Clients
      • Lesson 4: Evaluation context and targeting
    • Sample size calculation - I
      • Introduction
      • Lesson 1: What is the required sample size?
      • Lesson 2: Alpha and power
      • Lesson 3: Baseline mean and variance
      • Lesson 4: Sample size playground - I
    • Sample size calculation - II
      • Introduction
      • Lesson 1: Multi-metric decision making
      • Lesson 2: Number of success metrics
      • Lesson 3: Number of guardrail metrics
      • Lesson 4: Number of comparisons
      • Lesson 5: Sample size playground - II
    • Sample size calculation - III
      • Introduction
      • Lesson 1: Binary metrics
      • Lesson 2: Treatment group proportions
      • Lesson 3: Variance reduction
      • Lesson 4: Sequential testing and sample size
      • Lesson 5: Sample size playground - III
    • Advance your experimentation
      • Introduction
      • Lesson 1: Guardrail metrics with non-inferiority margins
      • Lesson 2: Choose evaluation frequency
      • Lesson 3: Metrics' roles in experiments
      • Lesson 4: Cumulative holdback evaluations
    • Experimentation culture
      • Introduction
      • Lesson 1: Onboarding into experimentation
      • Lesson 2: Empowering experimentation champions
      • Lesson 3: Sustaining the experimentation culture
    • Videos

Lesson 1: Multi-metric decision making

Summary

This lesson teaches you how to formalize decision-making from experiments with guardrail and success metrics. Confidence uses a decision rule to map the results of all success and guardrail metrics to one decision: Ship or not.

The two main types of metrics

There are two types of metrics used in experiments: success metrics and guardrail metrics. Guardrail metrics are metrics that we ensure don't move in the wrong direction due to our product change. Success metrics are metrics that we want to improve. By combining these metrics, we can make better, more precise, product decisions.

Spotify's decision rule

At Spotify we use the decision rule: Ship if, and only if, at least one success metric has significantly improved, and the treatment is significantly non-inferior to control in all guardrail metrics.

This is also the default decision rule for all recommendations in Confidence.

The goal of this experimental design is to manage the risks associated with decisions made using this rule. This means accounting for all metrics and how they affect the decision rule - simultaneously.

In the following lessons, we explain how the number of success and guardrail metrics affects the sample size calculation. If you want to learn more about how Confidence manages risk in decision-making, read more in this blog post.

Intro to guardrail metrics and non-inferiority tests

Guardrail metrics are different from success metrics. They are metrics that we want to ensure do not significantly worsen in the treatment group compared to the control group.

This means that we aim to prove that the metric did not deteriorate in the treatment group compared to the control group. To do this, we use non-inferiority tests.

This video introduces the concept of non-inferiority tests and how the choice of the Non-Inferiority Margin (NIM) affects the sample size calculation.

There are more ways to use guardrail metrics in Confidence besides with non-inferiority tests. Read more in this blog post.

Reader exercise

What is the primary purpose of guardrail metrics in experiments?

Reader exercise

What is Spotify's decision rule for shipping a product change?

Reader exercise

What is the relationship between the Non-Inferiority Margin (NIM) and sample size?

Notes for nerds

Deterioration metrics and health checks also affect the risk management of decision-making. We don't want to ship if:

  • Any metric included in the experiment moves significantly in the wrong direction.
  • Any health check, like the sample ratio mismatch test, triggers.

These are also based on statistical tests and therefore have a risk of incorrectly triggering.

Read more in this blog post and in this academic paper.

Was this page helpful?

PreviousIntroduction
NextLesson 2: Number of success metrics

© Copyright 2026. All rights reserved.

Follow us on TwitterFollow us on GitHub

On this page

  1. The two main types of metrics

  2. Spotify's decision rule

  3. Intro to guardrail metrics and non-inferiority tests

  4. Notes for nerds