Confidence
  • Documentation
  • Blog
  • Bootcamp
  • Status
  • Confidence Bootcamp
    • My learning
    • Intro to experimentation
      • Introduction
      • Lesson 1: Why you should experiment
      • Lesson 2: Experiment hypothesis
      • Lesson 3: Success and guardrail metrics
      • Lesson 4: Success metrics
      • Lesson 5: Set up your experiment
      • Lesson 6: Calculation frequency
      • Lesson 7: Target audience
      • Lesson 8: Sample size
      • Lesson 9: Quality assurance
      • Lesson 10: Run your experiment
      • Lesson 11: Evaluate your experiment and make a decision
      • Lesson 12: A/B tests and rollouts
      • Course wrap up
    • Intro to metrics
      • Introduction
      • Lesson 1: What is a metric?
      • Lesson 2: Metric roles
      • Lesson 3: Time considerations
      • Lesson 4: Capturing behavior
      • Lesson 5: Strategic metrics
      • Lesson 6: Interpretability
      • Lesson 7: Feasibility and sensitivity
      • Lesson 8: Variance reduction
      • Lesson 9: Select metrics
      • Lesson 10: Segment-level analysis
      • Course wrap up
    • Scientific product development
      • Introduction
      • Lesson 1: Why you should experiment
      • Lesson 2: The scientific method
      • Lesson 3: Randomized controlled trials
      • Lesson 4: Experiment hypothesis
      • Lesson 5: Case study
        • Case study
        • Answers to case study
      • Lesson 6: Why do we need statistics?
      • Lesson 7: Success metrics
      • Lesson 8: Detectable effects and sample size
      • Lesson 9: Make a decision
      • Course wrap up
    • A primer on hypothesis testing
      • Introduction
      • Lesson 1: Introduction to hypothesis testing
      • Lesson 2: True vs estimated effects
      • Lesson 3: Sampling distribution of the difference-in-means estimator
      • Lesson 4: Z-tests and how to reject the null hypothesis
      • Lesson 5: False postive rate and alpha
      • Lesson 6: True positive rate, MDE, and power
      • Course wrap up
    • Intro to Feature Flags
      • Introduction
      • Lesson 1: What is a feature flag?
      • Lesson 2: Lifecycle of a feature flag
      • Lesson 3: Clients
      • Lesson 4: Evaluation context and targeting
    • Sample size calculation - I
      • Introduction
      • Lesson 1: What is the required sample size?
      • Lesson 2: Alpha and power
      • Lesson 3: Baseline mean and variance
      • Lesson 4: Sample size playground - I
    • Sample size calculation - II
      • Introduction
      • Lesson 1: Multi-metric decision making
      • Lesson 2: Number of success metrics
      • Lesson 3: Number of guardrail metrics
      • Lesson 4: Number of comparisons
      • Lesson 5: Sample size playground - II
    • Sample size calculation - III
      • Introduction
      • Lesson 1: Binary metrics
      • Lesson 2: Treatment group proportions
      • Lesson 3: Variance reduction
      • Lesson 4: Sequential testing and sample size
      • Lesson 5: Sample size playground - III
    • Advance your experimentation
      • Introduction
      • Lesson 1: Guardrail metrics with non-inferiority margins
      • Lesson 2: Choose evaluation frequency
      • Lesson 3: Metrics' roles in experiments
      • Lesson 4: Cumulative holdback evaluations
    • Experimentation culture
      • Introduction
      • Lesson 1: Onboarding into experimentation
      • Lesson 2: Empowering experimentation champions
      • Lesson 3: Sustaining the experimentation culture
    • Videos

Lesson 7: Success metrics

Summary

In this lesson, you learn how to define success metrics for an experiment. Success metrics are what you use to evaluate whether your change is successful.

A good success metric:

  • Has a strong relation (ideally direct) with success of your product and business.
  • Is not too noisy and therefore easy to detect changes in.

After you've written the hypothesis, you should have a clear idea which user behavior the experiment should influence and what outcome you expect to see. Now you need to pick metrics that measure if the experiment successfully achieves this outcome. An ideal success metric directly measures the desired outcome and is

  • Observable in the short term
  • Sensitive to changes
  • Relevant for the business in the long term

In the best case, you can measure your desired outcome directly and with a reasonable delay after a user's exposure to the change.

Example

Consider an example that makes a change in the user flow for subscribing to premium. The experimenters can measure the share of users who successfully sign up. The impact on user behavior is directly related to the change in the user flow, it's measurable in the short term, and highly relevant to the business.

Unfortunately, often the outcome of interest happens further in the future and is difficult to measure directly in the experiment.

Example

For example, when we create a new feature at Spotify, we often hope to improve the user experience and reduce churn in the long term. But the subjective user experience is difficult to measure, and the impact of the user experience on churn takes time to detect. In those cases, we need to use proxy metrics that we can measure in the short term, and are reliable predictors of the long-term outcome that's our primary interest.

Example

At Spotify, common proxy metrics are share of active users (measured over a day or a week) and minutes played. These metrics measure short-term engagement with the product and correlate with long-term outcomes like churn and premium subscription.

Select few specific metrics

Success metrics should be as specific to the hypothesis as possible. You may be curious to learn about all the possible effects that your treatment may have. It's often tempting to just add every single metric that your change could possibly impact. However, when deciding on a success metric you should limit yourself to a few relevant metrics, and separate explorations from the criterion that defines success.

You should select only few success metrics because:

  • It's harder to reliably measure success with many metrics
  • More metrics require a larger sample size

After your experiment ends, you can explore the effects on other metrics using exploratory analysis. This can help you understand the results better and inspire new hypotheses. However, you should base the decision whether to ship a change on your pre-defined success metrics, not on metrics that you added afterwards. Pre-defining decision criteria helps to avoid confirmation bias, where you end up selectively looking for evidence that confirms your beliefs and ignore evidence against.

In Confidence

In Confidence, you can run exploratory analysis after your experiment ends to dig deeper into results and get inspiration for new hypotheses.

Example

Example

Consider a team that's working on the Spotify home page that wants to test whether adding a "shuffle" button in the "Try something else" shelf increases user engagement on the home screen. They create an experiment with two treatment groups: one called "Control" which gets the default experience (no shuffle button), and one called "Treatment" which gets the shuffle button.

They need to decide on a success metric to decide whether the shuffle button improves user experience. Which metric should they choose?

Treatment and control variants

If the goal of the button is to increase interaction with the Try something else shelf, then one possible metric is Share of users who play from the Try something else shelf. This directly measures the behavior that the feature aims to influence. But is this also relevant for the user and the business? Measuring success by that metric makes it tempting to introduce more features that direct traffic towards this shelf, and away from the Jump back in and Podcasts to try shelves. A better success metric is Minutes played on Week 1, because this measures overall user activity. You could add Share of users who play from the Try something else shelf as a metric to confirm that an increase in plays from Try something else caused an increase in overall activity.

Reader exercise

What is the purpose of a success metric in an experiment?

Was this page helpful?

PreviousLesson 6: Why do we need statistics?
NextLesson 8: Detectable effects and sample size

© Copyright 2026. All rights reserved.

Follow us on TwitterFollow us on GitHub

On this page

  1. Select few specific metrics