Lesson 3: How to measure impact with success and guardrail metrics

When you run an experiment, such as an A/B test or a rollout, the ultimate goal is to learn about the impact of the change you made. To know what the impact is, you need to measure the outcome on a relevant set of metrics. The metrics you select can serve different purposes, and even be subject to different statistical tests. This page describes the two main types of metrics you can use to measure impact, and how to select them.

The two types of metrics are:

  • Success metrics. Metrics that you aim to improve with your change.
  • Guardrail metrics. Metrics that you don't expect to improve, but that you want to make sure you don't have a negative impact on.

Success and guardrail metrics

Success metrics are the metrics that you aim to improve with your change. They're what you use to prove that your change had a positive impact.

In companion to success metrics, you should also select guardrail metrics. Guardrail metrics are metrics that help you make sure that your change doesn't have a negative impact on other aspects of your product. This means a hypothesis for an experiment includes two criteria: one for the success metric and one for the guardrail metric. Both criteria need evidence to support the decision to launch the change.

Let's look at some examples of success and guardrail metrics.

Example: Checkout flow

You run an A/B test with an improvement to the checkout flow of your e-commerce website. Your goal is to make the checkout flow more efficient so that your visitors spend less time in the checkout flow. You want to make sure that the improvement in the checkout flow doesn't come at the expense of the number of purchases.

  • Success metric: Average time to completed checkout per visitor.
  • Guardrail metric: Number of purchases per visitor.

Example: Search algorithm

With your new Spotify search algorithm, your hypothesis is that users get better podcast recommendations. You want to measure the impact of the new algorithm on the consumption of podcasts. You want to make sure that the new algorithm change doesn't reduce the number of users that listen to music.

  • Success metric: The average number of podcast minutes played per user.
  • Guardrail metric : The average number of music minutes played per user.

Example: Dating app

You have a dating app that requires new users to complete their profile before they can interact with others. You run a test where your hypothesis is that, showing a dialog with advice for how to onboard, increases the number of users that complete their setup. The dialog shown in your dating app experiment uses new technology, and you want to make sure you don't introduce any bugs.

  • Success metric: Share of users that complete their profile setup.
  • Guardrail metric: Number of crashes per user.

How success and guardrail metrics together define a successful variant

For a change to be worth shipping, at least one success metric must have improved significantly, while all guardrail metrics must not have regressed beyond acceptable limits. Both conditions must hold: a win on the success metric doesn't override a failure on a guardrail.