Lesson 3: Different types of metrics in experiments

The goal of the statistical analysis in an experiment is to provide a solid foundation for product decisions, covered in detail in a previous blog post. This means that the statistical analysis focuses on an overall decision, informed by all metrics in the experiment. A good experiment uses different types of metrics to balance the validity of the product decision with learning as much as possible, safeguarding against regressions, and inspiring new iterations. Learn how to use different types of metrics to maximize efficiency and to avoid inflating the sample sizes you need to collect.

Metrics in an experiment can be added in the following ways:

  • Required metrics on surfaces. Configure metrics that all experiments on specific surfaces must add. The platform verifies that these metrics don't deteriorate.
  • Success and guardrail metrics. Add the metrics meant to inform the decision when setting up the experiment.
  • Exploratory metrics. Create exploratory analyses to dig deeper and split by dimensions during or after launching your experiment.

Use required metrics to align decision scrutiny

Required metrics are metrics that all experiments that run on a surface check for regressions. These metrics show up on the experiment design page automatically when you select the surface. Importantly, required metrics are only checked for deterioration—Confidence recommends aborting the experiment if a required metric moves significantly in the unintended direction. These metrics have a negligible effect on the required sample size of the experiment. See this blog post for more details.

Required metrics increase the autonomy of teams. A single team focuses on optimizing the user experience they own, using metrics related directly to that experience. If an experiment has an unintended effect on business-critical metrics, Confidence alerts the team.

Use success and guardrail metrics to construct the basis for your decision

Success metrics

For success metrics, the recommendation is to ship if at least one success metric has significantly improved. This means that for each success metric you add, you have one more chance for a false positive result. The platform corrects for the number of success metrics to control the overall false positive rate, leading to a higher required sample size.

Guardrail metrics

For guardrail metrics, you can choose to use them in one of two ways:

  • With non-inferiority margins
  • Without non-inferiority margins

Learn more about the two ways to analyze guardrail metrics in the lesson on the topic.

If you use non-inferiority margins, which is the most rigorous way of using guardrail metrics, all guardrail metrics must be simultaneously powered. You should only ship if the treatment is significantly non-inferior to control for all guardrail metrics. Each guardrail metric adds a chance of not finding a significant non-inferior result, which means that the power per metric is corrected upwards, leading to a higher required sample size.

If you don't use non-inferiority margins for guardrail metrics, they are only checked for significant regressions. Interpreting the lack of evidence for deterioration as a signal for shipping the variant is a less rigorous way to work with guardrail metrics. Guardrail metrics without non-inferiority margins don't affect the required sample size. This means that if the sample size requirements are too large when all guardrails have non-inferiority margins, you can trade off rigor for a smaller sample size by changing some guardrail metrics to not use non-inferiority margins.

Use exploratory metrics to learn more

You can add any metrics to an exploratory analysis of the experiment. You can also slice and dice these results on dimensions to see the results in different subgroups. Use explorations to dig deeper into the results after you make a decision from the experiment. The exploration is separate from the analysis on the result page, and does not affect the required sample size of the experiment.

Recommendations

Consider the following recommendation when you configure metrics for your experiments:

  • Add as required metrics the metrics that should never deteriorate, but that you rarely use as decision metrics in experiments.

    • Add them to the global surface if they apply business wide, or on a local surface.
    • Required metrics don't inflate the required sample size, but adding more decreases the chances of finding a deterioration.
  • Add as success metrics the metrics that quantify the success of a treatment.

    • These should generally be few.
    • The more success metrics you add, the larger the sample size you need to collect is.
  • Add as guardrail metrics with non-inferiority margins the metrics that your success metrics are likely to cannibalize.

    • The margin lets you quantify the trade-off between a guardrail metric deteriorating and a success metric improving.
    • The more guardrail metrics with non-inferiority margins that you add, the larger the sample size you need to collect is.
  • Add as guardrail metrics without non-inferiority margins the metrics that you don't want to see deteriorate, but that you have a less strict trade-off for.

    • Guardrail metrics without non-inferiority margins don't inflate the sample size you need to collect, but detecting true deterioration becomes harder the more you add.
  • Add as exploratory metrics the metrics that you don't base your decision on, but that help you learn and understand the results of the experiment.

    • Exploratory metrics don't influence the sample size you need to collect, but adding more makes it harder to find effects in your explorations.