Lesson 4: Success metrics

After you've written the hypothesis, you should have a clear idea which user behavior the experiment should influence and what outcome you expect to see. Now you need to pick metrics that measure if the experiment successfully achieves this outcome. An ideal success metric directly measures the desired outcome and is:

  • Observable in the short term
  • Sensitive to changes
  • Relevant for the business in the long term

In the best case, you can measure your desired outcome directly and with a reasonable delay after a user's exposure to the change.

Unfortunately, often the outcome of interest happens further in the future and is difficult to measure directly in the experiment.

Select few specific metrics

Success metrics should be as specific to the hypothesis as possible. You may be curious to learn about all the possible effects that your treatment may have. It's often tempting to just add every single metric that your change could possibly impact. However, when deciding on a success metric you should limit yourself to a few relevant metrics, and separate explorations from the criterion that defines success.

You should select only a few success metrics because:

  • It's harder to reliably measure success with many metrics
  • More metrics require a larger sample size

After your experiment ends, you can explore the effects on other metrics using exploratory analysis. This can help you understand the results better and inspire new hypotheses. However, you should base the decision whether to ship a change on your pre-defined success metrics, not on metrics that you added afterwards. Pre-defining decision criteria helps to avoid confirmation bias, where you end up selectively looking for evidence that confirms your beliefs and ignore evidence against.

Example

Use the minimum detectable effect to set the sensitivity of the experiment

After you decide which metric to use to measure success, you need to define what effect size you want the experiment to be able to reliably detect. This effect size is called the "minimum detectable effect" (MDE), or sometimes the "minimum relevant effect." You use the MDE to set up and plan the experiment so that it has enough sensitivity to detect effects you consider meaningful.

Selecting the MDE is a trade-off between:

  • the smallest business relevant effect
  • the smallest practically measurable effect

As an experimenter, use your domain expertise and discuss with stakeholders what the smallest effect that consider meaningful is. In the next step, you use the MDE to calculate what amount of traffic you need to reliably detect this effect. If the sample size you need to measure the chosen MDE is unrealistically large, then you need to adjust MDE upwards.

One way to understand the MDE of an experiment is to imagine your experiment as a microscope.

Illustration: MDE is like the resolution of a microscope

Imagine looking at cells under a microscope. The minimum detectable effect of an experiment is analogous to the resolution of a microscope. With a blurry, low resolution image you can see large structures. If you are specifically interested in smaller structures, you need a higher resolution. For even smaller structures you need an even higher resolution. In experiments, you can increase the sensitivity by increasing the sample size. This allows detecting smaller changes.

Watch this video to learn more about what the MDE is and what to consider when you set it