Lesson 1: Guardrail metrics with non-inferiority margins

In this lesson you learn how to test guardrail metrics and what the difference between inferiority and non-inferiority tests is. You can test guardrail metrics in two different ways:

  • Use an inferiority test. This test evaluates whether there is evidence that the guardrail metric does worse in the treatment group compared to the control group.
  • Use a non-inferiority test. This test instead evaluates whether there is evidence that the guardrail metric does better than a pre-defined threshold in the treatment group compared to the control group.

In practice you choose between the two tests by specifying or not specifying a non-inferiority margin. If you specify a non-inferiority margin, you use a non-inferiority test. If you don't, you use an inferiority test.

Inferiority test

The inferiority test seeks evidence that the treatment group does worse than the control group. The test means that if the metric deteriorates due to the treatment, the test signals that the change causes the metric to deteriorate and is a not a good launch candidate.

The downside of the inferiority test is that it can never establish evidence that the change is safe to launch. It instead views lack of evidence for a negative impact as that the change is safe to launch.

If you can't prove that the metric deteriorated, then you conclude that the change is safe to launch.

Non-inferiority test

The non-inferiority test seeks evidence that the difference between the treatment and control groups is not worse than a pre-specified margin, known as the non-inferiority margin. The test means that a significant result is evidence that the change does better than the margin, and is a good launch candidate.

One way to think about the non-inferiority margin is as a threshold for how much a metric can deteriorate before you consider it to be a negative impact. If you can prove that the metric did not deteriorate by more than the non-inferiority margin, then you conclude that the change is safe to launch.

Learn about what the non-inferiority margin is and how to set it in 3 minutes and 44 seconds.

How Inferiority and non-inferiority tests compare

To illustrate the difference between the two tests, consider the following illustration.

Non-inferiority margin

The figure shows a confidence interval for a guardrail metric. The non-inferiority margin is below the interval's lower bound, which means that there is evidence that the difference between the treatment and control groups is not worse than the margin. In other words, the change in the metric is significantly non-inferior, and this is a successful non-inferiority test. If you use the non-inferiority test, the conclusion is that there is evidence that the change doesn't have a negative impact on the metric beyond the margin (non-inferiority margin) that you set.

For the same illustration, if you use an inferiority test (disregard the non-inferiority margin), the conclusion is that there is no evidence for deterioration (as the confidence interval covers zero), and thus the conclusion is that the change is safe to launch.

From non-inferiority to inferiority

If you use a non-inferiority test for a guardrail, Confidence still automatically checks if the metric has deteriorated as part of its monitoring checks.

Consider the following illustration. In this case the metric has not deteriorated beyond the non-inferiority margin, but it has deteriorated in relation to zero. The overall conclusion is that it is not safe to launch this feature.

Non-inferiority margin

Practical considerations

Inferiority or non-inferiority test

Inferiority tests can be a good starting point for guardrail metrics as they don't require the experimenter to specify a non-inferiority margin. However, the drawbacks are:

  • Non-significant doesn't mean neutral. You shouldn't make a product decision on a non-significant result. Lack of significance means you don't have evidence that the metric has deteriorated. It doesn't mean you have evidence that the metric hasn't deteriorated. With a wide enough confidence interval, as when, for example, you have a small sample size, everything can be neutral.
  • A non-inferiority test is a better, but more complicated, approach. With a non-inferiority test, you seek evidence that the change is safe to launch. This means you have a specified certainty that your change doesn't decrease your metric by more than a margin you've decided.

Set the non-inferiority margin

You need to consider both the business perspective and the statistical perspective when you set the non-inferiority margin. The business perspective is about what is an acceptable decrease for the business. The statistical perspective is about what is a reasonable margin given the variability of the metric and the sample size you can expect.

To decide on the non-inferiority margin, you can consider the following:

  • What negative change in the metric is acceptable?
  • What change in a given metric would you consider to be so small as to deem it practically equal?