Lesson 1: Guardrail metrics with non-inferiority margins
Guardrail metrics can be tested with inferiority tests or non-inferiority tests. Using non-inferiority tests improves the quality of the decision to launch a feature, but require you to specify a non-inferiority margin.
In this lesson you learn how to test guardrail metrics and what the difference between inferiority and non-inferiority tests is. You can test guardrail metrics in two different ways:
- Use an inferiority test. This test evaluates whether there is evidence that the guardrail metric does worse in the treatment group compared to the control group.
- Use a non-inferiority test. This test instead evaluates whether there is evidence that the guardrail metric does better than a pre-defined threshold in the treatment group compared to the control group.
In practice you choose between the two tests by specifying or not specifying a non-inferiority margin. If you specify a non-inferiority margin, you use a non-inferiority test. If you don't, you use an inferiority test.
Read more about the statistical details of inferiority and non-inferiority tests in the Confidence documentation.
Inferiority test
The inferiority test seeks evidence that the treatment group does worse than the control group. The test means that if the metric deteriorates due to the treatment, the test signals that the change causes the metric to deteriorate and is a not a good launch candidate.
The downside of the inferiority test is that it can never establish evidence that the change is safe to launch. It instead views lack of evidence for a negative impact as that the change is safe to launch.
Your guardrail metric number of purchases per visitor changes by -0.2% with a 95% confidence interval of [-0.7%, 0.3%]. There is no evidence that your change has a negative impact on the number of purchases per visitor. There is also no evidence that your change doesn't have a negative impact on the metric.
If you can't prove that the metric deteriorated, then you conclude that the change is safe to launch.
Non-inferiority test
The non-inferiority test seeks evidence that the difference between the treatment and control groups is not worse than a pre-specified margin, known as the non-inferiority margin. The test means that a significant result is evidence that the change does better than the margin, and is a good launch candidate.
Your guardrail metric number of purchases per visitor changes by -0.2% with a 95% confidence interval of [-0.7%, 0.3%]. Your non-inferiority margin is -1%, meaning that you are willing to accept a 1% decrease. The lower bound of the interval exceeds this value and you can conclude that there is evidence that the change does better than your margin.
One way to think about the non-inferiority margin is as a threshold for how much a metric can deteriorate before you consider it to be a negative impact. If you can prove that the metric did not deteriorate by more than the non-inferiority margin, then you conclude that the change is safe to launch.
Learn about what the non-inferiority margin is and how to set it in 3 minutes and 44 seconds.
How Inferiority and non-inferiority tests compare
To illustrate the difference between the two tests, consider the following illustration.

The figure shows a confidence interval for a guardrail metric. The non-inferiority margin is below the interval's lower bound, which means that there is evidence that the difference between the treatment and control groups is not worse than the margin. In other words, the change in the metric is significantly non-inferior, and this is a successful non-inferiority test. If you use the non-inferiority test, the conclusion is that there is evidence that the change doesn't have a negative impact on the metric beyond the margin (non-inferiority margin) that you set.
For the same illustration, if you use an inferiority test (disregard the non-inferiority margin), the conclusion is that there is no evidence for deterioration (as the confidence interval covers zero), and thus the conclusion is that the change is safe to launch.
From non-inferiority to inferiority
If you use a non-inferiority test for a guardrail, Confidence still automatically checks if the metric has deteriorated as part of its monitoring checks.
Consider the following illustration. In this case the metric has not deteriorated beyond the non-inferiority margin, but it has deteriorated in relation to zero. The overall conclusion is that it is not safe to launch this feature.
In Confidence, even when you use a non-inferiority test for a guardrail, Confidence automatically checks for deterioration as part of its monitoring checks. Read more about how the various checks feed into the recommendation to launch.

Practical considerations
Inferiority or non-inferiority test
Inferiority tests can be a good starting point for guardrail metrics as they don't require the experimenter to specify a non-inferiority margin. However, the drawbacks are:
- Non-significant doesn't mean neutral. You shouldn't make a product decision on a non-significant result. Lack of significance means you don't have evidence that the metric has deteriorated. It doesn't mean you have evidence that the metric hasn't deteriorated. With a wide enough confidence interval, as when, for example, you have a small sample size, everything can be neutral.
- A non-inferiority test is a better, but more complicated, approach. With a non-inferiority test, you seek evidence that the change is safe to launch. This means you have a specified certainty that your change doesn't decrease your metric by more than a margin you've decided.
Set the non-inferiority margin
You need to consider both the business perspective and the statistical perspective when you set the non-inferiority margin. The business perspective is about what is an acceptable decrease for the business. The statistical perspective is about what is a reasonable margin given the variability of the metric and the sample size you can expect.
To decide on the non-inferiority margin, you can consider the following:
- What negative change in the metric is acceptable?
- What change in a given metric would you consider to be so small as to deem it practically equal?