A non-inferiority test confirms that a treatment is not meaningfully worse than control on a guardrail metric. It provides positive evidence of safety: if the test passes, you can be statistically confident that any degradation the treatment causes is smaller than the pre-specified non-inferiority margin (NIM). This is a stronger guarantee than an inferiority test, which only detects clear harm without confirming the absence of it.
Non-inferiority tests are the mature form of guardrail metric evaluation. They turn the ship decision from "we didn't see harm" into "we confirmed the harm, if any, is within bounds we defined as acceptable."
How does a non-inferiority test work?
The null hypothesis of a non-inferiority test is that the treatment is inferior: it's worse than control by at least the NIM. The alternative hypothesis is that the treatment is non-inferior: any degradation is smaller than the NIM.
If the lower bound of the confidence interval for the treatment effect sits above the negative NIM threshold, the test rejects the null and declares the treatment non-inferior. If the confidence interval extends below the NIM threshold, the test fails to reject and you can't confirm safety.
Here's a concrete example. A team sets a NIM of 50 milliseconds for page load time. The experiment finds that the treatment increases load time by 20ms, with a 95% confidence interval of [5ms, 35ms]. The upper bound (35ms) is below the NIM (50ms), so the treatment is non-inferior. The team can ship knowing the load time regression is bounded.
If instead the confidence interval were [5ms, 65ms], the test would be inconclusive. The treatment might be causing harm within the acceptable range, or it might be causing harm beyond it. The experiment doesn't have enough power to distinguish.
Why is a non-inferiority test harder to implement than an inferiority test?
Two reasons.
First, it requires a NIM, and setting a good NIM requires product judgment. How many milliseconds of latency are you willing to trade for a 2% lift in conversion? How many basis points of crash rate increase? These aren't statistical questions. They're product decisions that require understanding the metric's business impact, its natural variability, and the strategic importance of the success metric gain.
Second, non-inferiority tests require more statistical power. To confirm that a treatment's effect is bounded within a narrow margin, you need enough data to estimate the effect precisely. An experiment that has sufficient power to detect a large regression (for an inferiority test) may not have sufficient power to confirm the treatment is within a tight NIM. This means teams may need to run experiments longer or on more traffic to get non-inferiority results on their guardrails.
Confidence handles the power calculation for non-inferiority tests during experiment setup, showing teams whether their planned sample size is sufficient to detect non-inferiority at the specified NIM. This prevents teams from running experiments that can't answer the guardrail question.
When should teams move from inferiority tests to non-inferiority tests?
The Confidence blog describes a graduated path. Teams new to guardrail metrics should start with inferiority tests, which require no NIM and catch obvious regressions. Once a team has run enough experiments to understand the natural variation of their guardrail metrics and can articulate the tradeoffs they're willing to make, they graduate to non-inferiority tests.
The signs that a team is ready include: they've tracked guardrail metrics for at least a dozen experiments, they can describe in concrete terms what a "tolerable" regression looks like for each metric, and they have enough traffic to power non-inferiority tests at their desired NIM.
Skipping straight to non-inferiority tests without this foundation leads to poorly chosen NIMs. A NIM set too loose provides false confidence: every experiment passes, including ones that are causing meaningful harm. A NIM set too tight blocks most experiments, frustrating product teams and incentivizing them to remove the guardrail.
How do non-inferiority tests interact with multiple guardrails?
When an experiment has multiple guardrail metrics, the probability of missing a real regression on at least one of them increases with the number of guardrails. The Confidence decision framework addresses this by adjusting the false negative rate (not the false positive rate) for the number of guardrail metrics.
The logic is specific to the guardrail context. A false positive on a guardrail means investigating a regression that doesn't exist: costly in time, but it doesn't ship harm. A false negative means missing a real regression and shipping it: directly harmful to users. The asymmetry means you want to control the probability of missing any real guardrail regression, which requires adjusting the power calculation rather than the significance level.