A non-inferiority margin (NIM) is the maximum amount of deterioration in a guardrail metric that a team is willing to accept in exchange for a gain on their success metric. If a treatment degrades a guardrail by less than the NIM, it's considered non-inferior: the tradeoff is acceptable. If the degradation exceeds the NIM, the treatment fails the guardrail check.
Setting a NIM forces a team to answer a question most experimentation programs leave implicit: how much harm to this metric are we actually willing to tolerate? That question has no purely statistical answer. It's a product decision.
How does a non-inferiority margin work in practice?
A NIM is defined per guardrail metric, per experiment (or per experiment surface). Suppose a team's success metric is conversion rate and their guardrail is app startup time. They might set a NIM of 50 milliseconds: any treatment that increases startup time by less than 50ms passes the guardrail, even if the increase is statistically significant.
The non-inferiority test then checks whether the treatment's effect on the guardrail metric is worse than control by more than the NIM. If the confidence interval for the treatment effect lies entirely above the negative NIM threshold, the treatment is declared non-inferior. If any part of the confidence interval extends below the NIM, the test is inconclusive or the treatment fails.
This is stricter than a simple inferiority test, which only asks: "is the treatment worse than control by a meaningful amount?" The non-inferiority test goes further and asks: "can we positively confirm that the treatment is not meaningfully worse?" The distinction matters. An inferiority test that fails to reject means "we didn't detect meaningful harm." A non-inferiority test that rejects means "we have evidence the harm is small."
How should teams choose a NIM?
The NIM should reflect the actual tradeoff the team is willing to make. Three inputs shape it.
Business context. How important is this guardrail metric relative to the success metric? A team launching a revenue feature might tolerate a small engagement dip. A team optimizing search might tolerate zero regression in result relevance.
Metric scale. A NIM of 0.5% means different things for a metric that averages 80% vs. one that averages 2%. Express the NIM in units that map to real user impact: milliseconds, percentage points, absolute counts.
Statistical feasibility. A very tight NIM requires more statistical power to confirm non-inferiority. If the NIM is smaller than the experiment's minimum detectable effect, you'll rarely be able to declare non-inferiority even when the treatment is truly harmless. The NIM must be large enough to be detectable with your available sample size, but small enough to represent a genuine tolerance for harm.
Confidence lets teams define NIMs per metric at the experiment surface level, so all experiments on a shared product area use the same thresholds. This prevents ad hoc NIM selection after seeing results, which would undermine the statistical guarantees.
Why do some teams start with inferiority tests instead?
Setting meaningful NIMs is hard. It requires a product judgment about acceptable tradeoffs that many teams haven't formalized. A team that's never tracked guardrail metrics before can't reasonably set a NIM because they don't yet know what normal variation looks like for their metrics.
The Confidence team recommends a graduated approach: start with inferiority tests, which detect whether a treatment is causing meaningful harm without requiring a NIM. Once teams have experience interpreting guardrail results and understand the natural variation in their metrics, they can graduate to non-inferiority tests with explicit NIMs. This progression is described in detail in the Confidence blog post on guardrail metrics.
Skipping straight to non-inferiority testing before the organization is ready usually results in NIMs that are either too loose (rubber-stamping everything) or too tight (blocking changes that aren't actually harmful).