Metrics

What is an Inferiority Test?

An inferiority test checks whether a treatment is worse than control by more than a specified margin on a guardrail metric.

An inferiority test checks whether a treatment is worse than control by more than a specified margin on a guardrail metric. It answers a focused question: is this change causing enough harm to block the ship decision? If the test rejects, the treatment has a statistically detectable negative effect that exceeds the acceptable threshold. If it doesn't reject, the evidence is insufficient to conclude the treatment is causing meaningful harm.

Inferiority tests are the starting point for teams building guardrail metric practices. They require less organizational maturity than non-inferiority tests because they catch clear regressions without demanding that teams pre-specify how much deterioration they're willing to tolerate.

How does an inferiority test differ from a non-inferiority test?

The two tests ask complementary questions.

An inferiority test asks: "is the treatment meaningfully worse?" If it rejects, you have evidence of harm. If it doesn't reject, you haven't proven the treatment is harmful, but you also haven't proven it's safe.

A non-inferiority test asks: "can we confirm the treatment is not meaningfully worse?" If it rejects (the null hypothesis of inferiority), you have positive evidence that any degradation is within the acceptable non-inferiority margin (NIM). If it doesn't reject, you can't confirm the treatment is safe enough.

The practical difference: an inferiority test that doesn't reject leaves you uncertain. A non-inferiority test that does reject gives you confidence that the harm, if any, is bounded. For shipping decisions, the non-inferiority test provides stronger assurance. But it requires a NIM, which requires the team to define "how much harm is acceptable" before seeing results.

This is why Confidence recommends starting with inferiority tests. They catch the obvious regressions. Once a team has enough experience with their guardrail metrics to set meaningful NIMs, they graduate to non-inferiority tests.

When does an inferiority test trigger a no-ship decision?

When the inferiority test rejects on a guardrail metric, the treatment is causing statistically detectable harm that exceeds the threshold. The standard response is to not ship the treatment, investigate the regression, and either fix the underlying cause or accept that this particular idea has an unacceptable tradeoff.

At Spotify, 42% of experiments are rolled back after guardrail metrics detect regressions. Many of those catches come from inferiority tests on metrics like app startup time, crash rate, and streaming quality. Without those tests, the regressions would ship silently. The success metric might look fine while a different part of the user experience degrades.

Confidence runs inferiority tests on all guardrail metrics by default. When a guardrail is flagged, the platform surfaces the effect size and confidence interval alongside the success metric results, giving the team the full picture for their ship decision.

What are the limits of inferiority tests?

The main limitation is that a non-significant inferiority test doesn't prove safety. If the experiment is underpowered for the guardrail metric, the test may fail to detect a real regression simply because there isn't enough data. A treatment could be causing moderate harm and the inferiority test won't catch it if the sample size is too small.

This is the gap that non-inferiority tests fill. By pre-specifying a NIM and designing the experiment with enough power to test against it, you can get positive evidence of safety rather than just absence of detected harm.

Another limitation: inferiority tests evaluate each experiment in isolation. A treatment that causes a tiny, non-detectable regression on a guardrail metric will pass the inferiority test. If hundreds of experiments each cause the same tiny regression, the cumulative effect can be substantial even though no individual experiment was flagged. Longitudinal guardrails address this by tracking guardrail metrics across experiments over time.

How should teams set the detection threshold?

Most inferiority tests use a significance level of 0.05, meaning the test flags a regression when there's less than a 5% probability the observed harm occurred by chance. The Confidence decision framework recommends adjusting the false negative rate (the probability of missing a real regression) based on the number of guardrail metrics, because missing a real regression on any guardrail metric is the operational risk that matters.

The false positive rate on guardrail metrics typically doesn't need adjustment for multiple testing. A false positive on a guardrail means investigating a regression that doesn't exist, which costs time but doesn't ship harm. A false negative means shipping harm you didn't detect, which is worse. This asymmetry is formalized in Confidence's metric type framework.