Lesson 6: Guardrail metrics and NIMs

Summary

In this lesson, you learn what guardrail metrics are and how to read their status labels. You see how adding a non-inferiority margin (NIM) changes what the labels mean, and why a NIM gives you stronger evidence of safety than no NIM.

Guardrail metrics are not the ones you are trying to improve. They are the ones you want to make sure you do not damage. Examples: session length, error rate, revenue per user. The question for a guardrail is not "did this go up?" but "did this go in the wrong direction?"

Because the question is different from a success metric, the status labels are different.

Without a NIM

The CI for a guardrail metric sits on the same axis as for a success metric. The zero line is still the reference. But the question is now about the harmful direction, not the positive direction.

For a metric where increases are harmful (such as time in checkout, error rate, or support contacts):

Has deteriorated: the CI is entirely on the wrong side of zero, in relation to the improvement-direction of the metric. There is statistical evidence the metric moved in the harmful direction. This is a serious signal.
Has not deteriorated: the CI crosses zero. No statistical evidence of movement in the wrong direction. The guardrail is holding.

Use the interactive below with the "No NIM" checkbox checked to explore these labels.

CI and status for guardrail metrics

Adjust the point estimate to see how the status changes. Use the direction toggle and NIM to explore different configurations.

Metric improves when it:

+4.2%

Has not deteriorated

Point estimate: +4.2%

-15%+15%

Sample size per group: 300 users

10010,000

Metric standard deviation (σ): 50 units

10 (low noise)100 (high noise)

Non-inferiority margin (NIM): −5.0%

1%15%

No NIM

Has not deteriorated: With high confidence, the true effect is between -1.5% and +9.9%. Since zero is in the interval, there is no statistical evidence of harm to this metric.

The result for this metric is in line with recommending to ship!

Try the following with "No NIM" checked:

With the direction set to "Decreases" (harmful = increase), drag the point estimate from +15% to -15% and watch both states: "Has deteriorated" when the CI sits entirely above zero, and "Has not deteriorated" when the CI crosses zero.
Move it to +3% and reduce the sample size to 200. The wide CI crosses zero: "Has not deteriorated."
Move the point estimate above +5% with a small sample size. The CI may sit entirely above zero: "Has deteriorated."

What a NIM is

A non-inferiority margin (NIM) defines how much deterioration is acceptable. Rather than asking "did this metric move at all?", a NIM lets you say "we accept up to X% increase: anything within that tolerance is acceptable."

Adding a NIM changes the question from "did this harm?" to "did this stay within bounds?"

In Confidence

In Confidence, the NIM appears as a solid vertical line on the results page, with an arrow pointing toward the safe zone. Whether a metric is non-inferior or possibly inferior is determined by the CI's position relative to the NIM line. Has deteriorated still uses the zero line as its threshold.

With a NIM

Non-inferior: the CI is entirely within the NIM boundary. This is positive evidence of safety: even the worst-case bound is within the acceptable tolerance.
Possibly inferior: the CI crosses the NIM boundary. Not enough evidence to confirm the metric stayed within the tolerance.
Has deteriorated: the CI is entirely on the wrong side of zero, in relation to the improvement-direction of the metric. Statistical evidence that the metric moved in the harmful direction, same threshold as without a NIM.

If a metric is possibly inferior, you can still interpret the CI the usual way: with high confidence, the true effect is somewhere between the lower and upper bound. That means you can use the CI to assess the worst case. Look at the bound in the harmful direction—the upper bound for a metric where increases are harmful, or the lower bound for a metric where decreases are harmful. That bound tells you how bad the effect could plausibly be, even if you cannot yet rule out that the metric stayed within the NIM.

To explore the with-NIM states, clear "No NIM" in the interactive above and try the following:

With the direction set to "Decreases" (harmful = increase), drag the point estimate from +15% to -15% and watch all three states appear in sequence: "Has deteriorated" when the CI is entirely above zero, "Possibly inferior" as the CI crosses the NIM boundary, and "Non-inferior" when the entire CI is within the acceptable range.
Set the point estimate near 0%. The CI sits well below the NIM: "Non-inferior."
Move the point estimate toward the NIM value. The CI starts to cross the NIM line: "Possibly inferior."
Move the point estimate well above the NIM. The whole CI is above zero: "Has deteriorated."

The key difference

Note

Without a NIM, "Has not deteriorated" tells you only that you could not detect harm: absence of evidence, not evidence of absence. With a NIM, "Non-inferior" is positive evidence of safety. The CI is entirely within the acceptable range, so even the most pessimistic estimate is acceptable. The second approach is more rigorous when safety matters.

Notes for nerds

The distinction between "Has not deteriorated" and "Non-inferior" maps onto a broader framework for thinking about what level of evidence an experiment actually needs to provide. Rather than treating every experiment as requiring the same strength of evidence, you can think of experimentation as a ladder of risk mitigation: each rung offers progressively stronger statistical guarantees, but also requires more data to reach. Using guardrail metrics without NIMs sits at a lower rung: you are ruling out obvious harm, but not positively bounding how bad things could be. Adding a NIM moves you up the ladder: you are now producing positive evidence of safety within an acceptable tolerance.

This framing has practical implications for how you design experiments when sample sizes are limited. The Confidence blog post on experimenting with smaller samples develops this idea in full.