The false negative rate, also called the Type II error rate or beta, is the probability of failing to detect a real treatment effect. When an experiment commits a false negative, the change genuinely improved (or harmed) the metric, but the test didn't produce a significant result. The team concludes "no effect" and discards or ignores a change that actually worked.
False negatives are the quiet failure mode of experimentation programs. False positives get attention because teams ship something and eventually notice it didn't help. False negatives are invisible: you never learn about the improvement you missed. At Spotify, the win rate across experiments is around 12%, but the learning rate is 64%. The gap between those numbers reflects experiments that don't produce a positive shipping result but still generate understanding. If experiments are underpowered, you don't even get the learning. The test simply fails to detect anything.
How does the false negative rate relate to power?
Statistical power equals 1 minus the false negative rate. If power is 80%, the false negative rate is 20%. If power is 50%, the false negative rate is 50%: a coin flip on whether the test detects a real effect.
Most teams target 80% power (20% false negative rate) as a minimum, and 90% power (10% false negative rate) for high-stakes decisions. Anything below 80% produces null results so frequently that teams lose confidence in the testing process itself. When most experiments come back inconclusive, the organization starts treating experimentation as slow and unhelpful rather than recognizing the problem is underpowered tests.
The false negative rate depends on the same four parameters as power: significance level (alpha), sample size, minimum detectable effect (MDE), and metric variance. Reducing the false negative rate means either increasing sample size, accepting a larger MDE, relaxing the significance level, or reducing variance through techniques like CUPED.
Why do false negatives matter more for guardrail metrics?
In Confidence's decision framework, the consequences of errors differ by metric type. For a success metric, a false negative means missing an improvement. The cost is an opportunity lost. For a guardrail metric, a false negative means missing a regression. The cost is shipping something that actively harms user experience.
This asymmetry has a practical implication. The Spotify team's research on risk-aware product decisions with multiple metrics showed that for guardrail metrics, the false negative rate requires explicit adjustment for the number of guardrails being monitored. If you have five guardrail metrics, each with 20% false negative rate, the probability of missing a real regression on at least one of them is higher than 20%. Confidence's framework adjusts guardrail power requirements to account for this, keeping the overall probability of missing harm at an acceptable level.
This is also why Confidence uses inferiority testing for guardrails rather than standard superiority testing. An inferiority test is designed to detect whether the treatment is worse than control, with the false negative calibrated to the cost of shipping a harmful change.
What causes high false negative rates?
Insufficient sample size. The experiment didn't collect enough data to distinguish the true effect from noise. This is the most common cause and the most preventable, since sample size can be calculated before the test starts.
MDE set too large. The experiment was designed to detect a 5% lift, but the true effect was 2%. The test had no chance of detecting it. Teams sometimes set large MDEs to keep experiments short, but the trade-off is that they'll miss smaller, real effects.
High metric variance. Noisy metrics require larger samples. Revenue per user, for example, is typically high-variance because of outlier spenders. Applying CUPED or capping extreme values reduces variance and lowers the false negative rate.
Dilution from untriggered users. If only 15% of users encounter the changed feature but the analysis includes everyone, the observed effect is diluted by the 85% who weren't exposed. Trigger analysis, supported by Confidence, restricts the analysis to exposed users and recovers the undiluted effect size.