What is a False Positive Rate (Type I Error)?

The false positive rate, also called the Type I error rate, is the probability of concluding that a treatment had an effect when it actually didn't. In an A/B test, a false positive means declaring a winner when the difference between treatment and control was just noise. The false positive rate is controlled by the significance level (alpha): setting alpha at 0.05 means accepting a 5% chance of a false positive on any single test.

False positives are expensive. A team ships a feature believing it improved conversion, allocates engineering resources to extend it, and builds follow-up plans around an effect that doesn't exist. At Spotify, where teams run over 10,000 experiments per year, even a 5% false positive rate on individual tests means hundreds of experiments per year could produce misleading results if teams don't account for the ways false positive rates compound.

How do false positives accumulate across metrics?

The 5% false positive rate applies to a single statistical test on a single metric. Most experiments evaluate multiple metrics: a success metric, several guardrail metrics, and sometimes secondary metrics. Each additional test is another opportunity for a false positive.

With 10 independent metrics and no correction, the probability of at least one false positive is 1 - (0.95)^10, which is about 40%. With 20 metrics, it's 64%. This is the multiple testing problem.

Confidence addresses this with multiple testing corrections. For success metrics, the platform applies Bonferroni correction by default, dividing the significance level by the number of success metrics. With 5 success metrics and alpha = 0.05, each is tested at alpha = 0.01. This keeps the overall false positive rate at 5% across the set. The power cost of Bonferroni compared to more complex corrections like Holm or Hommel is only about 4-5 percentage points in typical experimentation settings, and Bonferroni's advantage is that it produces valid simultaneous confidence intervals for every metric.

How does peeking inflate false positive rates?

Checking experiment results before the planned sample size is reached, and stopping when a significant result appears, is the most common way teams inadvertently inflate false positive rates.

A fixed-horizon test designed for one analysis at the end has a 5% false positive rate at that single analysis point. If you check daily for 14 days and stop at the first significant result, the effective false positive rate can exceed 20%. The reason: early in the experiment, random fluctuations are large relative to the small sample. Some of those fluctuations will cross the significance threshold by chance.

Sequential testing methods solve this problem. Confidence supports Group Sequential Tests (GST) and always-valid inference. GSTs allocate portions of the significance budget to each planned interim analysis, so the total false positive rate stays at the target. Always-valid inference provides confidence intervals that are valid at any stopping time. Both approaches let teams look at results early without paying the false positive penalty of uncontrolled peeking.

How does Confidence treat false positives differently for different metric types?

The cost of a false positive depends on the type of metric. For a success metric, a false positive means you shipped a feature that didn't actually move the metric you care about. For a guardrail metric, a false positive means you flagged a regression that didn't actually happen, potentially blocking a beneficial change.

Confidence's decision framework recognizes this asymmetry. Success metrics get strict false positive control because the cost of shipping a non-effect is wasted resources and diluted product quality. Guardrail metrics are handled differently: the primary concern there is the false negative rate (missing a real regression), because the cost of shipping harm is typically higher than the cost of a false alarm. This distinction, formalized in the risk-aware decision framework Spotify developed, means each metric type gets the error rate calibration appropriate to its role in the decision.

What is a False Positive Rate (Type I Error)?

How do false positives accumulate across metrics?

How does peeking inflate false positive rates?

How does Confidence treat false positives differently for different metric types?

Related terms