Statistical Methods

What is a Significance Level (Alpha)?

The significance level, commonly called alpha, is the maximum false positive rate you're willing to accept in an experiment.

The significance level, commonly called alpha, is the maximum false positive rate you're willing to accept in an experiment. Setting alpha at 0.05 means you accept a 5% chance of concluding that a change had an effect when it actually didn't. It's the threshold against which p-values are compared: if the p-value falls below alpha, the result is statistically significant.

Alpha is a design decision, not a statistical constant. Choosing 0.05 is convention, not law. The right alpha depends on the cost of a false positive relative to a false negative. In high-stakes decisions where shipping a harmful change is expensive to reverse, a stricter alpha (0.01) makes sense. For exploratory experiments where missing a real effect is worse than occasionally shipping a dud, a more relaxed alpha (0.10) may be appropriate. Confidence lets teams configure significance levels per experiment, and the platform's decision framework applies different thresholds to success metrics versus guardrail metrics based on the asymmetric costs of different types of errors.

How does alpha interact with other experiment design parameters?

Alpha is one of four connected parameters in experiment design. The others are statistical power, sample size, and the minimum detectable effect (MDE). Changing any one of them affects the others.

Lowering alpha (from 0.05 to 0.01, say) reduces false positives but requires more sample size to maintain the same power. Raising alpha gives you more power at a given sample size, but increases the chance of false positives. Most teams default to alpha = 0.05 and adjust sample size or MDE instead.

At Spotify, where teams run over 10,000 experiments per year across 300+ teams, Confidence provides sample size calculators that show these trade-offs explicitly. When you set your alpha, MDE, and desired power, the calculator tells you how many users and how much time you need. If the required runtime is too long, you can adjust alpha, accept a larger MDE, or apply variance reduction techniques like CUPED to shrink the required sample.

Why does alpha need adjustment for multiple metrics?

Running a single test at alpha = 0.05 gives you a 5% false positive rate. Running 20 independent tests at alpha = 0.05 each gives you a ~64% chance that at least one produces a false positive. This is the multiple testing problem.

When an experiment evaluates several success metrics, the significance level for each individual metric needs to be adjusted so that the overall false positive rate stays controlled. Confidence applies Bonferroni correction to success metrics by default: with 5 success metrics and an overall alpha of 0.05, each metric is tested at alpha = 0.01. This is conservative but has a practical advantage: it produces valid simultaneous confidence intervals for every metric, which means teams can interpret all metric results together without worrying about which comparisons are legitimate.

The framework treats guardrail metrics differently. For guardrails, the primary concern is false negatives: missing a real regression that harms user experience. The significance level for guardrails is calibrated to control the risk of shipping harm, with power requirements adjusted for the number of guardrails being monitored.

What happens when alpha is set incorrectly?

Setting alpha too high means you'll ship changes that don't actually work. At a 10% false positive rate, roughly one in ten "winning" experiments is a mirage. Over dozens of experiments, the accumulated false wins erode trust in the experimentation program.

Setting alpha too low means you'll miss real improvements. If you require p < 0.01 for everything, experiments that would have shown a significant result at 0.05 now come back inconclusive. Teams wait longer, use more traffic, and still don't get answers on borderline cases.

The discipline is in choosing alpha before the experiment runs and not changing it after seeing the data. Adjusting the significance level post hoc to make a result cross the threshold is a form of p-hacking that invalidates the statistical guarantees.