Confidence
  • Pricing
  • Success stories
  • Contact us
  • Login
Start free trial
All terms
Statistical Methods

What is a Significance Level (Alpha)?

The significance level, commonly called alpha, is the maximum false positive rate you're willing to accept in an experiment.

The significance level, commonly called alpha, is the maximum false positive rate you're willing to accept in an experiment. Setting alpha at 0.05 means you accept a 5% chance of concluding that a change had an effect when it actually didn't. It's the threshold against which p-values are compared: if the p-value falls below alpha, the result is statistically significant.

Alpha is a design decision, not a statistical constant. Choosing 0.05 is convention, not law. The right alpha depends on the cost of a false positive relative to a false negative. In high-stakes decisions where shipping a harmful change is expensive to reverse, a stricter alpha (0.01) makes sense. For exploratory experiments where missing a real effect is worse than occasionally shipping a dud, a more relaxed alpha (0.10) may be appropriate. Confidence lets teams configure significance levels per experiment, and the platform's decision framework applies different thresholds to success metrics versus guardrail metrics based on the asymmetric costs of different types of errors.

How does alpha interact with other experiment design parameters?

Alpha is one of four connected parameters in experiment design. The others are statistical power, sample size, and the minimum detectable effect (MDE). Changing any one of them affects the others.

Lowering alpha (from 0.05 to 0.01, say) reduces false positives but requires more sample size to maintain the same power. Raising alpha gives you more power at a given sample size, but increases the chance of false positives. Most teams default to alpha = 0.05 and adjust sample size or MDE instead.

At Spotify, where teams run over 10,000 experiments per year across 300+ teams, Confidence provides sample size calculators that show these trade-offs explicitly. When you set your alpha, MDE, and desired power, the calculator tells you how many users and how much time you need. If the required runtime is too long, you can adjust alpha, accept a larger MDE, or apply variance reduction techniques like CUPED to shrink the required sample.

Why does alpha need adjustment for multiple metrics?

Running a single test at alpha = 0.05 gives you a 5% false positive rate. Running 20 independent tests at alpha = 0.05 each gives you a ~64% chance that at least one produces a false positive. This is the multiple testing problem.

When an experiment evaluates several success metrics, the significance level for each individual metric needs to be adjusted so that the overall false positive rate stays controlled. Confidence applies Bonferroni correction to success metrics by default: with 5 success metrics and an overall alpha of 0.05, each metric is tested at alpha = 0.01. This is conservative but has a practical advantage: it produces valid simultaneous confidence intervals for every metric, which means teams can interpret all metric results together without worrying about which comparisons are legitimate.

The framework treats guardrail metrics differently. For guardrails, the primary concern is false negatives: missing a real regression that harms user experience. The significance level for guardrails is calibrated to control the risk of shipping harm, with power requirements adjusted for the number of guardrails being monitored.

What happens when alpha is set incorrectly?

Setting alpha too high means you'll ship changes that don't actually work. At a 10% false positive rate, roughly one in ten "winning" experiments is a mirage. Over dozens of experiments, the accumulated false wins erode trust in the experimentation program.

Setting alpha too low means you'll miss real improvements. If you require p < 0.01 for everything, experiments that would have shown a significant result at 0.05 now come back inconclusive. Teams wait longer, use more traffic, and still don't get answers on borderline cases.

The discipline is in choosing alpha before the experiment runs and not changing it after seeing the data. Adjusting the significance level post hoc to make a result cross the threshold is a form of p-hacking that invalidates the statistical guarantees.

Related terms

Statistical Methods
P-value

A p-value is the probability of observing a result at least as extreme as the one measured, assuming the null hypothesis is true (that is, assuming the change had no real effect).

Statistical Methods
Statistical Significance

Statistical significance is the determination that an observed difference between experiment groups is unlikely to have occurred by chance alone.

Statistical Methods
False Positive Rate (Type I Error)

The false positive rate, also called the Type I error rate, is the probability of concluding that a treatment had an effect when it actually didn't.

Statistical Methods
Statistical Power

Statistical power is the probability that an experiment will detect a real effect when one exists.

Statistical Methods
Sample Size

Sample size is the number of experimental units (typically users) needed in an A/B test to detect a given effect with a specified level of confidence and power.

Multiple Testing
Multiple Testing Correction

A multiple testing correction is an adjustment to significance thresholds that accounts for evaluating more than one hypothesis in the same experiment.

Multiple Testing
Bonferroni Correction

The Bonferroni correction adjusts significance thresholds for multiple testing by dividing the target alpha by the number of tests.

Spotify

Learn more

  • Read our blog
  • See comparisons
  • Glossary
  • RFP guides
  • Listen to us
  • Read our docs
  • Status page

Need help

  • Contact us

Legal

  • Terms of Service
  • Data Protection Agreement
  • Privacy Policy
  • Cookies

© 2026 Spotify

The Confidence name and logo are registered trademarks of Spotify.