Statistical Methods

What is Statistical Significance?

Statistical significance is the determination that an observed difference between experiment groups is unlikely to have occurred by chance alone.

Statistical significance is the determination that an observed difference between experiment groups is unlikely to have occurred by chance alone. When an A/B test result crosses the significance threshold, it means the data provides enough evidence to reject the null hypothesis: the assumption that no real difference exists between treatment and control.

Significance matters because product teams make shipping decisions based on experiment results. Without a principled threshold for "this is real, not noise," teams either ship changes that didn't actually help (false positives) or discard changes that did (false negatives). At Spotify, where 300+ teams run over 10,000 experiments per year, statistical significance is the gate that separates validated product decisions from guesses. 42% of Spotify experiments are rolled back after guardrail metrics catch regressions. Those rollback decisions depend on significance thresholds being correctly set and honestly interpreted.

How does statistical significance work in an A/B test?

After an experiment collects enough data, a statistical test compares the metric values between the control and treatment groups. The test produces a p-value: the probability of seeing a difference at least as large as the observed one, assuming there's actually no difference. If that p-value falls below a pre-set threshold called the significance level (alpha, typically 0.05), the result is declared statistically significant.

The logic is a proof by contradiction. You start by assuming the change had no effect. If the data would be extremely unlikely under that assumption, you reject it and conclude the change probably did have an effect. A significance level of 0.05 means you're willing to accept a 5% chance of incorrectly concluding an effect exists when it doesn't.

Confidence calculates significance using frequentist methods, including sequential testing frameworks like Group Sequential Tests that allow teams to check results at multiple points during an experiment without inflating the false positive rate. This matters because in practice, teams don't want to wait until the very end of a planned experiment to learn whether a change is harmful.

What does statistical significance not tell you?

A statistically significant result tells you an effect probably exists. It doesn't tell you the effect is large enough to matter. A test with millions of users can detect a 0.01% improvement in conversion rate with high confidence, but that improvement might not be worth the engineering cost to maintain the feature.

This is why Confidence reports confidence intervals alongside significance. The interval shows the plausible range of the true effect size. A result can be significant but practically meaningless if the entire confidence interval sits near zero. Teams that focus only on "is it significant?" miss the more important question: "is the effect big enough to care about?"

Significance also doesn't protect you from problems upstream. If your experiment has a sample ratio mismatch, if users weren't properly randomized, or if you're measuring the wrong metric, a significant result can be confidently wrong.

Why do experiments sometimes reach significance too early?

The most common way teams misuse significance is peeking: checking results repeatedly before the experiment has reached its planned sample size and stopping as soon as significance appears. In a fixed-horizon test designed for a single analysis, each additional look inflates the false positive rate. A test designed for 5% false positives can easily exceed 20% if checked daily.

Sequential testing solves this. Confidence supports Group Sequential Tests and always-valid inference, both of which adjust the significance boundary at each interim analysis to keep the overall false positive rate at the intended level. The cost is a small reduction in statistical power compared to a single-look test, but the benefit is that teams can make faster decisions on experiments with clear results without compromising the integrity of the conclusion.