A p-value is the probability of observing a result at least as extreme as the one measured, assuming the null hypothesis is true (that is, assuming the change had no real effect). A small p-value means the data would be surprising if there were no actual difference between treatment and control. It does not tell you the probability that the change worked. It tells you how incompatible the data is with the assumption that nothing happened.
P-values drive the ship/don't-ship decision in most A/B testing frameworks. When a p-value falls below the significance level (typically 0.05), the result is declared statistically significant and teams treat the observed effect as real. At Spotify, this threshold gates decisions across 10,000+ experiments per year. Getting it wrong in either direction is costly: too lenient, and you ship changes that don't actually help; too strict, and you discard features that would have improved the product.
How is a p-value calculated in an A/B test?
In a standard two-sample test, the p-value comes from comparing the observed difference in metrics between treatment and control to the sampling distribution you'd expect under the null hypothesis. The steps:
- Compute the difference in means (or proportions) between the two groups.
- Estimate the standard error of that difference, which depends on the variance of the metric and the sample sizes.
- Divide the difference by the standard error to get a test statistic (a z-score for large samples).
- Look up the probability of seeing a test statistic that large or larger under the null distribution.
That probability is the p-value. With a significance level of 0.05, you reject the null hypothesis when the p-value is below 0.05. Confidence automates this calculation inside your data warehouse, applying the appropriate test for each metric type and adjusting for sequential testing when experiments are monitored over time.
What are common mistakes when interpreting p-values?
Treating the p-value as the probability the treatment worked. A p-value of 0.03 does not mean there's a 97% chance the change had an effect. It means that if the change truly had zero effect, you'd see data this extreme only 3% of the time. The distinction matters because a small p-value from an underpowered test with a noisy metric can be misleading.
Ignoring effect size. A p-value of 0.001 on a 0.02% improvement in click-through rate is statistically significant but practically irrelevant. The p-value says the effect is real; it says nothing about whether the effect is large enough to justify shipping. Always look at the confidence interval alongside the p-value.
Cherry-picking metrics. Running an experiment with 20 metrics and declaring victory on whichever one has the smallest p-value produces a ~64% chance of at least one false positive, even if nothing real changed. Confidence applies multiple testing corrections (Bonferroni by default for success metrics) to keep the overall false positive rate controlled.
Peeking without correction. Checking p-values daily on a fixed-horizon test and stopping when one drops below 0.05 inflates false positive rates well beyond the intended 5%. Sequential testing methods, which Confidence supports through Group Sequential Tests and always-valid inference, are designed for exactly this use case. They allow repeated looks at the data while maintaining valid p-values at every analysis point.
How does the p-value relate to the significance level?
The significance level (alpha) is the threshold you set before the experiment runs. The p-value is the result you observe after data comes in. If the p-value is less than alpha, you reject the null hypothesis.
Alpha is a design choice, not a law of nature. Setting alpha at 0.05 means you accept a 5% false positive rate. Some teams use 0.01 for high-stakes decisions or 0.10 for exploratory tests where false negatives are more costly than false positives. At Spotify, the Confidence platform's decision framework distinguishes between success metrics and guardrail metrics, applying different error rate requirements to each. Guardrail metrics, which protect against harm, use significance thresholds calibrated to the cost of missing a regression.