A confidence interval is a range of values that, at a given confidence level, is expected to contain the true treatment effect. A 95% confidence interval means that if you repeated the experiment many times, 95% of the resulting intervals would contain the actual effect. It tells you not just whether an effect exists, but how large it plausibly is.
In practice, confidence intervals answer the question that matters most after an experiment: "how much did this change move the metric?" A p-value tells you whether the effect is real. The confidence interval tells you what the effect looks like. A narrow interval around a meaningful positive number is a strong signal to ship. A wide interval that spans zero tells you the experiment didn't collect enough evidence to decide. Confidence reports confidence intervals for every metric in every experiment, and they're the primary tool teams use to evaluate results.
How do you read a confidence interval?
A 95% confidence interval of [+0.5%, +2.1%] on a conversion rate metric means the true effect of the change is plausibly somewhere between a 0.5% and 2.1% increase. Both bounds are positive, so you can be confident the change improved conversion.
If the interval is [-0.3%, +1.8%], it crosses zero. The effect might be positive, but it might also be negative or exactly zero. The experiment hasn't produced enough evidence to distinguish a real improvement from noise.
Three properties determine how wide the interval is: the variance of the metric (noisier metrics produce wider intervals), the sample size (more users produce narrower intervals), and the confidence level (99% intervals are wider than 95% intervals for the same data). Variance reduction techniques like CUPED, which Confidence applies using the Negi-Wooldridge full regression estimator, tighten intervals by removing pre-experiment noise. This lets teams reach conclusive results with less traffic or detect smaller effects with the same traffic.
Why are confidence intervals more useful than p-values alone?
A p-value answers a binary question: is the effect statistically significant? A confidence interval answers a quantitative one: what's the range of plausible effect sizes?
This distinction has real consequences for product decisions. Suppose two experiments both produce p-values of 0.02 (significant at the 0.05 level). The first has a confidence interval of [+0.1%, +0.3%] on revenue per user. The second has a confidence interval of [+2.5%, +8.0%]. Both are statistically significant, but the second represents a meaningfully larger opportunity. Without confidence intervals, both results look the same.
At Spotify, where 300+ teams run experiments simultaneously, confidence intervals also help with prioritization. A team deciding between two follow-up directions can compare the intervals from prior experiments to see which lever has a larger plausible effect range. The interval width itself is informative: a very wide interval on a metric you care about signals that you need more traffic, lower variance, or a bolder treatment to learn anything useful.
How do confidence intervals work with sequential testing?
In fixed-horizon experiments (where you analyze once at the end), confidence intervals follow standard formulas. But most teams don't want to wait until the end. They want to check results as data accumulates.
Sequential testing frameworks produce valid confidence intervals at every interim analysis. Confidence supports Group Sequential Tests, which provide confidence intervals at each pre-planned look, and always-valid inference, which produces intervals that are valid at any stopping time. The trade-off is that sequential confidence intervals are wider than their fixed-horizon counterparts at each individual look, because they account for the multiple opportunities to stop. As more data accumulates, the intervals narrow.
This is what makes sequential testing practical: you can make a shipping decision the moment the confidence interval excludes the region of practical indifference, rather than waiting for a calendar date.