Lesson 4: Confidence intervals and precision

Summary

In this lesson, you learn what confidence intervals are and how to read them. You learn that CI width tells you how precisely the effect has been measured, and how sample size and metric noise affect that precision. You also learn how to read CIs from sequential and non-sequential tests.

The relative effect shown for a metric (say, +4.2%) is a single number, often called a point estimate. It is your best estimate of the true effect of the treatment variant on that metric. But a single number cannot tell you how trustworthy that estimate is. That is what the confidence interval is for.

A confidence interval (CI) gives you two things at once: where the effect likely is and how precisely you have measured it. These are not separate ideas; they are two ways of reading the same interval.

In Confidence

In Confidence, each metric result is shown as a horizontal bar. The dot in the middle is the point estimate (the +4.2% in our example). The bar extends to the left and right of that dot, and those endpoints are the lower bound and upper bound of the confidence interval.

Interpretation: correct or practical

Take an interval like [−15%, +6%]. The technically correct interpretation of a 95% CI is this: if you ran this experiment 100 times and computed a 95% CI each time, approximately 95 of those 100 intervals would contain the true effect. Notice that this is a statement about the procedure, not about the specific interval in front of you.

The practical interpretation you should use: With high confidence, the true effect lies between −15% and +6%. Reading it that way is not technically exact, but it leads to rational product decisions, and that is what matters. Read more about this in the Notes for nerds section.

CI width as precision

The width of the CI tells you how precisely you have measured the effect. A narrow CI means a precise estimate: you know roughly what the effect is. A wide CI means high uncertainty: the effect could be anywhere in a large range.

A wide CI that crosses zero is not just "not significant." It means you do not yet know whether there is an effect at all. If you have not yet reached the required sample size, the right response is to collect more data. If you have already met it, the wide CI is itself the finding: the effect is likely smaller than what the experiment was powered to detect, and concluding that nothing meaningful is happening may be exactly right. How required sample size is determined is covered in the Sample Size I course.

A bit simplified: two factors drive CI width in the most basic setting. Sample size: more users means a narrower CI. This is the most controllable factor; running longer or on a larger audience directly buys precision. Metric variance: some metrics are inherently noisier than others. Higher variance means you need more data to achieve the same precision. In practice, CI width also depends on whether you are running sequential or non-sequential testing, the number of metrics, the significance level (alpha), and other factors covered in the Sample Size and Hypothesis Testing courses.

Use the interactive below to build intuition for how these two factors interact.

Confidence interval explorer

The point estimate is fixed at +4.2%. Adjust sample size and standard deviation to see how the CI width changes.

Sample size per group: 300 users

10010,000

Metric standard deviation (σ): 50 units

10 (low noise)100 (high noise)

With high confidence, the true effect is between -1.5% and +9.9%. Since zero is in the interval, the effect is not statistically significant.

Try the following:

Set sample size to 100 and σ to 40. Notice how wide and uncertain the CI is.
Increase sample size to 5,000 with the same σ. The CI narrows significantly.
Set sample size back to 1,000 and move σ from 5 to 50. The CI widens as the metric becomes noisier.

The key insight: collecting more data is the primary lever you have to narrow the CI. Variance reduction (Lesson 8) is another lever; it effectively gives you a narrower CI for the same sample size by removing predictable noise from the data. Choosing a less noisy metric in the first place is a third: metric sensitivity and how to assess it are covered in the Feasibility and sensitivity lesson.

CIs in sequential and non-sequential tests

The CI you see on a results page always reflects the data collected up to that moment. How to interpret it depends on the evaluation strategy the experiment uses.

With a non-sequential (fixed-horizon) test, the CI is only statistically valid at the pre-specified end point. Intervals you see mid-experiment have not been corrected for repeated looks. They are informational but should not be acted on.

With a sequential test, the CI is valid at any point in time. That is precisely what sequential testing is designed for. The interval starts deliberately wide when data is sparse, because the test distributes its false-positive budget across all future looks, and narrows as data accumulates, just as any CI does with more data.

Reading a sequential result is straightforward: the CI at the latest time point is your current result. Interpret it exactly like any other confidence interval. The time-series view shows how the result developed over the course of the experiment, but the actionable number is always the most recent one.

In Confidence

In Confidence, the CI bar and summary shown in the results always reflect the latest time point. Expand a metric row to see the full time-series plot of how the interval has evolved. Move the pointer across the graph to inspect the CI at any earlier point in time.

Notes for nerds

Two-sided intervals, one-sided tests

Although confidence intervals in Confidence are displayed as symmetric two-sided intervals, the hypothesis tests underlying them are always one-sided. Each displayed CI is best understood as two back-to-back one-sided confidence bounds: the lower bound is the one-sided lower bound (testing whether the effect is negative), and the upper bound is the one-sided upper bound (testing whether the effect is positive). The actual significance decision for any metric is made in a single, pre-specified direction: the direction in which the metric is intended to move. One-sided tests require fewer observations than two-sided tests to achieve the same power for a given effect size, which is one reason Confidence can detect effects with fewer users than tools that use two-sided tests by default.

Z-tests, t-tests, and the Bayesian connection

Confidence uses z-tests throughout, not t-tests. For the sample sizes it targets (typically well above 1,000 users per group), the z and $t_{2n-2}$ quantiles are within 0.1% of each other, so the normal approximation costs nothing in practice and tracking degrees of freedom adds complexity for no gain.

Despite this, there is a precise argument for why Bayesian language is appropriate when interpreting results. The thread runs as follows. With a non-informative prior $p(\mu, \sigma^2) \propto 1/\sigma^2$ on each group, the Bayesian credible interval for the difference in means is numerically identical to the frequentist t-test confidence interval (see the derivation below). That Bayesian interval has a formal probabilistic guarantee: "there is a 95% probability that the true effect lies in this interval" is literally correct for it. The frequentist t-test interval converges to the z-test interval as $n$ grows, and by the Bernstein-von Mises theorem the Bayesian posterior converges to the same normal distribution. At the sample sizes Confidence targets, the z-test CI and the Bayesian credible interval are equal to any practically relevant precision, so the Bayesian probabilistic interpretation carries over.

The technically correct frequentist statement is: "95% of confidence intervals constructed using this procedure, across repeated experiments, would contain the true effect." This describes the long-run behavior of the procedure, not the probability that this specific interval is correct. For reading an experiment results page, treat the CI as a range of plausible values for the effect and you will make sound decisions.

If you are comfortable saying "95% probability the effect is in this range" for a Bayesian interval with a non-informative prior, you can use the same language for the frequentist CI without meaningfully misleading yourself or your colleagues.

The interactive below shows the convergence. Green is the frequentist z-test CI ( $z = 1.96$ , fixed); indigo is the Bayesian credible interval using $t_{2n-2}$ , which starts slightly wider at small $n$ and converges to the z-test CI as sample size grows.

Frequentist CI vs Bayesian credible interval

Fictitious metric: daily active minutes (control: 22.0 min, treatment: 23.1 min, pooled SD: 4.0 min). Drag the slider to see how both intervals narrow — and converge — as sample size grows.

Sample size per group: 200 users

1002,000

Frequentist: [+0.316 min, +1.884 min]

Bayesian: [+0.314 min, +1.886 min]

The Bayesian interval is 0.31% wider than the frequentist one. Increase the sample size to watch the t-quantile converge to z = 1.960 and the two intervals become indistinguishable.

The full derivation of why the Bayesian credible interval equals the frequentist t-test CI, and why both converge to the z-test CI:

Step 1: Non-informative prior

For a single normal group with unknown mean $\mu$ and variance $\sigma^2$ , the standard objective (Jeffreys) prior is:

$p(\mu, \sigma^2) \propto \frac{1}{\sigma^2}$

For two independent groups, control $(\mu_c, \sigma_c^2)$ and treatment $(\mu_t, \sigma_t^2)$ , the joint prior is the product:

$p(\mu_c, \sigma_c^2, \mu_t, \sigma_t^2) \propto \frac{1}{\sigma_c^2 \sigma_t^2}$

Step 2: Marginal posterior for each group mean

For group $j$ with $n$ observations $x_{j1}, \ldots, x_{jn}$ , the normal likelihood is:

$p(\mathbf{x}_j \mid \mu_j, \sigma_j^2) \propto (\sigma_j^2)^{-n/2} \exp\!\left(-\frac{(n-1)s_j^2 + n(\bar{x}_j - \mu_j)^2}{2\sigma_j^2}\right)$

Multiplying by the prior $1/\sigma_j^2$ and integrating out the nuisance parameter $\sigma_j^2$ yields the marginal posterior for $\mu_j$ :

$\mu_j \mid \mathbf{x}_j \sim t_{n-1}\!\left(\bar{x}_j,\ \frac{s_j^2}{n}\right)$

where $\bar{x}_j$ is the sample mean and $s_j^2$ is the unbiased sample variance. The result is a Student's $t$ -distribution rather than a normal, because integrating out the unknown variance introduces extra tail weight that shrinks with more data.

Step 3: Posterior for the difference

Under the assumption of equal variances ( $\sigma_c^2 = \sigma_t^2 = \sigma^2$ ), the posterior for the difference $\Delta = \mu_t - \mu_c$ is:

$\Delta \mid \mathbf{x}_c, \mathbf{x}_t \sim t_{2n-2}\!\left(\bar{x}_t - \bar{x}_c,\ s_p^2 \cdot \frac{2}{n}\right)$

where the pooled sample variance is $s_p^2 = \frac{(n-1)s_c^2 + (n-1)s_t^2}{2n-2}$ . The 95% credible interval is:

$\hat{\Delta} \pm t_{2n-2} \cdot s_p\sqrt{2/n}$

In the frequentist framework, the two-sample $t$ -statistic under the same assumption is $T = (\bar{x}_t - \bar{x}_c) / (s_p\sqrt{2/n}) \sim t_{2n-2}$ , giving the 95% confidence interval:

$\hat{\Delta} \pm t_{2n-2} \cdot s_p\sqrt{2/n}$

The formulas are the same. The frequentist CI and the Bayesian credible interval with a non-informative prior have numerically identical bounds.

Step 4: Convergence to normal

As $n \to \infty$ , the Bernstein-von Mises theorem guarantees that the posterior converges to a normal distribution centered at the maximum likelihood estimate, regardless of the prior. Since $t_{2n-2} \to \mathcal{N}(0,1)$ , both intervals converge to $\hat{\Delta} \pm 1.96 \cdot s_p\sqrt{2/n}$ .

Informative priors

Informative priors can offer genuine value. If you have reliable prior evidence from previous experiments or strong domain knowledge, encoding it as an informative prior will shift and shrink the posterior, producing a narrower interval when that prior is accurate. That is a real statistical benefit. In practice, though, the benefit rarely justifies the complexity. Specifying a prior well requires choosing a distribution family, setting its parameters, and verifying that the inference is not too sensitive to those choices. There is also a deeper problem: for experiment results to change decisions, they need to travel from the analyst to the product manager, designer, and leadership who act on them. The bottleneck is rarely statistical precision. It is whether people trust and understand the evidence. As Sebastian Andersson discusses in How Experimental Evidence Travels Through Your Organization, that chain of inference is fragile, and adding layers of Bayesian machinery tends to obscure rather than clarify.