Confidence
  • Documentation
  • Blog
  • Bootcamp
  • Status
  • Confidence Bootcamp
    • My learning
    • Intro to experimentation
      • Introduction
      • Lesson 1: Why you should experiment
      • Lesson 2: Experiment hypothesis
      • Lesson 3: Success and guardrail metrics
      • Lesson 4: Success metrics
      • Lesson 5: Set up your experiment
      • Lesson 6: Calculation frequency
      • Lesson 7: Target audience
      • Lesson 8: Sample size
      • Lesson 9: Quality assurance
      • Lesson 10: Run your experiment
      • Lesson 11: Evaluate your experiment and make a decision
      • Lesson 12: A/B tests and rollouts
      • Course wrap up
    • Intro to metrics
      • Introduction
      • Lesson 1: What is a metric?
      • Lesson 2: Metric roles
      • Lesson 3: Time considerations
      • Lesson 4: Capturing behavior
      • Lesson 5: Strategic metrics
      • Lesson 6: Interpretability
      • Lesson 7: Feasibility and sensitivity
      • Lesson 8: Variance reduction
      • Lesson 9: Select metrics
      • Lesson 10: Segment-level analysis
      • Course wrap up
    • Scientific product development
      • Introduction
      • Lesson 1: Why you should experiment
      • Lesson 2: The scientific method
      • Lesson 3: Randomized controlled trials
      • Lesson 4: Experiment hypothesis
      • Lesson 5: Case study
        • Case study
        • Answers to case study
      • Lesson 6: Why do we need statistics?
      • Lesson 7: Success metrics
      • Lesson 8: Detectable effects and sample size
      • Lesson 9: Make a decision
      • Course wrap up
    • A primer on hypothesis testing
      • Introduction
      • Lesson 1: Introduction to hypothesis testing
      • Lesson 2: True vs estimated effects
      • Lesson 3: Sampling distribution of the difference-in-means estimator
      • Lesson 4: Z-tests and how to reject the null hypothesis
      • Lesson 5: False postive rate and alpha
      • Lesson 6: True positive rate, MDE, and power
      • Course wrap up
    • Intro to Feature Flags
      • Introduction
      • Lesson 1: What is a feature flag?
      • Lesson 2: Lifecycle of a feature flag
      • Lesson 3: Clients
      • Lesson 4: Evaluation context and targeting
    • Sample size calculation - I
      • Introduction
      • Lesson 1: What is the required sample size?
      • Lesson 2: Alpha and power
      • Lesson 3: Baseline mean and variance
      • Lesson 4: Sample size playground - I
    • Sample size calculation - II
      • Introduction
      • Lesson 1: Multi-metric decision making
      • Lesson 2: Number of success metrics
      • Lesson 3: Number of guardrail metrics
      • Lesson 4: Number of comparisons
      • Lesson 5: Sample size playground - II
    • Sample size calculation - III
      • Introduction
      • Lesson 1: Binary metrics
      • Lesson 2: Treatment group proportions
      • Lesson 3: Variance reduction
      • Lesson 4: Sequential testing and sample size
      • Lesson 5: Sample size playground - III
    • Advance your experimentation
      • Introduction
      • Lesson 1: Guardrail metrics with non-inferiority margins
      • Lesson 2: Choose evaluation frequency
      • Lesson 3: Metrics' roles in experiments
      • Lesson 4: Cumulative holdback evaluations
    • Experimentation culture
      • Introduction
      • Lesson 1: Onboarding into experimentation
      • Lesson 2: Empowering experimentation champions
      • Lesson 3: Sustaining the experimentation culture
    • Videos

Lesson 4: Confidence intervals and precision

Summary

In this lesson, you learn what confidence intervals are and how to read them. You learn that CI width tells you how precisely the effect has been measured, and how sample size and metric noise affect that precision. You also learn how to read CIs from sequential and non-sequential tests.

The relative effect shown for a metric (say, +4.2%) is a single number, often called a point estimate. It is your best estimate of the true effect of the treatment variant on that metric. But a single number cannot tell you how trustworthy that estimate is. That is what the confidence interval is for.

A confidence interval (CI) gives you two things at once: where the effect likely is and how precisely you have measured it. These are not separate ideas; they are two ways of reading the same interval.

In Confidence

In Confidence, each metric result is shown as a horizontal bar. The dot in the middle is the point estimate (the +4.2% in our example). The bar extends to the left and right of that dot, and those endpoints are the lower bound and upper bound of the confidence interval.

A metric result row in Confidence showing the CI bar, point estimate dot, and zero line

Interpretation: correct or practical

Take an interval like [−15%, +6%]. The technically correct interpretation of a 95% CI is this: if you ran this experiment 100 times and computed a 95% CI each time, approximately 95 of those 100 intervals would contain the true effect. Notice that this is a statement about the procedure, not about the specific interval in front of you.

The practical interpretation you should use: With high confidence, the true effect lies between −15% and +6%. Reading it that way is not technically exact, but it leads to rational product decisions, and that is what matters. Read more about this in the Notes for nerds section.

CI width as precision

The width of the CI tells you how precisely you have measured the effect. A narrow CI means a precise estimate: you know roughly what the effect is. A wide CI means high uncertainty: the effect could be anywhere in a large range.

A wide CI that crosses zero is not just "not significant." It means you do not yet know whether there is an effect at all. If you have not yet reached the required sample size, the right response is to collect more data. If you have already met it, the wide CI is itself the finding: the effect is likely smaller than what the experiment was powered to detect, and concluding that nothing meaningful is happening may be exactly right. How required sample size is determined is covered in the Sample Size I course.

A bit simplified: two factors drive CI width in the most basic setting. Sample size: more users means a narrower CI. This is the most controllable factor; running longer or on a larger audience directly buys precision. Metric variance: some metrics are inherently noisier than others. Higher variance means you need more data to achieve the same precision. In practice, CI width also depends on whether you are running sequential or non-sequential testing, the number of metrics, the significance level (alpha), and other factors covered in the Sample Size and Hypothesis Testing courses.

Use the interactive below to build intuition for how these two factors interact.

Confidence interval explorer

The point estimate is fixed at +4.2%. Adjust sample size and standard deviation to see how the CI width changes.

-20%-10%0%+10%+20%
10010,000
10 (low noise)100 (high noise)
With high confidence, the true effect is between -1.5% and +9.9%. Since zero is in the interval, the effect is not statistically significant.

Try the following:

  • Set sample size to 100 and σ to 40. Notice how wide and uncertain the CI is.
  • Increase sample size to 5,000 with the same σ. The CI narrows significantly.
  • Set sample size back to 1,000 and move σ from 5 to 50. The CI widens as the metric becomes noisier.

The key insight: collecting more data is the primary lever you have to narrow the CI. Variance reduction (Lesson 8) is another lever; it effectively gives you a narrower CI for the same sample size by removing predictable noise from the data. Choosing a less noisy metric in the first place is a third: metric sensitivity and how to assess it are covered in the Feasibility and sensitivity lesson.

CIs in sequential and non-sequential tests

The CI you see on a results page always reflects the data collected up to that moment. How to interpret it depends on the evaluation strategy the experiment uses.

With a non-sequential (fixed-horizon) test, the CI is only statistically valid at the pre-specified end point. Intervals you see mid-experiment have not been corrected for repeated looks. They are informational but should not be acted on.

With a sequential test, the CI is valid at any point in time. That is precisely what sequential testing is designed for. The interval starts deliberately wide when data is sparse, because the test distributes its false-positive budget across all future looks, and narrows as data accumulates, just as any CI does with more data.

Reading a sequential result is straightforward: the CI at the latest time point is your current result. Interpret it exactly like any other confidence interval. The time-series view shows how the result developed over the course of the experiment, but the actionable number is always the most recent one.

In Confidence

In Confidence, the CI bar and summary shown in the results always reflect the latest time point. Expand a metric row to see the full time-series plot of how the interval has evolved. Move the pointer across the graph to inspect the CI at any earlier point in time.

A sequential CI time-series in Confidence, showing the interval narrowing as data accumulates
Reader exercise

A confidence interval for a metric shows [-2%, +9%]. What is the best interpretation of this result?

Reader exercise

What does a wide confidence interval tell you about a metric result?

Reader exercise

Which of the following actions directly leads to a narrower confidence interval?

Notes for nerds

Two-sided intervals, one-sided tests

Although confidence intervals in Confidence are displayed as symmetric two-sided intervals, the hypothesis tests underlying them are always one-sided. Each displayed CI is best understood as two back-to-back one-sided confidence bounds: the lower bound is the one-sided lower bound (testing whether the effect is negative), and the upper bound is the one-sided upper bound (testing whether the effect is positive). The actual significance decision for any metric is made in a single, pre-specified direction: the direction in which the metric is intended to move. One-sided tests require fewer observations than two-sided tests to achieve the same power for a given effect size, which is one reason Confidence can detect effects with fewer users than tools that use two-sided tests by default.

Z-tests, t-tests, and the Bayesian connection

Confidence uses z-tests throughout, not t-tests. For the sample sizes it targets (typically well above 1,000 users per group), the z and t2n−2t_{2n-2}t2n−2​ quantiles are within 0.1% of each other, so the normal approximation costs nothing in practice and tracking degrees of freedom adds complexity for no gain.

Despite this, there is a precise argument for why Bayesian language is appropriate when interpreting results. The thread runs as follows. With a non-informative prior p(μ,σ2)∝1/σ2p(\mu, \sigma^2) \propto 1/\sigma^2p(μ,σ2)∝1/σ2 on each group, the Bayesian credible interval for the difference in means is numerically identical to the frequentist t-test confidence interval (see the derivation below). That Bayesian interval has a formal probabilistic guarantee: "there is a 95% probability that the true effect lies in this interval" is literally correct for it. The frequentist t-test interval converges to the z-test interval as nnn grows, and by the Bernstein-von Mises theorem the Bayesian posterior converges to the same normal distribution. At the sample sizes Confidence targets, the z-test CI and the Bayesian credible interval are equal to any practically relevant precision, so the Bayesian probabilistic interpretation carries over.

The technically correct frequentist statement is: "95% of confidence intervals constructed using this procedure, across repeated experiments, would contain the true effect." This describes the long-run behavior of the procedure, not the probability that this specific interval is correct. For reading an experiment results page, treat the CI as a range of plausible values for the effect and you will make sound decisions.

If you are comfortable saying "95% probability the effect is in this range" for a Bayesian interval with a non-informative prior, you can use the same language for the frequentist CI without meaningfully misleading yourself or your colleagues.

The interactive below shows the convergence. Green is the frequentist z-test CI (z=1.96z = 1.96z=1.96, fixed); indigo is the Bayesian credible interval using t2n−2t_{2n-2}t2n−2​, which starts slightly wider at small nnn and converges to the z-test CI as sample size grows.

Frequentist CI vs Bayesian credible interval

Fictitious metric: daily active minutes (control: 22.0 min, treatment: 23.1 min, pooled SD: 4.0 min). Drag the slider to see how both intervals narrow — and converge — as sample size grows.

FrequentistBayesian0+0.5+1+1.5+2+2.5minutes
1002,000
Frequentist: [+0.316 min, +1.884 min]
Bayesian: [+0.314 min, +1.886 min]
The Bayesian interval is 0.31% wider than the frequentist one. Increase the sample size to watch the t-quantile converge to z = 1.960 and the two intervals become indistinguishable.

The full derivation of why the Bayesian credible interval equals the frequentist t-test CI, and why both converge to the z-test CI:

Step 1: Non-informative prior

For a single normal group with unknown mean μ\muμ and variance σ2\sigma^2σ2, the standard objective (Jeffreys) prior is:

p(μ,σ2)∝1σ2p(\mu, \sigma^2) \propto \frac{1}{\sigma^2}p(μ,σ2)∝σ21​

For two independent groups, control (μc,σc2)(\mu_c, \sigma_c^2)(μc​,σc2​) and treatment (μt,σt2)(\mu_t, \sigma_t^2)(μt​,σt2​), the joint prior is the product:

p(μc,σc2,μt,σt2)∝1σc2σt2p(\mu_c, \sigma_c^2, \mu_t, \sigma_t^2) \propto \frac{1}{\sigma_c^2 \sigma_t^2}p(μc​,σc2​,μt​,σt2​)∝σc2​σt2​1​

Step 2: Marginal posterior for each group mean

For group jjj with nnn observations xj1,…,xjnx_{j1}, \ldots, x_{jn}xj1​,…,xjn​, the normal likelihood is:

p(xj∣μj,σj2)∝(σj2)−n/2exp⁡ ⁣(−(n−1)sj2+n(xˉj−μj)22σj2)p(\mathbf{x}_j \mid \mu_j, \sigma_j^2) \propto (\sigma_j^2)^{-n/2} \exp\!\left(-\frac{(n-1)s_j^2 + n(\bar{x}_j - \mu_j)^2}{2\sigma_j^2}\right)p(xj​∣μj​,σj2​)∝(σj2​)−n/2exp(−2σj2​(n−1)sj2​+n(xˉj​−μj​)2​)

Multiplying by the prior 1/σj21/\sigma_j^21/σj2​ and integrating out the nuisance parameter σj2\sigma_j^2σj2​ yields the marginal posterior for μj\mu_jμj​:

μj∣xj∼tn−1 ⁣(xˉj, sj2n)\mu_j \mid \mathbf{x}_j \sim t_{n-1}\!\left(\bar{x}_j,\ \frac{s_j^2}{n}\right)μj​∣xj​∼tn−1​(xˉj​, nsj2​​)

where xˉj\bar{x}_jxˉj​ is the sample mean and sj2s_j^2sj2​ is the unbiased sample variance. The result is a Student's ttt-distribution rather than a normal, because integrating out the unknown variance introduces extra tail weight that shrinks with more data.

Step 3: Posterior for the difference

Under the assumption of equal variances (σc2=σt2=σ2\sigma_c^2 = \sigma_t^2 = \sigma^2σc2​=σt2​=σ2), the posterior for the difference Δ=μt−μc\Delta = \mu_t - \mu_cΔ=μt​−μc​ is:

Δ∣xc,xt∼t2n−2 ⁣(xˉt−xˉc, sp2⋅2n)\Delta \mid \mathbf{x}_c, \mathbf{x}_t \sim t_{2n-2}\!\left(\bar{x}_t - \bar{x}_c,\ s_p^2 \cdot \frac{2}{n}\right)Δ∣xc​,xt​∼t2n−2​(xˉt​−xˉc​, sp2​⋅n2​)

where the pooled sample variance is sp2=(n−1)sc2+(n−1)st22n−2s_p^2 = \frac{(n-1)s_c^2 + (n-1)s_t^2}{2n-2}sp2​=2n−2(n−1)sc2​+(n−1)st2​​. The 95% credible interval is:

Δ^±t2n−2⋅sp2/n\hat{\Delta} \pm t_{2n-2} \cdot s_p\sqrt{2/n}Δ^±t2n−2​⋅sp​2/n​

In the frequentist framework, the two-sample ttt-statistic under the same assumption is T=(xˉt−xˉc)/(sp2/n)∼t2n−2T = (\bar{x}_t - \bar{x}_c) / (s_p\sqrt{2/n}) \sim t_{2n-2}T=(xˉt​−xˉc​)/(sp​2/n​)∼t2n−2​, giving the 95% confidence interval:

Δ^±t2n−2⋅sp2/n\hat{\Delta} \pm t_{2n-2} \cdot s_p\sqrt{2/n}Δ^±t2n−2​⋅sp​2/n​

The formulas are the same. The frequentist CI and the Bayesian credible interval with a non-informative prior have numerically identical bounds.

Step 4: Convergence to normal

As n→∞n \to \inftyn→∞, the Bernstein-von Mises theorem guarantees that the posterior converges to a normal distribution centered at the maximum likelihood estimate, regardless of the prior. Since t2n−2→N(0,1)t_{2n-2} \to \mathcal{N}(0,1)t2n−2​→N(0,1), both intervals converge to Δ^±1.96⋅sp2/n\hat{\Delta} \pm 1.96 \cdot s_p\sqrt{2/n}Δ^±1.96⋅sp​2/n​.

Informative priors

Informative priors can offer genuine value. If you have reliable prior evidence from previous experiments or strong domain knowledge, encoding it as an informative prior will shift and shrink the posterior, producing a narrower interval when that prior is accurate. That is a real statistical benefit. In practice, though, the benefit rarely justifies the complexity. Specifying a prior well requires choosing a distribution family, setting its parameters, and verifying that the inference is not too sensitive to those choices. There is also a deeper problem: for experiment results to change decisions, they need to travel from the analyst to the product manager, designer, and leadership who act on them. The bottleneck is rarely statistical precision. It is whether people trust and understand the evidence. As Sebastian Andersson discusses in How Experimental Evidence Travels Through Your Organization, that chain of inference is fragile, and adding layers of Bayesian machinery tends to obscure rather than clarify.

Was this page helpful?

PreviousLesson 3: Means and relative effects
NextLesson 5: Significance for success metrics

© Copyright 2026. All rights reserved.

Follow us on TwitterFollow us on GitHub

On this page

  1. Interpretation: correct or practical

  2. CI width as precision

  3. CIs in sequential and non-sequential tests

  4. Notes for nerds