Confidence
  • Documentation
  • Blog
  • Bootcamp
  • Status
  • Confidence Bootcamp
    • My learning
    • Intro to experimentation
      • Introduction
      • Lesson 1: Why you should experiment
      • Lesson 2: Experiment hypothesis
      • Lesson 3: Success and guardrail metrics
      • Lesson 4: Success metrics
      • Lesson 5: Set up your experiment
      • Lesson 6: Calculation frequency
      • Lesson 7: Target audience
      • Lesson 8: Sample size
      • Lesson 9: Quality assurance
      • Lesson 10: Run your experiment
      • Lesson 11: Evaluate your experiment and make a decision
      • Lesson 12: A/B tests and rollouts
      • Course wrap up
    • Intro to metrics
      • Introduction
      • Lesson 1: What is a metric?
      • Lesson 2: Metric roles
      • Lesson 3: Time considerations
      • Lesson 4: Capturing behavior
      • Lesson 5: Strategic metrics
      • Lesson 6: Interpretability
      • Lesson 7: Feasibility and sensitivity
      • Lesson 8: Variance reduction
      • Lesson 9: Select metrics
      • Lesson 10: Segment-level analysis
      • Course wrap up
    • Scientific product development
      • Introduction
      • Lesson 1: Why you should experiment
      • Lesson 2: The scientific method
      • Lesson 3: Randomized controlled trials
      • Lesson 4: Experiment hypothesis
      • Lesson 5: Case study
        • Case study
        • Answers to case study
      • Lesson 6: Why do we need statistics?
      • Lesson 7: Success metrics
      • Lesson 8: Detectable effects and sample size
      • Lesson 9: Make a decision
      • Course wrap up
    • A primer on hypothesis testing
      • Introduction
      • Lesson 1: Introduction to hypothesis testing
      • Lesson 2: True vs estimated effects
      • Lesson 3: Sampling distribution of the difference-in-means estimator
      • Lesson 4: Z-tests and how to reject the null hypothesis
      • Lesson 5: False postive rate and alpha
      • Lesson 6: True positive rate, MDE, and power
      • Course wrap up
    • Intro to Feature Flags
      • Introduction
      • Lesson 1: What is a feature flag?
      • Lesson 2: Lifecycle of a feature flag
      • Lesson 3: Clients
      • Lesson 4: Evaluation context and targeting
    • Sample size calculation - I
      • Introduction
      • Lesson 1: What is the required sample size?
      • Lesson 2: Alpha and power
      • Lesson 3: Baseline mean and variance
      • Lesson 4: Sample size playground - I
    • Sample size calculation - II
      • Introduction
      • Lesson 1: Multi-metric decision making
      • Lesson 2: Number of success metrics
      • Lesson 3: Number of guardrail metrics
      • Lesson 4: Number of comparisons
      • Lesson 5: Sample size playground - II
    • Sample size calculation - III
      • Introduction
      • Lesson 1: Binary metrics
      • Lesson 2: Treatment group proportions
      • Lesson 3: Variance reduction
      • Lesson 4: Sequential testing and sample size
      • Lesson 5: Sample size playground - III
    • Advance your experimentation
      • Introduction
      • Lesson 1: Guardrail metrics with non-inferiority margins
      • Lesson 2: Choose evaluation frequency
      • Lesson 3: Metrics' roles in experiments
      • Lesson 4: Cumulative holdback evaluations
    • Experimentation culture
      • Introduction
      • Lesson 1: Onboarding into experimentation
      • Lesson 2: Empowering experimentation champions
      • Lesson 3: Sustaining the experimentation culture
    • Videos

Lesson 3: Sampling distribution of the difference-in-means estimator

Summary

Using probability theory, we know how the difference-in-means estimator varies across all possible samples and treatment assignments, without going through every combination.

Using probability theory, we can calculate, without going through all random samples and treatment assignments, how the difference-in-means estimator will vary across all possible samples and treatment assignments. In fact, we even know the precise distribution that the difference-in-means estimator will have across all possible samples and treatment assignments. For means and difference in means, the result that lets us do this is called the Central Limit Theorem. The Central Limit Theorem states that if the sample size is large enough, then the difference-in-means estimator will be normally distributed around the true average treatment effect across random samples and treatment assignments.

Note

The difference-in-means estimator is approximately normally distributed around the true treatment effect, regardless of the distribution of the data. In other words, even if data is not even remotely close to normally distributed, the difference-in-means estimator will be normally distributed around the true treatment effect if the sample size is large enough.

You only observe a point estimate

Importantly, the observed difference in means in a given sample is not normally distributed since it's just a fixed value. It is the difference-in-means estimator across random samples and treatment assignments that is normally distributed. This means that if you would run the experiment many times, the difference in means you observe would be normally distributed.

Simulation

In this simulation, we draw a random sample, split it randomly into treatment and control, and calculate the difference in means. We do this many times to see how the difference in means varies across random samples and treatment assignments. Note that there is no treatment effect in this simulation. The variation in the difference in means is only due to random variation in the sample and treatment assignment. The observed distribution is called the sampling distribution of the difference-in-means estimator, as it is the distribution this estimator has across random samples and treatment assignments.

Random Sample

Treatment

Control

Difference in means

Histogram of difference in means

Samples: 0 / 500

The magic that probability theory and statistics bring us is that we know the what distribution will be a good approximation of the 500 simulated difference-in-means estimates under the null before we have run the simulation. It works, because of math!

The value of knowing the distribution of the difference-in-means can't be overstated. It lets us observe one sample and still draw conclusions (make inference) about the full population. More on that in the next lesson.

Reader exercise

According to the Central Limit Theorem, what is the shape of the sampling distribution for the difference-in-means estimator when the sample size is large enough?

Reader exercise

What exactly is normally distributed according to the lesson?

Reader exercise

In the simulation described in the lesson, what causes the variation in the difference-in-means estimates when there is no treatment effect?

Notes for nerds

There are some technicalities in the Central Limit Theorem that we have glossed over. The Central Limit Theorem states that the difference-in-means estimator is normally distributed around the true treatment effect if the sample size is large enough. The exact conditions for when the Central Limit Theorem holds are a bit more nuanced, but for the purposes of this course, we can assume that the Central Limit Theorem holds when the sample size is large enough. In principle, as long as the underlying data doesn't have too fat tails, the Central Limit Theorem will hold.

There are ways of making inference that is not based on the Central Limit theorem. One example is the bootstrap method, which is a resampling method that can be used to estimate the distribution of an estimator without making assumptions about the distribution of the data. The bootstrap method is a powerful tool that can be used in many situations where the Central Limit Theorem doesn't hold. However, the bootstrap method is more computationally intensive but there are some tricks to make it faster. See for example our blog post on bootstrap for quantiles.

Was this page helpful?

PreviousLesson 2: True vs estimated effects
NextLesson 4: Z-tests and how to reject the null hypothesis

© Copyright 2026. All rights reserved.

Follow us on TwitterFollow us on GitHub

On this page

  1. You only observe a point estimate

  2. Simulation

  3. Notes for nerds