Confidence
  • Documentation
  • Blog
  • Bootcamp
  • Status
  • Confidence Bootcamp
    • My learning
    • Intro to experimentation
      • Introduction
      • Lesson 1: Why you should experiment
      • Lesson 2: Experiment hypothesis
      • Lesson 3: Success and guardrail metrics
      • Lesson 4: Success metrics
      • Lesson 5: Set up your experiment
      • Lesson 6: Calculation frequency
      • Lesson 7: Target audience
      • Lesson 8: Sample size
      • Lesson 9: Quality assurance
      • Lesson 10: Run your experiment
      • Lesson 11: Evaluate your experiment and make a decision
      • Lesson 12: A/B tests and rollouts
      • Course wrap up
    • Intro to metrics
      • Introduction
      • Lesson 1: What is a metric?
      • Lesson 2: Metric roles
      • Lesson 3: Time considerations
      • Lesson 4: Capturing behavior
      • Lesson 5: Strategic metrics
      • Lesson 6: Interpretability
      • Lesson 7: Feasibility and sensitivity
      • Lesson 8: Variance reduction
      • Lesson 9: Select metrics
      • Lesson 10: Segment-level analysis
      • Course wrap up
    • Scientific product development
      • Introduction
      • Lesson 1: Why you should experiment
      • Lesson 2: The scientific method
      • Lesson 3: Randomized controlled trials
      • Lesson 4: Experiment hypothesis
      • Lesson 5: Case study
        • Case study
        • Answers to case study
      • Lesson 6: Why do we need statistics?
      • Lesson 7: Success metrics
      • Lesson 8: Detectable effects and sample size
      • Lesson 9: Make a decision
      • Course wrap up
    • A primer on hypothesis testing
      • Introduction
      • Lesson 1: Introduction to hypothesis testing
      • Lesson 2: True vs estimated effects
      • Lesson 3: Sampling distribution of the difference-in-means estimator
      • Lesson 4: Z-tests and how to reject the null hypothesis
      • Lesson 5: False postive rate and alpha
      • Lesson 6: True positive rate, MDE, and power
      • Course wrap up
    • Intro to Feature Flags
      • Introduction
      • Lesson 1: What is a feature flag?
      • Lesson 2: Lifecycle of a feature flag
      • Lesson 3: Clients
      • Lesson 4: Evaluation context and targeting
    • Sample size calculation - I
      • Introduction
      • Lesson 1: What is the required sample size?
      • Lesson 2: Alpha and power
      • Lesson 3: Baseline mean and variance
      • Lesson 4: Sample size playground - I
    • Sample size calculation - II
      • Introduction
      • Lesson 1: Multi-metric decision making
      • Lesson 2: Number of success metrics
      • Lesson 3: Number of guardrail metrics
      • Lesson 4: Number of comparisons
      • Lesson 5: Sample size playground - II
    • Sample size calculation - III
      • Introduction
      • Lesson 1: Binary metrics
      • Lesson 2: Treatment group proportions
      • Lesson 3: Variance reduction
      • Lesson 4: Sequential testing and sample size
      • Lesson 5: Sample size playground - III
    • Advance your experimentation
      • Introduction
      • Lesson 1: Guardrail metrics with non-inferiority margins
      • Lesson 2: Choose evaluation frequency
      • Lesson 3: Metrics' roles in experiments
      • Lesson 4: Cumulative holdback evaluations
    • Experimentation culture
      • Introduction
      • Lesson 1: Onboarding into experimentation
      • Lesson 2: Empowering experimentation champions
      • Lesson 3: Sustaining the experimentation culture
    • Videos

Lesson 2: True versus estimated effects

Summary

In an experiment, you don't observe the treatment effect in the full population. You only observe a random sample of the population in your experiment, and, the sample is randomly split into treatment and control.

In an experiment, you don't observe the treatment effect in the full population. You only observe a random sample of the population in your experiment with that sample randomly split into treatment and control. You only observe the mean of the non-treated group for the subset of users in the sample that are in your control group, and only observe the mean of the treated group for the subset of users in the sample that are in your treatment group.

Clearly, the estimated difference in means between the treatment and control group is not the same as the true treatment effect in the population. The estimated difference in means varies depending on which users end up in your random samples, and which of the users in the sample that end up in the treatment and control groups.

In the illustration below, three random samples are drawn from a population. The samples are split into treatment and control, and exposed to different variants of a mobile app. The samples are small, which makes the estimates very uncertain. In some samples, the estimated difference in means is larger than zero, in some smaller than zero. This variation is referred to as the sampling variation of the difference-in-means estimator: It's the variation of the difference-in-means estimator across random samples and treatment assignments.

Experimentation Flow Several Samples

A treatment effect estimator is said to be unbiased if the average of all estimates across all possible random samples and treatment assignments is equal to the true population treatment effect.

Separate the signal from the noise

So how do you know if the observed difference is due to random variation? Did users with a high value of the outcome metric by chance end up in the treatment group, or did the treatment actually have an effect? This is where statistics comes in.

Because the sample and treatment assignment is random, probability theory lets us quantify the uncertainty in the estimated difference in means under the null hypothesis. If the treatment has no effect, then any variation in difference-in-means estimates across random samples only occurs because different users with different outcome values happen to be placed in different treatment groups.

Note

Probability theory lets us quantify how likely a certain mean difference is if the treatment has no effect (meaning that the null hypothesis is true). If the observed difference is very unlikely under the null hypothesis, we reject the null hypothesis and conclude that the treatment has an effect.

The idea of rejecting the null because the observed outcome is unlikely under the null can be challenging to digest. But this is important to understand to build intuition for experimentation.

We say that a mean difference is statistically significant if it's among the alpha percent most unlikely mean differences under the null hypothesis. If that's the case, we reject the null hypothesis and say that "we found evidence for the alternative hypothesis". Alpha is a parameter that the experimenter sets, we return to alpha in Lesson 5.

The logic of rejecting the null is that since it would be much more likely to observe a large mean difference if the treatment indeed had an effect, we rather believe that the treatment has an effect than believe that the null hypothesis is true and that we just observed a very unlikely mean difference by chance.

Recommendation

If this was your first time hearing about statistical significance and rejecting the null hypothesis, don't worry. This is a concept that takes time to understand. Go back to this lesson tomorrow or in a week an do it again. For most new experimenters, this takes a few attempts to understand.

But how can we know if an observed mean difference is among the alpha percent most unlikely mean differences under the null hypothesis? We can certainly not go through all samples and treatment assignments and give everyone no treatment (just in the example above there are more than 800 million combinations of samples and treatment assignments). In the next section, we dig into how we can know the distribution of the difference-in-means estimator under the null hypothesis without going through all possible samples and treatment assignments—using math.

Reader exercise

Why might the estimated difference in means from an experiment differ from the true treatment effect in the population?

Reader exercise

What does it mean when we say a mean difference is "statistically significant"?

Reader exercise

What causes sampling variation in the difference-in-means estimator?

Was this page helpful?

PreviousLesson 1: Introduction to hypothesis testing
NextLesson 3: Sampling distribution of the difference-in-means estimator

© Copyright 2026. All rights reserved.

Follow us on TwitterFollow us on GitHub

On this page

  1. Separate the signal from the noise