Confidence
  • Documentation
  • Blog
  • Bootcamp
  • Status
  • Confidence Bootcamp
    • My learning
    • Intro to experimentation
      • Introduction
      • Lesson 1: Why you should experiment
      • Lesson 2: Experiment hypothesis
      • Lesson 3: Success and guardrail metrics
      • Lesson 4: Success metrics
      • Lesson 5: Set up your experiment
      • Lesson 6: Calculation frequency
      • Lesson 7: Target audience
      • Lesson 8: Sample size
      • Lesson 9: Quality assurance
      • Lesson 10: Run your experiment
      • Lesson 11: Evaluate your experiment and make a decision
      • Lesson 12: A/B tests and rollouts
      • Course wrap up
    • Intro to metrics
      • Introduction
      • Lesson 1: What is a metric?
      • Lesson 2: Metric roles
      • Lesson 3: Time considerations
      • Lesson 4: Capturing behavior
      • Lesson 5: Strategic metrics
      • Lesson 6: Interpretability
      • Lesson 7: Feasibility and sensitivity
      • Lesson 8: Variance reduction
      • Lesson 9: Select metrics
      • Lesson 10: Segment-level analysis
      • Course wrap up
    • Scientific product development
      • Introduction
      • Lesson 1: Why you should experiment
      • Lesson 2: The scientific method
      • Lesson 3: Randomized controlled trials
      • Lesson 4: Experiment hypothesis
      • Lesson 5: Case study
        • Case study
        • Answers to case study
      • Lesson 6: Why do we need statistics?
      • Lesson 7: Success metrics
      • Lesson 8: Detectable effects and sample size
      • Lesson 9: Make a decision
      • Course wrap up
    • A primer on hypothesis testing
      • Introduction
      • Lesson 1: Introduction to hypothesis testing
      • Lesson 2: True vs estimated effects
      • Lesson 3: Sampling distribution of the difference-in-means estimator
      • Lesson 4: Z-tests and how to reject the null hypothesis
      • Lesson 5: False postive rate and alpha
      • Lesson 6: True positive rate, MDE, and power
      • Course wrap up
    • Intro to Feature Flags
      • Introduction
      • Lesson 1: What is a feature flag?
      • Lesson 2: Lifecycle of a feature flag
      • Lesson 3: Clients
      • Lesson 4: Evaluation context and targeting
    • Sample size calculation - I
      • Introduction
      • Lesson 1: What is the required sample size?
      • Lesson 2: Alpha and power
      • Lesson 3: Baseline mean and variance
      • Lesson 4: Sample size playground - I
    • Sample size calculation - II
      • Introduction
      • Lesson 1: Multi-metric decision making
      • Lesson 2: Number of success metrics
      • Lesson 3: Number of guardrail metrics
      • Lesson 4: Number of comparisons
      • Lesson 5: Sample size playground - II
    • Sample size calculation - III
      • Introduction
      • Lesson 1: Binary metrics
      • Lesson 2: Treatment group proportions
      • Lesson 3: Variance reduction
      • Lesson 4: Sequential testing and sample size
      • Lesson 5: Sample size playground - III
    • Advance your experimentation
      • Introduction
      • Lesson 1: Guardrail metrics with non-inferiority margins
      • Lesson 2: Choose evaluation frequency
      • Lesson 3: Metrics' roles in experiments
      • Lesson 4: Cumulative holdback evaluations
    • Experimentation culture
      • Introduction
      • Lesson 1: Onboarding into experimentation
      • Lesson 2: Empowering experimentation champions
      • Lesson 3: Sustaining the experimentation culture
    • Videos

Lesson 6: Why do we need statistics?

Summary

In this lesson, you learn about the role that statistical analysis plays in experimentation.

Statistics helps to:

  • Quantify the uncertainty in the metric results.
  • Makes it possible to manage and bound the risk of making the wrong product decisions.

Statistics is the mathematical language for quantifying uncertainty. In experiments, there is always random variation between the treatment groups even before you make a change to any group of users. All users are unique and therefore treatment groups never have exactly the same average.

Gym example

Let's go back to the weight-loss program trial.

You carried out your experiment as planned, randomizing participants into a control group and a treatment group. You measured the weight of the participants in both groups at the start of the trial. The treatment group started the 6-week weight-loss program right away, while the control group waited 6 weeks. At the end of the 6 weeks, you measured the weight of everyone again.

You ended up with 10 participants in each group, and the table below shows the difference before and after 6 weeks. People in the control group lost 1.3 kg on average, and people who participated in the weight-loss program lost 3.1 kg on average.

Weight change (kg)Group average12345678910
Control-1.3+2.1-3.3-2.7-1.8+2.1-1.8+0.9-1.2-4.5-3.0
Treatment-3.1-3.9-2.7-3.3-7.5-0.9-4.2-1.8-2.1-3.9-0.6

So, the people in your program lost 1.8 kg more weight than the people in the control group. Does that mean that your program works? Or could it just be a coincidence?

Quantify the noise to detect the signal

People's weights fluctuate somewhat over time. If you just divided 20 people randomly into two groups and measured their change in weight over time, it's quite unlikely that the averages of the two groups would be exactly the same. This random variation causes noise in your measurement.

How can you detect the signal among the noise?

To answer this question, you can think about how much "noise" in the weight measurements you would expect to see, even if your program had no impact at all. You can then compare the difference that you found, to the amount of noise that you would expect purely from random variation. The amount of noise in the average measurements depends on the number of people in each group, and the variance, that is how much the individuals differ between each other. Based on this, you can use statistical theory to calculate how much noise you can expect in the measurement. In other words: How likely would it be to find a difference this large, just due to random variation.

We'll skip the math here. For the weight-loss example, it turns out that there is an 8% chance to find a difference between two groups of 1.8kg or more, based purely on random variation. This calculation assumes that your weight-loss program had absolutely no effect on people's weight.

Use statistics to make a decision

You can use this calculation to make a decision about your program. If you find quite a small effect it may not be unlikely to see it even if your program did nothing. We can conclude that there's not enough evidence to conclude that your program is working. If you find a larger effect, it is less likely to see this purely due to random variation. We can define a threshold beforehand, to decide how "unlikely to be seen purely based on random variation" your result needs to be, to consider it strong enough evidence.

For example, if you had decided beforehand that you would consider your results strong enough if they are less than 5% likely to be seen based purely on random variation, then the result of your trial would not have produced strong enough evidence. If on the other hand, you had decided beforehand to set that threshold at 10%, then your result would have passed the test. We call a result that has passed such a test "statistically significant".

Don't change the target after you see the results

It is tempting to change the threshold after seeing the results. "OK, we said 5% beforehand, but 8% is still not that bad, if we decide that a threshold of 10% is good enough, then the experiment passes the test!". But to avoid confirmation bias, it is important to define the threshold before seeing the results. Changing the threshold after seeing the results, is like shooting arrows at a wall, and then drawing a target around the place where the arrow landed. Drawing a target around the results of an experiment is cheating, and cheating in experimentation leads to worse product decisions. The same applies to changing the metrics after the end of the experiment.

A note on statistical uncertainty for the curious

Statistics doesn't magically know how different any two groups are. However, you have one trick up your sleeve: randomization. By randomizing the treatment assignment, you know how the difference in means between two treatment groups varies across different random treatment assignments. In other words, randomly assigning the treatment to users serves two purposes, make the groups similar in all other aspects than the treatment (as discussed in the scientific method lesson), and to 'structure' the noise in the difference-in-means estimator to allow statistical inference.

Reader exercise

What is the purpose of statistical testing?

Was this page helpful?

PreviousAnswers to case study
NextLesson 7: Success metrics

© Copyright 2026. All rights reserved.

Follow us on TwitterFollow us on GitHub

On this page

  1. Gym example

  2. Quantify the noise to detect the signal

  3. Use statistics to make a decision

  4. A note on statistical uncertainty for the curious