Confidence
  • Documentation
  • Blog
  • Bootcamp
  • Status
  • Confidence Bootcamp
    • My learning
    • Intro to experimentation
      • Introduction
      • Lesson 1: Why you should experiment
      • Lesson 2: Experiment hypothesis
      • Lesson 3: Success and guardrail metrics
      • Lesson 4: Success metrics
      • Lesson 5: Set up your experiment
      • Lesson 6: Calculation frequency
      • Lesson 7: Target audience
      • Lesson 8: Sample size
      • Lesson 9: Quality assurance
      • Lesson 10: Run your experiment
      • Lesson 11: Evaluate your experiment and make a decision
      • Lesson 12: A/B tests and rollouts
      • Course wrap up
    • Intro to metrics
      • Introduction
      • Lesson 1: What is a metric?
      • Lesson 2: Metric roles
      • Lesson 3: Time considerations
      • Lesson 4: Capturing behavior
      • Lesson 5: Strategic metrics
      • Lesson 6: Interpretability
      • Lesson 7: Feasibility and sensitivity
      • Lesson 8: Variance reduction
      • Lesson 9: Select metrics
      • Lesson 10: Segment-level analysis
      • Course wrap up
    • Scientific product development
      • Introduction
      • Lesson 1: Why you should experiment
      • Lesson 2: The scientific method
      • Lesson 3: Randomized controlled trials
      • Lesson 4: Experiment hypothesis
      • Lesson 5: Case study
        • Case study
        • Answers to case study
      • Lesson 6: Why do we need statistics?
      • Lesson 7: Success metrics
      • Lesson 8: Detectable effects and sample size
      • Lesson 9: Make a decision
      • Course wrap up
    • A primer on hypothesis testing
      • Introduction
      • Lesson 1: Introduction to hypothesis testing
      • Lesson 2: True vs estimated effects
      • Lesson 3: Sampling distribution of the difference-in-means estimator
      • Lesson 4: Z-tests and how to reject the null hypothesis
      • Lesson 5: False postive rate and alpha
      • Lesson 6: True positive rate, MDE, and power
      • Course wrap up
    • Intro to Feature Flags
      • Introduction
      • Lesson 1: What is a feature flag?
      • Lesson 2: Lifecycle of a feature flag
      • Lesson 3: Clients
      • Lesson 4: Evaluation context and targeting
    • Sample size calculation - I
      • Introduction
      • Lesson 1: What is the required sample size?
      • Lesson 2: Alpha and power
      • Lesson 3: Baseline mean and variance
      • Lesson 4: Sample size playground - I
    • Sample size calculation - II
      • Introduction
      • Lesson 1: Multi-metric decision making
      • Lesson 2: Number of success metrics
      • Lesson 3: Number of guardrail metrics
      • Lesson 4: Number of comparisons
      • Lesson 5: Sample size playground - II
    • Sample size calculation - III
      • Introduction
      • Lesson 1: Binary metrics
      • Lesson 2: Treatment group proportions
      • Lesson 3: Variance reduction
      • Lesson 4: Sequential testing and sample size
      • Lesson 5: Sample size playground - III
    • Advance your experimentation
      • Introduction
      • Lesson 1: Guardrail metrics with non-inferiority margins
      • Lesson 2: Choose evaluation frequency
      • Lesson 3: Metrics' roles in experiments
      • Lesson 4: Cumulative holdback evaluations
    • Experimentation culture
      • Introduction
      • Lesson 1: Onboarding into experimentation
      • Lesson 2: Empowering experimentation champions
      • Lesson 3: Sustaining the experimentation culture
    • Videos

Lesson 1: Introduction to hypothesis tests

Summary

This lesson introduces hypothesis testing by looking at the relation between the experiment hypothesis and the hypotheses in a hypothesis test. You learn about the null and alternative hypotheses in a hypothesis test, and how they relate to the experiment hypothesis. You also learn why you often test the mean of metrics in experiments, and how the randomness in the experiment helps us reason about the true average treatment effect.

Experiment hypotheses

Most people are familiar with the concept of a hypothesis: it's a somewhat formal statement about what you believe will be the outcome in an experiment. If you have ever read some introduction to experimentation (including our introductory course), it's often stated how important it is to have a clear hypothesis. It's worth noting that the hypothesis in an experiment is not the same as the hypotheses in an hypothesis test. The hypotheses in a hypothesis test are mathematically precise statements about an aspect of a metric of interest. The test hypotheses should of course reflect the experiment hypothesis, but they are not the same.

An example of a good experiment hypothesis is:

Based on user research, we believe that having to create a username creates friction in the sign-up process. We think that removing the step to enter a username for users signing up in the app will lead to faster completion of the sign-up flow. We will know this is true when we see a decrease in the mean sign-up completion time.

The last part about decreasing the mean sign-up completion time is the part we test using hypothesis testing. To do that we need to translate the experiment hypothesis into the two hypotheses in an hypothesis test.

The null and alternative hypotheses of an hypothesis test

A hypothesis test has a null hypothesis and an alternative hypothesis. Hypotheses are specific statements about aspects of a metric of interest, such as the mean sign-up completion time that you want to improve. For example, a hypothesis might say "we think the mean sign-up completion time will decrease by 10 seconds".

Note

An experiment hypothesis is often a less mathematically precise version of the alternative hypothesis in a hypothesis test. In our example, "faster completion of the sign-up flow" becomes a precise statement about decreasing mean sign-up completion time.

Simply put:

  • The alternative hypothesis describes what happens to the metric if the treatment has an effect (in our case, that removing the username field decreases mean sign-up completion time).
  • The null hypothesis describes what happens to the metric if treatment has no effect (in our case, that removing the username field doesn't change mean sign-up completion time).

With these hypothesis in mind, we have a clear definition of what we expect will happen with the metrics if the treatment has or has not an effect.

Hypotheses for means

When talking about hypothesis tests in online A/B tests, we almost always talk about a test of the difference in means between treatment and control. You have an outcome metric of interest (like sign-up completion time), and this metric has some mean before starting any experiment. As a product team, you are trying to improve this metric by iterating on your product. When you are about to run an experiment, you are interested in if a treatment (like removing the username field) has a positive effect on the mean of this metric.

For our sign-up completion time example, we want the mean time to decrease after the treatment, and the null and alternative hypotheses would be:

  • H0H_0H0​: Mean sign-up completion time in control group = Mean sign-up completion time in treatment group (null hypothesis)
  • H1H_1H1​: Mean sign-up completion time in control group > Mean sign-up completion time in treatment group (one-sided, negative direction)

This example shows how we can have directionality in our hypothesis test. We specifically want to see a decrease in the mean time. In other contexts, where we might want to see an increase in a metric (like mean revenue per user), we would adjust the alternative hypothesis accordingly:

  • H0H_0H0​: Mean revenue in control group = Mean revenue in treatment group (null hypothesis)
  • H1H_1H1​: Mean revenue in control group < Mean revenue in treatment group (one-sided, positive direction)
Note

Because the hypothesis is used in the test formulas, it's common to present them in a mathematical form. For the purpose of this course, and to build intuition about hypothesis testing, it suffices to understand them as precise statements about some aspect of a metric.

There are always two hypotheses in a hypothesis test: the null hypothesis and the alternative hypothesis. You can be mathematically fancy and write H0H_0H0​ for the null hypothesis and H1H_1H1​ for the alternative.

Why look at the mean?

You can test other parameters than the mean of the metric, but for experimentation the mean is the most common. The mean is a good summary of a metric to base decisions on, and, the mean has attractive statistical properties that makes it easy to manage risks for using statistics.

The steps for running an experiment and testing the mean difference between treatment and control are illustrated in the figure below. We:

  1. take a random sample of users
  2. split the sample randomly into treatment and control groups
  3. give the treatment to the users in the treatment group
  4. observe the mean of the metric in the two groups
  5. compare the means using a hypothesis test.
Experimentation Flow

A Spotify example

For a more concrete example for Spotify, suppose the metric of interest is audio book minutes played. The current mean is 48 minutes per user, and we are interested in if a new version of a recommendation algorithm increases the mean minutes played from 48 minutes to a higher number. In other words, is the new version of the recommendation algorithm causing an increase in the average minutes played? In an experiment, we test if the treatment affects the mean of the metric by taking a random sample, splitting it randomly into treatment and control, giving the treatment to the users in the treatment group, and then observe the mean of the metric in the two groups and compare the means.

How the randomness helps us manage risks

When the goal is to affect the mean, the hypothesis refers to the difference in the means between the treatment and control groups. If the treatment has no effect, the treatment and the control group should have the same mean and the mean difference should be zero. If the treatment increases the mean, the treatment group should have a higher mean than the control group, which would lead to a mean difference larger than zero. The null hypothesis is that the difference in means is zero. The alternative hypothesis is that the difference in means is greater than zero.

Since the sample and treatment assignment is random and no person is exactly like any other person, the difference in means between treatment and control will not be exactly zero even if the treatment truly has no effect. The difference-in-means estimator gives an estimate of the true average treatment effect in the population, and it will vary across random samples and treatment assignments. In the next lesson we will dig deeper into the variation of the difference-in-means estimator across random samples and treatment assignments.

Reader exercise

What is the primary purpose of hypothesis testing in experimentation?

Reader exercise

In a typical experiment, what do we usually test?

Notes for nerds

It's common to call the population means in the treatment groups μ0μ_0μ0​ and μ1μ_1μ1​ to keep notation succinct. The null hypothesis is then H0:μ0=μ1H_0: μ_0 = μ_1H0​:μ0​=μ1​ and the alternative hypothesis is for example H1:μ0<μ1H_1: μ_0 < μ_1H1​:μ0​<μ1​.

The hypotheses in a hypothesis test refer to so-called population parameters. The population parameters are the true values of the metric in the population. In other words, what would be the true average treatment effect if all users in the population got this treatment. In an experiment, we can only observe the metric in a sample of the population, and we use the sample mean as an estimate of the population mean. The hypothesis test is then a test of the null hypothesis that the population means are equal, based on the sample means.

Following lessons are devoted to unpacking the relations between the sample and the population.

Was this page helpful?

PreviousIntroduction
NextLesson 2: True vs estimated effects

© Copyright 2026. All rights reserved.

Follow us on TwitterFollow us on GitHub

On this page

  1. Experiment hypotheses

  2. The null and alternative hypotheses of an hypothesis test

  3. Why look at the mean?

  4. A Spotify example

  5. How the randomness helps us manage risks

  6. Notes for nerds