Lesson 1: Introduction to hypothesis tests

Experiment hypotheses

Most people are familiar with the concept of a hypothesis: it's a somewhat formal statement about what you believe will be the outcome in an experiment. If you have ever read some introduction to experimentation (including our introductory course), it's often stated how important it is to have a clear hypothesis. It's worth noting that the hypothesis in an experiment is not the same as the hypotheses in an hypothesis test. The hypotheses in a hypothesis test are mathematically precise statements about an aspect of a metric of interest. The test hypotheses should of course reflect the experiment hypothesis, but they are not the same.

An example of a good experiment hypothesis is:

Based on user research, we believe that having to create a username creates friction in the sign-up process. We think that removing the step to enter a username for users signing up in the app will lead to faster completion of the sign-up flow. We will know this is true when we see a decrease in the mean sign-up completion time.

The last part about decreasing the mean sign-up completion time is the part we test using hypothesis testing. To do that we need to translate the experiment hypothesis into the two hypotheses in an hypothesis test.

The null and alternative hypotheses of an hypothesis test

A hypothesis test has a null hypothesis and an alternative hypothesis. Hypotheses are specific statements about aspects of a metric of interest, such as the mean sign-up completion time that you want to improve. For example, a hypothesis might say "we think the mean sign-up completion time will decrease by 10 seconds".

Simply put:

  • The alternative hypothesis describes what happens to the metric if the treatment has an effect (in our case, that removing the username field decreases mean sign-up completion time).
  • The null hypothesis describes what happens to the metric if treatment has no effect (in our case, that removing the username field doesn't change mean sign-up completion time).

With these hypothesis in mind, we have a clear definition of what we expect will happen with the metrics if the treatment has or has not an effect.

Hypotheses for means

When talking about hypothesis tests in online A/B tests, we almost always talk about a test of the difference in means between treatment and control. You have an outcome metric of interest (like sign-up completion time), and this metric has some mean before starting any experiment. As a product team, you are trying to improve this metric by iterating on your product. When you are about to run an experiment, you are interested in if a treatment (like removing the username field) has a positive effect on the mean of this metric.

For our sign-up completion time example, we want the mean time to decrease after the treatment, and the null and alternative hypotheses would be:

  • H0H_0: Mean sign-up completion time in control group = Mean sign-up completion time in treatment group (null hypothesis)
  • H1H_1: Mean sign-up completion time in control group > Mean sign-up completion time in treatment group (one-sided, negative direction)

This example shows how we can have directionality in our hypothesis test. We specifically want to see a decrease in the mean time. In other contexts, where we might want to see an increase in a metric (like mean revenue per user), we would adjust the alternative hypothesis accordingly:

  • H0H_0: Mean revenue in control group = Mean revenue in treatment group (null hypothesis)
  • H1H_1: Mean revenue in control group < Mean revenue in treatment group (one-sided, positive direction)

Why look at the mean?

You can test other parameters than the mean of the metric, but for experimentation the mean is the most common. The mean is a good summary of a metric to base decisions on, and, the mean has attractive statistical properties that makes it easy to manage risks for using statistics.

The steps for running an experiment and testing the mean difference between treatment and control are illustrated in the figure below. We:

  1. take a random sample of users
  2. split the sample randomly into treatment and control groups
  3. give the treatment to the users in the treatment group
  4. observe the mean of the metric in the two groups
  5. compare the means using a hypothesis test.
Experimentation Flow

A Spotify example

For a more concrete example for Spotify, suppose the metric of interest is audio book minutes played. The current mean is 48 minutes per user, and we are interested in if a new version of a recommendation algorithm increases the mean minutes played from 48 minutes to a higher number. In other words, is the new version of the recommendation algorithm causing an increase in the average minutes played? In an experiment, we test if the treatment affects the mean of the metric by taking a random sample, splitting it randomly into treatment and control, giving the treatment to the users in the treatment group, and then observe the mean of the metric in the two groups and compare the means.

How the randomness helps us manage risks

When the goal is to affect the mean, the hypothesis refers to the difference in the means between the treatment and control groups. If the treatment has no effect, the treatment and the control group should have the same mean and the mean difference should be zero. If the treatment increases the mean, the treatment group should have a higher mean than the control group, which would lead to a mean difference larger than zero. The null hypothesis is that the difference in means is zero. The alternative hypothesis is that the difference in means is greater than zero.

Since the sample and treatment assignment is random and no person is exactly like any other person, the difference in means between treatment and control will not be exactly zero even if the treatment truly has no effect. The difference-in-means estimator gives an estimate of the true average treatment effect in the population, and it will vary across random samples and treatment assignments. In the next lesson we will dig deeper into the variation of the difference-in-means estimator across random samples and treatment assignments.

Notes for nerds

It's common to call the population means in the treatment groups μ0μ_0 and μ1μ_1 to keep notation succinct. The null hypothesis is then H0:μ0=μ1H_0: μ_0 = μ_1 and the alternative hypothesis is for example H1:μ0<μ1H_1: μ_0 < μ_1.

The hypotheses in a hypothesis test refer to so-called population parameters. The population parameters are the true values of the metric in the population. In other words, what would be the true average treatment effect if all users in the population got this treatment. In an experiment, we can only observe the metric in a sample of the population, and we use the sample mean as an estimate of the population mean. The hypothesis test is then a test of the null hypothesis that the population means are equal, based on the sample means.

Following lessons are devoted to unpacking the relations between the sample and the population.