Lesson 6: Why do we need statistics?
In this lesson, you learn about the role that statistical analysis plays in experimentation.
Statistics helps to:
- Quantify the uncertainty in the metric results.
- Makes it possible to manage and bound the risk of making the wrong product decisions.
Statistics is the mathematical language for quantifying uncertainty. In experiments, there is always random variation between the treatment groups even before you make a change to any group of users. All users are unique and therefore treatment groups never have exactly the same average.
Gym example
Let's go back to the weight-loss program trial.
You carried out your experiment as planned, randomizing participants into a control group and a treatment group. You measured the weight of the participants in both groups at the start of the trial. The treatment group started the 6-week weight-loss program right away, while the control group waited 6 weeks. At the end of the 6 weeks, you measured the weight of everyone again.
You ended up with 10 participants in each group, and the table below shows the difference before and after 6 weeks. People in the control group lost 1.3 kg on average, and people who participated in the weight-loss program lost 3.1 kg on average.
| Weight change (kg) | Group average | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Control | -1.3 | +2.1 | -3.3 | -2.7 | -1.8 | +2.1 | -1.8 | +0.9 | -1.2 | -4.5 | -3.0 |
| Treatment | -3.1 | -3.9 | -2.7 | -3.3 | -7.5 | -0.9 | -4.2 | -1.8 | -2.1 | -3.9 | -0.6 |
So, the people in your program lost 1.8 kg more weight than the people in the control group. Does that mean that your program works? Or could it just be a coincidence?
Quantify the noise to detect the signal
People's weights fluctuate somewhat over time. If you just divided 20 people randomly into two groups and measured their change in weight over time, it's quite unlikely that the averages of the two groups would be exactly the same. This random variation causes noise in your measurement.
How can you detect the signal among the noise?
To answer this question, you can think about how much "noise" in the weight measurements you would expect to see, even if your program had no impact at all. You can then compare the difference that you found, to the amount of noise that you would expect purely from random variation. The amount of noise in the average measurements depends on the number of people in each group, and the variance, that is how much the individuals differ between each other. Based on this, you can use statistical theory to calculate how much noise you can expect in the measurement. In other words: How likely would it be to find a difference this large, just due to random variation.
We'll skip the math here. For the weight-loss example, it turns out that there is an 8% chance to find a difference between two groups of 1.8kg or more, based purely on random variation. This calculation assumes that your weight-loss program had absolutely no effect on people's weight.
Use statistics to make a decision
You can use this calculation to make a decision about your program. If you find quite a small effect it may not be unlikely to see it even if your program did nothing. We can conclude that there's not enough evidence to conclude that your program is working. If you find a larger effect, it is less likely to see this purely due to random variation. We can define a threshold beforehand, to decide how "unlikely to be seen purely based on random variation" your result needs to be, to consider it strong enough evidence.
For example, if you had decided beforehand that you would consider your results strong enough if they are less than 5% likely to be seen based purely on random variation, then the result of your trial would not have produced strong enough evidence. If on the other hand, you had decided beforehand to set that threshold at 10%, then your result would have passed the test. We call a result that has passed such a test "statistically significant".
Don't change the target after you see the results
It is tempting to change the threshold after seeing the results. "OK, we said 5% beforehand, but 8% is still not that bad, if we decide that a threshold of 10% is good enough, then the experiment passes the test!". But to avoid confirmation bias, it is important to define the threshold before seeing the results. Changing the threshold after seeing the results, is like shooting arrows at a wall, and then drawing a target around the place where the arrow landed. Drawing a target around the results of an experiment is cheating, and cheating in experimentation leads to worse product decisions. The same applies to changing the metrics after the end of the experiment.
A note on statistical uncertainty for the curious
Statistics doesn't magically know how different any two groups are. However, you have one trick up your sleeve: randomization. By randomizing the treatment assignment, you know how the difference in means between two treatment groups varies across different random treatment assignments. In other words, randomly assigning the treatment to users serves two purposes, make the groups similar in all other aspects than the treatment (as discussed in the scientific method lesson), and to 'structure' the noise in the difference-in-means estimator to allow statistical inference.