Lesson 4: Z-tests and how to reject the null hypothesis
Since we know that the distribution of the mean difference under the null hypothesis is normally distributed with large samples, we can calculate how uncommon certain differences in means would be under the null hypothesis.
To simplify examples, metrics are assumed to improve when they increase. In all hypothesis tests in this lesson, the aim is to find evidence that the metric has increased as opposed to not moved at all.
Since we know that the distribution of the mean difference under the null hypothesis is normally distributed with large samples, we can calculate how unlikely a certain observed difference in means is under the null hypothesis. We can see where on the normal distribution the observed difference lies and say that it's among the x% most unlikely differences under the null hypothesis by calculating the area in the tail of the distribution (which computers happily do for us).
This is how hypothesis testing works: We calculate how uncommon the observed difference in means is under the null hypothesis based on the quantiles of the theoretically known sampling distribution. If the observed difference is among the alpha % most unlikely difference in means you can observe under the null hypothesis, we reject the null hypothesis.
Z-tests
A Z-test is a statistical test used to determine whether there is a significant difference between the means of two treatment groups used when the sample size is sufficiently large for the Central Limit Theorem to make the sampling distribution of the mean difference normally distributed.
The Z-test calculates the Z-score, which is a standardized value that tells us how many standard deviations the observed difference in means is from the expected value under the null hypothesis. This Z-score can then be used to evaluate hypotheses in several ways.
For a Z-test, we use the following formula for the Z-score:
- Observed statistic: the sample mean (or difference in means) that we observe in our data.
- Value under (): the value of the statistic we would expect under the null hypothesis. For example, in a two-sample test, the expected difference in means under the null hypothesis is usually 0.
- Standard error of the statistic: the description of the variability in the sampling distribution of the statistic under the null hypothesis. A smaller standard error means we are more confident in our estimate of the mean or difference in means.
When the Z-score () is calculated it can be used in three ways:
- To compare the to a critical value based on the significance level alpha (denoted ).
- To calculate a confidence interval around the point estimate, indicating how much uncertainty the estimate has.
- To determine the p-value by finding the probability of observing a Z-score as extreme as (or more extreme than) under the null hypothesis.
The Z score () is the distance between the observed mean difference and the value under in terms of number of standard errors.
Critical Z values
One way to reject the null hypothesis in a Z-test is to compare the observed Z score with a so-called critical Z value, denoted . depends on alpha, as it is the Z-value such that only alpha percent of the observed statistics are larger than that across random sampling under the null hypothesis.
For example, the for alpha 0.05 is 1.645, meaning that only 5% of the observed difference-in-means are more than 1.645 standard errors from the null hypothesis when the null hypothesis is true.
You reject the null hypothesis if .
The is the number of standard errors that an observed statistic needs to be from the null to be significantly different from the null.
Confidence intervals
A one-sided lower bound confidence interval for difference-in-means point estimate is calculated as
.
Note that the theoretical interval is now between L and + infinity. It's common practice to show two-sided confidence intervals, but for the purpose of building intuition for hypothesis testing in online experimentation, we stick to the one-sided.
This is called a confidence interval. For example, if alpha is 5%, the we call it a 95% confidence interval.
The definition of a confidence interval is: Across random samples and treatment assignments, the confidence interval covers the true population treatment effect at least % of the times.
With a one-sided confidence interval, you reject the null hypothesis if Lower bound > Value under . For example, if the difference-in-means is zero under , you reject the null hypothesis if .
P-values
You might have heard about p-values. They quantify: How likely is it to obtain the observed mean difference (or a larger difference) under the null hypothesis.
You reject the null hypothesis if the p-value is smaller than the selected alpha.
For example, a p-value of 0.055 means that the observed difference is among the 5.5% most unlikely differences under the null hypothesis. If alpha is 10% we reject the null since 5.5% is less than 10%. If alpha is instead 5% we fail to reject the null since 5.5% is larger than 5%.
The smaller the alpha we use for the test, the more unlikely the observed difference must be (so that the p-value is small enough) for us to reject the null hypothesis.
An interactive example
Use this playground to build intuition for how to reject the null using confidence interval, the observed Z-score, and the p-value. It can also help answer the questions below.
Let's summarize
Hypothesis testing is a mouthful the first time you dig into it. Before we move on to the two last lessons, let's review what we've talked about.
Hypothesis tests let us quantify how likely or unlikely an observed difference-in-means estimate is if the null hypothesis of a zero effect is true. If the observed estimate is very unlikely under the null hypothesis, and much more likely under the alternative hypothesis, we reject the null hypothesis and conclude that the treatment has an effect.
To test if the mean of an outcome metric has improved due to a treatment, we test if the difference in means between treatment and control is larger than zero. For mean differences, we know the probability distribution of the difference in means under the null hypothesis across random samples and treatment assignments is the normal distribution. This lets us calculate how unlikely a certain difference in means is under the null hypothesis.
One thing that might bother you at this point is that, if we reject the null hypothesis whenever we observe a mean difference that is more unlikely than alpha, then we will reject a true null hypothesis in alpha % of all experiments where there truly is no effect. Statistically significant really means "the difference is probably not by chance", or "the difference is unlikely to be only due to random variation".
This is where risk management comes into the picture. Hypothesis testing cannot help us reach the right conclusion in any given experiment with complete certainty. It can only help us limit the risks across many experiments. More on this in the following lessons.