Lesson 4: Z-tests and how to reject the null hypothesis

Since we know that the distribution of the mean difference under the null hypothesis is normally distributed with large samples, we can calculate how unlikely a certain observed difference in means is under the null hypothesis. We can see where on the normal distribution the observed difference lies and say that it's among the x% most unlikely differences under the null hypothesis by calculating the area in the tail of the distribution (which computers happily do for us).

This is how hypothesis testing works: We calculate how uncommon the observed difference in means is under the null hypothesis based on the quantiles of the theoretically known sampling distribution. If the observed difference is among the alpha % most unlikely difference in means you can observe under the null hypothesis, we reject the null hypothesis.

Z-tests

A Z-test is a statistical test used to determine whether there is a significant difference between the means of two treatment groups used when the sample size is sufficiently large for the Central Limit Theorem to make the sampling distribution of the mean difference normally distributed.

The Z-test calculates the Z-score, which is a standardized value that tells us how many standard deviations the observed difference in means is from the expected value under the null hypothesis. This Z-score can then be used to evaluate hypotheses in several ways.

For a Z-test, we use the following formula for the Z-score:

Zobs=Observed statisticValue under H0Standard error of the statisticZ_{obs} = \frac{\text{Observed statistic} - \text{Value under } H_0}{\text{Standard error of the statistic}}

  • Observed statistic: the sample mean (or difference in means) that we observe in our data.
  • Value under (H0H_0): the value of the statistic we would expect under the null hypothesis. For example, in a two-sample test, the expected difference in means under the null hypothesis is usually 0.
  • Standard error of the statistic: the description of the variability in the sampling distribution of the statistic under the null hypothesis. A smaller standard error means we are more confident in our estimate of the mean or difference in means.

When the Z-score (ZobsZ_{obs}) is calculated it can be used in three ways:

  • To compare the ZobsZ_{obs} to a critical value based on the significance level alpha (denoted ZcritZ_{crit}).
  • To calculate a confidence interval around the point estimate, indicating how much uncertainty the estimate has.
  • To determine the p-value by finding the probability of observing a Z-score as extreme as (or more extreme than) ZobsZ_{obs} under the null hypothesis.

Critical Z values

One way to reject the null hypothesis in a Z-test is to compare the observed Z score with a so-called critical Z value, denoted ZcritZ_{crit}. ZcritZ_{crit} depends on alpha, as it is the Z-value such that only alpha percent of the observed statistics are larger than that across random sampling under the null hypothesis.

For example, the ZcritZ_{crit} for alpha 0.05 is 1.645, meaning that only 5% of the observed difference-in-means are more than 1.645 standard errors from the null hypothesis when the null hypothesis is true.

You reject the null hypothesis if Zobs>ZcritZ_{obs}>Z_{crit}.

Confidence intervals

A one-sided lower bound confidence interval for difference-in-means point estimate is calculated as

L=Observed StatisticZcrit×Standard error of the statisticL = \text{Observed Statistic} - Z_{crit} \times\text{Standard error of the statistic}.

Note that the theoretical interval is now between L and + infinity. It's common practice to show two-sided confidence intervals, but for the purpose of building intuition for hypothesis testing in online experimentation, we stick to the one-sided.

This is called a 1α1-\alpha confidence interval. For example, if alpha is 5%, the we call it a 95% confidence interval.

The definition of a confidence interval is: Across random samples and treatment assignments, the confidence interval covers the true population treatment effect at least 1α1-\alpha% of the times.

With a one-sided confidence interval, you reject the null hypothesis if Lower bound > Value under H0H_0. For example, if the difference-in-means is zero under H0H_0, you reject the null hypothesis if L>0L>0.

P-values

You might have heard about p-values. They quantify: How likely is it to obtain the observed mean difference (or a larger difference) under the null hypothesis.

You reject the null hypothesis if the p-value is smaller than the selected alpha.

For example, a p-value of 0.055 means that the observed difference is among the 5.5% most unlikely differences under the null hypothesis. If alpha is 10% we reject the null since 5.5% is less than 10%. If alpha is instead 5% we fail to reject the null since 5.5% is larger than 5%.

The smaller the alpha we use for the test, the more unlikely the observed difference must be (so that the p-value is small enough) for us to reject the null hypothesis.

An interactive example

Use this playground to build intuition for how to reject the null using confidence interval, the observed Z-score, and the p-value. It can also help answer the questions below.

Let's summarize

Hypothesis testing is a mouthful the first time you dig into it. Before we move on to the two last lessons, let's review what we've talked about.

Hypothesis tests let us quantify how likely or unlikely an observed difference-in-means estimate is if the null hypothesis of a zero effect is true. If the observed estimate is very unlikely under the null hypothesis, and much more likely under the alternative hypothesis, we reject the null hypothesis and conclude that the treatment has an effect.

To test if the mean of an outcome metric has improved due to a treatment, we test if the difference in means between treatment and control is larger than zero. For mean differences, we know the probability distribution of the difference in means under the null hypothesis across random samples and treatment assignments is the normal distribution. This lets us calculate how unlikely a certain difference in means is under the null hypothesis.

One thing that might bother you at this point is that, if we reject the null hypothesis whenever we observe a mean difference that is more unlikely than alpha, then we will reject a true null hypothesis in alpha % of all experiments where there truly is no effect. Statistically significant really means "the difference is probably not by chance", or "the difference is unlikely to be only due to random variation".

This is where risk management comes into the picture. Hypothesis testing cannot help us reach the right conclusion in any given experiment with complete certainty. It can only help us limit the risks across many experiments. More on this in the following lessons.