Confidence
  • Documentation
  • Blog
  • Bootcamp
  • Status
  • Confidence Bootcamp
    • My learning
    • Intro to experimentation
      • Introduction
      • Lesson 1: Why you should experiment
      • Lesson 2: Experiment hypothesis
      • Lesson 3: Success and guardrail metrics
      • Lesson 4: Success metrics
      • Lesson 5: Set up your experiment
      • Lesson 6: Calculation frequency
      • Lesson 7: Target audience
      • Lesson 8: Sample size
      • Lesson 9: Quality assurance
      • Lesson 10: Run your experiment
      • Lesson 11: Evaluate your experiment and make a decision
      • Lesson 12: A/B tests and rollouts
      • Course wrap up
    • Intro to metrics
      • Introduction
      • Lesson 1: What is a metric?
      • Lesson 2: Metric roles
      • Lesson 3: Time considerations
      • Lesson 4: Capturing behavior
      • Lesson 5: Strategic metrics
      • Lesson 6: Interpretability
      • Lesson 7: Feasibility and sensitivity
      • Lesson 8: Variance reduction
      • Lesson 9: Select metrics
      • Lesson 10: Segment-level analysis
      • Course wrap up
    • Scientific product development
      • Introduction
      • Lesson 1: Why you should experiment
      • Lesson 2: The scientific method
      • Lesson 3: Randomized controlled trials
      • Lesson 4: Experiment hypothesis
      • Lesson 5: Case study
        • Case study
        • Answers to case study
      • Lesson 6: Why do we need statistics?
      • Lesson 7: Success metrics
      • Lesson 8: Detectable effects and sample size
      • Lesson 9: Make a decision
      • Course wrap up
    • A primer on hypothesis testing
      • Introduction
      • Lesson 1: Introduction to hypothesis testing
      • Lesson 2: True vs estimated effects
      • Lesson 3: Sampling distribution of the difference-in-means estimator
      • Lesson 4: Z-tests and how to reject the null hypothesis
      • Lesson 5: False postive rate and alpha
      • Lesson 6: True positive rate, MDE, and power
      • Course wrap up
    • Intro to Feature Flags
      • Introduction
      • Lesson 1: What is a feature flag?
      • Lesson 2: Lifecycle of a feature flag
      • Lesson 3: Clients
      • Lesson 4: Evaluation context and targeting
    • Sample size calculation - I
      • Introduction
      • Lesson 1: What is the required sample size?
      • Lesson 2: Alpha and power
      • Lesson 3: Baseline mean and variance
      • Lesson 4: Sample size playground - I
    • Sample size calculation - II
      • Introduction
      • Lesson 1: Multi-metric decision making
      • Lesson 2: Number of success metrics
      • Lesson 3: Number of guardrail metrics
      • Lesson 4: Number of comparisons
      • Lesson 5: Sample size playground - II
    • Sample size calculation - III
      • Introduction
      • Lesson 1: Binary metrics
      • Lesson 2: Treatment group proportions
      • Lesson 3: Variance reduction
      • Lesson 4: Sequential testing and sample size
      • Lesson 5: Sample size playground - III
    • Advance your experimentation
      • Introduction
      • Lesson 1: Guardrail metrics with non-inferiority margins
      • Lesson 2: Choose evaluation frequency
      • Lesson 3: Metrics' roles in experiments
      • Lesson 4: Cumulative holdback evaluations
    • Experimentation culture
      • Introduction
      • Lesson 1: Onboarding into experimentation
      • Lesson 2: Empowering experimentation champions
      • Lesson 3: Sustaining the experimentation culture
    • Videos

Lesson 4: Z-tests and how to reject the null hypothesis

Summary

Since we know that the distribution of the mean difference under the null hypothesis is normally distributed with large samples, we can calculate how uncommon certain differences in means would be under the null hypothesis.

Note

To simplify examples, metrics are assumed to improve when they increase. In all hypothesis tests in this lesson, the aim is to find evidence that the metric has increased as opposed to not moved at all.

Since we know that the distribution of the mean difference under the null hypothesis is normally distributed with large samples, we can calculate how unlikely a certain observed difference in means is under the null hypothesis. We can see where on the normal distribution the observed difference lies and say that it's among the x% most unlikely differences under the null hypothesis by calculating the area in the tail of the distribution (which computers happily do for us).

This is how hypothesis testing works: We calculate how uncommon the observed difference in means is under the null hypothesis based on the quantiles of the theoretically known sampling distribution. If the observed difference is among the alpha % most unlikely difference in means you can observe under the null hypothesis, we reject the null hypothesis.

Z-tests

A Z-test is a statistical test used to determine whether there is a significant difference between the means of two treatment groups used when the sample size is sufficiently large for the Central Limit Theorem to make the sampling distribution of the mean difference normally distributed.

The Z-test calculates the Z-score, which is a standardized value that tells us how many standard deviations the observed difference in means is from the expected value under the null hypothesis. This Z-score can then be used to evaluate hypotheses in several ways.

For a Z-test, we use the following formula for the Z-score:

Zobs=Observed statistic−Value under H0Standard error of the statisticZ_{obs} = \frac{\text{Observed statistic} - \text{Value under } H_0}{\text{Standard error of the statistic}}Zobs​=Standard error of the statisticObserved statistic−Value under H0​​

  • Observed statistic: the sample mean (or difference in means) that we observe in our data.
  • Value under (H0H_0H0​): the value of the statistic we would expect under the null hypothesis. For example, in a two-sample test, the expected difference in means under the null hypothesis is usually 0.
  • Standard error of the statistic: the description of the variability in the sampling distribution of the statistic under the null hypothesis. A smaller standard error means we are more confident in our estimate of the mean or difference in means.

When the Z-score (ZobsZ_{obs}Zobs​) is calculated it can be used in three ways:

  • To compare the ZobsZ_{obs}Zobs​ to a critical value based on the significance level alpha (denoted ZcritZ_{crit}Zcrit​).
  • To calculate a confidence interval around the point estimate, indicating how much uncertainty the estimate has.
  • To determine the p-value by finding the probability of observing a Z-score as extreme as (or more extreme than) ZobsZ_{obs}Zobs​ under the null hypothesis.
Note

The Z score (ZobsZ_{obs}Zobs​) is the distance between the observed mean difference and the value under H0H_0H0​ in terms of number of standard errors.

Critical Z values

One way to reject the null hypothesis in a Z-test is to compare the observed Z score with a so-called critical Z value, denoted ZcritZ_{crit}Zcrit​. ZcritZ_{crit}Zcrit​ depends on alpha, as it is the Z-value such that only alpha percent of the observed statistics are larger than that across random sampling under the null hypothesis.

For example, the ZcritZ_{crit}Zcrit​ for alpha 0.05 is 1.645, meaning that only 5% of the observed difference-in-means are more than 1.645 standard errors from the null hypothesis when the null hypothesis is true.

You reject the null hypothesis if Zobs>ZcritZ_{obs}>Z_{crit}Zobs​>Zcrit​.

Note

The ZcritZ_{crit}Zcrit​ is the number of standard errors that an observed statistic needs to be from the null to be significantly different from the null.

Confidence intervals

A one-sided lower bound confidence interval for difference-in-means point estimate is calculated as

L=Observed Statistic−Zcrit×Standard error of the statisticL = \text{Observed Statistic} - Z_{crit} \times\text{Standard error of the statistic}L=Observed Statistic−Zcrit​×Standard error of the statistic.

Note that the theoretical interval is now between L and + infinity. It's common practice to show two-sided confidence intervals, but for the purpose of building intuition for hypothesis testing in online experimentation, we stick to the one-sided.

This is called a 1−α1-\alpha1−α confidence interval. For example, if alpha is 5%, the we call it a 95% confidence interval.

The definition of a confidence interval is: Across random samples and treatment assignments, the confidence interval covers the true population treatment effect at least 1−α1-\alpha1−α% of the times.

With a one-sided confidence interval, you reject the null hypothesis if Lower bound > Value under H0H_0H0​. For example, if the difference-in-means is zero under H0H_0H0​, you reject the null hypothesis if L>0L>0L>0.

P-values

You might have heard about p-values. They quantify: How likely is it to obtain the observed mean difference (or a larger difference) under the null hypothesis.

You reject the null hypothesis if the p-value is smaller than the selected alpha.

For example, a p-value of 0.055 means that the observed difference is among the 5.5% most unlikely differences under the null hypothesis. If alpha is 10% we reject the null since 5.5% is less than 10%. If alpha is instead 5% we fail to reject the null since 5.5% is larger than 5%.

The smaller the alpha we use for the test, the more unlikely the observed difference must be (so that the p-value is small enough) for us to reject the null hypothesis.

An interactive example

Use this playground to build intuition for how to reject the null using confidence interval, the observed Z-score, and the p-value. It can also help answer the questions below.

Sampling distribution under H₀ with variance = 5 (the standard error of the estimator is 2.24).

P-value Test

pvalue=0.5000 > 0.050
Fail to reject H₀

Critical Value Test

Z_obs=0.0000 < 1.6449
Fail to reject H₀

Confidence Interval Test

L=-3.6780 ≤ 0.0
Fail to reject H₀

Let's summarize

Hypothesis testing is a mouthful the first time you dig into it. Before we move on to the two last lessons, let's review what we've talked about.

Hypothesis tests let us quantify how likely or unlikely an observed difference-in-means estimate is if the null hypothesis of a zero effect is true. If the observed estimate is very unlikely under the null hypothesis, and much more likely under the alternative hypothesis, we reject the null hypothesis and conclude that the treatment has an effect.

To test if the mean of an outcome metric has improved due to a treatment, we test if the difference in means between treatment and control is larger than zero. For mean differences, we know the probability distribution of the difference in means under the null hypothesis across random samples and treatment assignments is the normal distribution. This lets us calculate how unlikely a certain difference in means is under the null hypothesis.

One thing that might bother you at this point is that, if we reject the null hypothesis whenever we observe a mean difference that is more unlikely than alpha, then we will reject a true null hypothesis in alpha % of all experiments where there truly is no effect. Statistically significant really means "the difference is probably not by chance", or "the difference is unlikely to be only due to random variation".

This is where risk management comes into the picture. Hypothesis testing cannot help us reach the right conclusion in any given experiment with complete certainty. It can only help us limit the risks across many experiments. More on this in the following lessons.

Reader exercise

What does a p-value represent?

Reader exercise

When do we reject the null hypothesis?

Reader exercise

What is the relation between the observed z-score, the observed mean difference, and the observed standard error?

Was this page helpful?

PreviousLesson 3: Sampling distribution of the difference-in-means estimator
NextLesson 5: False postive rate and alpha

© Copyright 2026. All rights reserved.

Follow us on TwitterFollow us on GitHub

On this page

  1. Z-tests

  2. Critical Z values

  3. Confidence intervals

  4. P-values

  5. An interactive example

  6. Let's summarize