Confidence
  • Documentation
  • Blog
  • Bootcamp
  • Status
  • Confidence Bootcamp
    • My learning
    • Intro to experimentation
      • Introduction
      • Lesson 1: Why you should experiment
      • Lesson 2: Experiment hypothesis
      • Lesson 3: Success and guardrail metrics
      • Lesson 4: Success metrics
      • Lesson 5: Set up your experiment
      • Lesson 6: Calculation frequency
      • Lesson 7: Target audience
      • Lesson 8: Sample size
      • Lesson 9: Quality assurance
      • Lesson 10: Run your experiment
      • Lesson 11: Evaluate your experiment and make a decision
      • Lesson 12: A/B tests and rollouts
      • Course wrap up
    • Intro to metrics
      • Introduction
      • Lesson 1: What is a metric?
      • Lesson 2: Metric roles
      • Lesson 3: Time considerations
      • Lesson 4: Capturing behavior
      • Lesson 5: Strategic metrics
      • Lesson 6: Interpretability
      • Lesson 7: Feasibility and sensitivity
      • Lesson 8: Variance reduction
      • Lesson 9: Select metrics
      • Lesson 10: Segment-level analysis
      • Course wrap up
    • Scientific product development
      • Introduction
      • Lesson 1: Why you should experiment
      • Lesson 2: The scientific method
      • Lesson 3: Randomized controlled trials
      • Lesson 4: Experiment hypothesis
      • Lesson 5: Case study
        • Case study
        • Answers to case study
      • Lesson 6: Why do we need statistics?
      • Lesson 7: Success metrics
      • Lesson 8: Detectable effects and sample size
      • Lesson 9: Make a decision
      • Course wrap up
    • A primer on hypothesis testing
      • Introduction
      • Lesson 1: Introduction to hypothesis testing
      • Lesson 2: True vs estimated effects
      • Lesson 3: Sampling distribution of the difference-in-means estimator
      • Lesson 4: Z-tests and how to reject the null hypothesis
      • Lesson 5: False postive rate and alpha
      • Lesson 6: True positive rate, MDE, and power
      • Course wrap up
    • Intro to Feature Flags
      • Introduction
      • Lesson 1: What is a feature flag?
      • Lesson 2: Lifecycle of a feature flag
      • Lesson 3: Clients
      • Lesson 4: Evaluation context and targeting
    • Sample size calculation - I
      • Introduction
      • Lesson 1: What is the required sample size?
      • Lesson 2: Alpha and power
      • Lesson 3: Baseline mean and variance
      • Lesson 4: Sample size playground - I
    • Sample size calculation - II
      • Introduction
      • Lesson 1: Multi-metric decision making
      • Lesson 2: Number of success metrics
      • Lesson 3: Number of guardrail metrics
      • Lesson 4: Number of comparisons
      • Lesson 5: Sample size playground - II
    • Sample size calculation - III
      • Introduction
      • Lesson 1: Binary metrics
      • Lesson 2: Treatment group proportions
      • Lesson 3: Variance reduction
      • Lesson 4: Sequential testing and sample size
      • Lesson 5: Sample size playground - III
    • Advance your experimentation
      • Introduction
      • Lesson 1: Guardrail metrics with non-inferiority margins
      • Lesson 2: Choose evaluation frequency
      • Lesson 3: Metrics' roles in experiments
      • Lesson 4: Cumulative holdback evaluations
    • Experimentation culture
      • Introduction
      • Lesson 1: Onboarding into experimentation
      • Lesson 2: Empowering experimentation champions
      • Lesson 3: Sustaining the experimentation culture
    • Videos

Lesson 2: A Refresher on Alpha and Power

Summary

This lesson is a brief summary of lessons 5 and 6 from the Hypothesis Testing course, to make sure you have what you need to understand the sample size calculation course.

Possible outcomes in experiments

In an experiment, there either exists a treatment effect or there doesn't, and you either detect it or you don't. This gives us four possible outcomes depicted below.

Possible outcomes

Across many experiments, these four outcomes will occur with some rates. That is, if we run 100 experiments, some number of them will end up in each of the four quadrants.

Possible outcomes

In hypothesis testing, alpha is used as a parameter to control the rate of false positive results among the experiments that have no effect, and power is used to control the rate of true positive results among the experiments where there is an effect. We call alpha and power the intended error rates of the test.

Our goal with experimentation is to control the rates of incorrect and correct results. We can trade off between the rates of false positives and false negatives by changing the alpha, power, and sample size of our test. In fact, there are several things that affect the risk handling in experiments, which we will cover in future courses. But for now, let's not get ahead of ourselves.

By using a statistically valid test with a certain alpha, and a sample size large enough for a certain MDE to achieve a certain power, we can:

  • Bound the proportion of experiments without an effect that falsely detect an effect to be lower than or equal to alpha
  • Bound the proportion of experiments with an effect of MDE (or larger) that correctly detect that effect to be larger than or equal to power.

Video recap

If you haven't already, watch this 4-minute and 31-second video to quickly review what we've learned so far:


Win rate across all experiments

Having powered tests does not bound the true positive rate across all experiments you run. It only bounds the true positive rate for the subset of experiments that have a true treatment effect of MDE or larger.

In practice, some experiments will have a non-zero effect smaller than the MDE for which we have designed the test. In those experiments, our chance to detect the treatment effect will be smaller than power.

The best we can do is to make sure that we select MDEs that map to the smallest effect size that is practically relevant for our business. By powering all experiments to detect that effect, we can ensure that our true positive rate is at least power for all experiments in which the true effect is of a relevant size.


The nonlinearity of alpha and power

It is important to understand how the alpha and power parameters affect the sample size. Because we, in most cases, use a Z-test for evaluating experiments, a normal distribution underlies the dependency between required sample size, alpha and power. This means that the required sample size is not increasing linearly with alpha or power. This makes it much harder to reason about how the required sample size changes with changes to the alpha or power.

The Alpha z-value

In the sample size calculation, the alpha parameter comes into the equation via a z-value. Although this is the same type of z-scores that we have discussed in previous lessons, here let's not focus on the rationale for the z-value being in the equation, but rather on the relation between alpha and the z-value.

The alpha enters into this sample size formula via a z-value because we are using a Z-test that is based on the asymptotic normality of the difference-in-means sample estimator. It is good to know that the relation between alpha and z-alpha is nonlinear. This implies that changing the alpha by a fixed value will change the required sample size by different amounts depending on the alpha you had to begin with. Changing from 0.02 to 0.01 will increase the required sample size more than changing from 0.1 to 0.09.

Note that the asymptotic normality that the Z-test (and therefore the sample size calculations) is based on doesn't require the underlying data to be normally distributed. Instead, it's the difference-in-means estimator that needs to be approximately normally distributed under the null hypothesis, which it is for many underlying data distributions thanks to the central limit theorem. Learn more about the distribution of the difference-in-means estimator in the Hypothesis Testing course.

The plot below shows how the z-value changes with alpha.

The power z-value

The same relation holds for how power comes into the sample size formula


Reader exercise

What is the primary goal of experimentation in terms of risks?

Reader exercise

What does a statistically valid test with a certain alpha guarantee?

Reader exercise

What happens in experiments where the true effect is smaller than the MDE?

Reader exercise

What changes the required sample size the most (in absolute numbers), increasing alpha with one unit or decreasing it with one unit?

Was this page helpful?

PreviousLesson 1: What is the required sample size?
NextLesson 3: Baseline mean and variance

© Copyright 2026. All rights reserved.

Follow us on TwitterFollow us on GitHub

On this page

  1. Possible outcomes in experiments

  2. Video recap

  3. Win rate across all experiments

  4. The nonlinearity of alpha and power