Confidence
  • Documentation
  • Blog
  • Bootcamp
  • Status
  • Confidence Bootcamp
    • My learning
    • Intro to experimentation
      • Introduction
      • Lesson 1: Why you should experiment
      • Lesson 2: Experiment hypothesis
      • Lesson 3: Success and guardrail metrics
      • Lesson 4: Success metrics
      • Lesson 5: Set up your experiment
      • Lesson 6: Calculation frequency
      • Lesson 7: Target audience
      • Lesson 8: Sample size
      • Lesson 9: Quality assurance
      • Lesson 10: Run your experiment
      • Lesson 11: Evaluate your experiment and make a decision
      • Lesson 12: A/B tests and rollouts
      • Course wrap up
    • Intro to metrics
      • Introduction
      • Lesson 1: What is a metric?
      • Lesson 2: Metric roles
      • Lesson 3: Time considerations
      • Lesson 4: Capturing behavior
      • Lesson 5: Strategic metrics
      • Lesson 6: Interpretability
      • Lesson 7: Feasibility and sensitivity
      • Lesson 8: Variance reduction
      • Lesson 9: Select metrics
      • Lesson 10: Segment-level analysis
      • Course wrap up
    • Scientific product development
      • Introduction
      • Lesson 1: Why you should experiment
      • Lesson 2: The scientific method
      • Lesson 3: Randomized controlled trials
      • Lesson 4: Experiment hypothesis
      • Lesson 5: Case study
        • Case study
        • Answers to case study
      • Lesson 6: Why do we need statistics?
      • Lesson 7: Success metrics
      • Lesson 8: Detectable effects and sample size
      • Lesson 9: Make a decision
      • Course wrap up
    • A primer on hypothesis testing
      • Introduction
      • Lesson 1: Introduction to hypothesis testing
      • Lesson 2: True vs estimated effects
      • Lesson 3: Sampling distribution of the difference-in-means estimator
      • Lesson 4: Z-tests and how to reject the null hypothesis
      • Lesson 5: False postive rate and alpha
      • Lesson 6: True positive rate, MDE, and power
      • Course wrap up
    • Intro to Feature Flags
      • Introduction
      • Lesson 1: What is a feature flag?
      • Lesson 2: Lifecycle of a feature flag
      • Lesson 3: Clients
      • Lesson 4: Evaluation context and targeting
    • Sample size calculation - I
      • Introduction
      • Lesson 1: What is the required sample size?
      • Lesson 2: Alpha and power
      • Lesson 3: Baseline mean and variance
      • Lesson 4: Sample size playground - I
    • Sample size calculation - II
      • Introduction
      • Lesson 1: Multi-metric decision making
      • Lesson 2: Number of success metrics
      • Lesson 3: Number of guardrail metrics
      • Lesson 4: Number of comparisons
      • Lesson 5: Sample size playground - II
    • Sample size calculation - III
      • Introduction
      • Lesson 1: Binary metrics
      • Lesson 2: Treatment group proportions
      • Lesson 3: Variance reduction
      • Lesson 4: Sequential testing and sample size
      • Lesson 5: Sample size playground - III
    • Advance your experimentation
      • Introduction
      • Lesson 1: Guardrail metrics with non-inferiority margins
      • Lesson 2: Choose evaluation frequency
      • Lesson 3: Metrics' roles in experiments
      • Lesson 4: Cumulative holdback evaluations
    • Experimentation culture
      • Introduction
      • Lesson 1: Onboarding into experimentation
      • Lesson 2: Empowering experimentation champions
      • Lesson 3: Sustaining the experimentation culture
    • Videos

Lesson 4: Number of comparisons

Summary

This lesson teaches you about how the number of comparisons affects the required sample size in experiments. Everything else held constant, the more treatment groups and thereby comparisons you have, the smaller alpha you will have to use per metric to bound the false positive rate for the decision below alpha, which leads to a larger required sample size. The number of comparisons does not affect the power you need to use per metric.

The impact of multiple comparisons

The most common pattern in product A/B tests is to compare all treatment groups against a control group. This means there are as many comparisons as there are treatment groups being tested.

In principle, it is also possible to compare all treatment groups against each other. This would mean the number of comparisons equals the number of pairs of treatments.

The number of comparisons affects the required sample size. The more comparisons, the more samples are required. This is because the probability of making a Type I error (false positive) increases with the number of comparisons. To counter this, we adjust the alpha level for multiple comparisons, which increases the required sample size.

The intuition behind this adjustment is that the more tests we run, the more chances there are to find a significant result by random chance. For example, if we run an experiment with 100 treatments and an alpha of 10%, even if no treatment has any effect, we would expect to see 10 treatments with a (false positive) significant result just by chance.

Reader exercise

How does the number of comparisons in an experiment affect the required sample size?

Reader exercise

Why is the alpha level adjusted when there are multiple comparisons in an experiment?

Notes for nerds

Some people wonder what to do if more than one treatment is significantly better than the control group. This is a deep question. You can test the treatments against each other to see if one is better than the other. However, the difference between the treatment groups is likely smaller than the difference between them and the control group. This makes the power to detect a difference between treatments lower than the power to detect a difference between a treatment and the control group.

There are more advanced methods for finding the best treatment among many, such as Tukey's and Scheffé's methods, and Dunnett's test. However, we don't recommend using these methods due to the complexity involved in learning how to use them. Instead, you should gather stakeholders to decide which of the significant treatments to implement based on factors such as:

  • Complexity
  • Cost
  • Future extensibility

Was this page helpful?

PreviousLesson 3: Number of guardrail metrics
NextLesson 5: Sample size playground - II

© Copyright 2026. All rights reserved.

Follow us on TwitterFollow us on GitHub

On this page

  1. The impact of multiple comparisons

  2. Notes for nerds