Confidence
  • Pricing
  • Success stories
  • Contact us
  • Login
Start free trial
All terms
Core Experimentation

What is a Null Hypothesis?

The null hypothesis is the default assumption in a statistical test that there is no difference between the treatment and control groups.

The null hypothesis is the default assumption in a statistical test that there is no difference between the treatment and control groups. In A/B testing, it states: the change you made had no effect on the metric. The experiment's job is to gather enough evidence to either reject this assumption or fail to reject it.

The null hypothesis is not a claim that the change does nothing. It's a starting point for the statistical test. You assume no effect, then ask: given this assumption, how unlikely is the data I observed? If the data would be very unlikely under the null hypothesis (typically less than 5% probability), you reject the null and conclude the treatment had a real effect. This framework is the foundation of frequentist hypothesis testing, which is the statistical approach Confidence uses for experiment analysis.

How does the null hypothesis work in an A/B test?

Every A/B test in Confidence implicitly tests a null hypothesis. When you set up an experiment with a success metric, the platform frames the analysis as:

  • Null hypothesis (H0): The treatment has no effect on the success metric. The true difference between groups is zero.
  • Alternative hypothesis (H1): The treatment has a non-zero effect on the success metric.

The statistical test then computes the probability of seeing a result as extreme as the one observed, assuming the null is true. This probability is the p-value. If the p-value falls below the significance level (typically 0.05), the result is statistically significant, and you reject the null hypothesis.

Failing to reject the null doesn't prove the change had no effect. It means the experiment didn't find enough evidence to conclude that it did. The distinction matters. An underpowered experiment will fail to reject the null most of the time, even when the treatment has a real but small effect. This is why statistical power and sample size calculations are part of the experiment design in Confidence: they ensure the experiment can detect the effect size you care about.

How does the null hypothesis apply to guardrail metrics?

For success metrics, the null hypothesis is "no difference" and you're looking for evidence of improvement. For guardrail metrics, the framing is different.

Confidence uses inferiority testing for guardrails. The null hypothesis for a guardrail metric is: the treatment is worse than control (or worse by more than a specified margin). You're looking for evidence to reject this null, meaning you want to confirm the treatment doesn't cause harm. This inversion is deliberate. With success metrics, a false positive means shipping a change that doesn't actually help. Annoying, but not destructive. With guardrails, a false negative means missing a real regression. That can be destructive.

This is why Confidence's decision framework, described in the risk-aware product decisions paper, doesn't apply multiple testing corrections to guardrail metrics. The cost of missing a guardrail violation outweighs the cost of a false alarm on a guardrail. The false negative rate for guardrails is controlled instead, which is the opposite trade-off from success metrics.

What are common misunderstandings about the null hypothesis?

"Failing to reject the null means the treatment doesn't work." It means you didn't find sufficient evidence. Maybe the treatment has a small effect your test wasn't powered to detect. At Spotify, where experiments produce a 12% win rate, most experiments fail to reject the null on the success metric. That doesn't mean 88% of product ideas have zero effect. It means many effects are too small to detect at the chosen power and significance levels, or the hypothesis was wrong about the direction of the effect.

"A p-value of 0.04 means there's a 4% chance the treatment doesn't work." The p-value is the probability of seeing data this extreme if the null is true. It says nothing about the probability that the null is true. This is a subtle but important distinction. The p-value is a property of the data under an assumption, not a probability of the assumption.

"A significant result on one metric means you can ignore the null results on others." Each metric has its own null hypothesis. If the success metric rejects the null but a guardrail metric also rejects the null (in the wrong direction), the experiment found both an improvement and a regression. Confidence displays all metric results together precisely so teams don't cherry-pick the favorable ones.

Related terms

Core Experimentation
Hypothesis

A hypothesis is a testable prediction about the effect of a specific product change on a specific metric.

Statistical Methods
Statistical Significance

Statistical significance is the determination that an observed difference between experiment groups is unlikely to have occurred by chance alone.

Statistical Methods
P-value

A p-value is the probability of observing a result at least as extreme as the one measured, assuming the null hypothesis is true (that is, assuming the change had no real effect).

Statistical Methods
Significance Level (Alpha)

The significance level, commonly called alpha, is the maximum false positive rate you're willing to accept in an experiment.

Statistical Methods
Statistical Power

Statistical power is the probability that an experiment will detect a real effect when one exists.

Statistical Methods
False Positive Rate (Type I Error)

The false positive rate, also called the Type I error rate, is the probability of concluding that a treatment had an effect when it actually didn't.

Statistical Methods
False Negative Rate (Type II Error)

The false negative rate, also called the Type II error rate or beta, is the probability of failing to detect a real treatment effect.

Metrics
Guardrail Metric

A guardrail metric is a metric monitored during an experiment to ensure the change doesn't cause unintended harm, even when the success metric improves.

Spotify

Learn more

  • Read our blog
  • See comparisons
  • Glossary
  • RFP guides
  • Listen to us
  • Read our docs
  • Status page

Need help

  • Contact us

Legal

  • Terms of Service
  • Data Protection Agreement
  • Privacy Policy
  • Cookies

© 2026 Spotify

The Confidence name and logo are registered trademarks of Spotify.