The null hypothesis is the default assumption in a statistical test that there is no difference between the treatment and control groups. In A/B testing, it states: the change you made had no effect on the metric. The experiment's job is to gather enough evidence to either reject this assumption or fail to reject it.
The null hypothesis is not a claim that the change does nothing. It's a starting point for the statistical test. You assume no effect, then ask: given this assumption, how unlikely is the data I observed? If the data would be very unlikely under the null hypothesis (typically less than 5% probability), you reject the null and conclude the treatment had a real effect. This framework is the foundation of frequentist hypothesis testing, which is the statistical approach Confidence uses for experiment analysis.
How does the null hypothesis work in an A/B test?
Every A/B test in Confidence implicitly tests a null hypothesis. When you set up an experiment with a success metric, the platform frames the analysis as:
- Null hypothesis (H0): The treatment has no effect on the success metric. The true difference between groups is zero.
- Alternative hypothesis (H1): The treatment has a non-zero effect on the success metric.
The statistical test then computes the probability of seeing a result as extreme as the one observed, assuming the null is true. This probability is the p-value. If the p-value falls below the significance level (typically 0.05), the result is statistically significant, and you reject the null hypothesis.
Failing to reject the null doesn't prove the change had no effect. It means the experiment didn't find enough evidence to conclude that it did. The distinction matters. An underpowered experiment will fail to reject the null most of the time, even when the treatment has a real but small effect. This is why statistical power and sample size calculations are part of the experiment design in Confidence: they ensure the experiment can detect the effect size you care about.
How does the null hypothesis apply to guardrail metrics?
For success metrics, the null hypothesis is "no difference" and you're looking for evidence of improvement. For guardrail metrics, the framing is different.
Confidence uses inferiority testing for guardrails. The null hypothesis for a guardrail metric is: the treatment is worse than control (or worse by more than a specified margin). You're looking for evidence to reject this null, meaning you want to confirm the treatment doesn't cause harm. This inversion is deliberate. With success metrics, a false positive means shipping a change that doesn't actually help. Annoying, but not destructive. With guardrails, a false negative means missing a real regression. That can be destructive.
This is why Confidence's decision framework, described in the risk-aware product decisions paper, doesn't apply multiple testing corrections to guardrail metrics. The cost of missing a guardrail violation outweighs the cost of a false alarm on a guardrail. The false negative rate for guardrails is controlled instead, which is the opposite trade-off from success metrics.
What are common misunderstandings about the null hypothesis?
"Failing to reject the null means the treatment doesn't work." It means you didn't find sufficient evidence. Maybe the treatment has a small effect your test wasn't powered to detect. At Spotify, where experiments produce a 12% win rate, most experiments fail to reject the null on the success metric. That doesn't mean 88% of product ideas have zero effect. It means many effects are too small to detect at the chosen power and significance levels, or the hypothesis was wrong about the direction of the effect.
"A p-value of 0.04 means there's a 4% chance the treatment doesn't work." The p-value is the probability of seeing data this extreme if the null is true. It says nothing about the probability that the null is true. This is a subtle but important distinction. The p-value is a property of the data under an assumption, not a probability of the assumption.
"A significant result on one metric means you can ignore the null results on others." Each metric has its own null hypothesis. If the success metric rejects the null but a guardrail metric also rejects the null (in the wrong direction), the experiment found both an improvement and a regression. Confidence displays all metric results together precisely so teams don't cherry-pick the favorable ones.