Lesson 5: Significance for success metrics
In this lesson, you learn what the status labels on success metric results mean and how to read a confidence interval to determine whether a result is significant. You also learn what "not significant" really means, and why it is not the same as "no effect."
Every metric result in Confidence has a status label: a short phrase that tells you what to conclude about that metric.
Success metrics: did the treatment variant improve things?
For success metrics, Confidence is asking: is there statistical evidence that the treatment variant changed this metric in the desired direction?
The zero line on the results page is the reference point. A confidence interval that sits entirely on the positive side of zero means the data are inconsistent with there being no effect: the result is statistically significant. A CI that crosses zero means "no effect" remains a plausible value.
- Significant: the CI does not cross zero in the direction of the test. There is statistical evidence that the treatment variant moved the metric in the intended direction. This is not certainty; it means the data are unlikely to look the way they do if there were truly no effect.
- Not significant: the CI crosses zero. There is not enough statistical evidence to conclude that the treatment variant affected this metric. The data are consistent with there being no effect.
"Not significant" does not mean "no effect." It means "no evidence of an effect strong enough to detect with the current data." With a wide confidence interval, you simply do not yet have enough data to know. Do not interpret "not significant" as proof that the treatment variant did nothing.
Use the interactive below to build intuition for how the CI position determines significance.
CI and significance for success metrics
Move the point estimate slider to see how the significance status changes. Use the direction toggle to set which way the metric should move.
Try the following:
- With the direction set to "Increases", drag the point estimate from +15% to -15% and watch all three states appear in sequence: "Has improved" when the CI clears zero on the positive side, "Not significant" as it crosses zero, and "Has deteriorated" when the CI sits entirely below zero.
- Move the point estimate back to a moderate positive value, then set the direction to "Decreases." The same sweep now works in reverse: "Has improved" appears on the negative side, and "Has deteriorated" on the positive side.
- Set sample size to 500 and move the PE to +5%. The wider CI may still cross zero even with a positive estimate.
- Increase sample size to 5,000. The CI narrows, and a smaller PE becomes sufficient for significance.
A note on adjusted significance thresholds
When an experiment has multiple success metrics, the significance threshold is corrected to control the overall false positive rate across all metrics. Adding more success metrics makes each individual metric slightly harder to call significant. This is the statistically correct approach: without it, the more metrics you add, the more likely you are to find a false positive by chance.
In Confidence, when an experiment has multiple success metrics, the significance threshold is automatically adjusted for each metric to control the overall false positive rate. In the Detailed results view, each metric shows its own adjusted alpha value.
A success metric (where increases are desirable) shows a +3.1% change with a status of 'Not significant'. What is the correct interpretation?
A success metric CI shows [+1.2%, +6.8%]. The metric improves with increases. What is the status of this result?
Notes for nerds
Two decisions, two false positive rates
What looks like a single significance test is actually two distinct decisions, each answering a different question. The shipping decision asks: is there enough evidence that this treatment variant improves the metrics we care about? The abort decision asks: is there evidence that this treatment variant is causing harm right now, and should we stop the experiment early to prevent it?
These are not the same question, and there is no statistical reason to answer them using the same false positive rate. The consequences of getting them wrong are asymmetric. A false positive on the shipping decision means you ship a variant that does not actually improve things, which is a recoverable mistake. A false negative on the abort decision means you keep running an experiment that is harming users, which is a much more urgent problem. It follows that the abort decision should use a more sensitive threshold: you want to catch deterioration early, even if it means occasionally stopping an experiment that would have recovered.
There is also a fundamental difference in how time enters each decision. The shipping decision is made at a single point in time (when the experiment ends), which is exactly the setting upon-conclusion evaluation is designed for. The abort decision, by contrast, must be made continuously throughout the experiment: you cannot wait until the end to find out whether users were harmed. This means the abort decision should always use a sequential test, regardless of which evaluation strategy was chosen for the shipping decision. Upon-conclusion evaluation gives you no valid way to act on mid-experiment results; a sequential test is specifically designed for exactly that.
Confidence implements this separation by default. The deterioration check that feeds into the Abort recommendation always runs as a sequential test, even when the experiment uses upon-conclusion evaluation for its main results. The shipping decision and the abort decision are evaluated independently, with different statistical properties suited to each. The theoretical grounding for this approach is developed in Schultzberg, Ankargren, and Frånberg (2024).