Lesson 5: Significance for success metrics

Every metric result in Confidence has a status label: a short phrase that tells you what to conclude about that metric.

Success metrics: did the treatment variant improve things?

For success metrics, Confidence is asking: is there statistical evidence that the treatment variant changed this metric in the desired direction?

The zero line on the results page is the reference point. A confidence interval that sits entirely on the positive side of zero means the data are inconsistent with there being no effect: the result is statistically significant. A CI that crosses zero means "no effect" remains a plausible value.

  • Significant: the CI does not cross zero in the direction of the test. There is statistical evidence that the treatment variant moved the metric in the intended direction. This is not certainty; it means the data are unlikely to look the way they do if there were truly no effect.
  • Not significant: the CI crosses zero. There is not enough statistical evidence to conclude that the treatment variant affected this metric. The data are consistent with there being no effect.

Use the interactive below to build intuition for how the CI position determines significance.

CI and significance for success metrics

Move the point estimate slider to see how the significance status changes. Use the direction toggle to set which way the metric should move.

Metric improves when it:
-20%-10%0%+10%+20%
+4.2%
Not significant
-15%+15%
10010,000
10 (low noise)100 (high noise)
Not significant: With high confidence, the true effect is between -1.5% and +9.9%. Since zero is in the interval, we cannot conclude whether the treatment improved or worsened this metric.
The result for this metric is inconclusive — collect more data, or if you have reached the required sample size, end the experiment.

Try the following:

  • With the direction set to "Increases", drag the point estimate from +15% to -15% and watch all three states appear in sequence: "Has improved" when the CI clears zero on the positive side, "Not significant" as it crosses zero, and "Has deteriorated" when the CI sits entirely below zero.
  • Move the point estimate back to a moderate positive value, then set the direction to "Decreases." The same sweep now works in reverse: "Has improved" appears on the negative side, and "Has deteriorated" on the positive side.
  • Set sample size to 500 and move the PE to +5%. The wider CI may still cross zero even with a positive estimate.
  • Increase sample size to 5,000. The CI narrows, and a smaller PE becomes sufficient for significance.

A note on adjusted significance thresholds

When an experiment has multiple success metrics, the significance threshold is corrected to control the overall false positive rate across all metrics. Adding more success metrics makes each individual metric slightly harder to call significant. This is the statistically correct approach: without it, the more metrics you add, the more likely you are to find a false positive by chance.

Notes for nerds

Two decisions, two false positive rates

What looks like a single significance test is actually two distinct decisions, each answering a different question. The shipping decision asks: is there enough evidence that this treatment variant improves the metrics we care about? The abort decision asks: is there evidence that this treatment variant is causing harm right now, and should we stop the experiment early to prevent it?

These are not the same question, and there is no statistical reason to answer them using the same false positive rate. The consequences of getting them wrong are asymmetric. A false positive on the shipping decision means you ship a variant that does not actually improve things, which is a recoverable mistake. A false negative on the abort decision means you keep running an experiment that is harming users, which is a much more urgent problem. It follows that the abort decision should use a more sensitive threshold: you want to catch deterioration early, even if it means occasionally stopping an experiment that would have recovered.

There is also a fundamental difference in how time enters each decision. The shipping decision is made at a single point in time (when the experiment ends), which is exactly the setting upon-conclusion evaluation is designed for. The abort decision, by contrast, must be made continuously throughout the experiment: you cannot wait until the end to find out whether users were harmed. This means the abort decision should always use a sequential test, regardless of which evaluation strategy was chosen for the shipping decision. Upon-conclusion evaluation gives you no valid way to act on mid-experiment results; a sequential test is specifically designed for exactly that.

Confidence implements this separation by default. The deterioration check that feeds into the Abort recommendation always runs as a sequential test, even when the experiment uses upon-conclusion evaluation for its main results. The shipping decision and the abort decision are evaluated independently, with different statistical properties suited to each. The theoretical grounding for this approach is developed in Schultzberg, Ankargren, and Frånberg (2024).