Metrics

What is a Success Metric?

A success metric is the primary metric an experiment is designed to move.

A success metric is the primary metric an experiment is designed to move. It represents the outcome you believe your change will improve: the specific, measurable signal that tells you whether your hypothesis was right. Every experiment in Confidence requires at least one success metric, because without one, there's no definition of what "working" means.

Choosing the wrong success metric is one of the most common ways experiments produce misleading results. If the metric doesn't reflect the actual value your change creates, a positive result gives you confidence in the wrong thing. A team that picks "button clicks" as their success metric when the real goal is completed purchases can ship a change that drives more clicks and fewer purchases.

How does a success metric differ from other metric types?

In Confidence's decision framework, metrics fall into distinct roles. The success metric is the one that drives the ship-or-don't-ship decision. Guardrail metrics protect against unintended harm. Secondary metrics provide context. Each type has different statistical implications.

The key distinction: false positive control matters most for success metrics. If you falsely conclude that a change improved your success metric, you ship something that doesn't work. Confidence applies multiple testing corrections (Bonferroni by default) to success metrics when an experiment tracks more than one. This keeps the probability of a false positive ship decision under control. For guardrail metrics, the critical risk is false negatives, not false positives, because the cost of missing a regression outweighs the cost of a false alarm.

What makes a good success metric?

Three properties matter.

Sensitivity. The metric needs to move when a real change happens. A metric that's too noisy or too stable won't register the effect of your treatment, even if the effect is real. Spotify's experimentation teams evaluate metric sensitivity before committing to an experiment design, because an insensitive success metric turns a well-powered experiment into an ambiguous one.

Alignment with user value. The metric should measure something your users actually care about. Proxy metrics (like page views or session length) can serve as success metrics, but only when the relationship between the proxy and the underlying outcome is validated and stable. The Confidence blog documents cases where optimizing a proxy directly broke the relationship between the proxy and the outcome it represented.

Measurability within the experiment window. Some outcomes take months to observe. If your success metric requires six months of data, your experiment will either run too long to be practical or end too early to produce a reliable result. When the real outcome is slow, teams often use a validated proxy metric as the success metric and monitor the long-term outcome separately.

How many success metrics should an experiment have?

One is ideal. Two or three are common. More than five is a sign that the experiment lacks a clear hypothesis.

Each additional success metric requires a multiple testing correction, which reduces the statistical power for each individual metric. An experiment with ten success metrics needs substantially more traffic to detect the same effect size on any one of them. The discipline of limiting success metrics forces a team to articulate what they're actually trying to learn.

Confidence's decision framework formalizes this: the paper "Risk-Aware Product Decisions in A/B Tests with Multiple Metrics" shows that adjusting false positive rates across success metrics, while adjusting false negative rates across guardrail metrics, produces the right tradeoff between shipping bad changes and missing real improvements.