Confidence
  • Pricing
  • Success stories
  • Contact us
  • Login
Start free trial
All terms
Culture & Organization

What is a Multi-Metric Decision Making?

Multi-metric decision making is the practice of evaluating experiment results across multiple metrics simultaneously rather than basing ship decisions on a single success metric.

Multi-metric decision making is the practice of evaluating experiment results across multiple metrics simultaneously rather than basing ship decisions on a single success metric. In real product decisions, a change that improves one metric often degrades another. The decision to ship, iterate, or roll back depends on the full picture, not just one number.

At Spotify, experiments are evaluated against three metric types: success metrics (what you're trying to improve), guardrail metrics (what you're trying not to break), and quality metrics (secondary indicators that provide context). 42% of experiments are rolled back after guardrail regressions, which means nearly half of the changes that improve the target metric also cause harm that would have shipped undetected if the team had only looked at one number.

Why is a single metric not enough?

Product decisions have tradeoffs. A recommendation algorithm change that increases streams per session might decrease the diversity of artists played, reducing long-term user satisfaction. A checkout flow redesign that increases conversion rate might increase support tickets because users feel pushed through the process. A page speed improvement that reduces load time might reduce ad revenue.

None of these tradeoffs are visible if you only look at the success metric. The whole point of guardrail metrics is to make the cost of the change visible alongside the benefit.

The statistical challenge is that evaluating multiple metrics simultaneously increases the risk of false discoveries. If you test 10 metrics at a significance level of 0.05, you have roughly a 40% chance of at least one false positive. Multiple testing corrections (Bonferroni being the most common in practice) adjust for this. Confidence applies multiple testing corrections to success metrics while treating guardrail metrics differently: as Spotify's research shows, the risk that needs controlling for guardrails is false negatives (missing a real regression), not false positives.

How should different metric types be treated?

Spotify's published decision framework formalizes the answer.

Success metrics answer "did the change achieve what we intended?" False positives are the primary risk: you don't want to ship a change that appears to have helped but actually didn't. Multiple testing corrections control this risk.

Guardrail metrics answer "did the change cause harm we didn't intend?" False negatives are the primary risk: you don't want to miss a real regression. The framework recommends controlling the false negative rate across guardrails, which is a different statistical problem from controlling false positives across success metrics.

Quality metrics provide context but don't drive the ship/no-ship decision directly. A quality metric that moves unexpectedly warrants investigation, but it doesn't automatically trigger a rollback.

This distinction matters because applying the same correction procedure to all metrics is wrong in both directions. Over-correcting success metrics means you miss real improvements. Under-correcting guardrail metrics means you miss real harm.

What does multi-metric decision making look like in practice?

A team running an experiment to improve search result relevance might define their metrics as follows:

  • Success metric: click-through rate on the first search result
  • Guardrail metrics: search latency (p95), user-reported search quality (monthly survey), streams initiated from search
  • Quality metrics: number of searches per session, scroll depth on search results page

The experiment shows a 1.2 percentage point increase in click-through rate (statistically significant after multiple testing correction). Search latency p95 is unchanged. But streams initiated from search dropped by 0.8% (significant at the guardrail threshold). The team now faces a real decision: the change makes users more likely to click the first result, but those clicks lead to fewer streams. Users might be clicking out of curiosity rather than genuine interest, or the new ranking is surfacing popular but less personally relevant results.

This is the kind of decision that a single-metric framework can't support. The multi-metric view makes the tradeoff explicit and forces the team to reason about what's happening, not just whether a number went up.

How does Confidence support multi-metric decisions?

Confidence structures experiment analysis around this framework. When you set up an experiment, you designate metrics by type: success, guardrail, or quality. The platform applies the appropriate statistical corrections to each type. Guardrail metrics are monitored with inferiority tests (testing whether the treatment is worse than control) rather than the two-sided tests used for success metrics, because the question is different: you're not asking "did it change?" but "did it get worse?"

The analysis view presents all metric types together so the decision-maker sees the complete picture. A green success metric next to a red guardrail metric is a common and important outcome, and the platform is designed to make that tension visible rather than hidden.

Related terms

Metrics
Guardrail Metric

A guardrail metric is a metric monitored during an experiment to ensure the change doesn't cause unintended harm, even when the success metric improves.

Metrics
Success Metric

A success metric is the primary metric an experiment is designed to move.

Multiple Testing
Multiple Testing Correction

A multiple testing correction is an adjustment to significance thresholds that accounts for evaluating more than one hypothesis in the same experiment.

Statistical Methods
Statistical Significance

Statistical significance is the determination that an observed difference between experiment groups is unlikely to have occurred by chance alone.

Multiple Testing
Family-Wise Error Rate

Family-wise error rate (FWER) is the probability of making at least one false positive across a set of hypothesis tests.

Spotify

Learn more

  • Read our blog
  • See comparisons
  • Glossary
  • RFP guides
  • Listen to us
  • Read our docs
  • Status page

Need help

  • Contact us

Legal

  • Terms of Service
  • Data Protection Agreement
  • Privacy Policy
  • Cookies

© 2026 Spotify

The Confidence name and logo are registered trademarks of Spotify.